## How are people feeling about AI and its impact in the workplace?
### Web crawling from pre-defined seeds
<hr>
I have selected some web links that talk about AI and its potential impact to the workplace. Those links will serve as 'seeds' to be able to find other hyperlinks that might contribute to my search of a general 'sentiment' around AI.

Links to visit: 
- https://futureoflife.org/open-letter/pause-giant-ai-experiments/
- https://www.bbc.com/worklife/article/20230418-ai-anxiety-artificial-intelligence-replace-jobs#:~:text=In%20March%2C%20Goldman%20Sachs%20published,by%20technology%20in%20three%20years.


In [1]:
# Libraries for requesting HTML
from bs4 import BeautifulSoup
import requests


# natural language toolkit
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Diana\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Diana\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
first_seed_url = "https://futureoflife.org/open-letter/pause-giant-ai-experiments/"
page = requests.get(first_seed_url)
soup = BeautifulSoup(page.content, features="html.parser")
#soup

In [3]:
##finding the <a> tags from the HTML that will contain new links
links = []


for a in soup.find_all("a"):
    try:
        links.append(a["href"])
    except:
        pass
    
for link in links:
    print(link)

#main-content
https://futureoflife.org/
http://futureoflife.org/our-mission/
#dropdown
https://futureoflife.org/cause-areas/
https://futureoflife.org/cause-area/artificial-intelligence/
https://futureoflife.org/cause-area/biotechnology/
https://futureoflife.org/cause-area/nuclear/
https://futureoflife.org/cause-area/climate/
#dropdown
https://futureoflife.org/our-work/
https://futureoflife.org/our-work/policy-work/
https://futureoflife.org/our-work/outreach-work/
https://futureoflife.org/our-work/grantmaking-work/
https://futureoflife.org/our-work/events-work/
https://futureoflife.org/project/mitigating-the-risks-of-ai-integration-in-nuclear-launch/
https://futureoflife.org/fli-area/policy/
https://futureoflife.org/project/future-of-life-award/
https://futureoflife.org/fli-area/outreach/
https://futureoflife.org/project/nist-framework/
https://futureoflife.org/fli-area/policy/
https://futureoflife.org/project/eu-ai-act/
https://futureoflife.org/fli-area/policy/
https://futureoflife.org

### Filtering the links to articles and other relevant pages
<hr>
Something that I have found very interesting is that the 'Future of Life' website has a  lot of links to papers and research around the area of AI. I have extracted the links of the websites that take me to those papers and I will try to tokenize the PDF (if it is accessible)

In [4]:
filtered_first_seed = [link for link in links if link.startswith((
    'https://futureoflife.org/cause-area/artificial-intelligence/', 
    'https://abcnews.go.com/',
    'https://arxiv.org/abs/',
    'https://openai.com/',
    'https://futureoflife.org/ai/',
    'https://futureoflife.org/open-letter/ai-principles/',
    'https://time.com/'))]

##insert the orignal link for the seed 
filtered_first_seed.insert(0, first_seed_url)

for link in filtered_first_seed:
    print(link)


https://futureoflife.org/open-letter/pause-giant-ai-experiments/
https://futureoflife.org/cause-area/artificial-intelligence/
https://futureoflife.org/open-letter/ai-principles/
https://openai.com/blog/planning-for-agi-and-beyond
https://futureoflife.org/ai/faqs-about-flis-open-letter-calling-for-a-pause-on-giant-ai-experiments/
https://arxiv.org/abs/2209.10604
https://arxiv.org/abs/2206.13353
https://arxiv.org/abs/2303.10130
https://arxiv.org/abs/2206.05862
https://arxiv.org/abs/2209.00626
https://arxiv.org/abs/2112.04359
https://abcnews.go.com/Technology/openai-ceo-sam-altman-ai-reshape-society-acknowledges/story?id=97897122
https://time.com/6246119/demis-hassabis-deepmind-interview/
https://arxiv.org/abs/2303.12712
https://futureoflife.org/cause-area/artificial-intelligence/


### Creating functions that will allow me to extract html href elements  

The previous process can be repeated with a second seed. The following functions were

In [5]:
def soup_object(link):
    page = requests.get(link)
    soup = BeautifulSoup(page.content, features = 'html.parser')
    return soup

def extract_links(soup):
    new_links = []
    for a in soup.find_all("a"):
        try:
            new_links.append(a["href"])
        except:
            pass
        
    return new_links    

In [6]:
# Finding 'sub-links' to the each of the elements of the filtered link list
for link in filtered_first_seed:
    soup = soup_object(link)
    links = extract_links(soup)
    
   #print('This is a primary link',link)
    #print(links)

In [7]:
# Using second seed to find more relevant links
second_seed = "https://www.bbc.com/worklife/article/20230418-ai-anxiety-artificial-intelligence-replace-jobs#:~:text=In%20March%2C%20Goldman%20Sachs%20published,by%20technology%20in%20three%20years."
soup2 = soup_object(second_seed)
links = extract_links(soup2)

filtered_second_seed = [link for link in links if link.startswith((
    'https://www.pwc.com/', 
    'https://www.technologyreview.com/',
    'https://journals.sagepub.com/',
    'https://news.byu.edu/',
    'https://www.bbc.com/worklife/'))]

##insert the orignal link for the seed 
filtered_second_seed.insert(0, second_seed)


for i in filtered_second_seed:
    print(i)

https://www.bbc.com/worklife/article/20230418-ai-anxiety-artificial-intelligence-replace-jobs#:~:text=In%20March%2C%20Goldman%20Sachs%20published,by%20technology%20in%20three%20years.
https://www.pwc.com/gx/en/issues/workforce/hopes-and-fears-2022.html
https://www.technologyreview.com/2023/02/08/1068068/chatgpt-is-everywhere-heres-where-it-came-from/
https://journals.sagepub.com/doi/full/10.1177/23780231221131377
https://news.byu.edu/intellect/robots-are-taking-over-jobs-but-not-at-the-rate-you-might-think-says-byu-research
https://www.bbc.com/worklife/work-in-progress


### Using a third seed

In [8]:
third_seed = "https://mitsloan.mit.edu/ideas-made-to-matter/report-finds-employees-embrace-ai-when-they-see-its-value"
soup3 = soup_object(third_seed)
links = extract_links(soup3)

filtered_third_seed = [link for link in links if link.startswith((
    '/ideas-made-to-matter/'
    ))]

##insert the orignal link for the seed 
filtered_third_seed.insert(0, '/ideas-made-to-matter/report-finds-employees-embrace-ai-when-they-see-its-value')

filtered_third_seeds = []
for i in filtered_third_seed:
    completeLink= 'https://mitsloan.mit.edu'+ i
    filtered_third_seeds.append(completeLink)
print(filtered_third_seeds)


['https://mitsloan.mit.edu/ideas-made-to-matter/report-finds-employees-embrace-ai-when-they-see-its-value', 'https://mitsloan.mit.edu/ideas-made-to-matter/delta-model-how-arnoldo-hax-reprioritized-corporate-strategy', 'https://mitsloan.mit.edu/ideas-made-to-matter/its-time-to-rechart-course-technology-here-are-4-ways-to-start', 'https://mitsloan.mit.edu/ideas-made-to-matter/sec-commissioner-hester-peirce-not-a-fan-interventionist-approaches', 'https://mitsloan.mit.edu/ideas-made-to-matter/topics/artificial-intelligence', 'https://mitsloan.mit.edu/ideas-made-to-matter/sara-brown', 'https://mitsloan.mit.edu/ideas-made-to-matter/new-book-explores-how-ai-really-changes-way-we-work', 'https://mitsloan.mit.edu/ideas-made-to-matter/3-requirements-successful-artificial-intelligence-programs', 'https://mitsloan.mit.edu/ideas-made-to-matter/mit-sloan-research-artificial-intelligence-and-machine-learning', 'https://mitsloan.mit.edu/ideas-made-to-matter/its-time-to-rechart-course-technology-here-a

### Using a fourth seed

In [9]:
fourth_seed = "https://www.bbvaopenmind.com/en/articles/artificial-intelligence-in-workplace-what-is-at-stake-for-workers/"
soup4 = soup_object(fourth_seed)
links = extract_links(soup4)

# for i in links:
#     print(i)

filtered_fourth_seed = [link for link in links if link.startswith((
    'https://newsroom.ibm.com/',
    'https://www2.deloitte.com/',
    'https://osha.europa.eu/',
    'https://www.pwc.com/',
    'https://www.bbvaopenmind.com/en/articles',
    'https://www.businessinsider.in/'
    ))]

##insert the orignal link for the seed 
filtered_fourth_seed.insert(0, fourth_seed)

for i in filtered_fourth_seed:
    print(i)

https://www.bbvaopenmind.com/en/articles/artificial-intelligence-in-workplace-what-is-at-stake-for-workers/
https://www.bbvaopenmind.com/en/articles/artificial-intelligence-in-workplace-what-is-at-stake-for-workers/
https://www.bbvaopenmind.com/en/articles/artificial-intelligence-in-workplace-what-is-at-stake-for-workers/
https://www2.deloitte.com/insights/us/en/focus/human-capital-trends/2017/people-analytics-in-hr.html
https://osha.europa.eu/en/tools-and-publications/publications/foresight-new-and-emerging-occupational-safety-and-health-risks/view
https://www.businessinsider.in/i-tried-the-software-that-uses-ai-to-scan-job-applicants-for-companies-like-goldman-sachs-and-unilever-before-meeting-them-and-its-not-as-creepy-as-it-sounds/articleshow/60196231.cms
https://osha.europa.eu/en/tools-and-publications/publications/future-work-crowdsourcing/view
https://newsroom.ibm.com/2018-11-28-IBM-Talent-Business-Uses-AI-To-Rethink-The-Modern-Workforce
https://osha.europa.eu/en/tools-and-publi

### Using fifth seed

In [10]:
fifth_seed = "https://www.bbvaopenmind.com/en/books/work-in-the-age-of-data/"
soup5 = soup_object(fifth_seed)
links = extract_links(soup5)


filtered_fifth_seed = [link for link in links if link.startswith((
    'https://www.bbvaopenmind.com/en/articles/'
    ))]

##insert the orignal link for the seed 
filtered_fifth_seed.insert(0, fifth_seed)

for i in filtered_fifth_seed:
    print(i)

https://www.bbvaopenmind.com/en/books/work-in-the-age-of-data/
https://www.bbvaopenmind.com/en/articles/on-the-effects-of-artificial-intelligence-on-growth-and-employment/
https://www.bbvaopenmind.com/en/articles/on-the-effects-of-artificial-intelligence-on-growth-and-employment/
https://www.bbvaopenmind.com/en/articles/measuring-productivity-in-the-context-of-technological-change/
https://www.bbvaopenmind.com/en/articles/measuring-productivity-in-the-context-of-technological-change/
https://www.bbvaopenmind.com/en/articles/inequality-in-the-digital-era/
https://www.bbvaopenmind.com/en/articles/inequality-in-the-digital-era/
https://www.bbvaopenmind.com/en/articles/intangible-capital-productivity-and-labor-markets/
https://www.bbvaopenmind.com/en/articles/intangible-capital-productivity-and-labor-markets/
https://www.bbvaopenmind.com/en/articles/causes-and-consequences-of-job-polarization-and-their-future-perspectives/
https://www.bbvaopenmind.com/en/articles/causes-and-consequences-of

### Using sixth seed

In [11]:
sixth_seed = "https://www.weforum.org/agenda/2022/01/artificial-intelligence-ai-technology-trust-survey/"

soup6 = soup_object(sixth_seed)
links = extract_links(soup6)

# for i in links:
#     print(i)

filtered_sixth_seed = [link for link in links if link.startswith((
    "https://www.weforum.org/topics/",
    "https://www.weforum.org/impact/",
    "https://www.weforum.org/agenda/",
     ))]

#insert the orignal link for the seed 
filtered_sixth_seed.insert(0, sixth_seed)

for i in filtered_sixth_seed:
    print(i)

https://www.weforum.org/agenda/2022/01/artificial-intelligence-ai-technology-trust-survey/
https://www.weforum.org/topics/artificial-intelligence-and-robotics
https://www.weforum.org/agenda/authors/joe-myers-e357a678-184f-4be2-9895-dad15a2435b3
https://www.weforum.org/impact/ai-for-agriculture-in-india/
https://www.weforum.org/agenda/2021/12/ai-mental-health-cbt-therapy
https://www.weforum.org/agenda/2021/12/how-to-keep-human-in-human-resources-with-ai-based-tools
https://www.weforum.org/topics/artificial-intelligence-and-robotics
https://www.weforum.org/agenda/archive/emerging-technology
https://www.weforum.org/topics/emerging-technologies
https://www.weforum.org/topics/artificial-intelligence-and-robotics
https://www.weforum.org/agenda/2023/05/jobs-ai-cant-replace/
https://www.weforum.org/agenda/2023/05/jobs-ai-cant-replace/
https://www.weforum.org/agenda/2023/05/emerging-technologies-human-crises-artificial-intelligence/
https://www.weforum.org/agenda/2023/05/emerging-technologies-h

### Using seventh seed

In [12]:
seventh_seed = "https://ai100.stanford.edu/2021-report/standing-questions-and-responses/sq6-how-has-public-sentiment-towards-ai-evolved-and"

soup7 = soup_object(seventh_seed)
links = extract_links(soup7)

# for i in links:
#     print(i)

filtered_seventh_seed = [link for link in links if link.startswith((
    "/2021-report/standing-questions-and-responses/",
    "gathering-strength-gathering-storms-one-hundred-year-study-artificial-intelligence-ai100-2021-study",
    "/2021-report/standing-questions-and-section-summaries/"
     ))]

#insert the orignal link for the seed 
#filtered_seventh_seed.insert(0, sixth_seed)

filtered_seventh_seeds = []
for i in filtered_seventh_seed:
    remove = "#" ##removing anchor togs within a link. If they are not remove, I can get repeated links
    completeLink= 'https://ai100.stanford.edu'+ i
    completeLink = completeLink.split(remove,1)[0]
    filtered_seventh_seeds.append(completeLink)
print(filtered_seventh_seeds)    

['https://ai100.stanford.edu/2021-report/standing-questions-and-responses/sq1-what-are-some-examples-pictures-reflect-important', 'https://ai100.stanford.edu/2021-report/standing-questions-and-responses/sq2-what-are-most-important-advances-ai', 'https://ai100.stanford.edu/2021-report/standing-questions-and-section-summaries/sq2-what-are-most-important-advances-ai', 'https://ai100.stanford.edu/2021-report/standing-questions-and-section-summaries/sq2-what-are-most-important-advances-ai', 'https://ai100.stanford.edu/2021-report/standing-questions-and-section-summaries/sq2-what-are-most-important-advances-ai', 'https://ai100.stanford.edu/2021-report/standing-questions-and-section-summaries/sq2-what-are-most-important-advances-ai', 'https://ai100.stanford.edu/2021-report/standing-questions-and-section-summaries/sq2-what-are-most-important-advances-ai', 'https://ai100.stanford.edu/2021-report/standing-questions-and-section-summaries/sq2-what-are-most-important-advances-ai', 'https://ai100.st

### Eight Seed

In [13]:
eighth_seed = "https://www.theguardian.com/global-development/2023/may/12/why-would-we-employ-people-experts-on-five-ways-ai-will-change-work"

soup8 = soup_object(eighth_seed)
links = extract_links(soup8)

# for i in links:
#     print(i)

filtered_eighth_seed = [link for link in links if link.startswith((
    "https://www.theguardian.com/technology/2023/",
    "https://www.theguardian.com/global-development/2023",
     ))]

#insert the orignal link for the seed 
filtered_eighth_seed.insert(0, eighth_seed)

for i in filtered_eighth_seed:
    print(i)

https://www.theguardian.com/global-development/2023/may/12/why-would-we-employ-people-experts-on-five-ways-ai-will-change-work
https://www.theguardian.com/global-development/2023/mar/30/the-future-of-work-a-guardian-series
https://www.theguardian.com/technology/2023/feb/18/the-ai-industrial-revolution-puts-middle-class-workers-under-threat-this-time
https://www.theguardian.com/technology/2023/may/09/techscape-artificial-intelligence-risk


In [26]:
ninth_seed = "https://www.bbntimes.com/technology/advantages-disadvantages-of-ai-in-the-workplace"

soup9 = soup_object(ninth_seed)
links = extract_links(soup9)

# for i in links:
#     print(i)

filtered_ninth_seed = [link for link in links if link.startswith((
    "/technology/crossing-the-threshold-into-the-ai-renaissance",
    "/politics/is-it-time-for-a-secretary-of-technology",
    "/science/postmodernism-and-the-hyperreality-of-gpt",
    "/financial/gpt-4-will-revolutionise-the-financial-services-industry",
    "/technology/preparing-for-a-future-dominated-by-artificial-intelligence",
    "/environment/how-ai-is-revolutionizing-smart-waste-management",
    "/society/gpt-and-the-quest-for-the-ghost-in-the-machine"
     ))]

filtered_ninth_seeds = []
for i in filtered_ninth_seed:
    completeLink= 'https://www.bbntimes.com'+ i
    filtered_ninth_seeds.append(completeLink)
#print(filtered_ninth_seeds)

for i in filtered_ninth_seeds:
    print(i)
#insert the orignal link for the seed 
#filtered_ninth_seed.insert(0, ninth_seed)

https://www.bbntimes.com/technology/crossing-the-threshold-into-the-ai-renaissance
https://www.bbntimes.com/technology/preparing-for-a-future-dominated-by-artificial-intelligence
https://www.bbntimes.com/technology/preparing-for-a-future-dominated-by-artificial-intelligence
https://www.bbntimes.com/technology/crossing-the-threshold-into-the-ai-renaissance
https://www.bbntimes.com/technology/crossing-the-threshold-into-the-ai-renaissance
https://www.bbntimes.com/environment/how-ai-is-revolutionizing-smart-waste-management
https://www.bbntimes.com/financial/gpt-4-will-revolutionise-the-financial-services-industry
https://www.bbntimes.com/politics/is-it-time-for-a-secretary-of-technology
https://www.bbntimes.com/science/postmodernism-and-the-hyperreality-of-gpt
https://www.bbntimes.com/society/gpt-and-the-quest-for-the-ghost-in-the-machine
https://www.bbntimes.com/technology/preparing-for-a-future-dominated-by-artificial-intelligence
https://www.bbntimes.com/technology/crossing-the-thresh

### Creating a CSV file that saves all the links that are relevant

After crawling the first and second seed and finding links that might contain relevant vocabulary, I will save them into a CSV file. It might be useful at a later stage, to keep track of the words I have found in each link.

In [27]:
final_list_of_links = filtered_first_seed + filtered_second_seed + filtered_third_seeds + filtered_fourth_seed + filtered_fifth_seed + filtered_sixth_seed + filtered_seventh_seeds + filtered_eighth_seed + filtered_ninth_seeds

##Adding extra website that have relevant content, but didn't have enough links to become a seed
extra_research_links= "https://futurism.com/the-byte/fear-ai-anxiety", "https://www.beekeeper.io/blog/3-reasons-you-want-ai-in-the-workplace/", "https://www.farrer.co.uk/news-and-insights/blogs/artificial-intelligence-in-the-workplace-helpful-or-harmful/", "https://fortune.com/2023/03/13/artificial-intelligence-make-workplace-decisions-human-intelligence-remains-vital-careers-tech-gary-friedman/", "https://www.mckinsey.com/featured-insights/future-of-work/ai-automation-and-the-future-of-work-ten-things-to-solve-for", "https://www.bbntimes.com/technology/advantages-disadvantages-of-ai-in-the-workplace", "https://en.wikipedia.org/wiki/Workplace_impact_of_artificial_intelligence", "https://www.akerman.com/en/perspectives/hr-def-how-to-be-smart-about-using-artificial-intelligence-in-the-workplace.html", "https://planergy.com/blog/how-ai-is-transforming-the-workplace/" 

final_list_of_links.extend(extra_research_links)


final_list_of_links = list(dict.fromkeys(final_list_of_links)) ##removing duplicate links
print(final_list_of_links)

['https://futureoflife.org/open-letter/pause-giant-ai-experiments/', 'https://futureoflife.org/cause-area/artificial-intelligence/', 'https://futureoflife.org/open-letter/ai-principles/', 'https://openai.com/blog/planning-for-agi-and-beyond', 'https://futureoflife.org/ai/faqs-about-flis-open-letter-calling-for-a-pause-on-giant-ai-experiments/', 'https://arxiv.org/abs/2209.10604', 'https://arxiv.org/abs/2206.13353', 'https://arxiv.org/abs/2303.10130', 'https://arxiv.org/abs/2206.05862', 'https://arxiv.org/abs/2209.00626', 'https://arxiv.org/abs/2112.04359', 'https://abcnews.go.com/Technology/openai-ceo-sam-altman-ai-reshape-society-acknowledges/story?id=97897122', 'https://time.com/6246119/demis-hassabis-deepmind-interview/', 'https://arxiv.org/abs/2303.12712', 'https://www.bbc.com/worklife/article/20230418-ai-anxiety-artificial-intelligence-replace-jobs#:~:text=In%20March%2C%20Goldman%20Sachs%20published,by%20technology%20in%20three%20years.', 'https://www.pwc.com/gx/en/issues/workforc

In [30]:
import pandas as pd

#List of filtered links
final_links = final_list_of_links
dict = {'links': final_links}

df = pd.DataFrame(dict)
df
#df.to_csv('Crawl_final_links.csv')