## How are people feeling about AI and its impact in the workplace?
### Web crawling from pre-defined seeds
<hr>
I have selected some web links that talk about AI and its potential impact to the workplace. Those links will serve as 'seeds' to be able to find other hyperlinks that might contribute to my search of a general 'sentiment' around AI.

Links to visit: 
- https://futureoflife.org/open-letter/pause-giant-ai-experiments/
- https://www.bbc.com/worklife/article/20230418-ai-anxiety-artificial-intelligence-replace-jobs#:~:text=In%20March%2C%20Goldman%20Sachs%20published,by%20technology%20in%20three%20years.


In [1]:
# Libraries for requesting HTML
from bs4 import BeautifulSoup
import requests
import random

# natural language toolkit
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Diana\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Diana\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
first_seed_url = "https://futureoflife.org/open-letter/pause-giant-ai-experiments/"
page = requests.get(first_seed_url)
soup = BeautifulSoup(page.content, features="html.parser")
#soup

<!DOCTYPE html>

<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link as="style" href="https://fonts.googleapis.com/css?family=Inter:100,200,300,400,500,600,700,800,900|Inter:100,200,300,400,500,600,700,800,900" rel="preload"/>
<link href="https://fonts.googleapis.com/css?family=Inter:100,200,300,400,500,600,700,800,900|Inter:100,200,300,400,500,600,700,800,900" rel="stylesheet"/>
<link as="style" href="https://use.typekit.net/srt8amd.css" rel="preload"/>
<link href="https://use.typekit.net/srt8amd.css" rel="stylesheet"/>
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots">
<title>Pause Giant AI Experiments: An Open Letter - Future of Life Institute</title>
<link href="https://futureoflife.org/open-letter/pause-giant-ai-experiments/" rel="canonical">
<meta content="en_US" property="og:locale">
<meta content="article" property="og:type"/>
<meta content="Paus

In [3]:
##finding the <a> tags from the HTML that will contain new links
links = []

for a in soup.find_all("a"):
    try:
        links.append(a["href"])
    except:
        pass
    
for link in links:
    print(link)

#main-content
https://futureoflife.org/
http://futureoflife.org/our-mission/
#dropdown
https://futureoflife.org/cause-areas/
https://futureoflife.org/cause-area/artificial-intelligence/
https://futureoflife.org/cause-area/biotechnology/
https://futureoflife.org/cause-area/nuclear/
https://futureoflife.org/cause-area/climate/
#dropdown
https://futureoflife.org/our-work/
https://futureoflife.org/our-work/policy-work/
https://futureoflife.org/our-work/outreach-work/
https://futureoflife.org/our-work/grantmaking-work/
https://futureoflife.org/our-work/events-work/
https://futureoflife.org/project/mitigating-the-risks-of-ai-integration-in-nuclear-launch/
https://futureoflife.org/fli-area/policy/
https://futureoflife.org/project/future-of-life-award/
https://futureoflife.org/fli-area/outreach/
https://futureoflife.org/project/nist-framework/
https://futureoflife.org/fli-area/policy/
https://futureoflife.org/project/eu-ai-act/
https://futureoflife.org/fli-area/policy/
https://futureoflife.org

### Filtering the links to articles and other relevant pages
<hr>
Something that I have found very interesting is that the 'Future of Life' website has a  lot of links to papers and research around the area of AI. I have extracted the links of the websites that take me to those papers and I will try to tokenize the PDF (if it is accessible)

In [4]:
filtered_first_seed = [link for link in links if link.startswith((
    'https://futureoflife.org/cause-area/artificial-intelligence/', 
    'https://abcnews.go.com/',
    'https://arxiv.org/abs/',
    'https://openai.com/',
    'https://futureoflife.org/ai/',
    'https://futureoflife.org/open-letter/ai-principles/',
    'https://time.com/'))]

for link in filtered_first_seed:
    print(link)


https://futureoflife.org/cause-area/artificial-intelligence/
https://futureoflife.org/open-letter/ai-principles/
https://openai.com/blog/planning-for-agi-and-beyond
https://futureoflife.org/ai/faqs-about-flis-open-letter-calling-for-a-pause-on-giant-ai-experiments/
https://arxiv.org/abs/2209.10604
https://arxiv.org/abs/2206.13353
https://arxiv.org/abs/2303.10130
https://arxiv.org/abs/2206.05862
https://arxiv.org/abs/2209.00626
https://arxiv.org/abs/2112.04359
https://abcnews.go.com/Technology/openai-ceo-sam-altman-ai-reshape-society-acknowledges/story?id=97897122
https://time.com/6246119/demis-hassabis-deepmind-interview/
https://arxiv.org/abs/2303.12712
https://futureoflife.org/cause-area/artificial-intelligence/


### Creating functions that will allow me to extract html href elements  

The previous process can be repeated with a second seed. The following functions were

In [5]:
def soup_object(link):
    page = requests.get(link)
    soup = BeautifulSoup(page.content, features = 'html.parser')
    return soup

def extract_links(soup):
    new_links = []
    for a in soup.find_all("a"):
        try:
            new_links.append(a["href"])
        except:
            pass
        
    return new_links    

In [7]:
# Finding 'sub-links' to the each of the elements of the filtered link list
for link in filtered_first_seed:
    soup = soup_object(link)
    links = extract_links(soup)
    
   #print('This is a primary link',link)
    #print(links)

In [18]:
# Using second seed to find more relevant links
second_seed = "https://www.bbc.com/worklife/article/20230418-ai-anxiety-artificial-intelligence-replace-jobs#:~:text=In%20March%2C%20Goldman%20Sachs%20published,by%20technology%20in%20three%20years."
soup2 = soup_object(second_seed)
links = extract_links(soup2)

filtered_second_seed = [link for link in links if link.startswith((
    'https://www.pwc.com/', 
    'https://www.technologyreview.com/',
    'https://www.nytimes.com/',
    'https://journals.sagepub.com/',
    'https://news.byu.edu/',
    'https://www.bbc.com/worklife/'))]

for i in filtered_second_seed:
    print(i)

https://www.pwc.com/gx/en/issues/workforce/hopes-and-fears-2022.html
https://www.technologyreview.com/2023/02/08/1068068/chatgpt-is-everywhere-heres-where-it-came-from/
https://www.nytimes.com/2023/02/03/technology/chatgpt-openai-artificial-intelligence.html
https://journals.sagepub.com/doi/full/10.1177/23780231221131377
https://news.byu.edu/intellect/robots-are-taking-over-jobs-but-not-at-the-rate-you-might-think-says-byu-research
https://www.bbc.com/worklife/work-in-progress


### Using a third seed

In [22]:
third_seed = "https://mitsloan.mit.edu/ideas-made-to-matter/report-finds-employees-embrace-ai-when-they-see-its-value"
soup3 = soup_object(third_seed)
links = extract_links(soup3)

filtered_third_seed = [link for link in links if link.startswith((
    '/ideas-made-to-matter/'
    ))]

for i in filtered_third_seed :
    print(i)

/ideas-made-to-matter/technology-expert-to-business-leader-evolution-cio
/ideas-made-to-matter/9-carbon-busting-startups-mit-sustainability-summit
/ideas-made-to-matter/how-to-use-competitive-insight
/ideas-made-to-matter/topics/artificial-intelligence
/ideas-made-to-matter/sara-brown
/ideas-made-to-matter/new-book-explores-how-ai-really-changes-way-we-work
/ideas-made-to-matter/3-requirements-successful-artificial-intelligence-programs
/ideas-made-to-matter/mit-sloan-research-artificial-intelligence-and-machine-learning
/ideas-made-to-matter/job-seekers-ai-boosted-resumes-more-likely-to-be-hired
/ideas-made-to-matter/new-podcast-explores-whether-data-can-solve-big-problems
/ideas-made-to-matter/making-most-ai-latest-lessons-mit-sloan-management-review


### Creating a CSV file that saves all the links that are relevant

After crawling the first and second seed and finding links that might contain relevant vocabulary, I will save them into a CSV file. It might be useful at a later stage, to keep track of the words I have found in each link.

In [23]:
final_list_of_links = filtered_first_seed + filtered_second_seed 
print(final_list_of_links)

['https://futureoflife.org/cause-area/artificial-intelligence/', 'https://futureoflife.org/open-letter/ai-principles/', 'https://openai.com/blog/planning-for-agi-and-beyond', 'https://futureoflife.org/ai/faqs-about-flis-open-letter-calling-for-a-pause-on-giant-ai-experiments/', 'https://arxiv.org/abs/2209.10604', 'https://arxiv.org/abs/2206.13353', 'https://arxiv.org/abs/2303.10130', 'https://arxiv.org/abs/2206.05862', 'https://arxiv.org/abs/2209.00626', 'https://arxiv.org/abs/2112.04359', 'https://abcnews.go.com/Technology/openai-ceo-sam-altman-ai-reshape-society-acknowledges/story?id=97897122', 'https://time.com/6246119/demis-hassabis-deepmind-interview/', 'https://arxiv.org/abs/2303.12712', 'https://futureoflife.org/cause-area/artificial-intelligence/', 'https://www.pwc.com/gx/en/issues/workforce/hopes-and-fears-2022.html', 'https://www.technologyreview.com/2023/02/08/1068068/chatgpt-is-everywhere-heres-where-it-came-from/', 'https://www.nytimes.com/2023/02/03/technology/chatgpt-ope

In [24]:
import pandas as pd

#List of filtered links
final_links = final_list_of_links
dict = {'links': final_links}

df = pd.DataFrame(dict)
df
#df.to_csv('Crawl_final_links.csv')