## How are people feeling about AI and its impact in the workplace?
### Accessing the content of the links collected in the web crawling and cleaning the data
<hr>
After crawling through relevant websites, I have built a CSV containing the different articles/ PDFs I would like to 
analyse. In this stage, I will tokenize the content of those articles and clean non-relevant words. The cleaning process will allow for the collection of meaningful words that will contribute to the sentiment analysis stage.


In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
import re 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Diana\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Diana\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
dataFrame = pd.read_csv("Crawl_final_links.csv")
dataFrame.drop('Unnamed: 0', inplace=True, axis =1)
dataFrame

Unnamed: 0,links
0,https://futureoflife.org/cause-area/artificial...
1,https://futureoflife.org/open-letter/ai-princi...
2,https://openai.com/blog/planning-for-agi-and-b...
3,https://futureoflife.org/ai/faqs-about-flis-op...
4,https://arxiv.org/abs/2209.10604
5,https://arxiv.org/abs/2206.13353
6,https://arxiv.org/abs/2303.10130
7,https://arxiv.org/abs/2206.05862
8,https://arxiv.org/abs/2209.00626
9,https://arxiv.org/abs/2112.04359


In [3]:
##Accessing the HTML content of the link
# individual_url = dataFrame['links'][11]
# individual_url
for i in dataFrame['links']:
    print(i)

https://futureoflife.org/cause-area/artificial-intelligence/
https://futureoflife.org/open-letter/ai-principles/
https://openai.com/blog/planning-for-agi-and-beyond
https://futureoflife.org/ai/faqs-about-flis-open-letter-calling-for-a-pause-on-giant-ai-experiments/
https://arxiv.org/abs/2209.10604
https://arxiv.org/abs/2206.13353
https://arxiv.org/abs/2303.10130
https://arxiv.org/abs/2206.05862
https://arxiv.org/abs/2209.00626
https://arxiv.org/abs/2112.04359
https://abcnews.go.com/Technology/openai-ceo-sam-altman-ai-reshape-society-acknowledges/story?id=97897122
https://time.com/6246119/demis-hassabis-deepmind-interview/
https://arxiv.org/abs/2303.12712
https://futureoflife.org/cause-area/artificial-intelligence/
https://www.pwc.com/gx/en/issues/workforce/hopes-and-fears-2022.html
https://www.technologyreview.com/2023/02/08/1068068/chatgpt-is-everywhere-heres-where-it-came-from/
https://www.nytimes.com/2023/02/03/technology/chatgpt-openai-artificial-intelligence.html
https://journals.

In [4]:
#get the individual links and convert them in soup objects

def tokenized(url):
    page = requests.get(individual_url)
    soup = BeautifulSoup(page.text, 'html.parser')
    readable_text = soup.get_text()
    
    #Tokenize the text
    tokenized_text = word_tokenize(readable_text.lower())
    return tokenized_text

In [5]:
#Removing stopwords and punctuation
stop_words = set(stopwords.words('english'))
punctuation = list(punctuation)

def clean(tokenized_text):
    ##extra punctuation that I have found within the texts
    extra_punctuation= '...', '©', '’', '“', '–', '”'
    punctuation.extend(extra_punctuation)

    #print(punctuation)

    #checked if the tokenized words are within the stopwords and punctuation list. If they are, remove.
    filtered_text = []

    for word in tokenized_text:
        pattern = r'[0-9]' #removing string-numbers

        if word not in stop_words and word not in punctuation:
            nword = re.sub(pattern, '', word)
            filtered_text.append(nword)
    return filtered_text

#print(clean(tokenized(i)))            

In [6]:
#Convert punctuation that is within strings to a comma, so that I can further divide the words

def second_clean(filtered_text):
    cleaned_text = []

    for word in filtered_text:
        for char in word:
            if char in punctuation:
                word = word.replace(char, ",")
        cleaned_text.append(word)
    return cleaned_text    

#print(second_clean(clean(tokenized(i))))


In [13]:
#Remove empty strings, and words that are not long enough to be relevant

def final_clean(cleaned_text): 
    cleaned_text_v2 = []

    for word in cleaned_text:
        if ',' in word:
            cleaned_text_v2.extend(word.split(','))
        else:
            cleaned_text_v2.append(word)
            
    #removing empty strings
    cleaned_text_v2 = list(filter(None, cleaned_text_v2))
    
    ##Removing words that are of lenght less than 1 and more than 12 characters
    final_text = list(filter(lambda word: len(word) > 1 and len(word)<12, cleaned_text_v2))
    return final_text

# print(final_clean(second_clean(clean(tokenized(i)))))    
    

## Looping through all the links on my dataFrame and cleaning the text they contain
<hr>
After creating the different functions that clean the content of the articles, I have used them to loop through all the links of my initial dataFrame. This process extracts meaningful words from the text content of the articles' HTML <br>
<strong>What is the best format to save this file ?</strong>

In [14]:
for individual_url in dataFrame['links']:
#     dataFrame['Tokens'] = pd.array(final_clean(second_clean(clean(tokenized(individual_url)))))
     print(individual_url, final_clean(second_clean(clean(tokenized(individual_url)))))

https://futureoflife.org/cause-area/artificial-intelligence/ ['future', 'life', 'institute', 'skip', 'content', 'mission', 'cause', 'area', 'change', 'workour', 'work', 'risks', 'ai', 'nuclear', 'life', 'nist', 'ai', 'risk', 'european', 'ai', 'lethal', 'life', 'institute', 'podcasts', 'resources', 'usabout', 'us', 'us', 'take', 'action', 'search', 'take', 'action', 'home', 'cause', 'self', 'driving', 'cars', 'ai', 'changing', 'lives', 'impact', 'magnifies', 'risks', 'self', 'driving', 'cars', 'racing', 'forward', 'today', 'narrow', 'ai', 'systems', 'perform', 'isolated', 'tasks', 'already', 'pose', 'major', 'risks', 'erosion', 'processes', 'financial', 'flash', 'crashes', 'arms', 'race', 'weapons', 'looking', 'ahead', 'many', 'pursuing', 'agi', 'general', 'ai', 'perform', 'well', 'better', 'humans', 'wide', 'range', 'cognitive', 'tasks', 'ai', 'systems', 'design', 'smarter', 'systems', 'may', 'hit', 'explosion', 'quickly', 'leaving', 'humanity', 'behind', 'could', 'eradicate', 'poverty

https://openai.com/blog/planning-for-agi-and-beyond ['planning', 'agi', 'beyond', 'submit', 'skip', 'main', 'dallâ·e', 'customer', 'mobile', 'closesite', 'submit', 'planning', 'agi', 'beyondour', 'mission', 'ensure', 'general', 'systems', 'generally', 'smarter', 'ofâ', 'humanity', 'justin', 'jay', 'wang', 'ã\x97', 'mission', 'ensure', 'general', 'systems', 'generally', 'smarter', 'ofâ', 'humanity', 'if', 'agi', 'created', 'could', 'help', 'us', 'elevate', 'humanity', 'abundance', 'global', 'economy', 'aiding', 'discovery', 'new', 'knowledge', 'changes', 'limits', 'ofâ', 'agi', 'potential', 'give', 'everyone', 'new', 'imagine', 'world', 'us', 'access', 'help', 'almost', 'cognitive', 'task', 'providing', 'great', 'force', 'human', 'ingenuity', 'andâ', 'on', 'hand', 'agi', 'would', 'also', 'come', 'serious', 'risk', 'misuse', 'drastic', 'accidents', 'societal', 'upside', 'agi', 'great', 'believe', 'possible', 'desirable', 'society', 'stop', 'forever', 'instead', 'society', 'agi', 'figure'

https://futureoflife.org/ai/faqs-about-flis-open-letter-calling-for-a-pause-on-giant-ai-experiments/ ['faqs', 'fli', 'open', 'letter', 'calling', 'pause', 'giant', 'ai', 'future', 'life', 'institute', 'skip', 'content', 'mission', 'cause', 'area', 'change', 'workour', 'work', 'risks', 'ai', 'nuclear', 'life', 'nist', 'ai', 'risk', 'european', 'ai', 'lethal', 'life', 'institute', 'podcasts', 'resources', 'usabout', 'us', 'us', 'take', 'action', 'search', 'take', 'action', 'home', 'faqs', 'fli', 'open', 'letter', 'calling', 'pause', 'giant', 'ai', 'post', 'minute', 'read', 'faqs', 'fli', 'open', 'letter', 'calling', 'pause', 'giant', 'ai', 'please', 'note', 'faq', 'prepared', 'fli', 'reflect', 'views', 'letter', 'live', 'document', 'continue', 'updated', 'response', 'questions', 'media', 'elsewhere', 'faqs', 'reference', 'pause', 'giant', 'ai', 'open', 'letter', 'letter', 'call', 'calling', 'pause', 'training', 'models', 'larger', 'gpt', 'months', 'imply', 'pause', 'ban', 'ai', 'research

https://arxiv.org/abs/2206.13353 ['power', 'seeking', 'ai', 'risk', 'skip', 'main', 'content', 'support', 'fromthe', 'simons', 'member', 'cs', 'arxiv', 'help', 'advanced', 'search', 'fields', 'title', 'author', 'abstract', 'comments', 'journal', 'reference', 'acm', 'msc', 'report', 'number', 'arxiv', 'doi', 'orcid', 'arxiv', 'author', 'id', 'help', 'pages', 'full', 'text', 'search', 'open', 'search', 'go', 'open', 'menu', 'quick', 'links', 'login', 'help', 'pages', 'computer', 'science', 'computers', 'society', 'arxiv', 'cs', 'submitted', 'jun', 'title', 'power', 'seeking', 'ai', 'risk', 'authors', 'joseph', 'carlsmith', 'download', 'pdf', 'paper', 'titled', 'power', 'seeking', 'ai', 'risk', 'joseph', 'carlsmith', 'download', 'pdf', 'abstract', 'report', 'examines', 'see', 'core', 'argument', 'concern', 'risk', 'proceed', 'two', 'stages', 'first', 'lay', 'backdrop', 'picture', 'informs', 'concern', 'picture', 'agency', 'extremely', 'powerful', 'force', 'creating', 'agents', 'much', 'us

https://arxiv.org/abs/2206.05862 ['risk', 'analysis', 'ai', 'research', 'skip', 'main', 'content', 'support', 'fromthe', 'simons', 'member', 'cs', 'arxiv', 'help', 'advanced', 'search', 'fields', 'title', 'author', 'abstract', 'comments', 'journal', 'reference', 'acm', 'msc', 'report', 'number', 'arxiv', 'doi', 'orcid', 'arxiv', 'author', 'id', 'help', 'pages', 'full', 'text', 'search', 'open', 'search', 'go', 'open', 'menu', 'quick', 'links', 'login', 'help', 'pages', 'computer', 'science', 'computers', 'society', 'arxiv', 'cs', 'submitted', 'jun', 'last', 'revised', 'sep', 'version', 'title', 'risk', 'analysis', 'ai', 'research', 'authors', 'dan', 'hendrycks', 'mantas', 'mazeika', 'download', 'pdf', 'paper', 'titled', 'risk', 'analysis', 'ai', 'research', 'dan', 'hendrycks', 'authors', 'download', 'pdf', 'abstract', 'ai', 'potential', 'greatly', 'improve', 'society', 'powerful', 'comes', 'risks', 'current', 'ai', 'research', 'lacks', 'manage', 'long', 'tail', 'risks', 'ai', 'systems'

https://abcnews.go.com/Technology/openai-ceo-sam-altman-ai-reshape-society-acknowledges/story?id=97897122 ['openai', 'ceo', 'sam', 'altman', 'says', 'ai', 'reshape', 'society', 'risks', 'little', 'bit', 'scared', 'abc', 'news', 'abc', 'addedwe', 'll', 'notify', 'news', 'aboutturn', 'desktop', 'breaking', 'stories', 'interest', 'onopenai', 'ceo', 'sam', 'altman', 'says', 'ai', 'reshape', 'society', 'risks', 'little', 'bit', 'scared', 'greatest', 'humanity', 'yet', 'developed', 'said', 'byvictor', 'ordonez', 'taylor', 'dunn', 'eric', 'nollmarch', 'pm', 'openai', 'ceo', 'sam', 'altman', 'speaks', 'abc', 'news', 'mar', 'abc', 'newsthe', 'ceo', 'behind', 'company', 'created', 'chatgpt', 'believes', 'reshape', 'society', 'know', 'believes', 'comes', 'real', 'dangers', 'also', 'greatest', 'humanity', 'yet', 'developed', 'improve', 'lives', 've', 'got', 'careful', 'said', 'sam', 'altman', 'ceo', 'openai', 'think', 'people', 'happy', 'little', 'bit', 'scared', 'altman', 'sat', 'exclusive', 'int



https://arxiv.org/abs/2303.12712 ['sparks', 'general', 'early', 'gpt', 'skip', 'main', 'content', 'support', 'fromthe', 'simons', 'member', 'cs', 'arxiv', 'help', 'advanced', 'search', 'fields', 'title', 'author', 'abstract', 'comments', 'journal', 'reference', 'acm', 'msc', 'report', 'number', 'arxiv', 'doi', 'orcid', 'arxiv', 'author', 'id', 'help', 'pages', 'full', 'text', 'search', 'open', 'search', 'go', 'open', 'menu', 'quick', 'links', 'login', 'help', 'pages', 'computer', 'science', 'language', 'arxiv', 'cs', 'submitted', 'mar', 'last', 'revised', 'apr', 'version', 'title', 'sparks', 'general', 'early', 'gpt', 'authors', 'sébastien', 'bubeck', 'varun', 'ronen', 'eldan', 'johannes', 'gehrke', 'eric', 'horvitz', 'ece', 'kamar', 'peter', 'lee', 'yin', 'tat', 'lee', 'yuanzhi', 'li', 'scott', 'lundberg', 'harsha', 'nori', 'hamid', 'palangi', 'marco', 'tulio', 'ribeiro', 'yi', 'zhang', 'download', 'pdf', 'paper', 'titled', 'sparks', 'general', 'early', 'gpt', 'ebastien', 'bubeck', 'a

https://www.technologyreview.com/2023/02/08/1068068/chatgpt-is-everywhere-heres-where-it-came-from/ ['chatgpt', 'came', 'mit', 'review', 'need', 'enable', 'view', 'site', 'skip', 'review', 'came', 'breakout', 'hit', 'overnight', 'built', 'decades', 'research', 'douglas', 'stephanie', 'arnett', 'mittr', 'tech', 'review', 'explains', 'let', 'writers', 'untangle', 'complex', 'messy', 'world', 'help', 'coming', 'next', 'read', 'reached', 'peak', 'chatgpt', 'released', 'end', 'november', 'web', 'app', 'san', 'francisco', 'based', 'firm', 'openai', 'chatbot', 'exploded', 'almost', 'overnight', 'according', 'estimates', 'fastest', 'growing', 'internet', 'service', 'ever', 'reaching', 'million', 'users', 'january', 'two', 'months', 'launch', 'openai', 'billion', 'deal', 'microsoft', 'tech', 'built', 'office', 'software', 'bing', 'search', 'engine', 'stung', 'action', 'newly', 'awakened', 'onetime', 'rival', 'battle', 'search', 'google', 'fast', 'tracking', 'rollout', 'chatbot', 'lamda', 'even'

https://www.nytimes.com/2023/02/03/technology/chatgpt-openai-artificial-intelligence.html ['nytimes', 'complease', 'enable', 'js', 'disable', 'ad', 'blocker']
https://journals.sagepub.com/doi/full/10.1177/23780231221131377 ['moment', 'enable', 'cookies', 'continue']
https://news.byu.edu/intellect/robots-are-taking-over-jobs-but-not-at-the-rate-you-might-think-says-byu-research ['robots', 'taking', 'jobs', 'rate', 'might', 'think', 'says', 'byu', 'research', 'byu', 'news', 'close', 'home', 'burger', 'menu', 'icon', 'skip', 'main', 'content', 'news', 'menu', 'search', 'faith', 'intellect', 'character', 'events', 'intellect', 'robots', 'taking', 'jobs', 'rate', 'might', 'think', 'says', 'byu', 'research', 'tyler', 'stahle', 'november', 'study', 'found', 'robots', 'replacing', 'humans', 'rate', 'people', 'think', 'people', 'prone', 'rate', 'robot', 'takeover', 'photo', 'jaren', 'wilkey', 'byu', 'photo', 'study', 'found', 'robots', 'replacing', 'humans', 'rate', 'people', 'think', 'people',

In [31]:
# dataFrame['clean'] = dataFrame.apply(lambda row: final_clean(row['links']), axis =1)
dataFrame['clean'] = dataFrame.apply(lambda column: final_clean(second_clean(clean(tokenized(r['links'])))), axis =1)
# dataFrame

NameError: name 'r' is not defined

In [18]:
#dataFrame.to_csv('Cleaned_data.csv')