# Creation of a american words lexicon/vocabulary

This notebook implements the creation of an american words lexicon which will then used to assess the "percentage" of american culture in each plot summaries (see notebook *results_P3.ipynb*). The procedure is the following. One retrieves the text of wikipedia pages of many countries including the United States. Then, every raw text retrieved from the wikipedia pages is tokennized using the NLP pipeleine of the Python library *Spacy*. Finally, the lexicon of typical american words is created by taking the intersection between the tokens of the USA wikipedia pages and the tokens of all other wikipedia pages.

In [1]:
# Needed libraries
import spacy # to implement NLP on the wikipedia page text
import wikipedia # to retrieve wikipedia page text content

In [None]:
# Initialize the Spacy analyzer in English since all the wikipedia pages are analysed in English
nlp = spacy.load("en_core_web_sm")

# Function to process a Wikipedia page
def process_page(page_name):
    page_content = wikipedia.page(page_name).content.replace('==', '').replace('\n', '')
    doc = nlp(page_content) # tokenizing each page
    return [token.lemma_.lower() for token in doc if not token.is_stop and token.is_alpha] # removing stopwords and non alphabetic characters (with .is_alpha) and lemmatize the text to discard close form of the same word

# Process each page separately for verification and clarity purposes
us_words = process_page('United States') # https://en.wikipedia.org/wiki/United_States 
fr_words = process_page('France') # https://en.wikipedia.org/wiki/France 
uk_words = process_page('United Kingdom') # https://en.wikipedia.org/wiki/United_Kingdom 
de_words = process_page('Germany') # https://en.wikipedia.org/wiki/Germany 
#it_words = process_page('Italy') # https://en.wikipedia.org/wiki/Italy 
jp_words = process_page('Japan') # https://en.wikipedia.org/wiki/Japan 
ch_words = process_page('Switzerland') # https://en.wikipedia.org/wiki/Switzerland 
ir_words = process_page('Ireland') # https://en.wikipedia.org/wiki/Ireland --> since Ireland has a strong impact on amercican culture

# Combine words from FR, UK, DE and keep only the unique ones for faster processing
other_words = set(fr_words + uk_words + de_words  + jp_words + ch_words + ir_words)

# Extract unique US words using the set() function
unique_us_words = set(us_words) - other_words

print(f"The list of unique US words is: \n {unique_us_words}")
print(f"Unique US words: {len(unique_us_words)}")


The list of unique US words is: 
 {'battery', 'boston', 'chef', 'marsden', 'trademark', 'franklin', 'zee', 'accessible', 'salvador', 'taco', 'ignite', 'stowe', 'hispanic', 'forgiveness', 'pew', 'detention', 'tornado', 'cheap', 'hamburger', 'correctional', 'articles', 'amassing', 'firmly', 'electrification', 'failing', 'corridor', 'cap', 'philippines', 'humanities', 'flour', 'herman', 'albany', 'federalist', 'jay', 'katharine', 'nevada', 'jim', 'naturalism', 'payload', 'plymouth', 'concentrated', 'sumter', 'morton', 'pollock', 'aggression', 'warhol', 'texas', 'reed', 'americans', 'proximity', 'incorporated', 'defendant', 'reiterate', 'unsheltered', 'sonoran', 'pitt', 'aeronautics', 'kentucky', 'vacation', 'trail', 'minstrel', 'sneaker', 'persistent', 'négritude', 'enroll', 'harlem', 'gazette', 'angell', 'enduring', 'outlet', 'temper', 'appomattox', 'diaspora', 'unorganized', 'transcendentalism', 'productivity', 'qaeda', 'analysis', 'redistribute', 'midwestern', 'skateboard', 'englishmen

In [10]:
print(type(unique_us_words))
if 'thanksgiving' in unique_us_words:
    print("yes")

<class 'set'>


### Verification of the results with a small list of very american words

One notices that *hollywood*, *thanksgiving* and *comboy* are not in the *unique_us_words* dictionnary although they are "very" american words. This arises because other wikipedia pages main contains these words. For example, the word *hollywood* appears in the wikipedia page of France and *thanksgiving* in the page of Japan, removing them from the *unique_us_words*.

In [9]:
list_straightforward_American_words = ['hollywood', 'cowboy', 'thanksgiving', 'donut', 'broadway', 'sheriff', 'mcdonald', 'doughnut', 'hamburger', 'pentagon', 'halloween']

words_in_list = []
words_not_in_list = []

for amercian_word in list_straightforward_American_words:
    if amercian_word in unique_us_words:
        words_in_list.append(amercian_word)
    else:
        words_not_in_list.append(amercian_word)

print(f"The words that are both in the US wikipedia list and in the simple straightforward american list are {words_in_list}")
print(f"The words that are in the US wikipedia list but NOT in the simple straightforward american list are {words_not_in_list}")

The words that are both in the US wikipedia list and in the simple straightforward american list are ['donut', 'broadway', 'sheriff', 'mcdonald', 'doughnut', 'hamburger', 'pentagon']
The words that are in the US wikipedia list but NOT in the simple straightforward american list are ['hollywood', 'cowboy', 'thanksgiving', 'halloween']


### Saving the lexicon of US words for later use

In [None]:
import pickle

# Convert to a set (if not already)
unique_us_words_set = set(unique_us_words)

# Save the set
with open('data/unique_us_words_set.pkl', 'wb') as f:
    pickle.dump(unique_us_words_set, f)


In [7]:
from nltk.corpus import cmudict

cmu_dict = cmudict.dict()
american_words = set(cmu_dict.keys())
print(len(american_words))
print(american_words)

123455


In [8]:
from nltk.corpus import wordnet as wn

american_words = set(lemma.name() for synset in wn.all_synsets() for lemma in synset.lemmas(lang='eng'))
print(len(american_words))


148730


In [None]:
# old chloé
# Extract United States 
import wikipedia
import spacy 
nlp = spacy.load('en_core_web_sm')

list_wiki_pages = ['France', 'United Kingdom', 'Germany'] # list of wikipedia pages used to extract the lexicon of american culture
text_FR_DE_UK = []

# # Extract the text of the wikipedia page specified above
for page in list_wiki_pages:
    page = wikipedia.page(page) # acces the page
    text = page.content # extract the raw tet content expcet images, tables etc.
    text = text.replace('==', '')
    text_FR_DE_UK.append(text)

wiki_us = wikipedia.page('United States') # Keep the US wikipedia page on the side since it's the target 

# Extract the plain text content of the page, excluding images, tables, and other data.
text_us = wiki_us.content


# # Replace '==' with '' (an empty string)
text_us = text_us.replace('==', '')

#print(text_us)
#print("Last 12 characters before slicing:", text_us.replace('\n', '')[-12:])

# # Replace '\n' (a new line) with '' & end the string at $1000.
text = text_us.replace('\n', '')#[:-12]

encoded_text = nlp(text)
print(encoded_text)


#encoded_text[1]


encoded_text[1].text

if any(token.text == "Hollywood" for token in encoded_text):
    print(f"bla is in the text!")
    # Words that appear in USa summaries but not in foreign ones 
american_vocab = [i.text for i in encoded_text]
print(american_vocab)
print(len(american_vocab))

no_dupli = list(set(american_vocab))
len(no_dupli)

bla is in the text!
