# steps


### Part 3
**[TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html)** returns polarity and subjectivity of a sentence. Polarity lies between [-1,1], -1 defines a negative sentiment and 1 defines a positive sentiment. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information
1. Install both textblob for sentiment analysis and wordclouds (pip install textblob wordclouds) and download the vader lexicon (nltk.download('vader_lexicon'))
2. Find the polarity and subjectivity of each text (Hint: `TextBlob(text).sentiment`)
3. Is there a correlation between negativity and recession years?
4. Create a word cloud for the cleaned up speeches of both Trump and Obama. What can be learned from the word clouds?

### Part 1
1. Get data from: "https://en.wikisource.org/wiki/Portal:State_of_the_Union_Speeches_by_United_States_Presidents"
2. Using BeatifullSoup get all the speeches from 1900-2022
3. Load all speech urls into a dictionary with year as key
4. Loop through dictionary and save content of each speech in [year].txt files

In [179]:
import requests as req
import bs4
import re

In [180]:
url = "https://en.wikisource.org/wiki/Portal:State_of_the_Union_Speeches_by_United_States_Presidents"
html = req.get(url)
soup = bs4.BeautifulSoup(html.text, 'html.parser')

year_reg = re.compile(r"\b(19|20)\d{2}\b")
events = soup.select('li')[144:269]

years = []
urls = []

for e in events:
    urls.append('https://en.wikisource.org'+e.a['href'])
    years.append(year_reg.search(e.text).group())

In [181]:
year_url_dict = dict(zip(years,urls))

In [182]:
year_url_dict

{'1900': 'https://en.wikisource.org/wiki/William_McKinley%27s_Fourth_State_of_the_Union_Address',
 '1901': 'https://en.wikisource.org/wiki/Theodore_Roosevelt%27s_First_State_of_the_Union_Address',
 '1902': 'https://en.wikisource.org/wiki/Theodore_Roosevelt%27s_Second_State_of_the_Union_Address',
 '1903': 'https://en.wikisource.org/wiki/Theodore_Roosevelt%27s_Third_State_of_the_Union_Address',
 '1904': 'https://en.wikisource.org/wiki/Theodore_Roosevelt%27s_Fourth_State_of_the_Union_Address',
 '1905': 'https://en.wikisource.org/wiki/Theodore_Roosevelt%27s_Fifth_State_of_the_Union_Address',
 '1906': 'https://en.wikisource.org/wiki/Theodore_Roosevelt%27s_Sixth_State_of_the_Union_Address',
 '1907': 'https://en.wikisource.org/wiki/Theodore_Roosevelt%27s_Seventh_State_of_the_Union_Address',
 '1908': 'https://en.wikisource.org/wiki/Theodore_Roosevelt%27s_Eighth_State_of_the_Union_Address',
 '1909': 'https://en.wikisource.org/wiki/William_Howard_Taft%27s_First_State_of_the_Union_Address',
 '191

In [121]:
for year in year_url_dict:
    filename = year+'.txt'
    url = year_url_dict[year]
    html = req.get(url)
    soup = bs4.BeautifulSoup(html.text, 'html.parser')
    events = soup.select("div > p")[1:-1]
    
    with open('data/'+filename,'w') as output:
        for e in events:
            output.write(e.text)

### Part 2
1. Install nltk: `pip install nltk`
2. From the data/gdp.csv file create a dataframe with year and GDP
3. From the data/US presidents.csv file create a dataframe with year, president and party
4. From the developed text files in part 1, create a dictionary with year:speech
5. Clean text by change all to lowercase and remove '\n'
6. Get words from texts (from nltk.tokenize import word_tokenize). Clean text by removing stop words (from nltk.corpus import stopwords) and all non-alphabetic characters (including , and .)
7. Use from nltk.stem import WordNetLemmatizer to lemmatize all texts

In [183]:
import pandas as pd

df_gdp = pd.read_csv('../../data/gdp.csv',sep=',')
df_presidents = pd.read_csv('../../data/US presidents.csv',sep=';')

In [184]:
df_gdp.head()

Unnamed: 0,date,level-current,level-chained,change-current,change-chained
0,1930,92.2,966.7,-16.0,-6.4
1,1931,77.4,904.8,-23.1,-12.9
2,1932,59.5,788.2,-4.0,-1.3
3,1933,57.2,778.3,16.9,10.8
4,1934,66.8,862.2,11.1,8.9


In [185]:
df_presidents.head()

Unnamed: 0,Years (after inauguration),President,Party
0,1900,William McKinley,Republican
1,1901,Theodore Roosevelt,Republican
2,1902,Theodore Roosevelt,Republican
3,1903,Theodore Roosevelt,Republican
4,1904,Theodore Roosevelt,Republican


In [186]:
year_speech_dict = {}


for year in years:
    with open('data/'+year+'.txt') as speech_file:
        year_speech_dict[year] = speech_file.read()
        

In [187]:
for year in year_speech_dict:
    year_speech_dict[year] = re.sub("\n","",year_speech_dict[year].lower())

In [190]:
import nltk
from nltk.tokenize import word_tokenize

tokens =[]

for year in year_speech_dict:
    t word_tokenize(year_speech_dict[year])
    for w in t:
        tokens.append(w)

tokens[:10]

['to',
 'the',
 'senate',
 'and',
 'house',
 'of',
 'representatives',
 ':',
 'at',
 'the']

In [191]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
print(tokens[0:10])

['senate', 'house', 'representatives', ':', 'outgoing', 'old', 'incoming', 'new', 'century', 'begin']


In [194]:
tokens_cleaned = []

for token in tokens:
    if token.isalpha():
        tokens_cleaned.append(token)

tokens_cleaned[0:10]

['senate',
 'house',
 'representatives',
 'outgoing',
 'old',
 'incoming',
 'new',
 'century',
 'begin',
 'last']

[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...


True

In [207]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

#lemmatize trouble variations
lemmatized_words=[lemmatizer.lemmatize(word=word,pos='v') for word in tokens_cleaned]
lemmatizeddf= pd.DataFrame({'original_word': tokens_cleaned,'lemmatized_word': lemmatized_words})
lemmatizeddf=lemmatizeddf[['original_word','lemmatized_word']]
lemmatizeddf[:50]

Unnamed: 0,original_word,lemmatized_word
0,senate,senate
1,house,house
2,representatives,representatives
3,outgoing,outgo
4,old,old
5,incoming,incoming
6,new,new
7,century,century
8,begin,begin
9,last,last
