## Andy Dsida (ad3678)
### <font color = "blue">Final Assignment - Analyzing Immigration Sentiment of Trump vs. Other Presidents</font>

There has been a narrative among many political pundits and news reporters that Donald Trump's discussion of immigration is a departure from traditional norms.  He has been labeled as a racist and a xenophobe and divisive terms like "build the wall", "detention centers" and "Muslim ban" have become synonymous with him.  Is this fair or is it some combination of a progressively harsh discourse on the topic, a justifiable posture because of the period of high immigration in the United States right now or, as President Trump has often asserted, a media attack on him for bucking the system?

I analyzed the common denominator, and most formal discourse, of all United States presidents, the State of the Union address.  Comparing Trump's speech content to the content of all previous presidents, to modern (post-WWII) presidents, to presidents who presided over periods of similar immigration and to the most recent president from his party, George W Bush, I will try to quantify the answer to some of these quesions.

### <font color = "blue">Loading Environment</font>

In [1]:
# Getting the directory for the import set up
import os
#cwd = os.getcwd()
#print(cwd)
os.chdir("//users/adsida/OneDrive/QMSS/SOU")
os.getcwd()

'/Users/adsida/OneDrive/QMSS/SOU'

In [2]:
# import libraries
import pandas as pd
import nltk
import spacy
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
import sklearn as sk
from sklearn.feature_extraction.text import CountVectorizer
from string import punctuation
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize, sent_tokenize

### <font color = "blue">Loading data files</font>
Loading text files that contain the following State of the Union sets found here:
https://www.kaggle.com/rtatman/state-of-the-union-corpus-1989-2017

Individual text files were combined using Unix (command line) commands to form the following sets of speeches:
<ul>All speeches<br>
    Trump speeches<br>
    George W. Bush speeches<br>
    Post-WWII speeches<br>
    Speeches by presidents during high (13%+ as percentage of total) periods</ul>

In [3]:
## load set of all speeches
with open ('all.txt') as a:
    all_text = a.read()

## Trump speeches
with open ('trump.txt') as b:
    trump_text = b.read()

## George W. Bush speeches
with open ('gwb.txt') as c:
    gwb_text = c.read()

## Post-WWII speeches
with open ('modern.txt') as d:
    modern_text = d.read()

print(len(all_text),len(trump_text),len(gwb_text),len(modern_text))

10542905 59616 237345 2735817


In [6]:
# open file containing immigration percentage and send to dataframe
import csv
immigrant_percent = pd.read_csv('/Users/adsida/Downloads/immigrant_percent.csv')    
immigrant_percent[0:5]

Unnamed: 0,year,immigrants,percent
0,1850,2244600,9.7
1,1860,4138700,13.2
2,1870,5567200,14.4
3,1880,6679900,13.3
4,1890,9249500,14.8


In [7]:
# filter to find periods of comparably high immigrant percentages
immigrant_percent['percent'] = pd.to_numeric(immigrant_percent['percent'],errors='coerce')
is_high = immigrant_percent['percent'] >= 13
high_imm = immigrant_percent[is_high]
high_imm

Unnamed: 0,year,immigrants,percent
1,1860,4138700,13.2
2,1870,5567200,14.4
3,1880,6679900,13.3
4,1890,9249500,14.8
5,1900,10341300,13.6
6,1910,13515900,14.7
7,1920,13920700,13.2
17,2011,40377900,13.0
18,2012,40824700,13.0
19,2013,41348100,13.1


In [7]:
# Loading speeches from Presidents during high immigration periods (minus Trump himself)

with open ('high.txt') as e:
    high_text = e.read()
    
print(len(high_text))

3879836


### <font color = "blue">Basic text parsing</font>

In [12]:
#### Can't get function to work...I give up and will copy/paste for the different sets....

text_list = (all_text, modern_text, high_text, trump_text, gwb_text)

###
def tokens (text_file):  
    text_words = word_tokenize(text_file)
    text_sents = sent_tokenize(text_file)
    print("Number of words for is ", len(text_words))
    print("Number of sentences is ", len(text_sents))

for items in text_list:
    tokens(modern_text)
###



Number of words for is  513184
Number of sentences is  21963
Number of words for is  513184
Number of sentences is  21963
Number of words for is  513184
Number of sentences is  21963
Number of words for is  513184
Number of sentences is  21963
Number of words for is  513184
Number of sentences is  21963


In [8]:
all_words = word_tokenize(all_text)
all_sents = sent_tokenize(all_text)
print("Number of words for all speeches is ", len(all_words))
print("Number of sentences for all speeches is ", len(all_sents))

trump_words = word_tokenize(trump_text)
trump_sents = sent_tokenize(trump_text)
print("Number of words for Trump speeches is ", len(trump_words))
print("Number of sentences for Trump speeches is ", len(trump_sents))

gwb_words = word_tokenize(gwb_text)
gwb_sents = sent_tokenize(gwb_text)
print("Number of words for GWB speeches is ", len(gwb_words))
print("Number of sentences for GWB speeches is ", len(gwb_sents))

modern_words = word_tokenize(modern_text)
modern_sents = sent_tokenize(modern_text)
print("Number of words for post-WWII speeches is ", len(modern_words))
print("Number of sentences for post-WWII speeches is ", len(modern_sents))

high_words = word_tokenize(high_text)
high_sents = sent_tokenize(high_text)
print("Number of words for high immigration speeches is ", len(high_words))
print("Number of sentences for high immigration speeches is ", len(high_sents))

Number of words for all speeches is  1921579
Number of sentences for all speeches is  60794
Number of words for Trump speeches is  11591
Number of sentences for Trump speeches is  613
Number of words for GWB speeches is  45417
Number of sentences for GWB speeches is  2175
Number of words for post-WWII speeches is  513184
Number of sentences for post-WWII speeches is  21963
Number of words for high immigration speeches is  704063
Number of sentences for high immigration speeches is  20583


### <font color = "blue">Checking for immigration mentions in the various speech groups

In [9]:
immigrant_words = ["immigrant", "immigrate", "immigrants", "immigration", "alien", "aliens", "migrant", "migrants", "foreigners", "foreigner", "border", "borders"]


all_imm_sents = []
for sents in all_sents:
    for words in immigrant_words:
        if words in sents.lower():
            all_imm_sents.append(sents)

print("Number of sentences about immigration is ",len(all_imm_sents))
all_imm_perc = len(all_imm_sents)/len(all_sents)
print("This is ", all_imm_perc, "percent of the speech.")

Number of sentences about immigration is  1180
This is  0.01940981017863605 percent of the speech.


In [10]:
trump_imm_sents = []
for sents in trump_sents:
    for words in immigrant_words:
        if words in sents.lower():
            trump_imm_sents.append(sents)

print("Number of sentences about immigration is ",len(trump_imm_sents))
trump_imm_perc = len(trump_imm_sents)/len(trump_sents)
print("This is ", trump_imm_perc, "percent of the speech.")

Number of sentences about immigration is  44
This is  0.07177814029363784 percent of the speech.


In [11]:
gwb_imm_sents = []
for sents in gwb_sents:
    for words in immigrant_words:
        if words in sents.lower():
            gwb_imm_sents.append(sents)

print("Number of sentences about immigration is ",len(gwb_imm_sents))
gwb_imm_perc = len(gwb_imm_sents)/len(gwb_sents)
print("This is ", gwb_imm_perc, "percent of the speech.")

Number of sentences about immigration is  51
This is  0.023448275862068966 percent of the speech.


In [12]:
modern_imm_sents = []
for sents in modern_sents:
    for words in immigrant_words:
        if words in sents.lower():
            modern_imm_sents.append(sents)

print("Number of sentences about immigration is ",len(modern_imm_sents))
modern_imm_perc = len(modern_imm_sents)/len(modern_sents)
print("This is ", modern_imm_perc, "percent of the speech.")

Number of sentences about immigration is  305
This is  0.013886991758867186 percent of the speech.


In [13]:
high_imm_sents = []
for sents in high_sents:
    for words in immigrant_words:
        if words in sents.lower():
            high_imm_sents.append(sents)

print("Number of sentences about immigration is ",len(high_imm_sents))
high_imm_perc = len(high_imm_sents)/len(high_sents)
print("This is ", high_imm_perc, "percent of the speech.")

Number of sentences about immigration is  587
This is  0.028518680464460964 percent of the speech.


<font color = "green"><b>Trump mentions immigration terms at a conspicuously high rate.</b></font><br>
<br>Trump mentions terms like 'immigrant' and 'foreigner' in approximately 7.2% of his State of The Union speech sentences.  This is far above the norm for presidential SOTU speeches, which is less than 2%.<br><br>
Is that because times have changed?  It doesn't appear so.  Presidents since WWII have mentioned these terms even less than Presidents who preceded them.  <br><br>
Maybe it is a phenomenon of speaking during a time of high immigration?  Looking at SOTU speeches during high (13%+, or roughly equal to or greater than the immigration percentage during Trump's time in office), there are more mentions of immigration terms.  But these are still only mentioned in approximately 2.9% of their speeches, less than half the frequency of Trump.. <br><br>
How about testing if this is a modern Republican talking point--how often did George W. Bush bring up these topics?  More often than the norm, but still only about 2% of his sentences mentioned this, which is less than 1/3 as often as Donald Trump.

### <font color = "blue">Looking at sentiment of sentences that discuss immigration

In [14]:
trump_neg = 0
trump_pos = 0
trump_p = []
for phrase in trump_imm_sents:
    N = TextBlob(phrase, analyzer = NaiveBayesAnalyzer()).sentiment
    if N[0] == 'neg':
        trump_neg += 1
    else:
        trump_pos += 1
    P = TextBlob(phrase).sentiment.polarity
    trump_p.append(P)

In [19]:
print("Naive Bayes showed ", trump_neg, "negative sentences and ", trump_pos, "positive sentences.")
print("Average TextBlob sentiment polarity is ", sum(trump_p)/len(trump_p))
#print(trump_p)

Naive Bayes showed  4 negative sentences and  40 positive sentences.
Average TextBlob sentiment polarity is  0.05404177051904326


In [61]:
all_neg = 0
all_pos = 0
all_p = []
for phrase in all_imm_sents:
    N = TextBlob(phrase, analyzer = NaiveBayesAnalyzer()).sentiment
    if N[0] == 'neg':
        all_neg += 1
    else:
        all_pos += 1
    P = TextBlob(phrase).sentiment.polarity
    all_p.append(P)

In [62]:
print("Naive Bayes showed ", all_neg, "negative sentences and ", all_pos, "positive sentences.")
print("Average TextBlob sentiment polarity is ", ((sum(all_p))/len(all_p)))

Naive Bayes showed  55 negative sentences and  1125 positive sentences.
Average TextBlob sentiment polarity is  0.10206988255948621


<font color = "green"><b>Sentiment of Trump's comments on immigration do not appear to be overly negative.</b></font><br><br>  Using standard TextBlob polarity and the TextBlob Naive Bayes classifier, the Trump immigration sentences seem slightly positive, with a polarity of around 0.054 (P) and 40 positive sentences to 4 negative sentences, respectively (Bayes).
<br><br>
This compares with the entire set of SOU speeches which showed an overall polarity for immigration sentences of approximately 0.1 (P) with 55 negative sentences and 1125 positive sentences (Bayes).  So, the average Presidential speaker was more positive on immigration than trump, but both were more positive than negative.

### <font color = "blue">TF/IDF, Bag of Words

In [15]:
stop_words = stopwords.words('english') + list(punctuation)
#print(stop_words[0:30])

In [16]:
# Cleanup of bag of words from each text set

all_clean = []
for word in all_words:
    word = word.lower()
    if word not in stop_words:
        if not word.isdigit():
            all_clean.append(word)

trump_clean = []
for word in trump_words:
    word = word.lower()
    if word not in stop_words:
        if not word.isdigit():
            trump_clean.append(word)
    
gwb_clean = []
for word in gwb_words:
    word = word.lower()
    if word not in stop_words:
        if not word.isdigit():
            gwb_clean.append(word)

modern_clean = []
for word in modern_words:
    word = word.lower()
    if word not in stop_words:
        if not word.isdigit():
            modern_clean.append(word)
    
high_clean = []
for word in high_words:
    word = word.lower()
    if word not in stop_words:
        if not word.isdigit():
            high_clean.append(word)

In [38]:
all_quotes =  " ".join(all_clean)
trump_quotes = " ".join(trump_clean)
gwb_quotes = " ".join(gwb_clean)
modern_quotes = " ".join(modern_clean)
high_quotes = " ".join(high_clean)
#all_quotes[0:50]

In [39]:
new_list = [all_quotes, trump_quotes, gwb_quotes, modern_quotes, high_quotes]

In [40]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2))

trump_vectors = vectorizer.fit_transform(new_list)

In [41]:
feature_names = vectorizer.get_feature_names()

dense = trump_vectors.todense()
denselist = dense.tolist()

In [42]:
# creating a Pandas dataframe with the  feature names as columns and the documents as rows
wordsdf = pd.DataFrame(denselist, columns=feature_names) #, index=speaker_names)

In [43]:
#transpose dataframe 
wordsdf_t = wordsdf.T

In [56]:
wordsdf_t.columns = ['All Presidents', 'Trump', 'GW Bush', 'Post WWII', "High Immigration Presidents"]

In [57]:
wordsdf_t[0:20]

Unnamed: 0,All Presidents,Trump,GW Bush,Post WWII,High Immigration Presidents
00,0.005701,0.0,0.0,0.000517,0.007249
00 085,7.6e-05,0.0,0.0,0.0,0.0
00 107,7.6e-05,0.0,0.0,0.0,0.0
00 12,0.000152,0.0,0.0,0.0,0.0
00 127,7.6e-05,0.0,0.0,0.0,0.0
00 14,0.000228,0.0,0.0,0.0,0.0
00 148,7.6e-05,0.0,0.0,0.0,0.0
00 15,7.6e-05,0.0,0.0,0.0,0.0
00 17,0.000152,0.0,0.0,0.0,0.0
00 171,7.6e-05,0.0,0.0,0.0,0.0


In [46]:
top_terms = []
for columns in wordsdf_t:
    top_terms.append(wordsdf_t.nlargest(30, columns))

In [68]:
print("Trump's top spoken terms are ", top_terms[1])
print("This is fairly generic and doesn't immediately point to differences with the overall top terms: ", top_terms[0])

Trump's top spoken terms are                        0         1         2         3         4
american       0.094964  0.240279  0.115533  0.122301  0.092462
america        0.062005  0.210244  0.270743  0.160903  0.030470
us             0.087539  0.165192  0.147042  0.157594  0.055592
people         0.145270  0.153929  0.197223  0.200730  0.133152
new            0.105793  0.150174  0.138873  0.178917  0.074122
one            0.101483  0.150174  0.099195  0.113355  0.100485
americans      0.029699  0.146420  0.108531  0.083944  0.017193
country        0.122453  0.138911  0.131871  0.090439  0.118156
tonight        0.018181  0.135157  0.085191  0.061150  0.008788
year           0.143641  0.116385  0.099195  0.174015  0.146238
great          0.116224  0.108876  0.074688  0.079777  0.110419
world          0.088734  0.101368  0.144708  0.180388  0.042219
also           0.063200  0.097613  0.084024  0.080390  0.050434
must           0.116332  0.097613  0.219395  0.206613  0.066194
nation    

<font color = "green"><b>There does not appear to be anything peculiar about Trump's top spoken terms overall.</b></font><br>

They seem to be as general and pertinent to the overall state of the couuntry as the most common terms used by other presidents.

### <font color = "blue">Use of abrasive language</font>
<br>
Let's see if Trump's use of potentially abrasive terms like 'border wall' and 'illegal immigrant' are more prevalent to his speech than others

In [66]:
# I'll look for some terms that might have negative connotations about immigration and immigrants.

print(wordsdf["illegal immigrant"], wordsdf["illegal"], wordsdf["wall"], wordsdf["border"])


0    0.000000
1    0.015758
2    0.000000
3    0.000000
4    0.000000
Name: illegal immigrant, dtype: float64 0    0.002390
1    0.011263
2    0.010503
3    0.003921
4    0.001815
Name: illegal, dtype: float64 0    0.001050
1    0.007509
2    0.003501
3    0.002941
4    0.001337
Name: wall, dtype: float64 0    0.003948
1    0.022526
2    0.019839
3    0.003799
4    0.003725
Name: border, dtype: float64


<font color = "green"><b>Trump does use potentailly divisive terms around immigration much more often.</b></font><br>

Trump used the term "illegal immigrant" frequently and this term was not used by any other president--as it was found at 0%, even for the corpus containing all State of the Union speeches.

Trump mentioned the term "illegal" at a rate that was slightly more than George W. Bush (0.112 vs 0.0105) and much greater than previous Presidents in any other era.  So this term appears to be part of the modern parlance, moreso than specific to Trump.  This may also be notable as both Bush's and Trump's appeal as law and order presidents, who is asserting himself against groups of people that his supporters find problematic.

The term "wall" is much more commonly used by Trump than Bush (0.0075 to 0.0035) and even more often than other presidential groupings.  Trump also mentions "border" more than Bush and much more than all, recent or high immigration era presidents.  This is notable, not just because it appears to show a stronger direction for controlling immigration, but it also means that these terms are used by him more often than some presidents who presided over wars that occcurred on our border.



### <font color = "blue">Searching for a single, negative phrase</font>

In [65]:
trump_illegal = []
for sents in trump_sents:
    if "illegal" in sents.lower():
            trump_illegal.append(sents)
print("There are ", len(trump_illegal), " times Trump used the term illegal.  Here they are:", trump_illegal)


There are  3  times Trump used the term illegal.  Here they are: ["Jamiel's 17-year-old son was viciously\nmurdered by an illegal immigrant gang member who had just been released\nfrom prison.", 'These brave\nmen were viciously gunned down by an illegal immigrant with a criminal\nrecord and two prior deportations.', 'Here are the four pillars of our plan:\n\nThe first pillar of our framework generously offers a path to\ncitizenship for 1.8 million illegal immigrants who were brought here by\ntheir parents at a young age -- that covers almost three times more\npeople than the previous administration.']


<font color = "green"><b>A qualitative look at Trump's use of 'illegal immigrant(s)'</b></font><br>

Trump used the sentence 'illegal immigrant(s)' three times in his speeches.  While this term is in common parlance and is not automatically ill-intentioned, the use of the term by Trump was clearly in a negative context in two of the three sentences--attributing murder of a child and then multiple men to an illegal immigrant.  This is a small data set, but it offers an illustration of the attitudes that the larger set analysis might imply.