<a id='title'></a>

# Analyzing the 2020 American Presidential Election using Twitter
## Page 3: Twitter Sentiment Analysis

<i> David Grinberg</i>
___________________

<a id='contents'> </a>
## Table of Contents
1. [<b>Project Introduction and 2020 Background Info](./Final_Project_1.ipynb)<br>
1. [<b>Analyzing Polling Data](./Final_Project_2.ipynb)<br>
1. [<b>Twitter Sentiment Analysis](#title)<br>
    3.1 [Packages Used](#packageimports)<br>
    3.2 [Reading Data into a DataFrame](#readcsv)<br>
    3.3 [Sorting and Cleaning Data](#sorting)<br>
    3.4 [Running Sentiment Analysis](#sentimentanalysis)<br>
    3.5 [Exporting the Data](#export)<br>    
1. [<b>Comparing Polling Data with Sentiment Analysis Data](./Final_Project_4.ipynb)<br>
1. [<b>Project Conclusion](./Final_Project_5.ipynb)<br>




________________________

<a id='packageimports'></a>
### Packages used


- Pandas is used to import, edit, manipulate, and create .csv files
- NLTK is used to perform Natural Language Processing on the individual tweets used (NLP).

[Table of Contents](#contents)
________________________

In [1]:
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

<div class="alert alert-block alert-danger"> 
<b> WARNING: Running this notebook uses a large amount of memory and time. This code takes at least 5 hours to run, so please plan accordingly if you intend to run the code in this notebook.
</b>    
    
</div>


<a id='readcsv'></a>
### Reading the data into a Pandas DataFrame

In this notebook, we will be using a data set of over 6 million tweets, downloaded from Twitter by Christo Tarazi and Wei Xiao. Like the previous notebook, we will be reading the .csv file into a pandas DataFrame object. Once in a DataFrame, we can then manipulate the data.


[[Table of Contents]](#contents)

<br>

___________________

In [2]:
pd.options.display.max_colwidth=999

df_1= pd.read_csv(r'D:/Final Project Econ/Sep1toSep16.csv',dtype={'id_str':str, 'created_at':str, 'text':str, 'user_screen_name':str,
                                           'user_created_at':str, 'favorite_count':str, 'user_favourites_count':str,
                                           'user_followers_count':str,'user_friends_count':str, 'is_quote':str,
                                           'retweeted':str, 'user_listed_count':str,'retweet_count':str, 'user_statuses_count':str,
                                           'user_id_str':str, 'user_verified':str,'user_description':str, 'user_location':str,
                                           'user_name':str, 'geo':str, 'longitutde':str,'latitude':str, 'place':str, 'lang':str,
                                           'reply_count':str, 'quote_count':str, 'candidate':str,'polarity':str, 'polarity_origin':str})



In [3]:
df_1=df_1.loc[:,['created_at','text']]
df_1['created_at']=pd.to_datetime(df_1['created_at']).dt.date


In [4]:
df_2= pd.read_csv(r'D:/Final Project Econ/Sep16toNov20.csv',dtype={'id_str':str, 'created_at':str, 'text':str, 'user_screen_name':str,
                                           'user_created_at':str, 'favorite_count':str, 'user_favourites_count':str,
                                           'user_followers_count':str,'user_friends_count':str, 'is_quote':str,
                                           'retweeted':str, 'user_listed_count':str,'retweet_count':str, 'user_statuses_count':str,
                                           'user_id_str':str, 'user_verified':str,'user_description':str, 'user_location':str,
                                           'user_name':str, 'geo':str, 'longitutde':str,'latitude':str, 'place':str, 'lang':str,
                                           'reply_count':str, 'quote_count':str, 'candidate':str,'polarity':str, 'polarity_origin':str})



df_2=df_2.loc[:,['created_at','text']]
df_2.sort_values(by=['created_at'],ascending=False)

Unnamed: 0,created_at,text
11734319,2020-11-28 22:06:12,@CB618444 @realDonaldTrump If you are fully supporting President Trump that question should not cross your lips\r\nThank you for saying what needed to be said
11734318,2020-11-28 22:06:12,Im starting to think Donald Trumps lawyers arent very good at this.
11734317,2020-11-28 22:06:12,President Trump @realDonaldTrump in disguise.
11734316,2020-11-28 22:06:12,Those in the GOP who arent supporting President Trump.... \r\n\r\nYou better get in step ......or get the hell out of our way.
11734315,2020-11-28 22:06:12,So let me get this straight. Trump is asking state legislatures to overturn the will of the people.\r\n\r\nQuestion: Do you believe State Legislators will go against the will of the people who ELECTED THEM and appoint Electors who will place Trump in Office?
...,...,...
12,2020-09-16 19:00:01,Trump last night A lot of people think that masks are not good Trump s CDC Director today This face mask is more guaranteed to protect me against COVID than when I take a COVID vaccine Trump s cavalier irresponsibility about masks will cause the death of more Americans
13,2020-09-16 19:00:01,Regina is a LincolnVoter from Arizona She s a Navy veteran who will be voting against Trump because of his devastating response to covid https t co TdlKwkePsD
14,2020-09-16 19:00:01,Why does Joe Biden only call on a pre approved list of reporters Because it s all a CHARADE
15,2020-09-16 19:00:01,Line up the donkey dicks because they can suck all of them


In [5]:
df_2['created_at']=pd.to_datetime(df_2['created_at']).dt.date
df_2=df_2.astype(str)
df_2=df_2[df_2['created_at']!= "2020-09-16"]

In [6]:
df_1=df_1.drop_duplicates(subset=['text'])
df_2=df_2.drop_duplicates(subset=['text'])

In [7]:
frames=[df_1,df_2]
df=pd.concat(frames)

del(df_1,df_2)
df.sort_index(ascending=True)

Unnamed: 0,created_at,text
0,2020-09-01,Anyone else notice that Trump supporters have huge patriotic motorcycle and boat rallies while Biden supporters are rioting and burning down cities?
1,2020-09-01,Where in the f*ck is he?????
2,2020-09-01,First #TheCorruptPartyOfTrump #LootingDecency grabbed Nazi suburban housewives by their real estate values #IMpotus science fiction theater presents: The Alien Invasion of Nazi Suburbs \__/
3,2020-09-01,Political and cultural elites like this rarely have to suffer the consequences of their call to violence. It's always the ordinary men and women. Disgusting.
4,2020-09-01,"Can we take the gloves off and tell the truth? Trump is deliberately killing people. He holds rallies where people get infected. On Thursday, no social distancing or masks, sending a clear message that the CDC should be ignored. His plan is to kill people. Let's just say it."
...,...,...
11734304,2020-11-28,@KlasfeldReports why is @MikeKellyPA so hellbent on subverting the will of the people? This would be the time to DISTANCE one's self from this shitshow -- not Mike. Remember this sycophant -- He chose Trump over all else. He doesn't deserve the office he currently resides in.
11734305,2020-11-28,So he remembers his name
11734306,2020-11-28,"@notcapnamerica They should start using the Governors Mansion for storage. \r\n\r\nThis couldve been avoided if @GregAbbott_TX acted in time. Followed the #CovidIdiots Trump and his AG @KenPaxtonTX, and now Texans are dying at an alarming rate. Incompetent bastards."
11734311,2020-11-28,"Trump was in a frenzy to finalize his bird-killer policy, David Yarnold, pres of the National Audubon Society, said in a statement Friday. \r\n\r\nWhy! \r\n\r\nReinstating this 100-year-old bedrock law must be a top conservation priority for the Biden-Harris Administration &amp; Congress."


In [8]:
df['created_at']=pd.to_datetime(df['created_at'])
df.set_index('created_at',inplace=True)
df.head(2)

Unnamed: 0_level_0,text
created_at,Unnamed: 1_level_1
2020-09-01,Anyone else notice that Trump supporters have huge patriotic motorcycle and boat rallies while Biden supporters are rioting and burning down cities?
2020-09-01,Where in the f*ck is he?????


In [9]:
df.drop_duplicates(inplace=True)
df = df.replace(r'\n',' ', regex=True) 
df = df.replace(r'\r',' ', regex=True) 

<a id='sorting'></a>
### Sorting and Cleaning the Data

On Twitter, there are a lot of bots that are programmed to post the same Tweets at the same time. Before performing sentiment analysis, we will clean up the dataset by removing duplicate tweets. Once clean, we can isolate tweets based on which candidate they reference.



[[Table of Contents]](#contents)


___________________

In [10]:
df.dropna(inplace=True)
trump=df.copy()
biden=df.copy()
kamala=df.copy()
pence=df.copy()
dem=df.copy()
gop=df.copy()

In [11]:
trump = df['text'].str.contains("trump|don",case=False)
trump=df[trump]

In [12]:
biden = df['text'].str.contains("biden|joe",case=False)
biden=df[biden]

In [13]:
pence= df['text'].str.contains("pence|mike",case=False)
pence=df[pence]

In [15]:
kamala= df['text'].str.contains("harris|kamala",case=False)
kamala=df[kamala]

In [11]:
dem= df['text'].str.contains("democrat|dem |dems",case=False)
dem=df[dem]

In [17]:
gop= df['text'].str.contains("gop|repub|republican",case=False)
gop=df[gop]

In [12]:
del(df)

<a id='sentimentanalysis'></a>
### Running a Sentiment Analysis 

Sentiment Analysis is a Natural Language Processing (NLP) technique that is used to analyze whether a piece of text is positive, neutral, or negative. There are several types of Sentiment Analysis with their own methods, however, here we will use the VADER Sentiment Analysis technique. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a sentiment analysis technique that uses a lexicon (dictionary) of words that have values between -4 and 4 attributed to them. Words with positive meanings will be given larger numbers than words with negative meanings. 

From this lexicon, an algorithm was created to assign values to text snippets. When running the Vader algorithm, 4 values will be returned: Negative, Neutral, Positive, and Compound. The Negative, Neutral, and Positive values represent the likelihood that a tweet is Negative, Neutral, or Positive, this will be represented as a number between 0-1. The Compound Value, however, is a number between -1 and 1. A highly negative compound value means that a sentence that is *most likely* negative, and a highly positive number will mean that the sentence is *most likely* positive.<br>
<br>

**For example:**

| Tweet | Negative  | Neutral  | Positive| Compound Value | 
| :--- | ---    | ---    | ---    | ---    |
| anyone else notice that trump supporters have huge patriotic motorcycle and boat rallies while biden supporters are rioting and burning down cities  | 0.000 | 0.701    | 0.299 | 0.7964    |


<br>
<br>

If you are interested in learning more about VADER, you can read the original academic paper [here](https://ojs.aaai.org/index.php/ICWSM/article/download/14550/14399/18068), or you can check out the [VADER Github](https://github.com/cjhutto/vaderSentiment) repository.


[[Table of Contents]](#contents)

<br>

___________________




In [14]:
nltk.download('vader_lexicon')
analyzer = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\capma\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [None]:
trump['neg'] = trump['text'].apply(lambda x:analyzer.polarity_scores(x)['neg'])
print("Trump1/4")

trump['neu'] = trump['text'].apply(lambda x:analyzer.polarity_scores(x)['neu'])
print("Trump2/4")

trump['pos'] = trump['text'].apply(lambda x:analyzer.polarity_scores(x)['pos'])
print("Trump 3/4")

trump['compound'] = trump['text'].apply(lambda x:analyzer.polarity_scores(x)['compound'])
print("Trump Sentiment Analysis Completed")
trump.to_csv(r'D:/Final Project Econ/trump sentiment and tweets.csv',index=True)
del trump

In [None]:
biden['neg'] = biden['text'].apply(lambda x:analyzer.polarity_scores(x)['neg'])
print("Biden 1/4")

biden['neu'] = biden['text'].apply(lambda x:analyzer.polarity_scores(x)['neu'])
print("Biden2/4")

biden['pos'] = biden['text'].apply(lambda x:analyzer.polarity_scores(x)['pos'])
print("Biden 3/4")

biden['compound'] =  biden['text'].apply(lambda x:analyzer.polarity_scores(x)['compound'])
print("Biden Sentiment Analysis Completed")
biden.to_csv(r'D:/Final Project Econ/biden sentiment and tweets.csv',index=True)

del biden

In [20]:
kamala['neg'] = kamala['text'].apply(lambda x:analyzer.polarity_scores(x)['neg'])
print("kamala 1/4")

kamala['neu'] = kamala['text'].apply(lambda x:analyzer.polarity_scores(x)['neu'])
print("kamala2/4")

kamala['pos'] = kamala['text'].apply(lambda x:analyzer.polarity_scores(x)['pos'])
print("kamala 3/4")

kamala['compound'] =  kamala['text'].apply(lambda x:analyzer.polarity_scores(x)['compound'])
print("kamala Sentiment Analysis Completed")
kamala.to_csv(r'D:/Final Project Econ/kamala sentiment and tweets.csv',index=True)

del kamala

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  kamala['neg'] = kamala['text'].apply(lambda x:analyzer.polarity_scores(x)['neg'])


kamala 1/4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  kamala['neu'] = kamala['text'].apply(lambda x:analyzer.polarity_scores(x)['neu'])


kamala2/4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  kamala['pos'] = kamala['text'].apply(lambda x:analyzer.polarity_scores(x)['pos'])


kamala 3/4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  kamala['compound'] =  kamala['text'].apply(lambda x:analyzer.polarity_scores(x)['compound'])


kamala Sentiment Analysis Completed


In [21]:
pence['neg'] = pence['text'].apply(lambda x:analyzer.polarity_scores(x)['neg'])
print("pence 1/4")

pence['neu'] = pence['text'].apply(lambda x:analyzer.polarity_scores(x)['neu'])
print("pence2/4")

pence['pos'] = pence['text'].apply(lambda x:analyzer.polarity_scores(x)['pos'])
print("pence 3/4")

pence['compound'] =  pence['text'].apply(lambda x:analyzer.polarity_scores(x)['compound'])
print("pence Sentiment Analysis Completed")
pence.to_csv(r'D:/Final Project Econ/pence sentiment and tweets.csv',index=True)

del pence

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pence['neg'] = pence['text'].apply(lambda x:analyzer.polarity_scores(x)['neg'])


pence 1/4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pence['neu'] = pence['text'].apply(lambda x:analyzer.polarity_scores(x)['neu'])


pence2/4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pence['pos'] = pence['text'].apply(lambda x:analyzer.polarity_scores(x)['pos'])


pence 3/4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pence['compound'] =  pence['text'].apply(lambda x:analyzer.polarity_scores(x)['compound'])


pence Sentiment Analysis Completed


In [15]:
dem['neg'] = dem['text'].apply(lambda x:analyzer.polarity_scores(x)['neg'])
print("dem 1/4")

dem['neu'] = dem['text'].apply(lambda x:analyzer.polarity_scores(x)['neu'])
print("dem 2/4")

dem['pos'] = dem['text'].apply(lambda x:analyzer.polarity_scores(x)['pos'])
print("dem 3/4")

dem['compound'] =  dem['text'].apply(lambda x:analyzer.polarity_scores(x)['compound'])
print("dem Sentiment Analysis Completed")
dem.to_csv(r'D:/Final Project Econ/dem sentiment and tweets.csv',index=True)

del dem

dem 1/4
dem2/4
dem 3/4
dem Sentiment Analysis Completed


In [23]:
gop['neg'] = gop['text'].apply(lambda x:analyzer.polarity_scores(x)['neg'])
print("gop 1/4")

gop['neu'] = gop['text'].apply(lambda x:analyzer.polarity_scores(x)['neu'])
print("gop2/4")

gop['pos'] = gop['text'].apply(lambda x:analyzer.polarity_scores(x)['pos'])
print("gop 3/4")

gop['compound'] =  gop['text'].apply(lambda x:analyzer.polarity_scores(x)['compound'])
print("gop Sentiment Analysis Completed")
gop.to_csv(r'D:/Final Project Econ/gop sentiment and tweets.csv',index=True)

del gop

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gop['neg'] = gop['text'].apply(lambda x:analyzer.polarity_scores(x)['neg'])


gop 1/4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gop['neu'] = gop['text'].apply(lambda x:analyzer.polarity_scores(x)['neu'])


gop2/4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gop['pos'] = gop['text'].apply(lambda x:analyzer.polarity_scores(x)['pos'])


gop 3/4


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gop['compound'] =  gop['text'].apply(lambda x:analyzer.polarity_scores(x)['compound'])


gop Sentiment Analysis Completed


### Exporting the Dataframes into Files
<a id='export'></a>

We will be exporting the two DataFrames we created into .csv files. This will allow us to access the data that we sorted, cleaned, and analyzed without having to re-run any code.

The following code will create two files: **trump sentiment and tweets.csv** and **biden sentiment and tweets.csv**. Do not run the following cell if you do not wish to create new files on your computer.


In [None]:
#trump.to_csv('trump sentiment and tweets.csv',index=True)
#biden.to_csv('biden sentiment and tweets.csv',index=True)

<a href="#contents">


<div class="alert alert-block alert-info"> 
<b>Click this box to return to the table of contents
</b>    
    
</div>
    </a>

<a href="./Final_Project_4.ipynb">


<div class="alert alert-block alert-success"> 
<b>Click this box to continue on to the comparison between Twitter sentiment analysis and polling data
</b>    
    
</div>

</a>

<a href="./Final_Project_2.ipynb">


<div class="alert alert-block alert-danger"> 
<b>Click this box to return to the analysis of 2020 presidential polling data
</b>    
    
</div>

</a>