## US Election 2020 Tweets Analysis

The ["US Election 2020 Tweets" dataset](https://www.kaggle.com/datasets/manchunhui/us-election-2020-tweets/data) provides an extensive collection of tweets from around the world that include hashtags related to the 2020 U.S. presidential candidates, Biden and Trump.

### Objectives
For the purposes of this analysis, we will focus exclusively on tweets from the United States, in English, to conduct text analysis with Three main goals:

- **Tweets distribution Analysis**: Describe how tweets are distributed between the two U.S. candidates across federal states;
- **Global Sentiment Analysis**: Assess the sentiment of the global population of Twitter users towards Biden and Trump evaluating which sentiment preveals on each of them;
- **Federal States Sentiment Analysis**: Assess the sentiment of each Federal state Twitter users towards Biden and Trump evaluating which sentiment preveals on each of them and if it coherent with the outcome of 2020 elections.

---------------

# Federal States Sentiment Analysis

In [1]:
# Import Libraries 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import plotly.express as px 
import plotly.io as pio
from pathlib import Path
import os
# Libraries for Sentiment Analysis 
import re 
import nltk 
from nltk.corpus import stopwords 
from nltk.corpus import wordnet 
from nltk.stem import WordNetLemmatizer 
from textblob import TextBlob 
from wordcloud import WordCloud 
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords, words
from nltk.probability import FreqDist

In [3]:
# Define the base path
base_path = Path("C:/Users/Davide/Desktop/Alma Mater/SECOND YEAR/PYTHON/Python_project")
# Change the working directory
os.chdir(base_path)

# Define the full path to the CSV file for Trump and Biden
merged_data = base_path / "data.csv" ###this dataset is already restricted to the US

# Print the current working directory
print("Current Working Directory:", Path.cwd())

Current Working Directory: C:\Users\Davide\Desktop\Alma Mater\SECOND YEAR\PYTHON\Python_project


In [36]:
try:
    data = pd.read_csv(merged_data, encoding="utf-8", engine='python', on_bad_lines='skip')
    print("First 5 rows of the DataFrame:")
    print(data.head())
except Exception as e:
    print("Error loading the file:", e)

First 5 rows of the DataFrame:
   index           created_at      tweet_id  \
0      0  2020-10-15 00:00:01  1.316529e+18   
1      2  2020-10-15 00:00:02  1.316529e+18   
2      4  2020-10-15 00:00:08  1.316529e+18   
3      5  2020-10-15 00:00:17  1.316529e+18   
4      7  2020-10-15 00:00:18  1.316529e+18   

                                               tweet  likes  retweet_count  \
0  #Elecciones2020 | En #Florida: #JoeBiden dice ...    0.0            0.0   
1  #Trump: As a student I used to hear for years,...    2.0            1.0   
2  You get a tie! And you get a tie! #Trump ‘s ra...    4.0            3.0   
3  @CLady62 Her 15 minutes were over long time ag...    2.0            0.0   
4  @DeeviousDenise @realDonaldTrump @nypost There...    0.0            0.0   

                source       user_id  \
0            TweetDeck  3.606665e+08   
1      Twitter Web App  8.436472e+06   
2   Twitter for iPhone  4.741380e+07   
3  Twitter for Android  1.138416e+09   
4   Twitter for i

In [None]:
#retain only the columns "tweet", "state_code" and "candidate"
data = data [['tweet', 'state_code', 'candidate']]
print(data)

In [39]:
#then we see whether there are missing values and drop rows that have missing values in either one of the three columns
#to get a more clear view on missing data, just sum up all the missing for each variable/column
print('NaN for each variable:\n', data.isna().sum(axis=0))  #sum by line
print('\nTotal NaN ', data.isna().sum(axis=0).sum())

NaN for each variable:
 tweet             0
state_code    61931
candidate         0
dtype: int64

Total NaN  61931


In [40]:
#the only missing values are in state code. We drop all rows with a missing value for the State code
data = data.dropna(subset=['state_code'], how='any') #if there is a missing value in state_code, then the line is dropped
data.shape  #create a new dataframe. the resulting dataframe had 332464 rows and 3 columns

(332464, 3)

In [None]:
#now that we have a dataset with only useful columns and no missing values, we can start performing our analysis by US state
#first we see how many tweets there are by candidate in each State, to ensure that our sample is balanced 
#(i.e., we dn't want a State with tweets only on one candidate, as this would bias our analysis) --> see chart by Davide with tweets for each candidate by State

#check that we have all US states. 
unique_US_states_count = data['state_code'].nunique()
unique_US_states = data['state_code'].unique()
print(unique_US_states)
#There are 54 states. they should be 50. Hence there are 4 codes that do not indicate States, but rather US territories. these are:
#DC - District of Columbia (federal district)
#PR - Puerto Rico (U.S. territory)
#GU - Guam (U.S. territory)
#MP - Northern Mariana Islands (U.S. territory)

# drop the 3 US territories in order to perform the analysis by US state
codes_to_drop = ['PR', 'GU', 'MP']
data = data[~data['state_code'].isin(codes_to_drop)] #exclude rows where state code is any of the four listed in codes to drop

# Check if any of the unwanted codes are still in the 'state_code' column
unwanted_codes_present = data['state_code'].isin(codes_to_drop).any()
if unwanted_codes_present:
    print("There are still unwanted codes in the data.")
else:
    print("All unwanted codes have been successfully removed from the data.")




['FL' 'OR' 'DC' 'CA' 'OH' 'PA' 'IL' 'MI' 'NJ' 'MA' 'NH' 'TX' 'SD' 'GA'
 'MO' 'NY' 'CO' 'SC' 'VA' 'AL' 'AZ' 'NC' 'TN' 'NE' 'LA' 'NV' 'MN' 'IN'
 'WA' 'HI' 'WV' 'VT' 'ID' 'PR' 'IA' 'KY' 'ND' 'AR' 'WI' 'UT' 'MT' 'KS'
 'WY' 'ME' 'CT' 'MD' 'NM' 'OK' 'AK' 'DE' 'RI' 'MS' 'GU' 'MP']
All unwanted codes have been successfully removed from the data.


In [42]:
#now we have to group tweets for both candidates by State. Once we have done this, we perform our sentiment analysis for each state. 
#Then, for each state we consider the results of our sentiment analysis and check whether they are a good proxy for the electoral outcome in that State


In [64]:
#step1: define the functions needed for cleaning the data and performing sentiment analysis
import unicodedata
def clean(text): 
	# Remove URLs 
	text = re.sub(r'https?://\S+|www\.\S+', '', str(text)) 

	# Convert text to lowercase (after turning bold and italics into normal text)
	text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8').lower()

	# Replace anything other than alphabets a-z with a space 
	text = re.sub('[^a-z]', ' ', text) 

	# Split the text into single words 
	text = text.split() 

	# Initialize WordNetLemmatizer 
	lm = WordNetLemmatizer() 

	# Lemmatize words and remove stopwords 
	text = [lm.lemmatize(word) for word in text if word not in set( 
		stopwords.words('english'))] 

	# Join the words back into a sentence 
	text = ' '.join(word for word in text) 

	return text 


def getpolarity(text): 
    return TextBlob(text).sentiment.polarity 

def getsubjectivity(text): 
    return TextBlob(text).sentiment.subjectivity 

def getAnalysis(score): 
    if score < 0: 
        return 'negative'
    elif score == 0: 
        return 'neutral'
    else: 
        return 'positive'

In [66]:
#step 2: group tweets for both candidates by US state 

# Create a dictionary to hold dataframes for each state
state_dataframes = {}

# Loop through each state code and create a separate DataFrame for each statecode
for state in unique_US_states:
    state_dataframes[state] = data[data['state_code'] == state][['tweet', 'candidate']]

# Example: Accessing the DataFrame for California (CA)
california_df = state_dataframes['CA']  
print("California DataFrame:")
print(california_df)
# Example: Accessing the DataFrame for Wyoming (WY)
Wyoming_df = state_dataframes['WY']  
print("Wyoming DataFrame:")
print(Wyoming_df)

California DataFrame:
                                                    tweet candidate
3       @CLady62 Her 15 minutes were over long time ag...     trump
7       #Trump #PresidentTrump #Trump2020LandslideVict...     trump
15      #BlacksForTrump \n#BlackVoicesForTrump \n#Bide...     trump
19      #TheWeek: "#Trump in Penn: "I saved suburbia. ...     trump
21      #Trump is tearing up #Biden at the #TrumpRally...     trump
...                                                   ...       ...
394346  On what date can we officially start blaming a...     biden
394370  Oigan como que ganó #Biden la #precidencia y d...     biden
394375  LIONZ DEN PRESENTS TO YOU \n\n“THE WHITE OBAMA...     biden
394381  #Election2020 President #Trump addresses Joe #...     biden
394386  #Biden 🗽🇺🇸👍🏽 | Images 📷 @ Santa Maria, CA.  | ...     biden

[56964 rows x 2 columns]
Wyoming DataFrame:
                                                    tweet candidate
486     Confessions of the secret suburban Trump 

In [72]:
#step 3: Within each dataframe relating to a specific state, consider separately tweets for trump and tweets for trump. Apply to each subset of
#tweets the cleaning and sentiment analysis functions defined in step 1. Display the results of the sentiment analysis for both candidates in each dataframe

# Iterate over each state's DataFrame in the dictionary
for state, df in state_dataframes.items():

    #first consider only tweets referring to trump
    trump_tweets = df[df['candidate'] == 'trump'].copy() #to prevent SettingWithCopyWarning
    # Clean the text in the tweet column
    trump_tweets['cleaned_tweet'] = trump_tweets['tweet'].apply(clean)
    # Apply sentiment analysis functions
    trump_tweets['polarity'] = trump_tweets['cleaned_tweet'].apply(getpolarity)
    trump_tweets['subjectivity'] = trump_tweets['cleaned_tweet'].apply(getsubjectivity)
    trump_tweets['sentiment'] = trump_tweets['polarity'].apply(getAnalysis)
    #display sentiment analysis results in numbers and percentages for trump
    print(f"Sentiment analysis results for Trump in {state}:")
    print(trump_tweets['sentiment'].value_counts())  #in numbers
    print(trump_tweets['sentiment'].value_counts(normalize=True) * 100)  #in percentages
    
    #then consider only tweets referring to biden
    biden_tweets = df[df['candidate'] == 'biden'].copy()
    # Clean the text in the tweet column
    biden_tweets['cleaned_tweet'] = biden_tweets['tweet'].apply(clean)
    # Apply sentiment analysis functions
    biden_tweets['polarity'] = biden_tweets['cleaned_tweet'].apply(getpolarity)
    biden_tweets['subjectivity'] = biden_tweets['cleaned_tweet'].apply(getsubjectivity)
    biden_tweets['sentiment'] = biden_tweets['polarity'].apply(getAnalysis)
    #display sentiment analysis results in numbers and percentages for biden
    print(f"Sentiment analysis results for Biden in {state}:")
    print(biden_tweets['sentiment'].value_counts())  #in numbers
    print(biden_tweets['sentiment'].value_counts(normalize=True) * 100)  #in percentages
    
   
    

Sentiment analysis results for Trump in FL:
sentiment
neutral     7733
positive    5533
negative    3292
Name: count, dtype: int64
sentiment
neutral     46.702500
positive    33.415871
negative    19.881628
Name: proportion, dtype: float64
Sentiment analysis results for Biden in FL:
sentiment
neutral     6545
positive    4546
negative    2187
Name: count, dtype: int64
sentiment
neutral     49.292062
positive    34.237084
negative    16.470854
Name: proportion, dtype: float64
Sentiment analysis results for Trump in OR:
sentiment
positive    1128
neutral     1126
negative     872
Name: count, dtype: int64
sentiment
positive    36.084453
neutral     36.020473
negative    27.895074
Name: proportion, dtype: float64
Sentiment analysis results for Biden in OR:
sentiment
neutral     962
positive    871
negative    417
Name: count, dtype: int64
sentiment
neutral     42.755556
positive    38.711111
negative    18.533333
Name: proportion, dtype: float64
Sentiment analysis results for Trump in DC: