## US Election 2020 Tweets Analysis

The ["US Election 2020 Tweets" dataset](https://www.kaggle.com/datasets/manchunhui/us-election-2020-tweets/data) provides an extensive collection of tweets from around the world that include hashtags related to the 2020 U.S. presidential candidates, Biden and Trump.

### Objectives
For the purposes of this analysis, we will focus exclusively on tweets from the United States, in English, to conduct text analysis with Three main goals:

- **Tweets distribution Analysis**: Describe how tweets are distributed between the two U.S. candidates across federal states;
- **Global Sentiment Analysis**: Assess the sentiment of the global population of Twitter users towards Biden and Trump evaluating which sentiment preveals on each of them;
- **Federal States Sentiment Analysis**: Assess the sentiment of each Federal state Twitter users towards Biden and Trump evaluating which sentiment preveals on each of them and if it coherent with the outcome of 2020 elections.

------

# Tweets Distribution Analysis

### Libraries to Import

In [None]:
# Import Libraries 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import plotly.express as px 
import plotly.io as pio
from pathlib import Path
import os

# Libraries for Sentiment Analysis 
import re 
import nltk 
from nltk.corpus import stopwords 
from nltk.corpus import wordnet 
from nltk.stem import WordNetLemmatizer 
from textblob import TextBlob 
from wordcloud import WordCloud 
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords, words
from nltk.probability import FreqDist

### Set the directory

In [14]:
# Define the base path
base_path = Path("C:/Users/Davide/Desktop/Alma Mater/SECOND YEAR/PYTHON/Python_project")
# Change the working directory
os.chdir(base_path)

# Define the full path to the CSV file for Trump and Biden
csv_path_trump = base_path / "data" / "hashtag_donaldtrump.csv"
csv_path_biden = base_path / "data" / "hashtag_joebiden.csv"

# Print the current working directory
print("Current Working Directory:", Path.cwd())

Current Working Directory: C:\Users\Davide\Desktop\Alma Mater\SECOND YEAR\PYTHON\Python_project


### Load Data

Import the "US Election 2020 Tweets" dataset for hashtags of Trump

In [15]:
## Trump 
try:
    trump = pd.read_csv(csv_path_trump, encoding="utf-8", engine='python', on_bad_lines='skip')
    print("First 5 rows of the DataFrame for Trump:")
    print(trump.head())
except Exception as e:
    print("Error loading the file:", e)

First 5 rows of the DataFrame for Trump:
            created_at                tweet_id  \
0  2020-10-15 00:00:01   1.316529221557252e+18   
1  2020-10-15 00:00:01  1.3165292227484303e+18   
2  2020-10-15 00:00:02   1.316529228091847e+18   
3  2020-10-15 00:00:02   1.316529227471237e+18   
4  2020-10-15 00:00:08  1.3165292523014513e+18   

                                               tweet likes  retweet_count  \
0  #Elecciones2020 | En #Florida: #JoeBiden dice ...   0.0            0.0   
1  Usa 2020, Trump contro Facebook e Twitter: cop...  26.0            9.0   
2  #Trump: As a student I used to hear for years,...   2.0            1.0   
3  2 hours since last tweet from #Trump! Maybe he...   0.0            0.0   
4  You get a tie! And you get a tie! #Trump ‘s ra...   4.0            3.0   

               source               user_id              user_name  \
0           TweetDeck           360666534.0     El Sol Latino News   
1    Social Mediaset            331617619.0            

Import the "US Election 2020 Tweets" dataset for hashtags of Biden

In [16]:
## Biden 
try:
    biden = pd.read_csv(csv_path_biden, encoding="utf-8", engine='python', on_bad_lines='skip')
    print("First 5 rows of the DataFrame for Biden:")
    print(biden.head())
except Exception as e:
    print("Error loading the file:", e)

First 5 rows of the DataFrame for Biden:
            created_at                tweet_id  \
0  2020-10-15 00:00:01   1.316529221557252e+18   
1  2020-10-15 00:00:18    1.31652929585929e+18   
2  2020-10-15 00:00:20  1.3165293050069524e+18   
3  2020-10-15 00:00:21  1.3165293080815575e+18   
4  2020-10-15 00:00:22   1.316529312741253e+18   

                                               tweet likes  retweet_count  \
0  #Elecciones2020 | En #Florida: #JoeBiden dice ...   0.0            0.0   
1  #HunterBiden #HunterBidenEmails #JoeBiden #Joe...   0.0            0.0   
2  @IslandGirlPRV @BradBeauregardJ @MeidasTouch T...   0.0            0.0   
3  @chrislongview Watching and setting dvr. Let’s...   0.0            0.0   
4  #censorship #HunterBiden #Biden #BidenEmails #...   1.0            0.0   

               source                user_id           user_name  \
0           TweetDeck            360666534.0  El Sol Latino News   
1    Twitter for iPad            809904438.0         Cheri 

### Data Destription

In the this subsection some descriptive analysis on the dataset concerning ***Trump*** are performed 

In [6]:
# Perform descriptive analysis
print("\nDescriptive Analysis:")
print(trump.head(10))  # Show first 10 rows


Descriptive Analysis:
            created_at                tweet_id  \
0  2020-10-15 00:00:01   1.316529221557252e+18   
1  2020-10-15 00:00:01  1.3165292227484303e+18   
2  2020-10-15 00:00:02   1.316529228091847e+18   
3  2020-10-15 00:00:02   1.316529227471237e+18   
4  2020-10-15 00:00:08  1.3165292523014513e+18   
5  2020-10-15 00:00:17   1.316529291052675e+18   
6  2020-10-15 00:00:17   1.316529289949569e+18   
7  2020-10-15 00:00:18  1.3165292934979625e+18   
8  2020-10-15 00:00:20  1.3165293013329183e+18   
9  2020-10-15 00:00:21  1.3165293085763092e+18   

                                               tweet likes  retweet_count  \
0  #Elecciones2020 | En #Florida: #JoeBiden dice ...   0.0            0.0   
1  Usa 2020, Trump contro Facebook e Twitter: cop...  26.0            9.0   
2  #Trump: As a student I used to hear for years,...   2.0            1.0   
3  2 hours since last tweet from #Trump! Maybe he...   0.0            0.0   
4  You get a tie! And you get a tie! #Tru

In [7]:
print("DataFrame Info:")
trump.info()  # Display DataFrame info

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 971087 entries, 0 to 971086
Data columns (total 21 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   created_at            971087 non-null  object 
 1   tweet_id              971073 non-null  object 
 2   tweet                 971073 non-null  object 
 3   likes                 971045 non-null  object 
 4   retweet_count         970933 non-null  float64
 5   source                970057 non-null  object 
 6   user_id               970929 non-null  object 
 7   user_name             970911 non-null  object 
 8   user_screen_name      970933 non-null  object 
 9   user_description      869661 non-null  object 
 10  user_join_date        970779 non-null  object 
 11  user_followers_count  970917 non-null  object 
 12  user_location         675830 non-null  object 
 13  lat                   445702 non-null  object 
 14  long                  445705 non-nul

In [8]:
print("Shape of DataFrame:", trump.shape)  # Show shape of the DataFrame

Shape of DataFrame: (971087, 21)


In [9]:
print("Columns in DataFrame:", trump.columns)  # List column names


Columns in DataFrame: Index(['created_at', 'tweet_id', 'tweet', 'likes', 'retweet_count', 'source',
       'user_id', 'user_name', 'user_screen_name', 'user_description',
       'user_join_date', 'user_followers_count', 'user_location', 'lat',
       'long', 'city', 'country', 'continent', 'state', 'state_code',
       'collected_at'],
      dtype='object')


In [10]:
print("Data Types of Columns:")
print(trump.dtypes)  # Print data types of each column

Data Types of Columns:
created_at               object
tweet_id                 object
tweet                    object
likes                    object
retweet_count           float64
source                   object
user_id                  object
user_name                object
user_screen_name         object
user_description         object
user_join_date           object
user_followers_count     object
user_location            object
lat                      object
long                     object
city                     object
country                  object
continent                object
state                    object
state_code               object
collected_at             object
dtype: object


In [None]:
print("Statistical Description:")
print(trump.describe())  # Get statistical description of numeric columns: non informative for our study

Statistical Description:
       retweet_count
count   9.709330e+05
mean    6.950235e+12
std     2.844675e+15
min     0.000000e+00
25%     0.000000e+00
50%     0.000000e+00
75%     0.000000e+00
max     1.322588e+18


Drop NA values in Trump's tweets

In [None]:
# Check for missing values in the DataFrame
print("Missing values in each column:")
print(trump.isnull().sum())  # Show number of missing values per column

Missing values in each column:
created_at                   0
tweet_id                    14
tweet                       14
likes                       42
retweet_count              154
source                    1030
user_id                    158
user_name                  176
user_screen_name           154
user_description        101426
user_join_date             308
user_followers_count       170
user_location           295257
lat                     525385
long                    525382
city                    743907
country                 528355
continent               528338
state                   650473
state_code              670673
collected_at               322
dtype: int64
DataFrame after dropping missing tweets:
            created_at                tweet_id  \
0  2020-10-15 00:00:01   1.316529221557252e+18   
1  2020-10-15 00:00:01  1.3165292227484303e+18   
2  2020-10-15 00:00:02   1.316529228091847e+18   
3  2020-10-15 00:00:02   1.316529227471237e+18   
4  2020-10-15 

In [14]:
# Drop rows with missing values in the 'tweet' column if necessary
if trump['tweet'].isna().sum() > 0:
    trump = trump.dropna(subset=['tweet'])
    print("DataFrame after dropping missing tweets:")
    print(trump.head())  # Display the first few rows of the cleaned DataFrame

In the this subsection some descriptive analysis on the dataset concerning ***Biden*** are performed 

In [17]:
# Perform descriptive analysis
print("\nDescriptive Analysis:")
print(biden.head(10))  # Show first 10 rows


Descriptive Analysis:
            created_at                tweet_id  \
0  2020-10-15 00:00:01   1.316529221557252e+18   
1  2020-10-15 00:00:18    1.31652929585929e+18   
2  2020-10-15 00:00:20  1.3165293050069524e+18   
3  2020-10-15 00:00:21  1.3165293080815575e+18   
4  2020-10-15 00:00:22   1.316529312741253e+18   
5  2020-10-15 00:00:23  1.3165293165079306e+18   
6  2020-10-15 00:00:25  1.3165293244182405e+18   
7  2020-10-15 00:00:31  1.3165293476086784e+18   
8  2020-10-15 00:00:36  1.3165293692009513e+18   
9  2020-10-15 00:00:41  1.3165293928273428e+18   

                                               tweet likes  retweet_count  \
0  #Elecciones2020 | En #Florida: #JoeBiden dice ...   0.0            0.0   
1  #HunterBiden #HunterBidenEmails #JoeBiden #Joe...   0.0            0.0   
2  @IslandGirlPRV @BradBeauregardJ @MeidasTouch T...   0.0            0.0   
3  @chrislongview Watching and setting dvr. Let’s...   0.0            0.0   
4  #censorship #HunterBiden #Biden #Biden

In [None]:
print("DataFrame Info:")
biden.info()  # Display DataFrame info

In [None]:
print("Statistical Description:")
print(biden.describe())  # Get statistical description of numeric columns

In [None]:
print("Shape of DataFrame:", biden.shape)  # Show shape of the DataFrame

In [None]:
print("Columns in DataFrame:", biden.columns)  # List column names

In [None]:
print("Data Types of Columns:")
print(biden.dtypes)  # Print data types of each column

In [None]:
# Get the data type of the 'tweet' column
tweet_type = biden['tweet'].dtype
print(f"\nThe data type of the 'tweet' column is: {tweet_type}")

Drop NA values in Biden's tweets

In [None]:
# Check for missing values in the DataFrame
print("Missing values in each column:")
print(biden.isnull().sum())  # Show number of missing values per column
missing_tweets = biden['tweet'].isna().sum()
print(f"\nNumber of missing tweets: {missing_tweets}")  # Count missing tweets

In [None]:
# Drop rows with missing values in the 'tweet' column
biden = biden.dropna(subset=['tweet'])
print("DataFrame after dropping missing tweets:")
print(biden.head())  # Display the first few rows of the cleaned DataFrame

### Merging the Biden and Trump Dataset

Merge the dataset with Biden and Trump hashtags into a global US tweets data and save it

In [17]:
# creating a new column 'candidate' to differentiate between tweets of Trump and Biden upon concatination 
trump['candidate'] = 'trump'
biden['candidate'] = 'biden'

In [18]:
# combining the dataframes providing a final data shape
data = pd.concat([trump, biden]) 
print('Final Data Shape :', data.shape) 

Final Data Shape : (1748160, 22)


In [19]:
data.columns

Index(['created_at', 'tweet_id', 'tweet', 'likes', 'retweet_count', 'source',
       'user_id', 'user_name', 'user_screen_name', 'user_description',
       'user_join_date', 'user_followers_count', 'user_location', 'lat',
       'long', 'city', 'country', 'continent', 'state', 'state_code',
       'collected_at', 'candidate'],
      dtype='object')

## Data Cleaning

To prepare the dataset for our analysis, we need to focus on tweets from users eligible to vote in the U.S. elections. This involves filtering the data based on the following criteria:

- **Select U.S.-based Tweets**: We will retain only tweets originating from within the United States, excluding tweets from foreign users.

- **Filter for Federal States Only**: Tweets from U.S. territories (e.g., Puerto Rico, Guam) will be excluded from the analysis to focus solely on the 50 federal states.

- **Keep just a tweeter for each user_ID**: assuming that everyone is coeherent with is political ideas (pro Trump or pro Biden) just keep one tweet for each user_ID

**Select U.S.-based Tweets**

In [None]:
# Reset the index as numeric
data.reset_index(inplace=True)  # equivalent methodologies 

data.iloc[971085:971090, 22] 
data.loc[971085:971090, 'candidate']

971085    trump
971086    trump
971087    biden
971088    biden
971089    biden
971090    biden
Name: candidate, dtype: object

In [23]:
# Since we are interested in USA we just keep the tweets correspondig to Country USA
data['country'].unique()
data['country'].value_counts()

country
US                          394395
United Kingdom               58049
India                        40088
Germany                      35379
France                       35292
                             ...  
Florida                          1
Northern Mariana Islands         1
Saint Lucia                      1
Lesotho                          1
East Timor                       1
Name: count, Length: 191, dtype: int64

In [24]:
# There is not a unique name for US
data['country'] = data['country'].replace({'United States of America': "US",'United States': "US"}) 

In [25]:
data = data[data['country'] == 'US']

**Filter for Federal States Only**

In [26]:
unique_US_states_count = data['state_code'].unique()
print( unique_US_states_count)

['FL' 'OR' 'DC' 'CA' 'OH' 'PA' 'IL' 'MI' nan 'NJ' 'MA' 'NH' 'TX' 'SD' 'GA'
 'MO' 'NY' 'CO' 'SC' 'VA' 'AL' 'AZ' 'NC' 'TN' 'NE' 'LA' 'NV' 'MN' 'IN'
 'WA' 'HI' 'WV' 'VT' 'ID' 'PR' 'IA' 'KY' 'ND' 'AR' 'WI' 'UT' 'MT' 'KS'
 'WY' 'ME' 'CT' 'MD' 'NM' 'OK' 'AK' 'DE' 'RI' 'MS' 'GU' 'MP']


In [27]:
# define the US territories
# - **PR**: Puerto Rico
# - **GU**: Guam
# - **MP**: Northern Mariana Islands

# Define a list of territories and DC to exclude
excluded_states = ['PR', 'GU', 'MP']

# Filter out rows where 'state_code' is in the excluded list
federal_states = data[~data['state_code'].isin(excluded_states)]

**Keed a tweet fo every user**

In [28]:
# keep just a tweet for every user assuming that every comment to Trumo or Biden is coherent over time 
data.drop_duplicates(subset=['user_id'])
data.shape

(394395, 24)

### Save The Clean Data

In [29]:
data.to_csv('data/data.csv', index=False)

## Explanatory Analysis

In this section some graphic representation is provided to have an idea on how tweets and likes are distributed over federal states for each candidate 

In [30]:
### Group the data by 'candidate' and count the number of tweets for each candidate 
tweets_count = data.groupby('candidate')['tweet'].count().reset_index() 

# Interactive bar chart 
fig_tweets = px.bar(tweets_count, x='candidate', y='tweet', color='candidate', 
color_discrete_map={'Trump': 'pink', 'Biden': 'blue'}, 
labels={'candidate': 'Candidates', 'tweet': 'Number of Tweets'}, 
title='Tweets for Candidates') 

# Show the chart 
pio.show(fig_tweets)

In [45]:
# Top10 states tweets Counts 
fed_states = data.groupby('state_code')['tweet'].count().sort_values(ascending=False).reset_index() 
 
# Interactive bar chart 
fig_tweet_bystates = px.bar(fed_states, x='state_code', y='tweet', 
template='plotly_dark', 
color_discrete_sequence=px.colors.qualitative.Dark24_r, 
title='Top10 states tweets') 

# To view the graph 
pio.show(fig_tweet_bystates)

In [46]:
# the number of tweets done for each candidate by all the states. 
tweet_df = data.groupby(['state_code', 'candidate'])['tweet'].count().reset_index() 

# Candidate for top 20 state_code tweet 
tweeters = tweet_df[tweet_df['state_code'].isin(fed_states.state_code)] 

# Plot for tweet counts for each candidate in the top 20 state_code 
fig_state_tweet = px.bar(tweeters, x='state_code', y='tweet', color='candidate', 
                         labels={'state_code': 'state_code', 'tweet': 'Number of Tweets', 
                                 'candidate': 'Candidate'}, 
                                 title='Tweet Counts for Each Candidate in the federal states',
                                 template='plotly_dark', 
                                 barmode='group') 

# Show the chart 
pio.show(fig_state_tweet) 