In [1]:
import pandas as pd
import re


NLP First Setp: Convert text data into numeric data

In [2]:
# Load DataFrame into variable df
df = pd.read_csv('ExtractedTweets.csv')
df.head()

Unnamed: 0,Party,Handle,Tweet
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P..."
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...


In [3]:
#Check for null values in the dataset
df.isna().sum()

Party     0
Handle    0
Tweet     0
dtype: int64

In [4]:
# Check the amount of tweets
df.shape[0]

86460

### Data Cleaning


In [5]:
# Check of unique values
for col in df.columns:
    selected =  df[col]
    n_unique = selected.unique().shape[0]
    print("The column {:40s} has {:6d} unique values".format(col,n_unique))

The column Party                                    has      2 unique values
The column Handle                                   has    433 unique values
The column Tweet                                    has  84502 unique values


We can see there are more tweets 86,460 than unique tweets 84,502. These duplicate values might be retweets, it is something we should look at.

There is a difference of 1,958 tweets.

Now we will check for duplicate values

In [6]:
df[df['Tweet'].duplicated() == True]

Unnamed: 0,Party,Handle,Tweet
675,Democrat,RepEspaillat,"RT @OfficialCBC: Join us on Tuesday, May 8, 20..."
702,Democrat,RepEspaillat,RT @DemsEspanol: El senador @marcorubio lo ha ...
1100,Democrat,RepBarragan,RT @NRDems: BREAKING: The Trump admin is movin...
1438,Democrat,RepRoKhanna,RT @Nextgov: Bill from @RepRoKhanna and @RepRa...
1451,Democrat,RepRoKhanna,RT @Nextgov: Bill from @RepRoKhanna and @RepRa...
...,...,...,...
86027,Republican,WaysandMeansGOP,RT @NFIB: Today is the first #TaxDay in years ...
86035,Republican,WaysandMeansGOP,RT @SteveScalise: The bad news → today is Tax ...
86047,Republican,WaysandMeansGOP,RT @RepKevinBrady: Championed by both Republic...
86057,Republican,WaysandMeansGOP,RT @RepKevinBrady: Tomorrow is the LAST time y...


There are 1,958 duplicate values, which explains the difference obtained above between total tweets and unique tweets.

We are going to show the Tweets that are duplicate and delete them.

In [7]:
# Create a variable with all of the duplicate tweets.
duplicate_tweets = df[df['Tweet'].duplicated() == True]
duplicate_tweets

Unnamed: 0,Party,Handle,Tweet
675,Democrat,RepEspaillat,"RT @OfficialCBC: Join us on Tuesday, May 8, 20..."
702,Democrat,RepEspaillat,RT @DemsEspanol: El senador @marcorubio lo ha ...
1100,Democrat,RepBarragan,RT @NRDems: BREAKING: The Trump admin is movin...
1438,Democrat,RepRoKhanna,RT @Nextgov: Bill from @RepRoKhanna and @RepRa...
1451,Democrat,RepRoKhanna,RT @Nextgov: Bill from @RepRoKhanna and @RepRa...
...,...,...,...
86027,Republican,WaysandMeansGOP,RT @NFIB: Today is the first #TaxDay in years ...
86035,Republican,WaysandMeansGOP,RT @SteveScalise: The bad news → today is Tax ...
86047,Republican,WaysandMeansGOP,RT @RepKevinBrady: Championed by both Republic...
86057,Republican,WaysandMeansGOP,RT @RepKevinBrady: Tomorrow is the LAST time y...


In [8]:
# Delete de duplicate values
# Make sure we are deleting duplicate tweets but from THE SAME account.
# Duplicate Tweets from different accounts can be considered as separate tweets

# First, show rows where both Handle and Tweets are duplicated.
# The Keep = False will allow us the keep the duplicated and visualize the results
duplicated_df = df[df.duplicated(subset=['Handle','Tweet'], keep = False)]
duplicated_df

Unnamed: 0,Party,Handle,Tweet
1412,Democrat,RepRoKhanna,RT @Nextgov: Bill from @RepRoKhanna and @RepRa...
1438,Democrat,RepRoKhanna,RT @Nextgov: Bill from @RepRoKhanna and @RepRa...
1451,Democrat,RepRoKhanna,RT @Nextgov: Bill from @RepRoKhanna and @RepRa...
3664,Democrat,RepValDemings,Good morning Central Florida!
3683,Democrat,RepValDemings,Good morning Central Florida!
...,...,...,...
83346,Republican,cathymcmorris,Be confident &amp; tell your story. https://t....
84409,Republican,virginiafoxx,Save the Date - Veterans Information Session o...
84418,Republican,virginiafoxx,Save the Date - Veterans Information Session o...
84744,Republican,LamarSmithTX21,RT @Austin_Police: APD is asking the public to...


In [9]:
#Second, we will delete the duplicates and only keep the first ocurrance
df_cleaned = df.drop_duplicates(subset=['Handle', 'Tweet'], keep='first')
df_cleaned.shape

(86403, 3)

From the original 86,460 tweets, we have deleted 57 duplicated tweets from the same Handle.

For sanity check, we will check that all of the duplicate Tweets that we left on the dataset, do have a different Handle.

In [10]:
#Now we will check the amount of duplicated that have a differente 'Handle'

# Identify and display rows where 'Tweets' are duplicated but 'Handle' values are different
duplicates_with_different_handle = df[df.duplicated(subset=['Tweet'], keep=False) & ~df.duplicated(subset=['Tweet', 'Handle'], keep=False)]

# Display the resulting DataFrame
duplicates_with_different_handle

Unnamed: 0,Party,Handle,Tweet
8,Democrat,RepDarrenSoto,RT @HispanicCaucus: Trump's anti-immigrant pol...
63,Democrat,RepDarrenSoto,RT @DemsEspanol: El senador @marcorubio lo ha ...
82,Democrat,RepDarrenSoto,RT @HispanicCaucus: Our Members are demanding ...
83,Democrat,RepDarrenSoto,"RT @RepAnthonyBrown: When I served in Iraq, I ..."
114,Democrat,RepDarrenSoto,RT @CaucusOnClimate: 11 lost lives. 5 million ...
...,...,...,...
86027,Republican,WaysandMeansGOP,RT @NFIB: Today is the first #TaxDay in years ...
86035,Republican,WaysandMeansGOP,RT @SteveScalise: The bad news → today is Tax ...
86047,Republican,WaysandMeansGOP,RT @RepKevinBrady: Championed by both Republic...
86057,Republican,WaysandMeansGOP,RT @RepKevinBrady: Tomorrow is the LAST time y...


There are 2908 duplicate tweets but with different Handle. We want to count the amount of unique handle per duplicate Tweet. The count of these unique handle per duplicate tweet should be 2908 so we can be sure that our dataframe has only duplicate values with unique Handle.

In [11]:
# Initialize a variable to accumulate the unique handles count
total_unique_handles_count = 0

# Iterate over duplicates and count unique handles
for tweet, group in duplicates_with_different_handle.groupby('Tweet'):
    
    # Obtain all of the unique Handles in the subset group, for each unique Tweet
    unique_handles_count = group['Handle'].nunique()
    
    # Add to the total count, the amount of unique handles per unique Tweet
    total_unique_handles_count += unique_handles_count

# Print the total unique handles count for all tweets
print(f"Total Unique Handles Count for All Tweets: {total_unique_handles_count}")

Total Unique Handles Count for All Tweets: 2908


Since the total unique handles for the unique Tweets in the duplicates_with_different_handle dataframe is equal to the total of Tweets in that same data frame, we can be sure our dataset is cleaned as desired.

### Check for Retweets
Binary column 'Retweets' will be created with the value of 1 of the Tweet is a retweet.

In [12]:
#Take a look at the amount of retweets in our data set
df_cleaned[df_cleaned['Tweet'].str.contains('RT')].shape[0]

19194

In [16]:
# Percantage of retweets from the whole data set

pct_of_rt = df_cleaned[df_cleaned['Tweet'].str.contains('RT')].shape[0]/df.shape[0]*100
print(f" The percentage of retweets in the dataset is {pct_of_rt}%")

 The percentage of retweets in the dataset is 22.199861207494795%


Now we know that roughly 22.20% of the tweets in the dataset are retweets. That is a big amount.

We will create a new binary column in the dataset with value of 1 if its a retweet and 0 if it's not.

In [20]:
# Create new blank columns to determine if tweet is retweeted or not
df_cleaned['Retweet'] = None

#Check results
df_cleaned['Retweet']

0        None
1        None
2        None
3        None
4        None
         ... 
86455    None
86456    None
86457    None
86458    None
86459    None
Name: Retweet, Length: 86403, dtype: object

In [21]:
# Fill columns with binary values based on if tweet is retweet or not

df_cleaned['Retweet'] = df_cleaned['Tweet'].apply(lambda x: 1 if 'RT' in x else 0)

#Check to see if the amount of rows where there is a retweet matches with 19,194 the amount of RT previously calcualted
df_cleaned[df_cleaned['Retweet']==1].shape

(19194, 4)

In [22]:
# Check results and new table
df_cleaned.head(50)

Unnamed: 0,Party,Handle,Tweet,Retweet
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P...",0
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...,1
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...,1
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...,1
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...,1
5,Democrat,RepDarrenSoto,RT @EmgageActionFL: Thank you to all who came ...,1
6,Democrat,RepDarrenSoto,Hurricane Maria left approx $90 billion in dam...,0
7,Democrat,RepDarrenSoto,RT @Tharryry: I am delighted that @RepDarrenSo...,1
8,Democrat,RepDarrenSoto,RT @HispanicCaucus: Trump's anti-immigrant pol...,1
9,Democrat,RepDarrenSoto,RT @RepStephMurphy: Great joining @WeAreUnidos...,1


Now we will extract all of the hashtags in the database

### Create a new column with all of the hashtags for each tweet

In [24]:
# Create a new column with the hashtags for each tweet
df_cleaned['Hashtags'] = None

#Iterate over each row in df['Tweets']
for index, row in df_cleaned.iterrows():
    # Use regex findall to search for all hashtags in each row
    hashtags = re.findall(r'#(\w+)', row['Tweet'])
    if hashtags:
        # Add the hashtags, if any, in the columns 'Hashtags' for each row
        df_cleaned.at[index, 'Hashtags'] = hashtags 
                          

In [26]:
df_cleaned['Hashtags']

0        [SaveTheInternet, NetNeutrality]
1                                    None
2                                    None
3                      [NALCABPolicy2018]
4                                    None
                       ...               
86455                                None
86456                                None
86457                                None
86458                  [CobbBackToSchool]
86459                              [Zika]
Name: Hashtags, Length: 86403, dtype: object

In [27]:
# Use regular expression to extract hashtags into a list
hashtag_list = df_cleaned['Tweet'].str.extractall(r'#(\w+)').reset_index()[0]
hashtag_list

0         SaveTheInternet
1           NetNeutrality
2        NALCABPolicy2018
3           NetNeutrality
4                 Orlando
               ...       
36621     OpeningCeremony
36622             TeamUSA
36623                Zika
36624    CobbBackToSchool
36625                Zika
Name: 0, Length: 36626, dtype: object

In [29]:
# Check Results
df_cleaned.head(10)

Unnamed: 0,Party,Handle,Tweet,Retweet,Hashtags
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P...",0,"[SaveTheInternet, NetNeutrality]"
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...,1,
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...,1,
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...,1,[NALCABPolicy2018]
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...,1,
5,Democrat,RepDarrenSoto,RT @EmgageActionFL: Thank you to all who came ...,1,
6,Democrat,RepDarrenSoto,Hurricane Maria left approx $90 billion in dam...,0,
7,Democrat,RepDarrenSoto,RT @Tharryry: I am delighted that @RepDarrenSo...,1,[NetNeutrality]
8,Democrat,RepDarrenSoto,RT @HispanicCaucus: Trump's anti-immigrant pol...,1,
9,Democrat,RepDarrenSoto,RT @RepStephMurphy: Great joining @WeAreUnidos...,1,[Orlando]


### Checkpoint
Until now, we have done the following with out dataset:
1. Check for null values.
2. Remove duplicate Tweets where the Handle was also duplicate. 
3. Create a RT binary column to mark all tweets which are RTs: 1 if it is a retweet, otherwise 0.
4. Create a Hashtags column which shows all of the hashtags, if any, for every tweet.
5. Create a list called **hashtag_list** with all of the hashtags that appear in the DataSet
 

So now, are cleaned DataFrame has been saved as the variable **df_cleaned**.

- The original **df** DataFrame is outdated from this point on.
- The DataFrame **duplicate_tweets** shows ALL of the duplicate Tweets.
- The DataFrame **duplicates_with_different_handle** shows ALL of the duplicate Tweets which are coming from different handles.

In [30]:
#Save the new DataFrame

df_cleaned.to_csv('PoliticalTweets.csv')