# COVID-19 Vaccine Twitter Data Wrangling

## Contents
1. [Imports](#1.-Imports)
2. [Load Covid Vaccine Twitter Data](#2.-Load-the-Covid-Vaccine-Twitter-Data)
3. [Explore the Data](#3.-Explore-the-Data)
4. [Check Missing Values](#4.-Check-Missing-Values)
5. [Preprocess/Clean hashtags](#5.-Preprocess/Clean-hashtags)
6. [Preprocess/Clean text](#6.-Preprocess/Clean-text)
7. [Save Clean Data](#7.-Save-Clean-Data)

## 1. Imports

In [1]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
import warnings
warnings.filterwarnings('ignore')

## 2. Load Covid Vaccine Twitter Data

In [2]:
# Load Data - CSV file: 'covidvaccine.csv'
file = '../data/covidvaccine.csv'
df = pd.read_csv(file)

## 3. Explore the Data

In [3]:
df.shape

(328619, 13)

In [4]:
df.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,MyNewsNE,Assam,MyNewsNE a dedicated multi-lingual media house...,24-05-2020 10:18,64.0,11.0,110.0,False,18-08-2020 12:55,Australia to Manufacture Covid-19 Vaccine and ...,['CovidVaccine'],Twitter Web App,False
1,Shubham Gupta,,I will tell about all experiences of my life f...,14-08-2020 16:42,1.0,17.0,0.0,False,18-08-2020 12:55,#CoronavirusVaccine #CoronaVaccine #CovidVacci...,"['CoronavirusVaccine', 'CoronaVaccine', 'Covid...",Twitter for Android,False
2,Journal of Infectiology,,Journal of Infectiology (ISSN 2689-9981) is ac...,14-12-2017 07:07,143.0,566.0,8.0,False,18-08-2020 12:46,Deaths due to COVID-19 in Affected Countries\n...,,Twitter Web App,False
3,Zane,,Fresher than you.,18-09-2019 11:01,29.0,25.0,620.0,False,18-08-2020 12:45,@Team_Subhashree @subhashreesotwe @iamrajchoco...,,Twitter for Android,False
4,Ann-Maree O’Connor,"Adelaide, South Australia",Retired university administrator. Melburnian b...,24-01-2013 14:53,83.0,497.0,10737.0,False,18-08-2020 12:45,@michellegrattan @ConversationEDU This is what...,,Twitter Web App,False


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 328619 entries, 0 to 328618
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   user_name         328613 non-null  object 
 1   user_location     286675 non-null  object 
 2   user_description  317700 non-null  object 
 3   user_created      197198 non-null  object 
 4   user_followers    197197 non-null  float64
 5   user_friends      197197 non-null  object 
 6   user_favourites   197197 non-null  object 
 7   user_verified     197197 non-null  object 
 8   date              197195 non-null  object 
 9   text              197197 non-null  object 
 10  hashtags          135581 non-null  object 
 11  source            194798 non-null  object 
 12  is_retweet        197189 non-null  object 
dtypes: float64(1), object(12)
memory usage: 32.6+ MB


In [6]:
# Set 'date' and 'user_created' columns as Datetime
df['date'] = pd.to_datetime(df.date,errors='coerce').dt.strftime('%Y-%m-%d %H:%M')
df['date'] = pd.to_datetime(df.date)

df['user_created'] = pd.to_datetime(df.user_created,errors='coerce').dt.strftime('%Y-%m-%d %H:%M')
df['user_created'] = pd.to_datetime(df.user_created)

In [7]:
# Sort by 'date'
df.sort_values(by='date',inplace=True)

In [8]:
# Set 'user_friends' and 'user_favourites' as float
df['user_friends'] = pd.to_numeric(df.user_friends,errors='coerce')
df['user_favourites'] = pd.to_numeric(df.user_favourites,errors='coerce')

In [9]:
# Set 'user_verified' and 'is_retweet' columns as bool
df['user_verified'] = df.user_verified.astype(bool)
df['is_retweet'] = df.is_retweet.astype(bool)

In [10]:
# Check datatypes
df.dtypes

user_name                   object
user_location               object
user_description            object
user_created        datetime64[ns]
user_followers             float64
user_friends               float64
user_favourites            float64
user_verified                 bool
date                datetime64[ns]
text                        object
hashtags                    object
source                      object
is_retweet                    bool
dtype: object

## 4. Check Missing Values

In [11]:
# Check missing values in columns
def missing_values(data):
    missing = pd.concat([data.isnull().sum(),100*data.isnull().mean()],axis=1)
    missing.columns = ['count','%']
    missing.sort_values(by=['count','%'],ascending=False,inplace=True)
    return missing

In [12]:
missing_values(df)

Unnamed: 0,count,%
hashtags,193038,58.742191
source,133821,40.722235
user_friends,131429,39.99434
user_favourites,131429,39.99434
date,131429,39.99434
user_created,131428,39.994036
user_followers,131422,39.99221
text,131422,39.99221
user_location,41944,12.763717
user_description,10919,3.322693


<br>

***

- **Note:** There are several fields that have around 131,428 missing values. Let's check to see if there are duplicate entries.

***

In [13]:
# Check duplicates
df[df.duplicated()]

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
185732,Fay Moody,Newcastle,A good head and a good heart are always a form...,NaT,,,,True,NaT,,,,True
185733,Fay Moody,Newcastle,A good head and a good heart are always a form...,NaT,,,,True,NaT,,,,True
185734,Fay Moody,Newcastle,A good head and a good heart are always a form...,NaT,,,,True,NaT,,,,True
185735,Fay Moody,Newcastle,A good head and a good heart are always a form...,NaT,,,,True,NaT,,,,True
185736,Fay Moody,Newcastle,A good head and a good heart are always a form...,NaT,,,,True,NaT,,,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
327675,Chelsea Baird,"Dundas, Ontario",waiting for the next big thing to happen in my...,NaT,,,,True,NaT,,,,True
327676,Chelsea Baird,"Dundas, Ontario",waiting for the next big thing to happen in my...,NaT,,,,True,NaT,,,,True
327677,Chelsea Baird,"Dundas, Ontario",waiting for the next big thing to happen in my...,NaT,,,,True,NaT,,,,True
327678,Chelsea Baird,"Dundas, Ontario",waiting for the next big thing to happen in my...,NaT,,,,True,NaT,,,,True


In [14]:
df_dups = df[df.duplicated(keep=False)]
dup_count = df_dups.duplicated(keep=False).groupby(df_dups['user_name']).value_counts()

print(dup_count)
print('Total number of duplicates: '+str(dup_count.sum())) 

user_name          
Chelsea Baird  True    62504
Fay Moody      True    10877
Mr. W. L.      True    58037
dtype: int64
Total number of duplicates: 131418


<br>

***

- **Note:** We found 131,418 duplicate records spread across 3 user_names. We will go ahead and drop them from our dataframe.

***

In [15]:
# Drop duplicates
df.drop_duplicates(keep=False,inplace=True)

In [16]:
# Recheck missing values
missing_values(df)

Unnamed: 0,count,%
hashtags,61620,31.247306
user_location,41944,21.269669
user_description,10919,5.53699
source,2403,1.218554
user_friends,11,0.005578
user_favourites,11,0.005578
date,11,0.005578
user_created,10,0.005071
user_name,6,0.003043
user_followers,4,0.002028


In [17]:
# Let's filter and check the records where there are missing values in the 'user_name' and 'date' column
df.loc[df[['user_name','date']].isnull().any(axis=1)]

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
51425,,,@PelosiLovesDJT's account is temporarily unava...,2021-01-10 04:52:00,90.0,36.0,37.0,True,2021-01-12 04:17:00,@PelosiLovesDJT's account is temporarily unava...,,Twitter for Android,False
121242,,,@farrahraja's account has been withheld in Ind...,2015-08-22 22:43:00,4989.0,1482.0,145584.0,True,2021-02-06 11:41:00,@farrahraja's account has been withheld in Ind...,,Twitter for Android,False
140102,,,@EngrMuhammadQa9's account has been withheld i...,2020-11-03 09:16:00,6.0,79.0,2528.0,False,2021-02-24 12:41:00,@EngrMuhammadQa9's account has been withheld i...,,Twitter for Android,False
151183,,,@furqanraja1122's account has been withheld in...,2015-10-06 11:25:00,2224.0,2816.0,3591.0,False,2021-02-28 14:36:00,@furqanraja1122's account has been withheld in...,,Twitter Web App,False
167083,,,@SouthwickAlexa's account has been withheld in...,2020-02-20 20:10:00,208.0,1112.0,5247.0,False,2021-03-24 11:04:00,@SouthwickAlexa's account has been withheld in...,,Twitter for Android,False
196815,,,@_AzeemButt's account has been withheld in Ind...,2019-01-14 16:51:00,2696.0,637.0,7863.0,True,2021-04-12 08:31:00,@_AzeemButt's account has been withheld in Ind...,,Twitter for Android,False
23986,#edutwitter #CovidVaccine,"['edutwitter', 'CovidVaccine']",Twitter for iPhone,NaT,,,,True,NaT,,,,True
27430,Samuel,"SA,Mpumalanga secunda",Life is a Gift and every day it a Celebration.,NaT,,,,True,NaT,,,,True
27431,265208E2 #BeyHive #SameLove,2014-05-03 07:38:07,129,NaT,444.0,,,True,NaT,Twitter for Android,False,,True
45326,JTKohlrieser,O-H-I-O,Don’t go around saying the world owes you a li...,NaT,,,,True,NaT,,,,True


<br>

***

- **Note:** Many of these records contain NaN/NaT values in numerous fields including the 'text' and 'hashtags' columns which are going to be key features for our model. We'll go ahead and drop these rows since they do not contain any useful information that we could use.

***

In [18]:
df = df.loc[~df[['user_name','date']].isnull().any(axis=1)]

In [19]:
missing_values(df)

Unnamed: 0,count,%
hashtags,61610,31.244929
user_location,41938,21.26846
user_description,10919,5.537468
source,2392,1.21308
user_name,0,0.0
user_created,0,0.0
user_followers,0,0.0
user_friends,0,0.0
user_favourites,0,0.0
user_verified,0,0.0


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 197184 entries, 4126 to 183494
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   user_name         197184 non-null  object        
 1   user_location     155246 non-null  object        
 2   user_description  186265 non-null  object        
 3   user_created      197184 non-null  datetime64[ns]
 4   user_followers    197184 non-null  float64       
 5   user_friends      197184 non-null  float64       
 6   user_favourites   197184 non-null  float64       
 7   user_verified     197184 non-null  bool          
 8   date              197184 non-null  datetime64[ns]
 9   text              197184 non-null  object        
 10  hashtags          135574 non-null  object        
 11  source            194792 non-null  object        
 12  is_retweet        197184 non-null  bool          
dtypes: bool(2), datetime64[ns](2), float64(3), object(6)
mem

## 5. Preprocess/Clean hashtags

In [21]:
# Drop NaN values in hashtags column
df.dropna(subset=['hashtags'],axis=0,inplace=True)

In [22]:
# Clean hashtags column
df['clean_hashtags'] = df['hashtags'].astype('str')
df['clean_hashtags'] = df['clean_hashtags'].apply(lambda x:x[1:-2]).str.replace(r"[\"\'\-\ー\_]",'').str.lower()

In [23]:
df.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet,clean_hashtags
4126,SouthSide,,"‘ologist, Feminist, mother and grandmother. St...",2013-10-15 18:42:00,40.0,33.0,29884.0,True,2020-01-09 00:04:00,Don’t politicize the #CDC and the #FDA. I’m p...,"['CDC', 'FDA']",Twitter for iPhone,False,"cdc, fda"
4125,Candidate PAJ,,"Assuming Twitter exists in the future, let thi...",2019-01-18 02:06:00,2.0,166.0,41.0,True,2020-01-09 00:05:00,Whomever the President is needs to get the #co...,['covidvaccine'],Twitter for iPhone,False,covidvaccine
4124,💧Salty Noulty💧,"Brisbane, Australia","Inspiring carbon neutral living, science and h...",2007-09-28 19:20:00,398.0,602.0,11076.0,True,2020-01-09 00:09:00,@GregHuntMP You forgot to mention $5 million i...,['covidvaccine'],Twitter for iPhone,False,covidvaccine
4123,PETER MAER,"Washington, DC",Retired White House Corr. \nEdward R. Murrow A...,2010-12-03 15:20:00,13514.0,11646.0,1563.0,True,2020-01-09 00:49:00,Is Trump pressuring #FDA for an October #Covid...,"['FDA', 'Covidvaccine']",Twitter Web App,False,"fda, covidvaccine"
4122,B Sabs,,Wife to my man of choice. Mother of 2 awesome ...,2014-09-09 16:09:00,100.0,316.0,19193.0,True,2020-01-09 00:59:00,Here we go. #COVIDvaccine https://t.co/FZgtcJ6XDP,['COVIDvaccine'],Twitter for iPhone,False,covidvaccine


In [24]:
# Check/Sort unique hashtags
def clean_hashtags(data):
    tokens = [re.sub("'",'',token) for token in data.values]
    clean_hashtag = ', '.join(tokens).split()
    while ',' in clean_hashtag:
        clean_hashtag.remove(',')
    global list_hashtag
    list_hashtag = ''.join(clean_hashtag).lower().split(',')
    print('There are {} unique hashtags'.format(len(set(list_hashtag))))

In [25]:
clean_hashtags(df['clean_hashtags'])

There are 27244 unique hashtags


<br>

***

- **Note:** We found 27,244 unique hashtags. Let's check if there are non-english or non-ASCII characters in the clean_hashtags column.   

***

In [26]:
def isascii_counter(data):
    global isascii_counts
    isascii_counts = [x for x in list_hashtag if x.isascii()==False]
    print('There are {} non-english/non-ASCII hashtags \n'.format(len(isascii_counts)))

In [27]:
isascii_counter(list_hashtag)
print(isascii_counts)

There are 319 non-english/non-ASCII hashtags 

['aqours5th上映会day1', '𝗳𝗮𝗰𝗲𝗺𝗮𝘀𝗸', '𝗳𝗮𝗰𝗲𝗺𝗮𝘀𝗸', 'सुशांतसिंहराजपूत', 'परशुरामनहींकांशीरामचाहिए', 'संदीपक्षमांयाचस्व', 'कृषिनिजीकरणरोको', 'covıd19', '𝐍𝐄𝐖𝐒𝐔𝐏𝐃𝐀𝐓𝐄𝐒', 'भारतकीशानबॉलीवुड', 'ม็อบ17ตุลา', 'حملةمقاطعةالبضائعالتركية', 'iyikivarsıneren', '大演練', 'राहतइंदौरी', 'चलोचलेंमथुराधाम', '10kasım', 'amérique', 'gfriend𓈉', '𝑽𝒂𝒄𝒄𝒊𝒏𝒆', 'كوفيد١٩', 'لقاحكورونا', 'اليومالعالميللاعترافات', 'گلگتبلتستانعمرانکا', 'இந்துக்களின்எதிரிமோடி', 'bjpभगाओदेशबचाओ', 'இந்துக்களின்எதிரிமோடி', 'இந்துக்களின்எதிரிமோடி', 'ม็อบ17พฤศจิกา', 'özlemtüreci', 'गुलनाजकोन्यायदो', 'ม็อบ17พฤศจิกา', 'ประชุมสภา', 'แก้รัฐธรรมนูญ', 'หยุดคุกคามประชาชน', 'كورونا', 'i字バランス部', 'kurallarauyalım', 'kurallarauyalım', 'ม็อบ18พฤศจิกา', 'لعمان', 'ذكرىالبيعةالسادسه', 'الرياضالان', '𝘾𝙤𝙫𝙞𝙙', '𝙐𝙠𝙧𝙖𝙞𝙣𝙚', '𝘾𝙤𝙫𝙞𝙙', '𝙐𝙠𝙧𝙖𝙞𝙣𝙚', '𝗖𝗢𝗩𝗜𝗗𝟭𝟵', 'covi̇d19', 'nowplaying️', 'reet2020noticificationजारीकरे', 'पूछताहैभारत', 'हरहरमहादेव', 'سیاستنہیںجانبچاؤ', 'واكسنبخرید', 'واكسنكرونامطالبهملى', 'كوفيد19', 'employeedelas

<br>

***

- **Note:** Let's go ahead and remove these non-english/non-ASCII hashtags from our clean_hashtags column

***

In [28]:
# Filter hashtags that contain ASCII characters
df = df[df['clean_hashtags'].map(lambda x:x.isascii())]

In [29]:
# Recheck: unique hashtags and non-ASCII hashtags
clean_hashtags(df['clean_hashtags'])
isascii_counter(df['clean_hashtags'])

There are 26930 unique hashtags
There are 0 non-english/non-ASCII hashtags 



<br>

***

- **Note:** After cleaning and filtering hashtags, we found 26,930 unique English/ASCII hashtags.

***

## 6. Preprocess/Clean text

In [30]:
# Let's go ahead and apply some initial preprocessing
def clean_tweets(text):
    text = re.sub(r"(?:\@|https?\://)\S+","",text)
    text = re.sub("[^A-Za-z0-9 ]+","",text)
    text = re.sub("(?<=[a-z])'(?=[a-z])","",text)
    text = re.sub("RT @[\w]*:","",text)
    text = re.sub("@[\w]*","",text)
    text = re.sub("\n","",text)
    text = re.sub(" +"," ",text)
    text = re.sub(r"(?<!\d)[.,;:](?!\d)"," ",text)
    text = re.sub("^\s+|\s+$","",text)
    text = re.sub(r"\s+"," ",text)
    return text.lower()

In [31]:
df['clean_text'] = df['text'].apply(lambda x:clean_tweets(x))

In [32]:
df.head(20)

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet,clean_hashtags,clean_text
4126,SouthSide,,"‘ologist, Feminist, mother and grandmother. St...",2013-10-15 18:42:00,40.0,33.0,29884.0,True,2020-01-09 00:04:00,Don’t politicize the #CDC and the #FDA. I’m p...,"['CDC', 'FDA']",Twitter for iPhone,False,"cdc, fda",dont politicize the cdc and the fda im perfect...
4125,Candidate PAJ,,"Assuming Twitter exists in the future, let thi...",2019-01-18 02:06:00,2.0,166.0,41.0,True,2020-01-09 00:05:00,Whomever the President is needs to get the #co...,['covidvaccine'],Twitter for iPhone,False,covidvaccine,whomever the president is needs to get the cov...
4124,💧Salty Noulty💧,"Brisbane, Australia","Inspiring carbon neutral living, science and h...",2007-09-28 19:20:00,398.0,602.0,11076.0,True,2020-01-09 00:09:00,@GregHuntMP You forgot to mention $5 million i...,['covidvaccine'],Twitter for iPhone,False,covidvaccine,you forgot to mention 5 million in funding fir...
4123,PETER MAER,"Washington, DC",Retired White House Corr. \nEdward R. Murrow A...,2010-12-03 15:20:00,13514.0,11646.0,1563.0,True,2020-01-09 00:49:00,Is Trump pressuring #FDA for an October #Covid...,"['FDA', 'Covidvaccine']",Twitter Web App,False,"fda, covidvaccine",is trump pressuring fda for an october covidva...
4122,B Sabs,,Wife to my man of choice. Mother of 2 awesome ...,2014-09-09 16:09:00,100.0,316.0,19193.0,True,2020-01-09 00:59:00,Here we go. #COVIDvaccine https://t.co/FZgtcJ6XDP,['COVIDvaccine'],Twitter for iPhone,False,covidvaccine,here we go covidvaccine
5969,RecallGavinNewsom,United States,Demand CDC investigation. SB277 warrior mom|sw...,2015-03-31 04:14:00,579.0,561.0,2945.0,True,2020-01-09 01:04:00,It's a #Scamdemic and only the sheeple who wea...,"['Scamdemic', 'CovidVaccine']",Twitter Web App,False,"scamdemic, covidvaccine",its a scamdemic and only the sheeple who wear ...
4121,Name cannot be blank,"New York, USA",Awake to the fact there is only 1 political pa...,2013-07-19 12:21:00,73.0,240.0,12048.0,True,2020-01-09 01:09:00,@USlawreview Yet #Fauci is saying a vaccine mi...,"['Fauci', 'covid']",Twitter for Android,False,"fauci, covid",yet fauci is saying a vaccine might not help d...
4120,rosannemiller,USA,“When all of us band together against injustic...,2011-05-02 19:26:00,2727.0,3564.0,37157.0,True,2020-01-09 01:28:00,"""Step Up To The Plate"" \ninitiate a #uniteforf...","['uniteforfreedom', 'lockdown', 'NewNormal', '...",Twitter Web App,False,"uniteforfreedom, lockdown, newnormal, covidvac...",step up to the plate initiate a uniteforfreedo...
4119,John J. Seng,"Washington, DC","Sr. Advisor, Prosper Group, counsel to PR firm...",2008-02-12 04:42:00,832.0,495.0,317.0,True,2020-01-09 01:56:00,Consider enrolling in this #COVID19 #CovidVacc...,"['COVID19', 'CovidVaccine']",Twitter Web App,False,"covid19, covidvaccine",consider enrolling in this covid19 covidvaccin...
4116,Victoria Hudson,"Craftsbury, VT",We only asked for leopards to guard our thinni...,2009-08-18 22:29:00,1436.0,1818.0,61022.0,True,2020-01-09 02:53:00,.Meanwhile in Canada 🇨🇦 normalcy and true lead...,"['Canada', 'CovidVaccine']",Twitter for iPhone,False,"canada, covidvaccine",meanwhile in canada normalcy and true leadersh...


## 7. Save Clean Data

In [33]:
df.shape

(135322, 15)

In [34]:
# df.to_csv('../data/covidvaccine_cleaned.csv', index=False)