# Sentiment Analysis of COVID-19 Tweets: When did the Public Panic Set In? Part 2: Processing Tweets

    Notebook by Allison Kelly - allisonkelly42@gmail.com
    
The following notebook picks up where <a href="https://github.com/akelly66/COVID-Tweet-Sentiment/blob/master/tweet-scraping/Twitter-API-Scraping.ipynb">Part 1: Scraping Tweets</a> left off. In Part 2, I am to process the tweet text to get into a manageable form for modeling. Once the processing functions have been finalized, I will process the training data according to the same rules. You can learn about the training data in Part 3.

# Imports

In [2]:
%matplotlib inline

# Generic Imports
import pandas as pd
pd.set_option('display.max_colwidth', 150) # See more text
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, time

# Get JSON
import json
import ast

# Text preprocessing libraries
import string
import contractions
import re
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords, wordnet
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk import FreqDist
from nltk.collocations import *
from nltk.collocations import BigramAssocMeasures

# Exploratory data analysis libraries
from wordcloud import WordCloud

# Obtain Data

View method to obtain data <a href="https://github.com/akelly66/COVID-Tweet-Sentiment/blob/master/tweet-scraping/COVID-tweets-true.ipynb">here</a>. <br>
<br>The tweet query parameters were as follows:

- <b>Keywords: </b> "coronavirus OR Wuhan virus OR 2019-nCoV OR China flu"<br>
- <b>Date Range: </b> 28 Jan 2020 - 03 Feb 2020<br>
- <b>Location:</b> United States of America<br><br>


In [3]:
df = pd.read_csv("expanded_query_tweets.csv")
df.drop_duplicates(inplace=True)
df = df.query("lang == 'en'")
df.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,extended_tweet,favorite_count,favorited,filter_level,...,quoted_status_id_str,quoted_status_permalink,reply_count,retweet_count,retweeted,retweeted_status,source,text,truncated,user
0,,,Sun Feb 02 23:59:59 +0000 2020,,"{'hashtags': [], 'urls': [], 'user_mentions': [{'screen_name': 'QuestForSense', 'name': 'Atakan Derelioglu, PhD', 'id': 1495052767, 'id_str': '149...",,,0,False,low,...,,,0,0,False,"{'created_at': 'Sun Feb 02 20:44:31 +0000 2020', 'id': 1224071120212627456, 'id_str': '1224071120212627456', 'text': 'Amazing Timelapse as China C...","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>","RT @QuestForSense: Amazing Timelapse as China Completes First of Two Hospitals in Wuhan within 10 days having 1,000 beds and 1,400 medical…",False,"{'id': 184207003, 'id_str': '184207003', 'name': '☮️Ope', 'screen_name': 'The_Ope_', 'location': 'Third rock from the sun', 'url': None, 'descript..."
1,,,Sun Feb 02 23:59:58 +0000 2020,,"{'hashtags': [{'text': 'coronavirus', 'indices': [37, 49]}], 'urls': [], 'user_mentions': [{'screen_name': 'selinawangtv', 'name': 'Selina Wang', ...",,,0,False,low,...,,,0,0,False,"{'created_at': 'Sun Feb 02 23:44:46 +0000 2020', 'id': 1224116481950011393, 'id_str': '1224116481950011393', 'text': 'Bloomberg SCOOP on #coronavi...","<a href=""http://twitter.com/#!/download/ipad"" rel=""nofollow"">Twitter for iPad</a>","RT @selinawangtv: Bloomberg SCOOP on #coronavirus impact: Chinese oil demand said to have dropped by about three million barrels a day, or…",False,"{'id': 561036180, 'id_str': '561036180', 'name': 'JJK', 'screen_name': 'jjkenny1', 'location': None, 'url': None, 'description': None, 'translator..."
2,,,Sun Feb 02 23:59:58 +0000 2020,,"{'hashtags': [], 'urls': [], 'user_mentions': [{'screen_name': 'Marfoogle', 'name': 'MARFOOGLE NEWS (OFFICIAL)', 'id': 961504257051521024, 'id_str...",,,0,False,low,...,,,0,0,False,"{'created_at': 'Sun Feb 02 22:31:43 +0000 2020', 'id': 1224098097468305408, 'id_str': '1224098097468305408', 'text': 'I have become Ill. But no wo...","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>","RT @Marfoogle: I have become Ill. But no worries, Its just stuff related to my existing G.I. issues. So No coronavirus here. I saw emails c…",False,"{'id': 1211438984229818368, 'id_str': '1211438984229818368', 'name': 'Laura Turner', 'screen_name': 'LauraTu85646722', 'location': 'Utah, USA', 'u..."
3,,,Sun Feb 02 23:59:58 +0000 2020,,"{'hashtags': [], 'urls': [{'url': 'https://t.co/4LkEEVSqqg', 'expanded_url': 'https://www.npr.org/2020/02/02/802087551/u-s-hospitals-unprepared-fo...",,,0,False,low,...,,,0,0,False,"{'created_at': 'Sun Feb 02 22:26:33 +0000 2020', 'id': 1224096797364088832, 'id_str': '1224096797364088832', 'text': 'U.S. Hospitals Unprepared Fo...","<a href=""http://twitter.com/#!/download/ipad"" rel=""nofollow"">Twitter for iPad</a>",RT @NPRHealth: U.S. Hospitals Unprepared For A Quickly Spreading Coronavirus https://t.co/4LkEEVSqqg,False,"{'id': 2227143195, 'id_str': '2227143195', 'name': 'Dr. Scott Newton 😷↔️😷', 'screen_name': 'DrScottNewton', 'location': 'USA & Global', 'url': 'ht..."
4,,,Sun Feb 02 23:59:58 +0000 2020,,"{'hashtags': [], 'urls': [], 'user_mentions': [{'screen_name': 'PuffDragon11', 'name': 'Puff Dragon', 'id': 1057666719127240704, 'id_str': '105766...",,,0,False,low,...,,,0,0,False,"{'created_at': 'Sat Feb 01 03:05:07 +0000 2020', 'id': 1223442123761975296, 'id_str': '1223442123761975296', 'text': 'Just read the @zerohedge pie...","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>",RT @PuffDragon11: Just read the @zerohedge piece on the coronavirus. My PhD is in Mol. Bio. No doubt this is an engineered bioweapon and no…,False,"{'id': 1443971138, 'id_str': '1443971138', 'name': 'Josh', 'screen_name': 'JoshAFC', 'location': 'London, England', 'url': 'https://www.arsenal.co..."


In [4]:
print(len(df))
print(df.info())
df.describe()

2996
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2996 entries, 0 to 2999
Data columns (total 36 columns):
contributors                 0 non-null float64
coordinates                  0 non-null float64
created_at                   2996 non-null object
display_text_range           228 non-null object
entities                     2996 non-null object
extended_entities            244 non-null object
extended_tweet               234 non-null object
favorite_count               2996 non-null int64
favorited                    2996 non-null bool
filter_level                 2996 non-null object
geo                          0 non-null float64
id                           2996 non-null int64
id_str                       2996 non-null int64
in_reply_to_screen_name      141 non-null object
in_reply_to_status_id        133 non-null float64
in_reply_to_status_id_str    133 non-null float64
in_reply_to_user_id          146 non-null float64
in_reply_to_user_id_str      146 non-null float64
is_

Unnamed: 0,contributors,coordinates,favorite_count,geo,id,id_str,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,quote_count,quoted_status_id,quoted_status_id_str,reply_count,retweet_count
count,0.0,0.0,2996.0,0.0,2996.0,2996.0,133.0,133.0,146.0,146.0,2996.0,226.0,226.0,2996.0,2996.0
mean,,,1.985981,,1.223419e+18,1.223419e+18,1.223356e+18,1.223356e+18,2.864927e+17,2.864927e+17,0.062083,1.222852e+18,1.222852e+18,0.164553,0.622497
std,,,40.995987,,130385900000000.0,130385900000000.0,238033400000000.0,238033400000000.0,4.388216e+17,4.388216e+17,0.856762,2313826000000000.0,2313826000000000.0,1.68603,9.376703
min,,,0.0,,1.223394e+18,1.223394e+18,1.221944e+18,1.221944e+18,786764.0,786764.0,0.0,1.190002e+18,1.190002e+18,0.0,0.0
25%,,,0.0,,1.223394e+18,1.223394e+18,1.223346e+18,1.223346e+18,64123000.0,64123000.0,0.0,1.222953e+18,1.222953e+18,0.0,0.0
50%,,,0.0,,1.223395e+18,1.223395e+18,1.223389e+18,1.223389e+18,1342405000.0,1342405000.0,0.0,1.22324e+18,1.22324e+18,0.0,0.0
75%,,,0.0,,1.223395e+18,1.223395e+18,1.223393e+18,1.223393e+18,7.661313e+17,7.661313e+17,0.0,1.223373e+18,1.223373e+18,0.0,0.0
max,,,2013.0,,1.22412e+18,1.22412e+18,1.224119e+18,1.224119e+18,1.202716e+18,1.202716e+18,29.0,1.224101e+18,1.224101e+18,48.0,393.0


Many of the tweets are truncated due to being retweets. The full, original tweet can be found in the retweeted_status column. The responses from the Twitter API are nested JSON objects, however when I converted them into a dataframe, the nested JSON objects became dictionary-like strings. The following cells will use abstract syntax trees to convert the string to a dictionary (though I'm really not sure why or how it works!) and pull the full text of the original tweet. I believe it's important to the sentiment analysis to interpret the original content as supported by the profile doing the retweeting.

In [5]:
def evaluate(row):
    '''
    This function, when applied to
    the series containing 
    dictionary-like strings will 
    convert each instance to
    actual dictionaries and return
    the dictionary.
    '''
    
    row = ast.literal_eval(row) 
    return row

# Saving dictionaries in new column
df['expanded'] = df.retweeted_status.dropna().apply(evaluate)

In [6]:
# Creating features from dictionary keys in new dataframe
expanded_df = df['expanded'].apply(pd.Series)

# Again, unnesting another dictionary to get to the full_text column
expanded_df = expanded_df.extended_tweet.apply(pd.Series)

# Dropping rows corresponding to original tweets (not retweeted text)
expanded_df = expanded_df.full_text.dropna()

In [7]:
# Joining with original dataframe
df = pd.DataFrame.join(df, expanded_df)

In [8]:
# Swapping NaNs for original tweets in the full text column
df['full_text'].fillna(df['text'],inplace=True)

# Preprocess Tweet Functions

Though the entirety of the dataset is a treasure trove of information, I've singled out just the text portion to process for the sentiment analysis. 

In [24]:
tweet_df = df.loc[:,['full_text']]
tweet_df.head()

Unnamed: 0,full_text
0,"Amazing Timelapse as China Completes First of Two Hospitals in Wuhan within 10 days having 1,000 beds and 1,400 medical staff to treat those infec..."
1,"Bloomberg SCOOP on #coronavirus impact: Chinese oil demand said to have dropped by about three million barrels a day, or 20% of total consumption...."
2,"I have become Ill. But no worries, Its just stuff related to my existing G.I. issues. So No coronavirus here. I saw emails concerning my visit to ..."
3,RT @NPRHealth: U.S. Hospitals Unprepared For A Quickly Spreading Coronavirus https://t.co/4LkEEVSqqg
4,Just read the @zerohedge piece on the coronavirus. My PhD is in Mol. Bio. No doubt this is an engineered bioweapon and not natural. Statistically ...


In [59]:
def remove_url_and_RT(row):
    '''
    This function takes each tweet
    and removes the urls and retweet
    indicator from them.
    '''
    
    row = re.sub('https://[A-Za-z0-9./]+',"",row)
    row = re.sub('http://[A-Za-z0-9./]+',"",row)
    row = re.sub('^RT',"", row)
    return row

tweet_df.full_text = tweet_df.full_text.apply(remove_url_and_RT)

In [32]:
tweet_df.head()

Unnamed: 0,full_text
0,"Amazing Timelapse as China Completes First of Two Hospitals in Wuhan within 10 days having 1,000 beds and 1,400 medical staff to treat those infec..."
1,"Bloomberg SCOOP on #coronavirus impact: Chinese oil demand said to have dropped by about three million barrels a day, or 20% of total consumption...."
2,"I have become Ill. But no worries, Its just stuff related to my existing G.I. issues. So No coronavirus here. I saw emails concerning my visit to ..."
3,@NPRHealth: U.S. Hospitals Unprepared For A Quickly Spreading Coronavirus
4,Just read the @zerohedge piece on the coronavirus. My PhD is in Mol. Bio. No doubt this is an engineered bioweapon and not natural. Statistically ...


In [33]:
def clean_tweet(tweet):
    
    '''
    This function takes a tweet variable,
    removes punctuation and linebreaks,
    sets all words to lowercase, and 
    returns the cleaned tweet as a single
    variable list.
    '''
    
    # Grabbing most common punctuation symbols and ellipsis symbol
    punctuation_list = list(string.punctuation)+ ["…"] + ['’']
    punctuation_list.remove('#')
    
    
    cleaned_tweet = []
    
    for symbol in punctuation_list:
        
        tweet = tweet.replace(symbol, "").lower()
        
        # Removing trailing characters
        tweet = tweet.rstrip()
        
        # Cleaning non-ASCII characters
        tweet = re.sub("([^\x00-\x7F])+","",tweet)
      
    cleaned_tweet.append(tweet)
    
    return cleaned_tweet

cleaned_tweet_test = clean_tweet(tweet_df.full_text[1])
cleaned_tweet_test        

['bloomberg scoop on #coronavirus impact chinese oil demand said to have dropped by about three million barrels a day or 20 of total consumption china is the worlds largest oil importer w outsized impact on the global energy mkt business quicktake']

In [34]:
def tokenize(clean_tweet):
    
    '''
    This function takes a cleaned tweet,
    joins into one string (if not already),
    runs the tweet through NLTK work tokenizer, 
    removes English stopwords, replaces "us"
    with "usa," removes numbers and returns
    the tokenized tweet in list format.
    '''
    
    joined_tweet = ' '.join(clean_tweet)
    stopwords_list = stopwords.words('english')
    
    tokenizer = TweetTokenizer()
    tokenized_tweet = tokenizer.tokenize(joined_tweet)
    # Removing stopwords
    tokenized_tweet = [word for word in tokenized_tweet if word not in stopwords_list]
    
    # Subbing 'usa' for 'us'
    tokenized_tweet = ['usa' if word == 'us' else word for word in tokenized_tweet]
    
    # Removing numbers
    tokenized_tweet = [word for word in tokenized_tweet if not word.isnumeric()]
    
    return tokenized_tweet

    

tokenized_tweet_test = tokenize(cleaned_tweet_test)
tokenized_tweet_test

['bloomberg',
 'scoop',
 '#coronavirus',
 'impact',
 'chinese',
 'oil',
 'demand',
 'said',
 'dropped',
 'three',
 'million',
 'barrels',
 'day',
 'total',
 'consumption',
 'china',
 'worlds',
 'largest',
 'oil',
 'importer',
 'w',
 'outsized',
 'impact',
 'global',
 'energy',
 'mkt',
 'business',
 'quicktake']

In [37]:
def lem_tweet(tweet):
    '''
    This function takes a tweet in
    the form of a tokenized
    word list and lemmatizes it.
    '''
    lemmatizer = WordNetLemmatizer()
    
    lemmed_tweet = [lemmatizer.lemmatize(word) for word in tweet]
    
    return lemmed_tweet

lemmed_tweet_test = lem_tweet(tokenized_tweet_test)

In [38]:
lemmed_tweet_test

['bloomberg',
 'scoop',
 '#coronavirus',
 'impact',
 'chinese',
 'oil',
 'demand',
 'said',
 'dropped',
 'three',
 'million',
 'barrel',
 'day',
 'total',
 'consumption',
 'china',
 'world',
 'largest',
 'oil',
 'importer',
 'w',
 'outsized',
 'impact',
 'global',
 'energy',
 'mkt',
 'business',
 'quicktake']

In [39]:
def stem_tweet(tweet):
    
    stemmer = SnowballStemmer('english')
    stemmed_tweet = [stemmer.stem(word) for word in tweet]
    
    return stemmed_tweet

stem_test = stem_tweet(lemmed_tweet_test)
stem_test

['bloomberg',
 'scoop',
 '#coronavirus',
 'impact',
 'chines',
 'oil',
 'demand',
 'said',
 'drop',
 'three',
 'million',
 'barrel',
 'day',
 'total',
 'consumpt',
 'china',
 'world',
 'largest',
 'oil',
 'import',
 'w',
 'outsiz',
 'impact',
 'global',
 'energi',
 'mkt',
 'busi',
 'quicktak']

In [60]:
def process_tweet(tweet):
    '''
    This function takes an original 
    tweet, cleans, tokenizes, 
    and lemmatizes the tweet.
    '''
    
    cleaned = clean_tweet(tweet)
    tokenized = tokenize(cleaned)
#     stemmed_tweet = stem_tweet(tokenized)
    lemmed_tweet = lem_tweet(tokenized)
    
    return lemmed_tweet

In [43]:
#Preprocessing COVID tweets
tweet_df['processed_tweets'] = tweet_df['full_text'].apply(process_tweet)

# Resetting index
tweet_df = tweet_df.reset_index().drop('index',axis=1)
tweet_df.head()

Unnamed: 0,full_text,processed_tweets
0,"Amazing Timelapse as China Completes First of Two Hospitals in Wuhan within 10 days having 1,000 beds and 1,400 medical staff to treat those infec...","[amazing, timelapse, china, completes, first, two, hospital, wuhan, within, day, bed, medical, staff, treat, infected, #coronavirus, #coronaviruso..."
1,"Bloomberg SCOOP on #coronavirus impact: Chinese oil demand said to have dropped by about three million barrels a day, or 20% of total consumption....","[bloomberg, scoop, #coronavirus, impact, chinese, oil, demand, said, dropped, three, million, barrel, day, total, consumption, china, world, large..."
2,"I have become Ill. But no worries, Its just stuff related to my existing G.I. issues. So No coronavirus here. I saw emails concerning my visit to ...","[become, ill, worry, stuff, related, existing, gi, issue, coronavirus, saw, email, concerning, visit, provedence, hospitalhome, first, usa, case, ..."
3,@NPRHealth: U.S. Hospitals Unprepared For A Quickly Spreading Coronavirus,"[nprhealth, usa, hospital, unprepared, quickly, spreading, coronavirus]"
4,Just read the @zerohedge piece on the coronavirus. My PhD is in Mol. Bio. No doubt this is an engineered bioweapon and not natural. Statistically ...,"[read, zerohedge, piece, coronavirus, phd, mol, bio, doubt, engineered, bioweapon, natural, statistically, improbably, segment, map, completely, d..."


Looking good! Below, I'll process the training tweets. The dataset will be explored in Part 3. 

In [47]:
training_set = pd.read_csv('training_tweets.csv',index_col=0)
training_set.head()

  mask |= (ar1 == a)


Unnamed: 0,polarity,tweet_id,date,query,twitter_handle,tweet,processed_tweets
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D","['switchfoot', 'httptwitpiccom', '2y1zl', 'awww', 'thats', 'bummer', 'shoulda', 'got', 'david', 'carr', 'third', 'day']"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,"['upset', 'cant', 'update', 'facebook', 'texting', 'might', 'cry', 'result', 'school', 'today', 'also', 'blah']"
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,"['kenichan', 'dived', 'many', 'time', 'ball', 'managed', 'save', 'rest', 'go', 'bound']"
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,"['whole', 'body', 'feel', 'itchy', 'like', 'fire']"
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.","['nationwideclass', 'behaving', 'im', 'mad', 'cant', 'see']"


In [53]:
training_set = training_set.loc[:,["polarity","tweet"]]
training_set.head()

Unnamed: 0,polarity,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


In [61]:
#Preprocessing training tweets
training_set['processed_train'] = training_set['tweet'].apply(process_tweet)

# Resetting index
training_set = training_set.reset_index().drop('index',axis=1)
training_set.head()

Unnamed: 0,polarity,tweet,processed_train
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D","[switchfoot, httptwitpiccom, 2y1zl, awww, thats, bummer, shoulda, got, david, carr, third, day]"
1,0,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!,"[upset, cant, update, facebook, texting, might, cry, result, school, today, also, blah]"
2,0,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds,"[kenichan, dived, many, time, ball, managed, save, rest, go, bound]"
3,0,my whole body feels itchy and like its on fire,"[whole, body, feel, itchy, like, fire]"
4,0,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.","[nationwideclass, behaving, im, mad, cant, see]"


## Word Vectorization

This portion will be moved to another notebook. Please ignore for now.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
tweet_df = pd.read_csv('processed_tweets.csv')

In [4]:
tweet_list = tweet_df.processed_tweets.apply(('').join)

In [5]:
tweet_list

0       ['amazing', 'timelapse', 'china', 'completes', 'first', 'two', 'hospital', 'wuhan', 'within', 'day', 'bed', 'medical', 'staff', 'treat', 'infected...
1       ['bloomberg', 'scoop', '#coronavirus', 'impact', 'chinese', 'oil', 'demand', 'said', 'dropped', 'three', 'million', 'barrel', 'day', 'total', 'con...
2       ['become', 'ill', 'worry', 'stuff', 'related', 'existing', 'gi', 'issue', 'coronavirus', 'saw', 'email', 'concerning', 'visit', 'provedence', 'hos...
3                                                                       ['nprhealth', 'usa', 'hospital', 'unprepared', 'quickly', 'spreading', 'coronavirus']
4       ['read', 'zerohedge', 'piece', 'coronavirus', 'phd', 'mol', 'bio', 'doubt', 'engineered', 'bioweapon', 'natural', 'statistically', 'improbably', '...
                                                                                ...                                                                          
2991    ['due', 'threat', 'novel', 'coronavirus', 's

In [6]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(tweet_list)
dense_matrix = tfidf_matrix.todense()
dense_list = dense_matrix.tolist()

In [7]:
tfidf_df = pd.DataFrame(dense_list, columns = vectorizer.get_feature_names(), index=tweet_list)

In [8]:
tfidf_df.head()

Unnamed: 0_level_0,aa,aaaaggghhhhh,aampe,aasma,abbvie,abc,abcchicago,abcnews,abcnewsbayarea,abdirashidm,...,zhengli,zhou,zimbabwe,zimbabwean,zls,zombie,zone,zoonotic,zorrillaalex,zxrnoh
processed_tweets,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"['amazing', 'timelapse', 'china', 'completes', 'first', 'two', 'hospital', 'wuhan', 'within', 'day', 'bed', 'medical', 'staff', 'treat', 'infected', '#coronavirus', '#coronavirusoutbreak']",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"['bloomberg', 'scoop', '#coronavirus', 'impact', 'chinese', 'oil', 'demand', 'said', 'dropped', 'three', 'million', 'barrel', 'day', 'total', 'consumption', 'china', 'world', 'largest', 'oil', 'importer', 'w', 'outsized', 'impact', 'global', 'energy', 'mkt', 'business', 'quicktake']",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"['become', 'ill', 'worry', 'stuff', 'related', 'existing', 'gi', 'issue', 'coronavirus', 'saw', 'email', 'concerning', 'visit', 'provedence', 'hospitalhome', 'first', 'usa', 'case', 'contact', 'anyone', 'kept', 'distance', 'hospital']",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"['nprhealth', 'usa', 'hospital', 'unprepared', 'quickly', 'spreading', 'coronavirus']",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"['read', 'zerohedge', 'piece', 'coronavirus', 'phd', 'mol', 'bio', 'doubt', 'engineered', 'bioweapon', 'natural', 'statistically', 'improbably', 'segment', 'map', 'completely', 'different', 'virus', 'hiv', 'conservation']",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
tfidf_df.shape

(2996, 5740)

From 2,996 tweets, there is a total of 5,704 unique words. This includes hashtags and tagged handles. 

In [10]:
train_df = pd.read_csv("processed_train.csv")

In [11]:
train_tweet_list = train_df.processed_tweets.apply(('').join)

In [12]:
train_tweet_list

0          ['switchfoot', 'httptwitpiccom', '2y1zl', 'awww', 'thats', 'bummer', 'shoulda', 'got', 'david', 'carr', 'third', 'day']
1                  ['upset', 'cant', 'update', 'facebook', 'texting', 'might', 'cry', 'result', 'school', 'today', 'also', 'blah']
2                                          ['kenichan', 'dived', 'many', 'time', 'ball', 'managed', 'save', 'rest', 'go', 'bound']
3                                                                               ['whole', 'body', 'feel', 'itchy', 'like', 'fire']
4                                                                      ['nationwideclass', 'behaving', 'im', 'mad', 'cant', 'see']
                                                                    ...                                                           
1599995                                                                              ['woke', 'school', 'best', 'feeling', 'ever']
1599996                                           ['thewdbcom', 'cool', 'hear', 'ol

In [None]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(train_tweet_list)
dense_matrix = tfidf_matrix.todense()
dense_list = dense_matrix.tolist()

In [None]:
tfidf_df = pd.DataFrame(dense_list, columns = vectorizer.get_feature_names(), index=tweet_list)