# Sentiment Analysis

What is the overall sentiment of tweets pertaining to covid? How does that change according to province/over time? Are tweets talking about different topics specific pain points?

In [11]:
from utils import DTYPE, PARSE_DATES, PROV_CONSOLIDATION, CONSOLIDATED_PROVINCES, CONVERTERS
from tqdm.auto import tqdm
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
import numpy as np
import glob
tqdm.pandas()


prov_map = lambda x : x if x not in PROV_CONSOLIDATION else PROV_CONSOLIDATION[x]

data_paths = glob.glob("../data/processed_data/2*.csv")
frames = [pd.read_csv(f,header=0,dtype=DTYPE,converters=CONVERTERS,parse_dates=PARSE_DATES) for f in tqdm(data_paths)]
total_df = pd.concat(frames, axis=0, ignore_index=True).set_index("id").sort_values("created_at")
total_df = total_df[~total_df.index.duplicated()]

total_df["created_at"] = total_df["created_at"].dt.to_period("D").dt.to_timestamp('s')
total_df["province"] = total_df["province"].apply(prov_map)
total_df = total_df[total_df.clean_text.notnull()]
print(len(total_df))
total_df.head()


The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version



HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))





Converting to PeriodArray/Index representation will drop timezone information.



473805


Unnamed: 0_level_0,created_at,screen_name,source,clean_text,original_text,is_retweet,favorite_count,retweet_count,hashtags,urls,mentions,city,province,longitude,latitude
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1228469111451242497,2020-02-15,Transport_gc,Hootsuite Inc.,improve many priority pilot project important ...,Improving #RoadSafety in #Canada is one of our...,False,9,1,"[RoadSafety, Canada, seatbelts]",[https://twitter.com/i/web/status/122846911145...,,,,-113.64258,60.10867
1228470050996113408,2020-02-15,4Everanimalz1,Twitter for iPad,improve many priority pilot project important ...,Improving #RoadSafety in #Canada is one of our...,True,0,0,"[RoadSafety, Canada, seatbelts]",,[Transport_gc],Calgary,Alberta,-114.08529,51.05011
1228470466668564481,2020-02-15,Mom_ASDadvocate,Twitter for iPhone,student tcdsb would nice walk place know safe ...,"to me, a student in the TCDSB, this would be s...",True,0,0,,,[leahbanning],Toronto,Ontario,-79.4163,43.70011
1228470535530647552,2020-02-15,camille4change,Twitter for iPhone,student tcdsb would nice walk place know safe ...,"to me, a student in the TCDSB, this would be s...",True,0,0,,,[leahbanning],Hamilton,Ontario,-79.84963,43.25011
1228472099464810498,2020-02-15,DianneWatts4BC,Twitter for iPhone,still discuss pilot do already safety kid rid ...,Why is this still being discussed and piloted ...,False,27,4,,[https://twitter.com/i/web/status/122847209946...,,Surrey,British Columbia,-122.82509,49.10635


In [18]:
from textblob import TextBlob
from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer
tb = Blobber()

polarity = lambda t : tb(t["clean_text"]).polarity

total_df["polarity"] = total_df[["clean_text"]].progress_apply(polarity,result_type="expand",axis=1)
total_df["polarity_bucket"] = total_df["polarity"].map(lambda x : "positive" if x > 0.3 else "negative" if x < -0.3 else "NA")
total_df.head()


HBox(children=(FloatProgress(value=0.0, max=473805.0), HTML(value='')))




Unnamed: 0_level_0,created_at,screen_name,source,clean_text,original_text,is_retweet,favorite_count,retweet_count,hashtags,urls,mentions,city,province,longitude,latitude,polarity,polarity_bucket
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1228469111451242497,2020-02-15,Transport_gc,Hootsuite Inc.,improve many priority pilot project important ...,Improving #RoadSafety in #Canada is one of our...,False,9,1,"[RoadSafety, Canada, seatbelts]",[https://twitter.com/i/web/status/122846911145...,,,,-113.64258,60.10867,0.395238,positive
1228470050996113408,2020-02-15,4Everanimalz1,Twitter for iPad,improve many priority pilot project important ...,Improving #RoadSafety in #Canada is one of our...,True,0,0,"[RoadSafety, Canada, seatbelts]",,[Transport_gc],Calgary,Alberta,-114.08529,51.05011,0.395238,positive
1228470466668564481,2020-02-15,Mom_ASDadvocate,Twitter for iPhone,student tcdsb would nice walk place know safe ...,"to me, a student in the TCDSB, this would be s...",True,0,0,,,[leahbanning],Toronto,Ontario,-79.4163,43.70011,0.575,positive
1228470535530647552,2020-02-15,camille4change,Twitter for iPhone,student tcdsb would nice walk place know safe ...,"to me, a student in the TCDSB, this would be s...",True,0,0,,,[leahbanning],Hamilton,Ontario,-79.84963,43.25011,0.575,positive
1228472099464810498,2020-02-15,DianneWatts4BC,Twitter for iPhone,still discuss pilot do already safety kid rid ...,Why is this still being discussed and piloted ...,False,27,4,,[https://twitter.com/i/web/status/122847209946...,,Surrey,British Columbia,-122.82509,49.10635,0.0,


In [19]:
import plotly.express as px
c = total_df[["polarity_bucket","urls"]].groupby("polarity_bucket").count().rename({"urls":"count"},axis=1).reset_index()
fig = px.pie(c, values='count', names='polarity_bucket')
fig.show()


In [20]:
by_date = total_df.groupby(["created_at","polarity_bucket"]).count()[["urls"]].rename({"urls":"count"},axis=1).reset_index()
px.area(by_date, x="created_at", y="count", color="polarity_bucket")

In [17]:
list(total_df.sort_values("polarity",ascending=False).head()["original_text"])

['Premier Doug Ford’s “best plan in the country” is receiving a ton of criticism - and rightfully so. #onpoli #onted #SafeSeptember \nhttps://t.co/qAdeJ7UGgL',
 'You can tell there are a lot of teachers chalking MLA sidewalks because the printing is perfect. 🖍️ #SafeSeptemberAB',
 "Even if cases remain a trickle, for every COVID case discovered in a school, two dozen or more kids will be sent home for two weeks of 'isolation'. And that's the best case scenario where this doesn't spread like wildfire in November and everything is shut down again. 1/2",
 'As parents navigate the return to school, in whatever way is best for each family, @HamHealthSci has created this set of videos to help kids and their caregivers deal with new stresses and challenges.  \nhttps://t.co/Yg9kAgSbGD\n\n#Covid_19 \n#mentalhealth \n#BackToSchool\n#parenting https://t.co/S0ofhAZYSS',
 "Broadway's @LauraBenanti asks kids to share their school performances after Coronavirus cancellations. Spoiler alert: The resul