This file contains functions used to analyze the data collected by reddit_crawler.py

In [66]:
import pandas as pd
from textblob import TextBlob as tb

__author__ = "Aamir Hasan"
__version__ = "1.0"
__email__ = "hasanaamir215@gmail.com"

datafile_path = "../data/got_crawler_data.csv"

In [67]:
data = pd.read_csv(datafile_path)

The data collected by the GoT crawler is loaded into the dataframe named data. Let's look at what we have

In [68]:
data.head()

Unnamed: 0,id,text,score,subreddit,author,time
0,e52w1bo,The determination of the gender of reptiles is...,1,gameofthrones,LegendaryDragonRider,1535604000.0
1,e52w0az,Your submission has been automatically removed...,1,gameofthrones,AutoModerator,1535604000.0
2,e52vzcx,"Well, between s6 and 7 they didnt really have ...",1,gameofthrones,minemoney123,1535604000.0
3,e52vmb4,There is a few actually. \n\nMy theory is that...,1,gameofthrones,mohelgamal,1535604000.0
4,e52vf6u,[NO SPOILERS] means any comments with spoilers...,1,gameofthrones,AutoModerator,1535603000.0


The data collected was from three subreddits, r/gameofthrones, r/freefolk and r/asoiaf. Lets see how many rows we have from each subreddit

In [69]:
data['subreddit'].value_counts()

freefolk         1000
asoiaf            998
gameofthrones     982
Name: subreddit, dtype: int64

Now lets convert the time column to values we can understand better.

In [70]:
data['time'] = pd.to_datetime(data['time'], unit='s', utc=True)

In [73]:
data.sort_values('time').tail(10)

Unnamed: 0,id,text,score,subreddit,author,time
985,e52vokx,Probably the same way the high technology of t...,1,asoiaf,yeaokbb,2018-08-30 04:34:49+00:00
1982,e52vp11,badly executed Chicken Myrcella/Marsala pun,1,freefolk,turtleduck,2018-08-30 04:35:05+00:00
1981,e52vtw6,Hold the door,1,freefolk,Aury121,2018-08-30 04:38:07+00:00
984,e52vuk5,There is an island in the middle of a lake nam...,1,asoiaf,erichiro,2018-08-30 04:38:32+00:00
983,e52vxxk,One big reason the analyses of the Winds sampl...,1,asoiaf,GyantSpyder,2018-08-30 04:40:42+00:00
2,e52vzcx,"Well, between s6 and 7 they didnt really have ...",1,gameofthrones,minemoney123,2018-08-30 04:41:35+00:00
982,e52vzj3,Robert's bastards that are still alive like My...,1,asoiaf,GoAvs14,2018-08-30 04:41:42+00:00
1,e52w0az,Your submission has been automatically removed...,1,gameofthrones,AutoModerator,2018-08-30 04:42:12+00:00
0,e52w1bo,The determination of the gender of reptiles is...,1,gameofthrones,LegendaryDragonRider,2018-08-30 04:42:51+00:00
1980,e52w29p,Did you mean to post this on r/Emiliaclarke? ...,1,freefolk,ChiefStark1893,2018-08-30 04:43:27+00:00


It might be interesting to see how many comments come in per day and in what way they are biased against a character and how that trend changes with time. This could be particularly useful for characters like Cersie, Jaime and Theon, Characters whose ethics change throughout the show.

To do this, lets first add a date column to the dataframe and extract just the text and the date columns to see how sentiments change.

In [83]:
data['date'] = data['time'].apply(lambda x:"%d/%d/%d" % (x.day,x.month,x.year))

Unnamed: 0,id,text,score,subreddit,author,time,date
0,e52w1bo,The determination of the gender of reptiles is...,1,gameofthrones,LegendaryDragonRider,2018-08-30 04:42:51+00:00,30/8/2018
1,e52w0az,Your submission has been automatically removed...,1,gameofthrones,AutoModerator,2018-08-30 04:42:12+00:00,30/8/2018
2,e52vzcx,"Well, between s6 and 7 they didnt really have ...",1,gameofthrones,minemoney123,2018-08-30 04:41:35+00:00,30/8/2018
3,e52vmb4,There is a few actually. \n\nMy theory is that...,1,gameofthrones,mohelgamal,2018-08-30 04:33:23+00:00,30/8/2018
4,e52vf6u,[NO SPOILERS] means any comments with spoilers...,1,gameofthrones,AutoModerator,2018-08-30 04:29:00+00:00,30/8/2018
5,e52veod,I meant satisfactory from a story telling stan...,1,gameofthrones,mohelgamal,2018-08-30 04:28:42+00:00,30/8/2018
6,e52vdw3,That I must bow so low,1,gameofthrones,shearhead9001,2018-08-30 04:28:14+00:00,30/8/2018
7,e52v79a,Lines on the left are much darker than the rig...,1,gameofthrones,kaylethpop,2018-08-30 04:24:10+00:00,30/8/2018
8,e52v564,Your submission has been automatically removed...,1,gameofthrones,AutoModerator,2018-08-30 04:22:53+00:00,30/8/2018
9,e52uunz,"Erm, you use that anyway.",1,gameofthrones,Aldebaran135,2018-08-30 04:16:39+00:00,30/8/2018


Now we will see how many comments we have for each day.

In [88]:
data['date'].value_counts()

29/8/2018    2128
30/8/2018     714
28/8/2018     138
Name: date, dtype: int64

Having just three dates isnt enough, we need a larger dataset.

Now, lets extract only the text and the date for the comments

In [91]:
sentiment_overtime = data[['text', 'date']]
sentiment_overtime.head(10)

Unnamed: 0,text,date
0,The determination of the gender of reptiles is...,30/8/2018
1,Your submission has been automatically removed...,30/8/2018
2,"Well, between s6 and 7 they didnt really have ...",30/8/2018
3,There is a few actually. \n\nMy theory is that...,30/8/2018
4,[NO SPOILERS] means any comments with spoilers...,30/8/2018
5,I meant satisfactory from a story telling stan...,30/8/2018
6,That I must bow so low,30/8/2018
7,Lines on the left are much darker than the rig...,30/8/2018
8,Your submission has been automatically removed...,30/8/2018
9,"Erm, you use that anyway.",30/8/2018
