## 02_Data Cleaning, EDA, & Preprocessing

This Notebook reads in data that was scraped from three subreddits using pushshift api and exported from previous notebook.  Here the data will undergo further cleaning and pre-processing for natural language processing.

## Contents
- [Import Packages](#Import-Packages)
- [Read in Data](#Read-in-Data)
- [Combine Dataframes for Comparison](#Combine-Dataframes-for-Comparison:)
- [Create Target Column](#Create-Target-Column)
- [Clean & Tokenize Submissions and Comments](#Clean-&-Tokenize-Submissions-and-Comments)
- [Stem and Lemmatize Tokens](#Stem-and-Lemmatize-Tokens)
- [Sentiment Analysis Processing](#Sentiment-Analysis-Processing)

## Import Packages

In [36]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
%matplotlib inline

from nltk.tokenize import RegexpTokenizer
from nltk import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from afinn import Afinn

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from collections import Counter

%matplotlib inline

### Read in Data

In [3]:
sdc_sub = pd.read_csv('./datasets/selfdriving_subs.csv', index_col=[0], na_filter = False)
fut_sub = pd.read_csv('./datasets/future_subs.csv', index_col=[0], na_filter = False)
tech_sub = pd.read_csv('./datasets/tech_subs.csv', index_col=[0], na_filter = False)
ai_sub = pd.read_csv('./datasets/ai_subs.csv', index_col=[0], na_filter = False)

sdc_com = pd.read_csv('./datasets/selfdriving_coms.csv', index_col=[0], na_filter = False)
fut_com = pd.read_csv('./datasets/future_coms.csv', index_col=[0], na_filter = False)
tech_com = pd.read_csv('./datasets/tech_coms.csv', index_col=[0], na_filter = False)
ai_com = pd.read_csv('./datasets/ai_coms.csv', index_col=[0], na_filter = False)

sdc_srch_sub = pd.read_csv('./datasets/selfdriving_srch_sub.csv', index_col=[0], na_filter = False)
fut_srch_sub = pd.read_csv('./datasets/future_srch_sub.csv', index_col=[0], na_filter = False)
tech_srch_sub = pd.read_csv('./datasets/tech_srch_sub.csv', index_col=[0], na_filter = False)
ai_srch_sub = pd.read_csv('./datasets/ai_sub.csv', index_col=[0], na_filter = False)

sdc_srch_com = pd.read_csv('./datasets/selfdriving_srch_com.csv', index_col=[0], na_filter = False)
fut_srch_com = pd.read_csv('./datasets/future_com.csv', index_col=[0], na_filter = False)
tech_srch_com = pd.read_csv('./datasets/tech_com.csv', index_col=[0], na_filter = False)
ai_srch_com = pd.read_csv('./datasets/ai_srch_com.csv', index_col=[0], na_filter = False)

In [4]:
search_terms = ['self-driving', 'self driving', 'autonomous vehicle', 'driverless']

In [5]:
# EDA and PreProcessing

# Ignore AI data from here on.  
# Will Focus on the other three.

In [6]:
# Check for Nulls
#sdc_sub.isnull().sum()
#fut_sub.isnull().sum()
#tech_sub.isnull().sum()
#sdc_com.isnull().sum()
#fut_com.isnull().sum()
tech_com.isnull().sum()

author          0
body            0
created_utc     0
id              0
parent_id       0
subreddit       0
subreddit_id    0
dtype: int64

In [7]:
# Check Dtypes
sdc_sub.dtypes

author          object
created_utc      int64
id              object
full_link       object
num_comments     int64
subreddit       object
subreddit_id    object
title           object
dtype: object

In [8]:
# Function to drop duplicates when passed a list of dataframes

def drop_dups(df_list, typ):
    if typ == 'sub': 
        for df in df_list: 
            df.drop_duplicates(subset ='title',
                               keep=False, inplace=True)
    else: 
        for df in df_list:
            df.drop_duplicates(subset = 'body',
                               keep=False, inplace=True)
    return df_list      


In [9]:
# Assign dataframes to lists
sub_df_list = [sdc_sub, fut_sub, tech_sub, sdc_srch_sub, fut_srch_sub, tech_srch_sub]
com_df_list = [sdc_com, fut_com, tech_com, sdc_srch_com, fut_srch_com, tech_srch_com]

# Call drop duplicate function and reassign to lists
sub_df_list = drop_dups(sub_df_list,'sub')
com_df_list = drop_dups(com_df_list,'com')

# Unpack lists back into the df variables
sdc_sub, fut_sub, tech_sub, sdc_srch_sub, fut_srch_sub, tech_srch_sub = sub_df_list
sdc_com, fut_com, tech_com, sdc_srch_com, fut_srch_com, tech_srch_com = com_df_list

In [10]:
# Check count of empty title fields. 

sdc_sub.loc[sdc_sub['title'] == '',:].count()

author          0
created_utc     0
id              0
full_link       0
num_comments    0
subreddit       0
subreddit_id    0
title           0
dtype: int64

### Combine Dataframes for Comparison: 
    1. Selfdrivingcars vs Futurology
    2. Selfdrivingcars vs Technology
    3. Searched Submissions: Selfdrivingcars, Futurology, Tech with 'Self-driving'
    4. Searched Comments: Selfdrivingcars, Futurology, Tech with 'Self-driving'

#### Combine Dataframes

In [11]:
# Combine Submission Dataframes

sdc_fut_sub = pd.concat([sdc_sub, fut_sub], axis=0)
sdc_tech_sub = pd.concat([sdc_sub, tech_sub], axis=0)
srch_sub = pd.concat([sdc_srch_sub, fut_srch_sub, tech_srch_sub], axis=0)

# Combine Comments Dataframes

sdc_fut_com = pd.concat([sdc_com, fut_com], axis=0)
sdc_tech_com = pd.concat([sdc_com, fut_com], axis=0)
srch_com = pd.concat([sdc_srch_com, fut_srch_com, tech_srch_com], axis=0)

In [12]:
# Check for duplicates in merged dataframes

sdc_fut_sub[sdc_fut_sub.duplicated(['title'])].count()

author          34
created_utc     34
id              34
full_link       34
num_comments    34
subreddit       34
subreddit_id    34
title           34
dtype: int64

In [13]:
sdc_fut_com[sdc_fut_com.duplicated(['body'])].count()

author          3
body            3
created_utc     3
id              3
parent_id       3
subreddit       3
subreddit_id    3
dtype: int64

In [14]:
# Drop Duplicates from merged dataframes

# SelfdrivingCars | Futurology
sdc_fut_sub = drop_dups([sdc_fut_sub],'sub')[0]
sdc_fut_com = drop_dups([sdc_fut_com],'com')[0]

# SelfdrivingCars | Technology
sdc_tech_sub = drop_dups([sdc_tech_sub],'sub')[0]
sdc_tech_com = drop_dups([sdc_tech_com],'com')[0]

srch_sub = drop_dups([srch_sub],'sub')[0]
srch_com = drop_dups([srch_com],'com')[0]


In [15]:
sdc_fut_sub.head()

Unnamed: 0,author,created_utc,id,full_link,num_comments,subreddit,subreddit_id,title
0,aibits,1571168021,did2fk,https://www.reddit.com/r/SelfDrivingCars/comme...,14,SelfDrivingCars,t5_2udmw,"Keen to develop self-driving cars, Hyundai Mot..."
1,tamu,1571156136,dia5o0,https://www.reddit.com/r/SelfDrivingCars/comme...,0,SelfDrivingCars,t5_2udmw,Texas A&amp;M Lands $7 Million Federal Grant T...
2,aibits,1571152446,di9aub,https://www.reddit.com/r/SelfDrivingCars/comme...,0,SelfDrivingCars,t5_2udmw,Welcome to the 2019 DriveML Huawei Autonomous ...
3,Avenue21,1571151126,di8zvp,https://www.reddit.com/r/SelfDrivingCars/comme...,1,SelfDrivingCars,t5_2udmw,There is a puzzle picturing a driverless Paris...
4,aibits,1571118986,di3hc7,https://www.reddit.com/r/SelfDrivingCars/comme...,11,SelfDrivingCars,t5_2udmw,Top 5 self-driving trucks in the world


In [16]:
# Function to reduce all submissions dataframes

def sub_reduc(df_list):
    for df in df_list: 
        df = df.loc[:, ['author', 'id', 'title', 'subreddit']]

In [17]:
# Reduce submissions dataframes to only desired columns

sdc_fut_sub = sdc_fut_sub.loc[:, ['author', 'id','title', 'subreddit']]
sdc_tech_sub = sdc_tech_sub.loc[:, ['author', 'id','title', 'subreddit']]
srch_sub = srch_sub.loc[:, ['author', 'id','title', 'subreddit', 'year']]

sdc_fut_sub.head()


Unnamed: 0,author,id,title,subreddit
0,aibits,did2fk,"Keen to develop self-driving cars, Hyundai Mot...",SelfDrivingCars
1,tamu,dia5o0,Texas A&amp;M Lands $7 Million Federal Grant T...,SelfDrivingCars
2,aibits,di9aub,Welcome to the 2019 DriveML Huawei Autonomous ...,SelfDrivingCars
3,Avenue21,di8zvp,There is a puzzle picturing a driverless Paris...,SelfDrivingCars
4,aibits,di3hc7,Top 5 self-driving trucks in the world,SelfDrivingCars


In [18]:
# Reduce comments dataframe to only desired columns

sdc_fut_com = sdc_fut_com.loc[:, ['author', 'id','body', 'subreddit']]
sdc_tech_com = sdc_tech_com.loc[:, ['author', 'id','body', 'subreddit']]
srch_com = srch_com.loc[:, ['author', 'id','body', 'subreddit','year']]

sdc_fut_com.head()

Unnamed: 0,author,id,body,subreddit
0,LeoBrasnar,f3wml7d,And [Locomation](https://locomation.ai/).,SelfDrivingCars
1,Lancaster61,f3wkqaf,“If we build rail tracks and all cars are trai...,SelfDrivingCars
2,loose_sweater,f3wcjjw,Prius prime!,SelfDrivingCars
3,VirtuallyChris,f3wa351,"Yes, I've called Geico and asked them to add t...",SelfDrivingCars
4,JacobHSR,f3w3m4f,Goodness gracious me! \n\nIs any car factory h...,SelfDrivingCars


In [19]:
# Lowercase subreddit categories
sdc_fut_sub['subreddit'] = sdc_fut_sub['subreddit'].str.lower()
sdc_fut_com['subreddit'] = sdc_fut_com['subreddit'].str.lower()
sdc_tech_sub['subreddit'] = sdc_tech_sub['subreddit'].str.lower()
sdc_tech_com['subreddit'] = sdc_tech_com['subreddit'].str.lower()
srch_sub['subreddit'] = srch_sub['subreddit'].str.lower()
srch_com['subreddit'] = srch_com['subreddit'].str.lower()

### Create Target Column

In [20]:
# Create target columns with self-driving cars being our reference class

sdc_fut_sub['target'] = sdc_fut_sub['subreddit'].map({'futurology':0, 'selfdrivingcars':1})
sdc_fut_com['target'] = sdc_fut_com['subreddit'].map({'futurology':0, 'selfdrivingcars':1})
sdc_tech_sub['target'] = sdc_tech_sub['subreddit'].map({'technology':0, 'selfdrivingcars':1})
sdc_tech_com['target'] = sdc_tech_com['subreddit'].map({'technology':0, 'selfdrivingcars':1})

sdc_fut_sub.sample(5)

Unnamed: 0,author,id,title,subreddit,target
212,ruperap,d2qthh,RLE Test Vehicle Spotted in Germany,selfdrivingcars,1
24,darealmvp1,ag9xkp,Do self driving cars have a kill feature? If n...,selfdrivingcars,1
80,kailashsuresh,bl7vuj,Autonomous Vehicles - Market Update May 09,selfdrivingcars,1
478,NegentropicBoy,cjqd9o,The Self-Driving Car Capital Of The World - BB...,selfdrivingcars,1
763,walky22talky,9qd470,NHTSA directs driverless shuttle to stop trans...,selfdrivingcars,1


### Clean & Tokenize Submissions and Comments

In [21]:
# Remove numbers from Titles and Comments

sdc_fut_sub['title'] = sdc_fut_sub['title'].str.replace('\d+', '')
sdc_fut_com['tokens'] = sdc_fut_com['body'].str.replace('\d+', '')
sdc_tech_sub['tokens'] = sdc_tech_sub['title'].str.replace('\d+', '')
sdc_tech_com['tokens'] = sdc_tech_com['body'].str.replace('\d+', '')

srch_sub['tokens'] = srch_sub['title'].str.replace('\d+', '')
srch_com['tokens'] = srch_com['body'].str.replace('\d+', '')

In [22]:
sdc_fut_sub.head()

Unnamed: 0,author,id,title,subreddit,target
0,aibits,did2fk,"Keen to develop self-driving cars, Hyundai Mot...",selfdrivingcars,1
1,tamu,dia5o0,Texas A&amp;M Lands $ Million Federal Grant To...,selfdrivingcars,1
2,aibits,di9aub,Welcome to the DriveML Huawei Autonomous Vehi...,selfdrivingcars,1
3,Avenue21,di8zvp,There is a puzzle picturing a driverless Paris...,selfdrivingcars,1
4,aibits,di3hc7,Top self-driving trucks in the world,selfdrivingcars,1


In [23]:
# Define Tokenizer

def tokenize(x): 
    tokenizer = RegexpTokenizer(r'\w+')
    return tokenizer.tokenize(x)

In [24]:
# Add tokens columns for all merged dataframes

sdc_fut_sub['tokens'] = sdc_fut_sub['title'].map(tokenize)
sdc_fut_com['tokens'] = sdc_fut_com['body'].map(tokenize)
sdc_tech_sub['tokens'] = sdc_tech_sub['title'].map(tokenize)
sdc_tech_com['tokens'] = sdc_tech_com['body'].map(tokenize)

srch_sub['tokens'] = srch_sub['title'].map(tokenize)
srch_com['tokens'] = srch_com['body'].map(tokenize)

In [25]:
sdc_fut_sub.head()

Unnamed: 0,author,id,title,subreddit,target,tokens
0,aibits,did2fk,"Keen to develop self-driving cars, Hyundai Mot...",selfdrivingcars,1,"[Keen, to, develop, self, driving, cars, Hyund..."
1,tamu,dia5o0,Texas A&amp;M Lands $ Million Federal Grant To...,selfdrivingcars,1,"[Texas, A, amp, M, Lands, Million, Federal, Gr..."
2,aibits,di9aub,Welcome to the DriveML Huawei Autonomous Vehi...,selfdrivingcars,1,"[Welcome, to, the, DriveML, Huawei, Autonomous..."
3,Avenue21,di8zvp,There is a puzzle picturing a driverless Paris...,selfdrivingcars,1,"[There, is, a, puzzle, picturing, a, driverles..."
4,aibits,di3hc7,Top self-driving trucks in the world,selfdrivingcars,1,"[Top, self, driving, trucks, in, the, world]"


In [26]:
srch_sub.head()

Unnamed: 0,author,id,title,subreddit,year,tokens
1,Sidewinder77,vn2lo,Sebastian Thrun on Google's Self-Driving Car,selfdrivingcars,2011,"[Sebastian, Thrun, on, Google, s, Self, Drivin..."
2,Sidewinder77,vn2tg,5 Ways Self-Driving Cars Will Make You Love Co...,selfdrivingcars,2011,"[5, Ways, Self, Driving, Cars, Will, Make, You..."
3,Sidewinder77,vn2y4,Volvo's self-driving 'convoy' hits the Spanish...,selfdrivingcars,2011,"[Volvo, s, self, driving, convoy, hits, the, S..."
4,Sidewinder77,vnuw3,Google Self-Driving Car License Approved in Ne...,selfdrivingcars,2011,"[Google, Self, Driving, Car, License, Approved..."
5,Sidewinder77,vq0pl,Could self-driving cars reduce future health c...,selfdrivingcars,2011,"[Could, self, driving, cars, reduce, future, h..."


#### Stem and Lemmatize Tokens 

In [27]:
def stemmer(x): 
    stemmer = PorterStemmer()
    return ' '.join([stemmer.stem(word) for word in x])

In [28]:
def lemmatize(x): 
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in x])

In [29]:
# Stem and Lemmatize all Dataframes

sdc_fut_sub['lemma'] = sdc_fut_sub['tokens'].map(lemmatize)
sdc_fut_com['lemma'] = sdc_fut_com['tokens'].map(lemmatize)
sdc_tech_sub['lemma'] = sdc_tech_sub['tokens'].map(lemmatize)
sdc_tech_com['lemma'] = sdc_tech_com['tokens'].map(lemmatize)

sdc_fut_sub['stems'] = sdc_fut_sub['tokens'].map(stemmer)
sdc_fut_com['stems'] = sdc_fut_com['tokens'].map(stemmer)
sdc_tech_sub['stems'] = sdc_tech_sub['tokens'].map(stemmer)
sdc_tech_com['stems'] = sdc_tech_com['tokens'].map(stemmer)

srch_sub['lemma'] = srch_sub['tokens'].map(lemmatize)
srch_com['lemma'] = srch_com['tokens'].map(lemmatize)
srch_sub['stems'] = srch_sub['tokens'].map(stemmer)
srch_com['stems'] = srch_com['tokens'].map(stemmer)

sdc_fut_sub.sample(5)

Unnamed: 0,author,id,title,subreddit,target,tokens,lemma,stems
951,Wagamaga,8ogsth,Renewable power accounted for percent of net ...,futurology,0,"[Renewable, power, accounted, for, percent, of...",Renewable power accounted for percent of net a...,renew power account for percent of net addit t...
178,walky22talky,9cvgsm,How Aurora Plans to Make Robocars Real,selfdrivingcars,1,"[How, Aurora, Plans, to, Make, Robocars, Real]",How Aurora Plans to Make Robocars Real,how aurora plan to make robocar real
659,KidKilobyte,9t9hck,Tesla's Summon upgrade turns vehicles into rem...,selfdrivingcars,1,"[Tesla, s, Summon, upgrade, turns, vehicles, i...",Tesla s Summon upgrade turn vehicle into remot...,tesla s summon upgrad turn vehicl into remot c...
942,REIGuy3,8nt8bm,Why GM and Waymo Rely on Allies in Self-Drivin...,selfdrivingcars,1,"[Why, GM, and, Waymo, Rely, on, Allies, in, Se...",Why GM and Waymo Rely on Allies in Self Drivin...,whi GM and waymo reli on alli in self drive race
84,GrillaNea,7v7n1c,"Photos leaked on Wednesday, Jan. , have caused...",futurology,0,"[Photos, leaked, on, Wednesday, Jan, have, cau...",Photos leaked on Wednesday Jan have caused man...,photo leak on wednesday jan have caus mani to ...


In [30]:
srch_sub.head()

Unnamed: 0,author,id,title,subreddit,year,tokens,lemma,stems
1,Sidewinder77,vn2lo,Sebastian Thrun on Google's Self-Driving Car,selfdrivingcars,2011,"[Sebastian, Thrun, on, Google, s, Self, Drivin...",Sebastian Thrun on Google s Self Driving Car,sebastian thrun on googl s self drive car
2,Sidewinder77,vn2tg,5 Ways Self-Driving Cars Will Make You Love Co...,selfdrivingcars,2011,"[5, Ways, Self, Driving, Cars, Will, Make, You...",5 Ways Self Driving Cars Will Make You Love Co...,5 way self drive car will make you love commut
3,Sidewinder77,vn2y4,Volvo's self-driving 'convoy' hits the Spanish...,selfdrivingcars,2011,"[Volvo, s, self, driving, convoy, hits, the, S...",Volvo s self driving convoy hit the Spanish mo...,volvo s self drive convoy hit the spanish moto...
4,Sidewinder77,vnuw3,Google Self-Driving Car License Approved in Ne...,selfdrivingcars,2011,"[Google, Self, Driving, Car, License, Approved...",Google Self Driving Car License Approved in Ne...,googl self drive car licens approv in nevada
5,Sidewinder77,vq0pl,Could self-driving cars reduce future health c...,selfdrivingcars,2011,"[Could, self, driving, cars, reduce, future, h...",Could self driving car reduce future health ca...,could self drive car reduc futur health care cost


In [39]:
# Check counts of Lemmas

#sdc_fut_sub.groupby('lemma').count()
token_cts = pd.Series(Counter([(i, t) for i, l in enumerate(srch_sub['tokens']) for t in l]))

In [44]:
token_cts.head()

0  Sebastian    1
   Thrun        1
   on           1
   Google       1
   s            1
dtype: int64

In [43]:
token_cts.sort_values(ascending=False).head(50)

5275  s          6
5349  the        5
2939  a          5
2955  the        5
4298  to         5
674   the        5
1247  the        5
2842  s          4
4276  a          4
3353  to         4
3939  is         4
880   to         4
1247  fire       4
4281  the        4
4378  the        4
4620  a          4
5900  the        4
3861  the        4
4304  system     4
4743  the        4
5924  a          4
6567  the        4
4022  to         4
5280  to         4
2399  for        4
2335  cars       4
4453  the        4
6606  to         4
2647  a          4
4650  to         4
4280  the        4
1120  of         4
6031  a          4
2194  that       4
2334  r          4
4397  to         4
5191  a          4
1024  the        4
1877  will       4
4040  will       4
3955  to         4
1161  the        3
3057  the        3
2252  and        3
5496  driving    3
4141  the        3
5727  a          3
3320  is         3
2864  car        3
2400  and        3
dtype: int64

### Sentiment Analysis Processing

Source: Afinn Sentiment package: 
Finn Årup Nielsen, "A new ANEW: evaluation of a word list for sentiment analysis in microblogs", Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages. Volume 718 in CEUR Workshop Proceedings: 93-98. 2011 May. Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, Mariann Hardey (editors)
https://github.com/fnielsen/afinn
http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6006/pdf/imm6006.pdf

Source for Implementation of package:
https://www.kdnuggets.com/2018/08/emotion-sentiment-analysis-practitioners-guide-nlp-5.html

Afinn Sentiment Analysis 

In [407]:
af = Afinn()

# Compute sentiment scores and map to new column

sdc_fut_sub['sent_score'] = sdc_fut_sub['title'].map(lambda x: af.score(x))
sdc_fut_com['sent_score'] = sdc_fut_com['body'].map(lambda x: af.score(x))
sdc_tech_sub['sent_score'] = sdc_tech_sub['title'].map(lambda x: af.score(x))
sdc_tech_com['sent_score'] = sdc_tech_com['body'].map(lambda x: af.score(x))

srch_sub['sent_score'] = srch_sub['title'].map(lambda x: af.score(x))
srch_com['sent_score'] = srch_com['body'].map(lambda x: af.score(x))

In [408]:
srch_sub.head()

Unnamed: 0,author,id,title,subreddit,year,tokens,lemma,stems,sent_score
1,Sidewinder77,vn2lo,Sebastian Thrun on Google's Self-Driving Car,selfdrivingcars,2011,"[Sebastian, Thrun, on, Google, s, Self, Drivin...",Sebastian Thrun on Google s Self Driving Car,sebastian thrun on googl s self drive car,0.0
2,Sidewinder77,vn2tg,5 Ways Self-Driving Cars Will Make You Love Co...,selfdrivingcars,2011,"[5, Ways, Self, Driving, Cars, Will, Make, You...",5 Ways Self Driving Cars Will Make You Love Co...,5 way self drive car will make you love commut,3.0
3,Sidewinder77,vn2y4,Volvo's self-driving 'convoy' hits the Spanish...,selfdrivingcars,2011,"[Volvo, s, self, driving, convoy, hits, the, S...",Volvo s self driving convoy hit the Spanish mo...,volvo s self drive convoy hit the spanish moto...,0.0
4,Sidewinder77,vnuw3,Google Self-Driving Car License Approved in Ne...,selfdrivingcars,2011,"[Google, Self, Driving, Car, License, Approved...",Google Self Driving Car License Approved in Ne...,googl self drive car licens approv in nevada,2.0
5,Sidewinder77,vq0pl,Could self-driving cars reduce future health c...,selfdrivingcars,2011,"[Could, self, driving, cars, reduce, future, h...",Could self driving car reduce future health ca...,could self drive car reduc futur health care cost,2.0


In [409]:
srch_sub.groupby(['subreddit', 'year']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,sent_score
subreddit,year,Unnamed: 2_level_1
futurology,2011,0.333333
futurology,2013,0.225
futurology,2014,0.118644
futurology,2015,0.01084
futurology,2016,-0.033149
futurology,2017,0.061765
futurology,2018,0.0
futurology,2019,0.122137
selfdrivingcars,2011,0.22807
selfdrivingcars,2013,0.300395


### Export Processed Dataframes for Modeling

In [410]:
sdc_fut_sub.to_csv('./datasets/selfdrive_fut_sub.csv')
sdc_fut_com.to_csv('./datasets/selfdrive_fut_com.csv')
sdc_tech_sub.to_csv('./datasets/selfdrive_tech_sub.csv')
sdc_tech_com.to_csv('./datasets/selfdrive_tech_com.csv')

srch_sub.to_csv('./datasets/search_sub.csv')
srch_com.to_csv('./datasets/search_com.csv')
