# Project 3: Reddit Post Classification

<i>Pulling information and classifying posts via Pushshift's API</i>

**Author: Brendan McDonnell**

## Step 2: EDA Part 2

Exploring, visualizing, and pulling information from the two datasets before modeling.

## Relative Links
- [Importing Libraries and Datasets Needed](#Importing-Libraries-and-Datasets-Needed)
- [Adding Textblob Sentiment Features](#Adding-Textblob-Sentiment-Features)
- [Further Exploration](#Further-Exploration)
- [Export the Data to CSV](#Export-the-Data-to-CSV)

## Importing Libraries and Datasets Needed

In [1]:
import pandas as pd
import numpy as np
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob, Word, Blobber 
import matplotlib.pyplot as plt
import plotly.plotly as py
import plotly.graph_objs as go
import cufflinks as cf

In [2]:
df = pd.read_csv('data/data_comb_w_sent.csv')

In [3]:
df.columns

Index(['title', 'body', 'is_the_donald', 'vad_title_neg', 'vad_title_neu',
       'vad_title_pos', 'vad_title_compound', 'vad_body_neg', 'vad_body_neu',
       'vad_body_pos', 'vad_body_compound'],
      dtype='object')

In [4]:
df[df.duplicated()]

Unnamed: 0,title,body,is_the_donald,vad_title_neg,vad_title_neu,vad_title_pos,vad_title_compound,vad_body_neg,vad_body_neu,vad_body_pos,vad_body_compound
22,House Democrat wants to prosecute people who m...,_,0,0.150,0.667,0.183,0.1531,0.0,0.0,0.0,0.0
136,It is time to make it clear that Qatar is fina...,_,0,0.267,0.581,0.151,-0.4588,0.0,0.0,0.0,0.0
290,Less Than Half of the Children in the U.S. Are...,_,0,0.000,1.000,0.000,0.0000,0.0,0.0,0.0,0.0
523,Yep,_,0,0.000,0.000,1.000,0.2960,0.0,0.0,0.0,0.0
561,*cough cough* leftists,_,0,0.000,1.000,0.000,0.0000,0.0,0.0,0.0,0.0
610,Please consider signing this CitisenGO petitio...,_,0,0.000,0.815,0.185,0.3400,0.0,0.0,0.0,0.0
624,Please consider signing this CitisenGO petitio...,_,0,0.000,0.815,0.185,0.3400,0.0,0.0,0.0,0.0
645,AOC encourages illegal immigrants to hide from...,_,0,0.415,0.478,0.107,-0.8126,0.0,0.0,0.0,0.0
902,Donald Trump Sets His Sights on Following His ...,_,0,0.000,1.000,0.000,0.0000,0.0,0.0,0.0,0.0
923,What Explains the Ferocity of the Attack Again...,_,0,0.375,0.625,0.000,-0.5859,0.0,0.0,0.0,0.0


In [5]:
df.drop_duplicates(inplace=True)

## Adding Textblob Sentiment Features

In [6]:
body_dicts = []

for text in list(df['body']):
    col_dict = {}
    pol, sub = TextBlob(text).sentiment
    col_dict['polarity_bod'] = pol
    col_dict['subjectivity_bod'] = sub
    body_dicts.append(col_dict)
    
title_dicts = []

for text in list(df['title']):
    col_dict = {}
    pol, sub = TextBlob(text).sentiment
    col_dict['polarity_tit'] = pol
    col_dict['subjectivity_tit'] = sub
    title_dicts.append(col_dict)

In [7]:
df = df.merge(pd.DataFrame(title_dicts), left_index=True, right_index=True)
df = df.merge(pd.DataFrame(body_dicts), left_index=True, right_index=True)
df.shape

(28427, 15)

## Further Exploration

In [8]:
# popular alt-right symbol for pepe, doesn't show up in any r/Republican posts

# df[df['title'].map(lambda x: True if '🐸' in x else False)]

In [9]:
# df[df['title'].map(lambda x: True if 'immigrant' in x.lower() else False)]

In [10]:
# cf.set_config_file(offline=False, world_readable=True, theme='ggplot')

# df.groupby('is_the_donald').mean().T.iplot(kind='bar',
#                yTitle='Analysis score',
#                title='Mean Sentment Analysis by Subreddit',
#                filename='Average_Sentiments')

In [11]:
# 9 of top 10 lowest compound titles were the_Donald

df.sort_values('vad_title_compound').head(10)

# different story for bodies

Unnamed: 0,title,body,is_the_donald,vad_title_neg,vad_title_neu,vad_title_pos,vad_title_compound,vad_body_neg,vad_body_neu,vad_body_pos,vad_body_compound,polarity_tit,subjectivity_tit,polarity_bod,subjectivity_bod
11826,Child rape victim won't forgive Hillary Clinto...,_,0,0.445,0.509,0.047,-0.9807,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0
24010,Are you f—king kidding me right now? Warren an...,_,1,0.433,0.544,0.022,-0.9791,0.0,0.0,0.0,0.0,-1.0,1.0,0.0,0.0
16670,AOC is just another pathetic LiB - TaRd nutcas...,_,1,0.415,0.585,0.0,-0.9783,0.0,0.0,0.0,0.0,-0.1275,0.366667,0.0,0.0
27932,"Imprisoned for Killing Terrorist, Seeks Justic...",_,1,0.437,0.51,0.053,-0.9774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24833,"“If you know the enemy and know yourself, you ...",We may have been quarantined but we will have ...,1,0.355,0.578,0.067,-0.9762,0.0,0.786,0.214,0.8103,0.0,0.0,0.0,0.0
20652,Brian Stelter is forced to condemn the assault...,_,1,0.39,0.61,0.0,-0.9739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15272,It’s one propaganda assault after another. Pro...,_,1,0.395,0.605,0.0,-0.9734,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23118,Admins STILL haven’t intervened in an Anarchis...,_,1,0.486,0.476,0.038,-0.9733,0.0,0.0,0.0,0.0,-0.166667,0.433333,0.0,0.0
20866,There's a video with a different angle of the ...,_,1,0.412,0.588,0.0,-0.9732,0.0,0.0,0.0,0.0,0.125,0.216667,0.0,0.0
21473,Illegal alien gets 4 years in prison for hit a...,_,1,0.503,0.497,0.0,-0.9729,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Export the Data to CSV 

In [12]:
# export to csv for manipulation
# df.to_csv('data/final_cleaned.csv', index=False)