# Project 3: Reddit Post Classification

<i>Pulling information and classifying posts via Pushshift's API</i>

**Author: Brendan McDonnell**

## Step 2: EDA Part 1

Exploring, visualizing, and pulling information from the two datasets before modeling.

## Relative Links
- [Importing Libraries and Datasets Needed](#Importing-Libraries-and-Datasets-Needed)
- [Visualizing and Exploring Data](#Visualizing-and-Exploring-Data)
- [A Few Notes About the Data](#A-Few-Notes-About-the-Data)
- [Export the Data to CSV](#Export-the-Data-to-CSV)

## Importing Libraries and Datasets Needed

In [None]:
import pandas as pd
import numpy as np
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
df_don = pd.read_csv('data/the_donald.csv')
df_rep = pd.read_csv('data/republican.csv')

## Visualizing and Exploring Data

In [None]:
df_don.drop(columns='Unnamed: 0', inplace=True)
df_rep.drop(columns='Unnamed: 0', inplace=True)

In [None]:
# 15,000 posts from r/The_Donald
df_don.head(10)

In [None]:
# 14,999 posts from r/Republican
df_rep.head(10)

In [None]:
# nulls in body; no body to report
# nulls in user, user has deleted account
df_don.isnull().sum()

In [None]:
# nulls in body; no body to report
# nulls in user, user has deleted account
df_rep.isnull().sum()

In [None]:
# number of users posting all 15,000 posts
len(df_don.user.unique())

In [None]:
# number of users posting all 14,999 posts
len(df_rep.user.unique())

In [None]:
# 85 overlapping users posting (because nan is one of the users)
# guessing users posting can be a good variable for narrowing down where the post is going
len(set(df_rep.user.unique()).intersection(list(df_don.user.unique())))

In [None]:
# republican data starts at July 4, 2019 and goes back to May 15, 2018
df_rep.head()

In [None]:
# Donald data starts at July 4, 2019 and goes back to June 28, 2019
df_don.head()

In [None]:
list_don = [1 if 'breitbart' in value else 0 for value in list(df_don['url'].values)]
list_rep = [1 if 'breitbart' in value else 0 for value in list(df_rep['url'].values)]
sum(list_don), sum(list_rep)

In [None]:
trump_don = [1 if 'trump' in title.lower() else 0 for title in list(df_don['title'].values)]
trump_rep = [1 if 'trump' in title.lower() else 0 for title in list(df_rep['title'].values)]
sum(trump_don) / 15000, sum(trump_rep) / 14999

## A Few Notes About the Data

1. r/Republican has 3,012 unique users making 14,999 posts over the last year plus in r/Republican, whereas r/The_Donald has 4,893 unique users to 15,0000 posts over the last 6 days. 86 of those users have posted in both subreddits.
    - r/Republican is less active AND has less unique posters. Users posting will probably be a good indicator of which subreddit the post belongs to
2. Breitbart articles get posted less on r/Republican, but not as much of a difference as I expected.
    - Could indicate heavier alt-right leanings on r/Republican but that is a very broad assumption. There are probably plenty of less famous alt right websites I should check first.
3. Trump's name pops up in a LOT of titles on both subreddits; 11.3% of The_Donald titles and 14.8% of Republican titles in the datasets.

**For the final model, I will only be using the Title and Body columns to predict the subreddit a post belongs to.**

I will need to impute some values for the nulls in the body column before I start manipulating the data.

In [None]:
# update dataframes and append to make one big DF
df_don['is_the_donald'] = 1
df_rep['is_the_donald'] = 0

In [None]:
df = df_rep.append(df_don, ignore_index=True)

In [None]:
df_rep

In [None]:
df = df.drop(columns=['id', 'score', 'url', 'comms_num', 'created', 'user'])

In [None]:
# impute missing values with a character that has no sentiment or meaning
df['body'] = df['body'].apply(lambda text: '_' if text == '[removed]' or text == '[deleted]' else text)

In [None]:
df.fillna('_', inplace=True)

In [None]:
df.head()

In [None]:
df['title'][0]

In [None]:
SentimentIntensityAnalyzer().polarity_scores(df['title'][0])

In [None]:
df['vad_title_neg'] = df['title'].apply(lambda text: SentimentIntensityAnalyzer().polarity_scores(text)['neg'])
df['vad_title_neu'] = df['title'].apply(lambda text: SentimentIntensityAnalyzer().polarity_scores(text)['neu'])
df['vad_title_pos'] = df['title'].apply(lambda text: SentimentIntensityAnalyzer().polarity_scores(text)['pos'])
df['vad_title_compound'] = df['title'].apply(lambda text: SentimentIntensityAnalyzer().polarity_scores(text)['compound'])

In [None]:
df['vad_body_neg'] = df['body'].apply(lambda text: SentimentIntensityAnalyzer().polarity_scores(text)['neg'])
df['vad_body_neu'] = df['body'].apply(lambda text: SentimentIntensityAnalyzer().polarity_scores(text)['neu'])
df['vad_body_pos'] = df['body'].apply(lambda text: SentimentIntensityAnalyzer().polarity_scores(text)['pos'])
df['vad_body_compound'] = df['body'].apply(lambda text: SentimentIntensityAnalyzer().polarity_scores(text)['compound'])

In [None]:
df.head()

In [None]:
df.shape

## Export the Data to CSV

In [None]:
# combined data
# df.to_csv('data/data_comb_w_sent.csv', index=False)

# NOTE: End of PT 1 of EDA. Above code takes a long time to run.