In this notebook I will pre-process my data by creating one dataframe out of the republican and democratic subreddit posts. I will also tokenize my data by count vectorizing and add columns for sentiment analysis to my tokenized matrix. Lastly, I will train test split my reddit dataframe and my term matrix with sentiment analysis in order to be used for modeling.

In [1]:
import requests
import time
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import csv
import seaborn as sns
import pickle
import matplotlib.pyplot as plt
from sklearn.feature_extraction import stop_words
from scipy.stats import ttest_ind
from nltk.sentiment.vader import SentimentIntensityAnalyzer

% matplotlib inline



# Reading in my data

In [45]:
republican = pd.read_csv('../Data/republican.csv')
democrat = pd.read_csv('../Data/democrat.csv')

Setting my columns

In [46]:
democrat.columns = ['name','text','title','subreddit']
republican.columns = ['name','text','title','subreddit']

### Undersampling my democrat data in order to balance my two classes

In [47]:
republican.shape

(710, 4)

In [48]:
democrat.shape

(995, 4)

In [49]:
democrat = democrat.sample(n=710, random_state=42)

In [50]:
democrat.shape

(710, 4)

In [51]:
republican.shape

(710, 4)

### Creating one data frame from my two subreddits

In [52]:
reddit = pd.concat([republican,democrat])
reddit.fillna('', inplace=True)

In [53]:
reddit.shape

(1420, 4)

Creating a column that contains both the title and body of the post

In [54]:
reddit['text_title'] = reddit['title'] + reddit['text']

### Saving my data frame

In [55]:
reddit.to_csv('../Data/reddit.csv')

### Train test split and save

In [13]:
y = reddit['subreddit']

In [14]:
X = reddit['text_title']

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X,y,stratify=y)

In [16]:
X_train.to_csv('../Data/X_train.csv')

In [17]:
X_test.to_csv('../Data/X_test.csv')

In [18]:
y_train.to_csv('../Data/y_train.csv')

In [19]:
y_test.to_csv('../Data/y_test.csv')

### Creating a count vectorizer data frame

In [20]:
y = list(reddit['subreddit'])

Creating a custom stop word list that includes some numbers and words found in urls

In [79]:
custom_stopwords = list(stop_words.ENGLISH_STOP_WORDS)
custom_stopwords.extend(['10', '12', '13', '14', '15', '18', '25','200', '000','https', 'com', 'youtube', 'www'])

In [80]:
cvec = CountVectorizer(stop_words = custom_stopwords, min_df = 5, max_df = .9)

In [81]:
X_cvec = cvec.fit_transform(X)

In [82]:
term_mat = pd.DataFrame(X_cvec.toarray(), columns=cvec.get_feature_names())

In [83]:
term_mat.head()

Unnamed: 0,2014,2016,2017,2018,2020,2nd,40,able,according,account,...,work,workers,working,world,wrong,year,years,york,young,youtu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Instantiating my sentiment analyzer

In [84]:
sid = SentimentIntensityAnalyzer()

### Adding columns for sentiment to my data frame

Each post will be given a percent score for each sentiment column, in the hopes that this will help my models differentiate between the subreddits. This is becuase I assume that republicans and democrats will speak with different tones for a given topic or news headline

In [85]:
reddit['positive'] = reddit['text_title'].map(lambda x: sid.polarity_scores(x)['pos'])

In [86]:
reddit['negative'] = reddit['text_title'].map(lambda x: sid.polarity_scores(x)['neg'])

In [87]:
reddit['neutral'] = reddit['text_title'].map(lambda x: sid.polarity_scores(x)['neu'])

### Adding sentiment columns to my term matrix

In [88]:
reddit.reset_index(inplace=True)

In [89]:
reddit.drop(columns='index', inplace=True)

In [90]:
term_mat_sen=term_mat

In [91]:
term_mat_sen['positive'] = reddit['positive']

In [92]:
term_mat_sen['negative'] = reddit['negative']

In [93]:
term_mat_sen['neutral'] = reddit['neutral']

### Train, test split and save my new sentiment term matrix

In [94]:
y = reddit['subreddit']

In [95]:
X_train_sen, X_test_sen, y_train_sen, y_test_sen = train_test_split(term_mat_sen, y, stratify=y)

In [96]:
X_train_sen.to_csv('../Data/X_train_sen.csv')

In [97]:
X_test_sen.to_csv('../Data/X_test_sen.csv')

In [98]:
y_train_sen.to_csv('../Data/y_train_sen.csv')

In [99]:
y_test_sen.to_csv('../Data/y_test_sen.csv')

In [100]:
term_mat.drop(columns=['positive','negative','neutral'],inplace=True)

In [101]:
term_mat['new_target'] = y
term_mat.to_csv('../Data/term_mat.csv')

In my next notebook, EDA, I will make use of this term matrix to show the most prevalent words for Democrats and Republicans through some bar plots. I will also use my train test splitted data for modeling 