# Preprocessing

This notebook will generate the training, validation, and test sets that we should all use with our models. Here's some general notes on the process. It also includes all the visualization.

### Data Cleaning

The data cleaning here isn't perfect. There are instances where words like "don't" become "don t" and stuff like that. But overall it's good enough for now. Here's a good tutorial to get an idea of some of the important steps 

https://medium.com/@annabiancajones/sentiment-analysis-of-reviews-text-pre-processing-6359343784fb



### Sampling

We need to pay attention to exactly how we sample the comments from each subreddit. There are two important things to focus on.

#### 1) Length of comments

One subreddit might have a higher average comment length than another, so we want to eliminate any effect this may have on our model. 

Also, if the goal of our model is to distinguish a comment as coming from a "conservative" subreddit versus a "liberal" subreddit, then longer comments are better. Short comments like "sure haha", "thanks!","I totally agree", ..etc. don't give any information that could help the model's prediction. 

Since r/democrats is much less popular than r/Conservative and r/politics, there aren't as many lengthly discussions with long comments. You can see this by running `dem_comments_filtered = filter_by_word_count(df=dem_comments, min_words=10)` and comparing it to r/politics and r/Conservative (r/democrats also has less total comments).

#### 2) Size of dataset

The models should be trained on a dataset which combines an equal number of "liberal" and "conservative" comments. We have 411,811 comments from r/Conservative, 77,763 from r/democrats, and 1,481,227 from r/politics. Therefore we need take an equal number from each dataset though some type of random sampling. But before we do the random sampling we need to filter the dataset so that we're working with only comments that are above some word count threshold. 

WIth these two points in mind, it seems like we will have more flexibility if we use r/politics as our "liberal" subreddit instead of r/democrats, since we can construct a dataset with a large amount of long comments. And if you take a look at r/politics vs. r/democrats, they are very similar in overall tone and political leaning. 


In [None]:
import re
import random 
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.graph_objects as go
from wordcloud import WordCloud, STOPWORDS
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

pd.set_option("display.precision", 2)
%config InlineBackend.figure_format = 'retina'
plt.style.use('seaborn-whitegrid')
plt.rc('xtick',labelsize=18)
plt.rc('ytick',labelsize=18)

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# from psutil import virtual_memory
# ram_gb = virtual_memory().total / 1e9
# print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

# if ram_gb < 20:
#   print('To enable a high-RAM runtime, select the Runtime > "Change runtime type"')
#   print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
#   print('re-execute this cell.')
# else:
#   print('You are using a high-RAM runtime!')

In [None]:
comment_col_names = ["body_text", "author", "score", "created_utc", "post_id"]

Import data and add word count column. Remove rows which are missing body_text, remove duplicate rows having same body_text.

In [None]:
def import_comments(filename):
    df = pd.read_csv(filename, names=comment_col_names)
    df.dropna(inplace=True)
    df = df[df.body_text != ''] 
    df.drop_duplicates(['body_text'], inplace=True) 
    df["body_text"].astype("string")
    df['body_word_count'] = df['body_text'].apply(lambda x: len(x.strip().split()))
    return df

In [None]:
con_comments_raw = import_comments(filename="reddit_data/conservative_comments.csv")

In [None]:
dem_comments_raw = import_comments(filename="reddit_data/democrats_comments.csv")

In [None]:
pol_comments_raw = import_comments(filename="reddit_data/politics_comments.csv")

Using the python regrex module to do some cleaning. See this link for a basic tutorial
https://www.w3schools.com/python/python_regex.asp

A couple points
- `re.strip()` removes blank spaces from the beginning and end of the comment.
- `re.sub()` works like this: 

`cleaned_comment = re.sub(r'things you want to replace', 'what you want to replace it with', comment)`

In [None]:
punctuation='["\'?,\.]' # I will replace all these punctuation with ''
abbr_dict={
    "what's":"what is",
    "what're":"what are",
    "who's":"who is",
    "who're":"who are",
    "where's":"where is",
    "where're":"where are",
    "when's":"when is",
    "when're":"when are",
    "how's":"how is",
    "how're":"how are",

    "i'm":"i am",
    "we're":"we are",
    "you're":"you are",
    "they're":"they are",
    "it's":"it is",
    "he's":"he is",
    "she's":"she is",
    "that's":"that is",
    "there's":"there is",
    "there're":"there are",

    "i've":"i have",
    "we've":"we have",
    "you've":"you have",
    "they've":"they have",
    "who've":"who have",
    "would've":"would have",
    "not've":"not have",

    "i'll":"i will",
    "we'll":"we will",
    "you'll":"you will",
    "he'll":"he will",
    "she'll":"she will",
    "it'll":"it will",
    "they'll":"they will",

    "isn't":"is not",
    "wasn't":"was not",
    "aren't":"are not",
    "weren't":"were not",
    "can't":"can not",
    "couldn't":"could not",
    "don't":"do not",
    "didn't":"did not",
    "shouldn't":"should not",
    "wouldn't":"would not",
    "doesn't":"does not",
    "haven't":"have not",
    "hasn't":"has not",
    "hadn't":"had not",
    "won't":"will not",
    punctuation:'',
    '\s+':' ', # replace multi space with one single space
}

In [None]:
def clean_comment(w):
    """This is a pretty important function. Try commenting out / uncommenting some of these lines."""
    w = w.strip()
    # Take out brackets ([hello] --> hello)
    w = re.sub(r'[" "]+', " ", w)
    # replacing everything with space except (a-z, A-Z) (or add more stuff like ?,!,..etc)
    w = re.sub(r"[^a-zA-Z]", " ", w)
    # Remove all duplicate whitespaces ("   hello   my name   is" --> "hello my name is")
    w = " ".join(w.split())
    w = w.strip()
    return w

In [None]:
def clean_data(df):
    df['body_text'] = df["body_text"].str.lower()
    df["body_text"] = df["body_text"].apply(lambda x: re.sub(r"\\n", " ", x))
    df["body_text"] = df["body_text"].apply(lambda x: re.sub(r"\\", "", x))
    df["body_text"] = df["body_text"].apply(lambda x: re.sub(r'["?,\.]', "", x))
    # This method for replacing abbreviations only works part of the time
    df["body_text"] = df["body_text"].replace(abbr_dict, regex=True, inplace=False)
    df["body_text"] = df["body_text"].apply(lambda x: re.sub(r'http\S+', '', x))
    df['body_text'] = df['body_text'].apply(lambda x: clean_comment(x)) 
    return df

In [None]:
con_comments = con_comments_raw.copy(deep=True)
con_comments = clean_data(df=con_comments)

In [None]:
dem_comments = dem_comments_raw.copy(deep=True)
dem_comments = clean_data(df=dem_comments)

In [None]:
pol_comments = pol_comments_raw.copy(deep=True)
pol_comments = clean_data(df=pol_comments)

Save the final cleaned comments as `.pkl` files. Then whenever you run the notebook again you don't need to run all the data cleaning stuff

In [None]:
con_comments.to_pickle("reddit_data/conservative_comments_cleaned.pkl")
# dem_comments.to_pickle("reddit_data/democrats_comments_cleaned.pkl")
# pol_comments.to_pickle("reddit_data/politics_comments_cleaned.pkl")

Read in the cleaned data if it was already saved

In [None]:
con_comments = pd.read_pickle("reddit_data/conservative_comments_cleaned.pkl")
dem_comments = pd.read_pickle("reddit_data/democrats_comments_cleaned.pkl")
pol_comments = pd.read_pickle("reddit_data/politics_comments_cleaned.pkl")

In [None]:
print("{} comments from r/Conservative".format(len(con_comments)))
print("{} comments from r/democrats".format(len(dem_comments)))
print("{} comments from r/politics".format(len(pol_comments)))

50,000 seems like a good number to sample from each subreddit (we're using r/Conservative and r/politics). So I set the `min_words` to be the highest number that would return over 50,000 comments. 

In [None]:
def filter_by_word_count(df, min_words, max_words=False):
    if max_words is not False:
        df = df.loc[(df.body_word_count > min_words) & (df.body_word_count < max_words),:]
    else:
        df = df.loc[(df.body_word_count > min_words),:]
    print("{} comments found".format(len(df)))
    return df

In [None]:
con_comments_filtered = filter_by_word_count(df=con_comments, min_words=30, max_words=110)

In [None]:
dem_comments_filtered = filter_by_word_count(df=dem_comments, min_words=10)

In [None]:
pol_comments_filtered = filter_by_word_count(df=pol_comments, min_words=30, max_words=110)

Randomly sample equal number from each 

In [None]:
con_comments = con_comments_filtered.sample(100000)
pol_comments = pol_comments_filtered.sample(100000)

con_comments = con_comments.reset_index(drop=True)
pol_comments = pol_comments.reset_index(drop=True)

## Handling stop words

"Stop words" are the common words like "the", "and", "is" ... etc. 



For visualization, we need to remove them or else they will fill up the word cloud. Removing them might also help with model performance, but it depends which type of model you are using (not sure on this). Right now I am removing stop words for our final dataset, but we should revisit this and maybe train our models with stop words included. 

Also, from earlier versions of the word cloud I noticed certain errors in the data cleaning, like the letter "t" and the letter "m" showing up as distinct words. 

In [None]:
# Add stuff to the stopwords list 
SW_list = list(STOPWORDS)
# SW_list.extend(["n","want","will","one","s","dont","don","don t","t","people","think","even","thing","m","u"])
SW_list.extend(["n","s","don","don t","t","m","u"])
STOPWORDS = set(SW_list) 
# print(STOPWORDS)

In [None]:
strip_words = lambda x: ' '.join([item for item in x.split() if item not in STOPWORDS])
con_comments["body_text"] = con_comments["body_text"].apply(strip_words)
pol_comments["body_text"] = pol_comments["body_text"].apply(strip_words)

# Data Visualization



## Word Cloud

In [None]:
def convert_to_wordlist(df):
    # Take out the columns we don't need
    text = df.drop(["author", "score", "created_utc", "post_id", "body_word_count"], axis=1)
    words = []
    # Turn each comment into a list of words, append to list
    for ii in range(0,len(text)):
        words.append(str(text.iloc[ii]['body_text']).split(" "))
    # Turn nested lists into one big list
    flat_list = [item for sublist in words for item in sublist]
    # Final cleaning step (might not be necessary)
    cleanedList = [x for x in flat_list if str(x) != 'nan']
    return cleanedList

In [None]:
con_wordlist = convert_to_wordlist(df=con_comments)
pol_wordlist = convert_to_wordlist(df=pol_comments)

con_temp_df = pd.DataFrame({'col':con_wordlist})
pol_temp_df = pd.DataFrame({'col':pol_wordlist})

print(con_wordlist[0:10])
print(pol_wordlist[0:10])

In [None]:
plt.figure(figsize=(16,16))

plt.subplot(1,2,1)
wordcloud = WordCloud(stopwords = STOPWORDS, background_color = 'white', width = 1200,  height = 1200,
                      max_words = 150).generate(' '.join(con_wordlist))
plt.imshow(wordcloud)
plt.axis('off')
plt.title('r/Conservative Word Cloud',fontsize = 20)

plt.subplot(1,2,2)
wordcloud = WordCloud(stopwords = STOPWORDS, background_color = 'white', width = 1200,  height = 1200,
                      max_words = 150).generate(' '.join(pol_wordlist))
plt.imshow(wordcloud)
plt.axis('off')
plt.title('r/politics Word Cloud',fontsize = 20)

plt.tight_layout()
plt.show()

In [None]:
n_words = 30
rot = 45

plt.figure(figsize=(16,16))
color = plt.cm.copper(np.linspace(0, 1, n_words))

plt.subplot(2,1,1)
con_temp_df['col'].value_counts().head(n_words).plot.bar(color = color)
plt.title('r/Conservative Most Used Words', fontsize = 20)
plt.xticks(rotation = rot)
plt.grid()

plt.subplot(2,1,2)
pol_temp_df['col'].value_counts().head(n_words).plot.bar(color = color)
plt.title('r/politics Most Used Words', fontsize = 20)
plt.xticks(rotation = rot)
plt.grid()

plt.tight_layout()
plt.show()

# Visualizing Comment Clusters

Set aside a copy of the comments and format them in a special way (wrap them so that a long comment doesn't show up as one super long line of text that goes off the page).

In [None]:
def format_df_for_plotting(df):
    df_formatted = df.copy(deep=True)
    df_formatted.body_text = df_formatted.body_text.str.wrap(30)
    df_formatted.body_text = df_formatted.body_text.apply(lambda x: x.replace('\n', '<br>'))
    return df_formatted

In [None]:
con_comments_formatted = format_df_for_plotting(df=con_comments)
pol_comments_formatted = format_df_for_plotting(df=pol_comments)

Vectorize the comments. Here I copied the project 2 notebook and use tf-idf. 

In [None]:
vectorizer = TfidfVectorizer(max_features=2**12) # Mess around with this number 

X_con = vectorizer.fit_transform(con_comments['body_text'].values)
X_pol = vectorizer.fit_transform(pol_comments['body_text'].values)

print(X_con.shape)
print(X_pol.shape)

## Generate the cluster labels. 



#### Reduce Dimensions Using PCA


t-SNE would take forever to compute on a N x 4096 array. So use PCA to get it down to around N x 20 (or some other number), and then use t-SNE. Might be worth experimenting with different PCA dimensions to see how it effects the clusters.

In [None]:
pca = PCA(n_components=20) 

con_pca_result = pca.fit_transform(X_con.toarray())
pol_pca_result = pca.fit_transform(X_pol.toarray())

#### Cluster Using Gaussian Mixtures

Here I use Gaussian Mixtures. We can't apply it to the full N x 4096 matrix, so instead we apply it to the array we get after doing PCA.

In [None]:
gm = GaussianMixture(n_components=20, n_init=1, verbose=0)

y_con_pred = gm.fit_predict(con_pca_result)
y_pol_pred = gm.fit_predict(pol_pca_result)
print(y_pol_pred.shape)

In [None]:
tsne = TSNE(
    verbose=0, 
    n_components=2,
    perplexity=30, # good values are 10-50. 30 is default.
)

In [None]:
X_con_embedded = tsne.fit_transform(con_pca_result)

In [None]:
X_pol_embedded = tsne.fit_transform(pol_pca_result)

In [None]:
# For 3D plot, change the n_components parameter of TSNE to 3. 
fig = go.Figure(
    data=go.Scattergl( #go.Scatter3d  for 3D plot, go.Scattergl for 2D
        name="",
        x=X_con_embedded[:,0],
        y=X_con_embedded[:,1],
        # z=X_con_embedded[:,2], # for 3D plot
        mode='markers',
        marker=dict(
            size=5, # size=2 for 3D plot, size=5 for 2D
            opacity=0.7,
            color = y_con_pred,
            colorscale="jet"
        ),
        text=con_comments_formatted['body_text'],
        hovertemplate = "</br> %{text}",
    )
) 

fig.update_layout(
    title='t-SNE r/Conservative Comments',
    template="ggplot2",
    height=800,
    hoverlabel=dict(
        bgcolor="white",
        font_size=16,
    )
)
fig.show()

In [None]:
fig = go.Figure(
    data=go.Scattergl(
        name="",
        x=X_pol_embedded[:,0],
        y=X_pol_embedded[:,1],
        # z=X_pol_embedded[:,2],
        mode='markers',
        marker=dict(
            size=5,
            opacity=0.7,
            color = y_pol_pred,
            colorscale="jet"
        ),
        text=pol_comments_formatted['body_text'],
        hovertemplate = "</br> %{text}",
    )
) 

fig.update_layout(
    title='t-SNE r/politics Comments',
    template="ggplot2",
    height=800,
    hoverlabel=dict(
        bgcolor="white",
        font_size=16,
    )
)
fig.show()

Here I made labels 1 for conservative and 0 for liberal. Use r/democrats or r/politics for the "liberal" label

In [None]:
# Labels for conservative or democrat
con_comments["label"] = 1
pol_comments["label"] = 0

df = pd.concat([con_comments, pol_comments])

# Shuffle the data
df = df.sample(frac=1).reset_index(drop=True)

Sample 15% of the data for the test set 

In [None]:
test_df = df.sample(frac=0.15, replace=False)

test_df.to_pickle("reddit_data/TEST_DATASET.pkl")

Save the remaining data as the training set. The training set can then be split into training and validation sets when experimenting with models. Only use the test set for final evaluations

In [None]:
df.to_pickle("reddit_data/DATASET.pkl")