# Part 2: Preprocessing & Modeling

## Imports

In [None]:
import pandas                        as pd
import numpy                         as np
import seaborn                       as sns
import matplotlib.pyplot             as plt
from sklearn.ensemble                import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model            import LogisticRegression
from sklearn.metrics                 import confusion_matrix
from sklearn.model_selection         import GridSearchCV, train_test_split, cross_val_score
from sklearn.pipeline                import Pipeline
from sklearn.tree                    import DecisionTreeClassifier
from sklearn.neighbors               import KNeighborsClassifier
from nltk.stem                       import WordNetLemmatizer
from nltk.tokenize                   import word_tokenize 
from IPython.core.display            import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
sns.set(style = "white", palette = "deep")
%matplotlib inline

## Table Of Contents

-------

1. [Reading In The Data](#Reading-In-The-Data)
    - [Overview](#Overview)
    - [Visuals](#Visuals)
    
    
2. [Preprocessing](#Preprocessing)


3. [Establishing The Baseline](#Establishing-The-Baseline)




## Reading In The Data

### Overview

In [None]:
model_data = pd.read_csv("../Data/model_data.csv")

In [None]:
# Checking the data's head

model_data.head()

In [None]:
# Checking for null values

model_data.isnull().sum()

In [None]:
# Checking data types

model_data.info()

### Visuals

#### Functions

In [None]:
def plot_text_length_dist(list):
    plt.figure(figsize = (18,6))
    sns.distplot(list, kde = False, color = "black",
                 bins = 60)
    plt.title(f"Distribution Of Text Lengths", size = 18)
    plt.xlabel("Length", size = 16)
    plt.ylabel("Frequency", size = 16)
    plt.xticks(np.arange(0,23500,1500), size = 14)
    plt.yticks(size = 14)
    plt.tight_layout()
    plt.show();

In [None]:
def plot_most_frequent_authors(df, col):
    
    plt.figure(figsize = (20,6))
    sns.barplot(x = df.index,
                y = col,
                data = df)
    plt.title("Most Common Posters", size = 18)
    plt.xlabel("Reddit User", size = 16)
    plt.ylabel("Number Of Posts", size = 16)
    plt.xticks(size = 13)
    plt.yticks(size = 14);

#### Text Length

In [None]:
# Generating a list of text lengths

length_list = [len(text) for text in model_data["text"]]

plot_text_length_dist(length_list)

Most of the posts are relatively short (<2000 words), but there are a few that are extremely long (>20,000 words.)  We expected that most posts would be less than a few thousand words, which is true for the majority.

####  Most Frequent Authors

In [None]:
author_count = pd.DataFrame(model_data["author"].value_counts().head(10))

plot_most_frequent_authors(df  = author_count, 
                           col = "author")

We did not really know what to expect when we plotted this graph, because it is generally the case that a few users post most frequently and most barely post at all.  We would have like to look at the number of comments by each user in both subreddits as a measure of activity, but that is beyond the scope of this project.

#### Subreddit Of Origin

In [None]:
tick_labels = ["r/Cooking", "r/AskCulinary"]


plt.figure(figsize = (10,5))
sns.countplot(model_data["source"])
plt.title("Post Origin", size = 18)
plt.xlabel("Source", size = 16)
plt.ylabel("Number Of Posts", size = 16)
plt.xticks(np.arange(0,2,1), 
           labels = tick_labels, 
           size = 14)
plt.yticks(size = 14);

We were a little surprised that there are more r/AskCulinary posts because we had roughly equal numbers of pulls from each subreddit.

#### Visualizing Most Common Words

Before we start modeling, we need to know what the most frequent words are in each subreddit are because it might be harder for our model to predict with those words in the dataframe.

We will subset the data frame into posts from r/Cooking and r/AskCulinary and use count vectorizer to determine the most frequent words.  We will also remove stop words from the outset.

In [None]:
def plot_most_frequent_words(dataframes, titles):
    count = 0
    fig   = plt.figure(figsize = (24,20))
    for d, dataframe in enumerate(dataframes):
        count += 1
        ax    = fig.add_subplot(2, 2, count)
        sns.barplot(x       = 0,
                    y       = dataframe.index,
                    data    = dataframe,
                    palette = "deep")
        plt.title(f"Most Common Words From {titles[d]}", size = 20)
        plt.xlabel("Word", size = 18)
        plt.ylabel("Number Of Occurrences", size = 18)
        plt.xticks(size = 16)
        plt.yticks(size = 16)

In [None]:
# Instantiating the count vectorizer

vectorizer = CountVectorizer()

# Masking the vectorizer with English stop words

cvec_cooking     = CountVectorizer(stop_words = "english")
cvec_askculinary = CountVectorizer(stop_words = "english")

In [None]:
# Subsetting the dataframe

cooking     = model_data[model_data["target"] == 1]
askculinary = model_data[model_data["target"] == 0]

# Fit-transforming the vectorizer

vec_cooking     = cvec_cooking.fit_transform(cooking["text"])
vec_askculinary = cvec_askculinary.fit_transform(askculinary["text"])

In [None]:
# Saving the vectorized dfs to a new dataframe

cooking_vectorized = pd.DataFrame(vec_cooking.toarray(), 
                                  columns = cvec_cooking.get_feature_names())

askculinary_vectorized = pd.DataFrame(vec_askculinary.toarray(), 
                                      columns = cvec_askculinary.get_feature_names())

In [None]:
# Lemmatizing the vectorized dfs

lemmatizer = 

In [None]:
# Getting the 15 most frequent words from each

vectorized_cooking     = pd.DataFrame(cooking_vectorized.sum().sort_values(ascending = False).head(15))
vectorized_askculinary = pd.DataFrame(askculinary_vectorized.sum().sort_values(ascending = False).head(15))

In [None]:
# Plotting the most common words

plot_most_frequent_words(dataframes = [vectorized_cooking, vectorized_askculinary],
                         titles     = ["r/Cooking", "r/AskCulinary"])

We can see that there are a lot of words that occur in both subreddits.  We decided that because of that, we should create a list of customized stop words.  Furthermore, we noticed that we have to lemmatize or stem the text columns because of there are multiple forms of words in the most frequent words such as 'make' & 'making' or 'recipe' and recipes.

## Preprocessing

## Establishing The Baseline

A baseline in classification gives us an idea of how exactly the model is performing.  The baseline is simply the percentage of occurrences of our target in the data as a whole.  In this case it will be what percentage of posts are from r/Cooking.

If our model has an accuracy of >41.44% we know that it is better than simply guessing the class of a post.

In [None]:
round(model_data["source"].value_counts(normalize = True)*100, 2)