## Project 3: Subreddit Classification

### Contents

- [Probelem Scenario](#Problem-Scenario)
- [Executive Summary](#Executive-Summary)
- [Imports](#Imports)
- [The Data](#The-Data)
- [Preliminary EDA](#Preliminary-EDA)
- [Cleaning](#More-EDA-&-Cleaning)
- [Pre-processing](#Setting-up-our-data-for-modeling)
- [Modeling](#MODELING)
- [Evaluation](#Evaluation)
- [Conclusions](#Conclusions-&-Next-steps)

Note: Data Collection can be found in first notebook (title: Data Collection

### Executive Summary

The data in this project was acquired using Reddit's API to pull relevant data from both r/FORTnITE and r/Minecraft subreddits. 1000 posts were pulled from both subreddits which totalled 2000 posts exactly. The we extracted 'title' and 'subtext' from the raw data. This is the information that we wanted to analyze with our model. After reviewing,  the data in those features, it was decided that we would keep the title feature as it contained full information/title in each row with no missing/null values.

Next, a new column was created and labeled either 1 or 0. 1 was the label for the Fortnite subreddit and 0 was the label for the Minecraft subreddit. Both subreddits were then joined together to form a full dataframe and saved to a csv as preperation for EDA and cleaning.

The Cleaning process includes transforming the data to lowercase, remove punctuation, tokenizing, examining data using both lemmatizing and stemming, ultimately choosing lemmatizing, removing stop words, and then saving the updated information to an updated csv to prepare for modeling.

I used 5 different models on the data: Logistic Regression with Countvectorizer, Logistic Regression with TFIDF, Naive Bayes - Multinomial with Countvectorizer, Naive Bayes - Gaussian with TFIDF, and Decision Tree Classifier - TFIDF Vectorizer. A grid search was performed on all models except Multinomial and Gaussian.

The best model was Naive Bayes - Multinomial with Countvectorize.

### Imports

In [93]:
#potential needed imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier

import nltk
from scipy.sparse import csr_matrix, issparse
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import PorterStemmer
from collections import Counter 

### The Data

In [94]:
#reading the csv
df = pd.read_csv('./reddit.csv')
#checking out the first 5 rows
df.head()

Unnamed: 0,title,label
0,Wow. 20 Gold. I am so happy.,1
1,Missions Open lobby number count per missions ...,1
2,My new makeshift girlfriend in stw,1
3,Seeing as Gas traps changed should I reset my ...,1
4,Brainstorm Session: New Abilities,1


In [95]:
#checking out a sample of 10 posts to make sure the labels are correct in the df
df.sample(20)

Unnamed: 0,title,label
1208,What even is that?,0
901,Best weapon for every catagory,1
68,hoping that season 11 will actually give us co...,1
272,Please bring back the old transition animation...,1
1990,thingy i made,0
1968,New family member!,0
1490,I found ... and that's it,0
1563,[Windows] Visual glitch when opening. Launcher...,0
954,Ninja party,1
10,It's High Noon. A crazy strong and powerful Qu...,1


### Preliminary EDA

In [96]:
#shape
df.shape

(2000, 2)

In [97]:
df.dtypes

title    object
label     int64
dtype: object

In [98]:
#no nulls
df.isnull().sum().sum()

0

In [99]:
#lets set up our data for modeling
X = df['title']
y = df['label']

In [100]:
#equal amount of values for each labels
y.value_counts()

1    1000
0    1000
Name: label, dtype: int64

In [101]:
#baseline 50%
y.value_counts(normalize=True)

1    0.5
0    0.5
Name: label, dtype: float64

### More EDA & Cleaning

In this section, the focus is on changing the data so that it suits our model best. We will explore tokenizing, stemming, lemmatizing and removing stop words.

In [102]:
#making all letters lowercase
df['title'] = df['title'].str.lower()

In [103]:
#checking to confirm data has been converted
df

Unnamed: 0,title,label
0,wow. 20 gold. i am so happy.,1
1,missions open lobby number count per missions ...,1
2,my new makeshift girlfriend in stw,1
3,seeing as gas traps changed should i reset my ...,1
4,brainstorm session: new abilities,1
...,...,...
1995,odd zombie villager texturing,0
1996,my underground village is still very young but...,0
1997,a render of a map i made.,0
1998,i found a village halfway into a swamp!,0


In [104]:
#tokenizing example - I want to tokenize all words in df ( split the post text into individual words ).
example_post = df.loc[0]
print (nltk.word_tokenize(example_post['title']))

['wow', '.', '20', 'gold', '.', 'i', 'am', 'so', 'happy', '.']


In [105]:
def all_tokens(row):
    title = row['title']
    tokens = nltk.word_tokenize(title)
    # taken only words (not punctuation)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words

df['tokenized'] = df.apply(all_tokens, axis=1)

#double checking to make sure tokenizing worked
df

Unnamed: 0,title,label,tokenized
0,wow. 20 gold. i am so happy.,1,"[wow, gold, i, am, so, happy]"
1,missions open lobby number count per missions ...,1,"[missions, open, lobby, number, count, per, mi..."
2,my new makeshift girlfriend in stw,1,"[my, new, makeshift, girlfriend, in, stw]"
3,seeing as gas traps changed should i reset my ...,1,"[seeing, as, gas, traps, changed, should, i, r..."
4,brainstorm session: new abilities,1,"[brainstorm, session, new, abilities]"
...,...,...,...
1995,odd zombie villager texturing,0,"[odd, zombie, villager, texturing]"
1996,my underground village is still very young but...,0,"[my, underground, village, is, still, very, yo..."
1997,a render of a map i made.,0,"[a, render, of, a, map, i, made]"
1998,i found a village halfway into a swamp!,0,"[i, found, a, village, halfway, into, a, swamp]"


In [106]:
def stem_list(row):
    stemming = PorterStemmer()
    my_list = row['tokenized']
    stemmed_list = [stemming.stem(word) for word in my_list]
    return (stemmed_list)

df['stemmed_words'] = df.apply(stem_list, axis=1)

#double checking title now that it has been lemmatized
df['stemmed_words']

0                           [wow, gold, i, am, so, happi]
1       [mission, open, lobbi, number, count, per, mis...
2               [my, new, makeshift, girlfriend, in, stw]
3       [see, as, ga, trap, chang, should, i, reset, m...
4                        [brainstorm, session, new, abil]
                              ...                        
1995                         [odd, zombi, villag, textur]
1996    [my, underground, villag, is, still, veri, you...
1997                     [a, render, of, a, map, i, made]
1998       [i, found, a, villag, halfway, into, a, swamp]
1999                          [a, small, bridg, i, built]
Name: stemmed_words, Length: 2000, dtype: object

In [107]:
def lem_list(row):
    lemmatizer = WordNetLemmatizer()
    my_list = row['tokenized']
    lematized_list = [lemmatizer.lemmatize(word) for word in my_list]
    return (lematized_list)

df['lematized'] = df.apply(lem_list, axis=1)

df['lematized']

0                           [wow, gold, i, am, so, happy]
1       [mission, open, lobby, number, count, per, mis...
2               [my, new, makeshift, girlfriend, in, stw]
3       [seeing, a, gas, trap, changed, should, i, res...
4                     [brainstorm, session, new, ability]
                              ...                        
1995                   [odd, zombie, villager, texturing]
1996    [my, underground, village, is, still, very, yo...
1997                     [a, render, of, a, map, i, made]
1998      [i, found, a, village, halfway, into, a, swamp]
1999                         [a, small, bridge, i, built]
Name: lematized, Length: 2000, dtype: object

In [108]:
#looking at the different features and checking out the changes that have been made
# lematized and stemmed_words are taken from tokenized words.
df.head()

Unnamed: 0,title,label,tokenized,stemmed_words,lematized
0,wow. 20 gold. i am so happy.,1,"[wow, gold, i, am, so, happy]","[wow, gold, i, am, so, happi]","[wow, gold, i, am, so, happy]"
1,missions open lobby number count per missions ...,1,"[missions, open, lobby, number, count, per, mi...","[mission, open, lobbi, number, count, per, mis...","[mission, open, lobby, number, count, per, mis..."
2,my new makeshift girlfriend in stw,1,"[my, new, makeshift, girlfriend, in, stw]","[my, new, makeshift, girlfriend, in, stw]","[my, new, makeshift, girlfriend, in, stw]"
3,seeing as gas traps changed should i reset my ...,1,"[seeing, as, gas, traps, changed, should, i, r...","[see, as, ga, trap, chang, should, i, reset, m...","[seeing, a, gas, trap, changed, should, i, res..."
4,brainstorm session: new abilities,1,"[brainstorm, session, new, abilities]","[brainstorm, session, new, abil]","[brainstorm, session, new, ability]"


In [109]:
#i like the lematized version best so we will stick with that version
df.drop(['title', 'tokenized', 'stemmed_words'], inplace=True, axis=1)

df

Unnamed: 0,label,lematized
0,1,"[wow, gold, i, am, so, happy]"
1,1,"[mission, open, lobby, number, count, per, mis..."
2,1,"[my, new, makeshift, girlfriend, in, stw]"
3,1,"[seeing, a, gas, trap, changed, should, i, res..."
4,1,"[brainstorm, session, new, ability]"
...,...,...
1995,0,"[odd, zombie, villager, texturing]"
1996,0,"[my, underground, village, is, still, very, yo..."
1997,0,"[a, render, of, a, map, i, made]"
1998,0,"[i, found, a, village, halfway, into, a, swamp]"


In [110]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [111]:
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))    

def remove_stops(row):
    my_list = row['lematized']
    meaningful_words = [w for w in my_list if not w in stops]
    return (meaningful_words)

df['without_stops'] = df.apply(remove_stops, axis=1)
df.without_stops

0                                      [wow, gold, happy]
1       [mission, open, lobby, number, count, per, mis...
2                       [new, makeshift, girlfriend, stw]
3          [seeing, gas, trap, changed, reset, schematic]
4                     [brainstorm, session, new, ability]
                              ...                        
1995                   [odd, zombie, villager, texturing]
1996    [underground, village, still, young, many, sto...
1997                                  [render, map, made]
1998                     [found, village, halfway, swamp]
1999                               [small, bridge, built]
Name: without_stops, Length: 2000, dtype: object

In [112]:
#comparing the regular lematized words to words 
df.head(2)

Unnamed: 0,label,lematized,without_stops
0,1,"[wow, gold, i, am, so, happy]","[wow, gold, happy]"
1,1,"[mission, open, lobby, number, count, per, mis...","[mission, open, lobby, number, count, per, mis..."


In [113]:
#checking out what removing the stop words looks like
print(df['without_stops'][0:10])

0                                   [wow, gold, happy]
1    [mission, open, lobby, number, count, per, mis...
2                    [new, makeshift, girlfriend, stw]
3       [seeing, gas, trap, changed, reset, schematic]
4                  [brainstorm, session, new, ability]
5    [hidden, alert, reward, reward, comment, mythi...
6               [okay, seriously, need, new, game, ui]
7    [chicken, bender, got, ta, eat, chicken, tende...
8    [x, reward, someone, posted, got, mythic, lead...
9    [find, new, quest, page, clicking, past, busin...
Name: without_stops, dtype: object


In [114]:
#changing the documents from individual strings to one string per document
def rejoin_words(row):
    my_list = row['without_stops']
    joined_words = ( " ".join(my_list))
    return joined_words

df['processed'] = df.apply(rejoin_words, axis=1)

df.head(2)

Unnamed: 0,label,lematized,without_stops,processed
0,1,"[wow, gold, i, am, so, happy]","[wow, gold, happy]",wow gold happy
1,1,"[mission, open, lobby, number, count, per, mis...","[mission, open, lobby, number, count, per, mis...",mission open lobby number count per mission li...


In [115]:
#keeping the processed data as our new title
df.rename(columns={'processed':'title'}, inplace=True)

In [116]:
df.drop(['lematized', 'without_stops'], inplace=True, axis=1)

#new tokenized, lematized and stop words removed
df.head(2)

Unnamed: 0,label,title
0,1,wow gold happy
1,1,mission open lobby number count per mission li...


In [117]:
#saving the processed information to a csv incase we need to pull the information again
df.to_csv('df_processed.csv', index=False)

df.head(2)

Unnamed: 0,label,title
0,1,wow gold happy
1,1,mission open lobby number count per mission li...


In [118]:
#checking out the most common words in the dataframe overall
count = Counter(" ".join(df["title"]).split()).most_common(100)
count[:30]

[('minecraft', 105),
 ('new', 99),
 ('wa', 88),
 ('made', 88),
 ('game', 83),
 ('found', 76),
 ('mission', 75),
 ('like', 73),
 ('world', 72),
 ('get', 71),
 ('build', 67),
 ('epic', 63),
 ('think', 63),
 ('one', 62),
 ('first', 58),
 ('got', 57),
 ('stw', 54),
 ('know', 54),
 ('time', 53),
 ('look', 52),
 ('make', 47),
 ('good', 47),
 ('house', 45),
 ('block', 43),
 ('bug', 42),
 ('ha', 41),
 ('doe', 41),
 ('friend', 41),
 ('idea', 39),
 ('anyone', 37)]

### Setting up out data for modeling

In [119]:
#lets set up our data for modeling
X = df['title']
y = df['label']

### Dividing data set into two subsets:

In [120]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

In [76]:
#instantiating pipeline
pipe = Pipeline(
    [
        ('cvec', CountVectorizer()),
        ('logreg', LogisticRegression(solver='liblinear'))
    ]
)

# Search over the following values of hyperparameters:
# Maximum number of features fit: 1000
# Minimum number of documents needed to include token: 10
# Maximum number of documents needed to include token: .15, .90
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe_params = {
    'cvec__max_features': [1000],
    'cvec__min_df': [10],
    'cvec__max_df': [.15, .90],
    'cvec__ngram_range': [(1,1), (1,2)]
}

# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # what object are we optimizing?
                  pipe_params, # what parameters values are we searching?
                  cv = 5) # 5-fold cross-validation.

# Fit GridSearch to training data.
gs.fit(X_train, y_train)

# What's the best score?
print("best score: ",gs.best_score_)

gs.best_params_

# Save best model as gs_model.
gs_model = gs.best_estimator_
gs_model

# Score model on training set.
print("training score:", gs_model.score(X_train, y_train))

# Score model on testing set.
print("test score:", gs_model.score(X_test, y_test))

best score:  0.7529850746268656
training score: 0.8216417910447761
test score: 0.7545454545454545


## MODELING

### 1.) Logistic Regresstion w/ Countvectorizer

In [77]:
#instantiating pipeline
pipe = Pipeline(
    [
        ('cvec', CountVectorizer()),
        ('logreg', LogisticRegression(solver='liblinear'))
    ]
)

# Search over the following values of hyperparameters:
# Maximum number of features fit: 1000
# Minimum number of documents needed to include token: 10
# Maximum number of documents needed to include token: .15, .90
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe_params = {
    'cvec__max_features': [1000],
    'cvec__min_df': [10],
    'cvec__max_df': [.15, .90],
    'cvec__ngram_range': [(1,1), (1,2)]
}

# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # what object are we optimizing?
                  pipe_params, # what parameters values are we searching?
                  cv = 5) # 5-fold cross-validation.

# Fit GridSearch to training data.
gs.fit(X_train, y_train)

# What's the best score?
print("best score: ",gs.best_score_)

gs.best_params_

# Save best model as gs_model.
gs_model = gs.best_estimator_
gs_model

# Score model on training set.
print("training score:", gs_model.score(X_train, y_train))

# Score model on testing set.
print("test score:", gs_model.score(X_test, y_test))

best score:  0.7529850746268656
training score: 0.8216417910447761
test score: 0.7545454545454545


In [78]:
gs.best_params_


{'cvec__max_df': 0.15,
 'cvec__max_features': 1000,
 'cvec__min_df': 10,
 'cvec__ngram_range': (1, 1)}

### Bag of Words 

In [79]:
train_bag_of_words = gs_model[0].fit_transform(X_train)

# gs_model[1].coef_[0]

sum_words = train_bag_of_words.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in gs_model[0].vocabulary_.items()]
words_df = pd.DataFrame(words_freq, columns=["words", "frequency"])
words_df["coefficient"] = gs_model[1].coef_[0]

In [80]:
minecraft_0 = words_df[words_df["coefficient"] < -0.5]
minecraft_0.head(10)

Unnamed: 0,words,frequency,coefficient
8,weapon,17,-1.678517
11,epic,45,-2.10583
15,sorry,13,-0.76816
16,people,18,-2.161831
20,man,10,-1.158151
22,villager,22,-1.11328
26,reward,14,-0.816243
27,someone,15,-0.623323
36,first,36,-1.338027
38,thought,29,-0.6432


In [81]:
fortnite_1= words_df[words_df["coefficient"] < 0.5]
fortnite_1.head(10)

Unnamed: 0,words,frequency,coefficient
1,found,55,-0.411424
2,glitch,12,0.337053
3,idea,27,0.001185
4,seed,13,-0.061988
6,use,16,-0.453681
7,block,31,-0.136832
8,weapon,17,-1.678517
10,find,12,0.386244
11,epic,45,-2.10583
14,one,40,-0.400693


### 2.) LogisticRegression (estimator) w/ TFIDF classifier (transformer)

In [82]:
#instantiating pipeline
pipe2 = Pipeline(
    [
        ('tfidf', TfidfVectorizer()),
        ('logreg', LogisticRegression(solver='liblinear'))
    ]
)

# Search over the following values of hyperparameters:
# Maximum number of features fit: 50
# Minimum number of documents needed to include token: 10
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe2_params = {
    'tfidf__max_features': [1000],
    'tfidf__min_df': [10],
    'tfidf__max_df': [.9, .95],
    'tfidf__ngram_range': [(1,1), (1,2), (1,3)]
}

# Instantiate GridSearchCV.

gs2 = GridSearchCV(pipe2, # what object are we optimizing?
                  pipe2_params, # what parameters values are we searching?
                  cv = 5) # 5-fold cross-validation.

gs2.fit(X_train, y_train)

# What's the best score?
print("best score: ",gs2.best_score_)

gs2.best_params_

# Save best model as gs_model.
gs2_model = gs2.best_estimator_
gs2_model

# Score model on training set.
print("training score:", gs2_model.score(X_train, y_train))

# Score model on testing set.
print("test score:", gs2_model.score(X_test, y_test))

best score:  0.7559701492537314
training score: 0.8201492537313433
test score: 0.7575757575757576


In [83]:
#confusion matrix
pred = gs2_model.predict(X_train)

pred


cm = confusion_matrix(y_train, pred)

cm

array([[490, 180],
       [ 61, 609]])

### 3.) Niave Bayes (Multinomial w/ Count Vectorizer)
### Multinomial #1

In [84]:
cvec = CountVectorizer()

train_df = cvec.fit_transform(X_train)

test_df = cvec.transform(X_test)

mnb = MultinomialNB()

mnb.fit(train_df, y_train)

print("training score:",mnb.score(train_df, y_train))

print("test score:",mnb.score(test_df, y_test))

training score: 0.9574626865671642
test score: 0.8378787878787879


In [85]:
#confusion matrix
pred = mnb.predict(train_df)

pred

cm = confusion_matrix(y_train, pred)

cm

array([[645,  25],
       [ 32, 638]])

### Multinomial #2

In [86]:
cvec2 = CountVectorizer()

train_df = cvec2.fit_transform(X_train)

test_df = cvec2.transform(X_test)

mnb2 = MultinomialNB(alpha=0)

mnb2.fit(train_df, y_train)

print("training score:",mnb2.score(train_df, y_train))

print("test score:",mnb2.score(test_df, y_test))

training score: 0.9723880597014926
test score: 0.8257575757575758


  'setting alpha = %.1e' % _ALPHA_MIN)


In [87]:
#confusion matrix
pred = mnb2.predict(train_df)

pred

cm = confusion_matrix(y_train, pred)

cm

array([[653,  17],
       [ 20, 650]])

### 4.) Naive Bayes - Gaussian w/ TFIDF

In [88]:
tvec = TfidfVectorizer()

train_df2 = tvec.fit_transform(X_train)
train_df2 = train_df2.toarray()

test_df2 = tvec.transform(X_test)
test_df2 = test_df2.toarray()

gnb = GaussianNB()

gnb.fit(train_df2, y_train)

print("training score:",gnb.score(train_df2, y_train))

print("test score:",gnb.score(test_df2, y_test))

training score: 0.9656716417910448
test score: 0.7545454545454545


In [89]:
#confusion matrix
pred = gnb.predict(train_df2)

pred

cm = confusion_matrix(y_train, pred)

cm

array([[624,  46],
       [  0, 670]])

### .5) Decision Tree Classifier w TFIDF Vectorizer

In [90]:
# Import model.
from sklearn.tree import DecisionTreeClassifier

#instantiating pipeline
pipe3 = Pipeline(
    [
        ('tfidf', TfidfVectorizer(stop_words='english')),
        ('dt', DecisionTreeClassifier())
    ]
)

# Search over the following values of hyperparameters:
# Maximum number of features fit: 50
# Minimum number of documents needed to include token: 10
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe3_params = {
    'tfidf__max_features': [1000],
    'tfidf__min_df': [10],
    'tfidf__max_df': [.15, .95],
    'tfidf__ngram_range': [(1,1), (1,2)]
}

gs3 = GridSearchCV(pipe3, # what object are we optimizing?
                  pipe3_params, # what parameters values are we searching?
                  cv = 5) # 5-fold cross-validation.

gs3.fit(X_train, y_train)

# What's the best score?
print("best score: ",gs3.best_score_)

gs3.best_params_

gs3_model = gs3.best_estimator_
gs3_model

# Score model on training set.
print("training score:",gs3_model.score(X_train, y_train))

# Score model on testing set.
print("testing score:",gs3_model.score(X_test, y_test))

best score:  0.7171641791044776
training score: 0.8432835820895522
testing score: 0.6878787878787879


### Evaluation

The Multinomial Naive Bayes model performed the best.I played around with parameters in this one and the setting alpha to 0.5 produced a slightly higher school. The best parameters being — alpha=0 and fit_prior being set True or False didnt make much of a difference .

Next was the logistic regression model with TFIDF Vectorizer. I used Pipeline and GridSearch to run several models and confirm which one would be best.  The accuracy score was 97% on training data and 82% on the test data. This means our model is overfit by quite a lot but I believe this can possibly be adjusted by tweaking the parameters or fitting a pipeline and gridsearch. This also means that 82% of our posts will be accurately classified by our model.

Decision Tree Classifier w TFIDF Vectorizer was quite unremarkable with a low accuracy score(see above for more info).

### Conclusions & Next steps

The Multinomial Naive Bayes model was the most outstanding. I would definitely use it as the one to classify my reddit post. However, if given more time and data to answer the problem I would recommend the following: 1)spending more time feature engineering, 2)exploring new features (e.g. upvotes or post comments),3)Using a larger dataset (is this corpus sufficient?), 4)Try Multinomial Naive Bayes model to classify other gaming subreddits. 