# Model Experimentation 

### Contents:
- [1. Multinomial Naive Bayes and Count Vectorizer](#1.-Multinomial-Naive-Bayes-and-Count-Vectorizer)
- [2. Multinomial Naive Bayes and TFIDF Vectorizer](#2.-Multinomial-Naive-Bayes-and-TFIDF-Vectorizer)
- [3. Logisitic Regression with Count Vectorizer](#3.-Logistic-Regression-with-Count-Vectorizer)
- [4. Logisitic Regression with TFIDF Vectorizer](#4.-Logistic-Regression-with-TFIDF-Vectorizer)
- [5. KNN and Count Vectorizer](#5.-KNN-and-Count-Vectorizer)
- [6. KNN with TFIDF Vectorizer](#6.-KNN-and-TFIDF-Vectorizer)
- [7. Decision Tree with Count Vectorization](#7.-Decision-Tree-with-Count-Vectorization)
- [8. Decision Tree with TFIDF Vectorizer](#8.-Decision-Tree-with-TFIDF-Vectorizer)
- [9. Bagging Classifier with TFIDF Vectorizer](#9.-Bagging-Classifier-with-TFIDF-Vectorizer)
- [10. Random Forest with TFIDF Vectorizer](#10.-Random-Forest-with-TFIDF-Vectorizer)
- [11. Extra Trees with TFIDF Vectorizer](#11.-Extra-Trees-with-TFIDF-Vectorizer)
- [12. AdaBoostClassifier and TfidfVectorizer](#12.-AdaBoostClassifier-and-TfidfVectorizer)
- [13. Gradient Boosting and TfidfVectorizer](#13.-Gradient-Boosting-and-TfidfVectorizer)
- [14. SVM and TfidfVectorizer](#14.-SVM-and-TfidfVectorizer)

In [4]:
# The long list of packages I need to run all my models
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier

In [2]:
# Reading in my proprocessed csv to pandas
reddits = pd.read_csv('./data/reddits_preprocessed.csv')

In [3]:
# Checking out the data
reddits.head()

Unnamed: 0,subreddit,type,created_utc,words
0,0,comment,1553281124,she didnt mention this when i asked her she...
1,0,comment,1553280963,i mean i was but not for the sole purpose of...
2,0,comment,1553280896,hardly the best talent around in podcast has...
3,0,comment,1553280716,she cant do season 2 because gimlet owns the...
4,0,comment,1553280571,search party really excellent and critically...


In [5]:
# Checking the shape of my data
reddits.shape

(28012, 4)

In [6]:
# Assigning my X and y variables. X is the words used to predict which subreddit and y is which subreddit
X = reddits['words']
y = reddits['subreddit']

In [7]:
# Train test split my data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=22)

### 1. Multinomial Naive Bayes and Count Vectorizer 

In [8]:
# Making a pipeline for Multinomial Naive Bayes and CountVectorizer
pipe = Pipeline([('vect', CountVectorizer()),
                     ('nb', MultinomialNB())
                ])
# Setting pipeline parameters 
pipe_params = {
    'vect__max_features': [100, 1000, 10000],
    'vect__ngram_range': [(1,1), (1,2)],
    'vect__stop_words' : [None, stop_words.ENGLISH_STOP_WORDS],
}
# Instantiating my gridsearch
gs = GridSearchCV(pipe, 
                  param_grid=pipe_params
                 ) 
# Fit GridSearch to training data.
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        p

In [161]:
# Printing the best score 
print(gs.best_score_)

0.7936597788813231


In [22]:
# Looking at the best parameters for this model. Commented this out only because it takes up so much space
# and it's a lot of scrolling
# gs.best_params_

In [9]:
# Setting my best estimator to be my model for scoring
gs_model = gs.best_estimator_

In [10]:
# Training score
gs_model.score(X_train, y_train)

0.8369746299205103

In [11]:
# Testing score
gs_model.score(X_test, y_test)

0.790946737112666

### Conclusions about this model:

Naive Bayes models are known to be good models for Natural Language Processing. They are probabilistic algorithms that predict the tag of a text. The model is naive because it assumes no relationship between features. This model was using the count vectorizer which generally was outperformed by the TFIDF vectorizer because the TFIDF vectorizer gives more weight to words that are unique to the document rather than just counting each word. The best parameters were max_features': 10000, ngram_range: (1, 1), and stop words included.

### 2. Multinomial Naive Bayes and TFIDF Vectorizer 

In [12]:
# Building a pipeline
pipe2 = Pipeline([('tfidf', TfidfVectorizer()),
                     ('nb', MultinomialNB())
                ])
# Setting the parameters of the pipeline 
pipe_params2 = {
    'tfidf__max_features': [100, 1000, 10000],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__stop_words' : [None, stop_words.ENGLISH_STOP_WORDS],
}
# Instantiated the grid search
gs2 = GridSearchCV(pipe2, 
                  param_grid=pipe_params2
                 ) 
# Fitting the model
gs2.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [24]:
# Looking at the best score
gs2.best_score_

In [24]:
# Looking at the best parameters for the gridsearch
# gs2.best_params_

In [13]:
# Setting the best estimator as the model
gs_model2 = gs2.best_estimator_

In [14]:
gs_model2.score(X_train, y_train)

0.8587272121471751

In [15]:
gs_model2.score(X_test, y_test)

0.7948022276167357

In [16]:
gs_model2 = gs2.best_estimator_

### Conclusions about this model:

As stated above, Naive Bayes models are great models for Natural Language Processing, and using the TFIDF vectorizer improved its performance. This was the best model, although the SVM model was also very accurate and too close to call. The best parameters were the same as above, max_features': 10000, ngram_range: (1, 1), and stop words included.

### 3. Logisitic Regression with Count Vectorizer 

In [17]:
# Setting a pipeline including scaling my features
pipe3 = Pipeline([('vect', CountVectorizer()),
                  ('scaler',  StandardScaler(with_mean=False)),
                     ('lr', LogisticRegression())
                ])
# Setting pipeline parameters. The main issue I had with this model was that I could not set a large number
# of max features or I would get an error code. I had to limit the features to 200 for max_features for 
# this code to run without errors. 
pipe_params3 = {
    'vect__max_features': [100, 200],
    'vect__ngram_range': [(1,1), (1,2)],
    'vect__stop_words' : [None, stop_words.ENGLISH_STOP_WORDS],
}
# Instantiated a grid search
gs3 = GridSearchCV(pipe3, 
                  param_grid=pipe_params3) 
# Fitting the model
gs3.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        p

In [25]:
# Checking the best score
gs3.best_score_

0.6879432618488764

In [27]:
# Checking the parameters
# gs3.best_params_

In [18]:
# Setting the best estimator as this model
gs_model3 = gs3.best_estimator_

In [19]:
# Checking the training score
gs_model3.score(X_train, y_train)

0.6998905231091437

In [20]:
# Checking the testing score
gs_model3.score(X_test, y_test)

0.697129801513637

### Conclusions about this model:

The logistic regression model was one of the weakest, and it could not handle 10,000 features like I had in the previous models. The best parameters were max_features': 200, ngram_range: (1, 1), and stop words included.

### 4. Logisitic Regression with TFIDF Vectorizer 

In [29]:
# Building a pipeline, same as above but with TfidfVectorizer 
pipe4 = Pipeline([('tfidf', TfidfVectorizer()),
                  ('scaler',  StandardScaler(with_mean=False)),
                     ('lr', LogisticRegression())
                ])
# Setting pipeline parameters. I could not use more than 100 features without getting an error
pipe_params4 = {
    'tfidf__max_features': [100],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__stop_words' : [None, stop_words.ENGLISH_STOP_WORDS],
}
# Instantiated grid search
gs4 = GridSearchCV(pipe4, 
                  param_grid=pipe_params4) 
# Fitting the model
gs4.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [30]:
# Checking the best score
gs4.best_score_

0.6591935171936693

In [36]:
# Checking the best parameters
# gs4.best_params_

In [32]:
# Setting the best estimator to be the model
gs_model4 = gs4.best_estimator_

In [33]:
# Checking the training score
gs_model4.score(X_train, y_train)

0.6635251558855728

In [34]:
# Checking the testing score 
gs_model4.score(X_test, y_test)

0.6538626303012994

### Conclusions about this model:

This model was even worse than the previous one in spite of using TFIDF Vectorizer, which worked better on the other models. I believe this is because I could only set the max features to 100 before I started to get errors which prevented it from running. The best parameters were max_features: 100, ngram_range: (1, 1), and stop words included.

### 5. KNN and Count Vectorizer 

In [40]:
# Setting the pipeline including scaling features
pipe5 = Pipeline([('vect', CountVectorizer()),
                  ('scaler',  StandardScaler(with_mean=False)),
                     ('knn', KNeighborsClassifier())
                ])
# Setting the pipe parameters 
pipe_params5 = {
    'vect__max_features': [100, 10000],
    'vect__ngram_range': [(1,1), (1,2)],
    'vect__stop_words' : [stop_words.ENGLISH_STOP_WORDS],
    'knn__n_neighbors': [3, 5, 7]
}
# Instantiated a grid search
gs5 = GridSearchCV(pipe5, 
                  param_grid=pipe_params5) 
# Fitting the model
gs5.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        p

In [41]:
# The best score from the model
gs5.best_score_

0.6579561358716409

In [43]:
# The best parameters
# gs5.best_params_

In [44]:
# Setting the model to the best estimator 
gs_model5 = gs5.best_estimator_

In [45]:
# Training score
gs_model5.score(X_train, y_train)

0.8242657908515398

In [46]:
# Testing score 
gs_model5.score(X_test, y_test)

0.6621447950878195

### Conclusions about this model:

Based on the training and testing scores, this model is overfit. The best parameters were max_features: 10,000, ngram_range: (1, 1), stop words included, and 3 n_neighbors for the KNN.

### 6. KNN with TFIDF Vectorizer 

In [48]:
# Setting a pipeline for knn and tfidf
pipe6 = Pipeline([('tfidf', TfidfVectorizer()),
                  ('scaler',  StandardScaler(with_mean=False)),
                     ('knn', KNeighborsClassifier())
                ])
# Setting the pipeline parameters 
pipe_params6 = {
    'tfidf__max_features': [100, 10000],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__stop_words' : [stop_words.ENGLISH_STOP_WORDS],
    'knn__n_neighbors' : [3, 5, 7]
}
# Instantiated the grid search
gs6 = GridSearchCV(pipe6, 
                  param_grid=pipe_params6) 
# Fitting the model
gs6.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [60]:
# Checking the best score
gs6.best_score_

In [50]:
# Looking at the best parameters
# gs6.best_params_

In [51]:
# Setting the model to be the best estimator 
gs_model6 = gs6.best_estimator_

In [52]:
# The training score of the model
gs_model6.score(X_train, y_train)

0.6777571516968918

In [53]:
# The testing score of this model 
gs_model6.score(X_test, y_test)

0.6095958874767956

### Conclusions about this model:

Unlike most of the other models, this one performed better with the count vectorizer than the tfidf vectorizer. This model was also overfit, but not as much as the previous one with the count vectorizer. The best parameters were max_features: 100, ngram_range: (1, 1), stop words included, and for knn, n_neighbors of 7.

### 7. Decision Tree with Count Vectorization 

In [54]:
# Setting the pipeline for decision tree and count vectorizer 
pipe7 = Pipeline([('vect', CountVectorizer()),
                     ('dt', DecisionTreeClassifier())
                ])
# Setting the pipe parameters
pipe_params7 = {
    'vect__max_features': [10000],
    'vect__ngram_range': [(1,1)],
    'vect__stop_words' : [stop_words.ENGLISH_STOP_WORDS],
    'dt__max_depth': [3, 10],
    'dt__min_samples_split': [5, 20],
    'dt__min_samples_leaf': [2, 7]
}
# Instantiated a grid search
gs7 = GridSearchCV(pipe7, 
                  param_grid=pipe_params7) 
# Fitting the model
gs7.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        p

In [55]:
# Checking the best score
gs7.best_score_

0.6237326146026518

In [61]:
# Checking the best parameters
# gs7.best_params_

In [57]:
# Setting the best estimator as the model 
gs_model7 = gs7.best_estimator_

In [58]:
# Checking out the training score
gs_model7.score(X_train, y_train)

0.6251606454376696

In [59]:
# Checking out the testing score 
gs_model7.score(X_test, y_test)

0.6320148507782379

### Conclusions about this model:

This model was one of the worst performing ones on this particular dataset. The accuracy scores for the training and testing data are only about 10% more than the baseline. The best parameters were max_features: 100, ngram_range: (1, 1), stop words included, and for the decision tree, max_depth: 10, min_samples_leaf: 2, min_samples_split: 20.

### 8. Decision Tree with TFIDF Vectorizer

In [62]:
# Setting up a pipeline for Decision tree and tfidf Vectorizer
pipe8 = Pipeline([('tfidf', TfidfVectorizer()),
                     ('dt', DecisionTreeClassifier())
                ])
# I removed tfidf feature options so I could try more dt hyperparameters since there has been a lot of
# consistency with hyperparametes that work best
pipe_params8 = {
    'tfidf__max_features': [10000],
    'tfidf__ngram_range': [(1,1)],
    'tfidf__stop_words' : [stop_words.ENGLISH_STOP_WORDS],
    'dt__max_depth': [3, 10],
    'dt__min_samples_split': [5, 20],
    'dt__min_samples_leaf': [2, 7]
}
# Instantiated grid search
gs8 = GridSearchCV(pipe8, 
                  param_grid=pipe_params8) 
# Fitting the model
gs8.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [63]:
# Checking the best score
gs8.best_score_

0.6273975360686205

In [65]:
# Seeing what the best parameters are 
# gs8.best_params_

In [66]:
# Setting the best estimator to be the model
gs_model8 = gs8.best_estimator_

In [67]:
# Checking the training score for this model 
gs_model8.score(X_train, y_train)

0.6320624494264363

In [68]:
# Checking the testing score for this model 
gs_model8.score(X_test, y_test)

0.6375838926174496

### Conclusions about this model:

This model only did slightly better than with the Count Vectorizer. The best parameters were max_features: 100, ngram_range: (1, 1), stop words included, and for decision tree classifier max_depth: 10, min_samples_leaf: 7, min_samples_split: 20.

### 9. Bagging Classifier with TFIDF Vectorizer

In [73]:
# Building the pipeline for a bagging classifier 
pipe9 = Pipeline([('tfidf', TfidfVectorizer()),
                     ('bag', BaggingClassifier())
                ])
# Setting the parameters
pipe_params9 = {
    'tfidf__max_features': [10000],
    'tfidf__ngram_range': [(1,1)],
    'tfidf__stop_words' : [stop_words.ENGLISH_STOP_WORDS],
    'bag__max_samples' : [.5, 1.0, 10],
    'bag__n_estimators' : [2, 6, 10]
}
# Instantiated the grid search
gs9 = GridSearchCV(pipe9, 
                  param_grid=pipe_params9) 
# Fitting the model to the data
gs9.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [81]:
# Checking the best score
gs9.best_score_

0.7232135523137042

In [88]:
# Checking the best parameters
# gs9.best_params_

In [84]:
# Setting the model to the best estimator 
gs_model9 = gs9.best_estimator_

In [85]:
# Checking the training score
gs_model9.score(X_train, y_train)

0.9658717692417536

In [86]:
# Checking the testing score 
gs_model9.score(X_test, y_test)

0.7256889904326718

### Conclusions about this model:

As we discussed in class, certain models including the Bagging Classifier can become overfit very quickly. This is evident with this model, since there is a large disparity between the training and testing scores. Also, at this point I decided to only use the TFIDF Vectorizor since it was performing better than the Count Vectorizer on most models. 

### 10. Random Forest with TFIDF Vectorizer

In [87]:
# Setting the pipeline for random forest 
pipe10 = Pipeline([('tfidf', TfidfVectorizer()),
                     ('rf', RandomForestClassifier())
                ])
# Pipeline parameters
pipe_params10 = {
    'tfidf__max_features': [10000],
    'tfidf__ngram_range': [(1,1)],
    'tfidf__stop_words' : [stop_words.ENGLISH_STOP_WORDS],
    'rf__n_estimators': [100, 150],
    'rf__max_depth': [None, 5, 6]
}
# Instantiating a grid search
gs10 = GridSearchCV(pipe10, 
                  param_grid=pipe_params10) 
# Fitting my model
gs10.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [89]:
# The best score for this model
gs10.best_score_

0.7564375042274222

In [92]:
# The best parameters for this model
# gs10.best_params_

In [93]:
# Setting the best estimator as this model
gs_model10 = gs10.best_estimator_

In [94]:
# Checking the training score
gs_model10.score(X_train, y_train)

0.9809605407206435

In [95]:
# Checking the testing score 
gs_model10.score(X_test, y_test)

0.7528202199057547

### Conclusions about this model:

As with the Bagging model, Random Forest models have a tendancy to be very overfit and this is an example of that. While it is overfit, the accuracy on the testing data is still near the higher end of the results for testing. The best parameters were max_features: 100, ngram_range: (1, 1), stop words included, and for decision tree classifier rf__max_depth: None, and n_estimators: 150.

### 11. Extra Trees with TFIDF Vectorizer

In [96]:
# Setting the pipeline for tfidf and extra trees
pipe11 = Pipeline([('tfidf', TfidfVectorizer()),
                     ('xt', ExtraTreesClassifier())
                ])
# Setting the pipeline parameters
pipe_params11 = {
    'tfidf__max_features': [10000],
    'tfidf__ngram_range': [(1,1)],
    'tfidf__stop_words' : [stop_words.ENGLISH_STOP_WORDS],
    'xt__n_estimators': [100, 150],
    'xt__max_depth': [None, 5, 6]
}
# Instantiating the grid search
gs11 = GridSearchCV(pipe11, 
                  param_grid=pipe_params11) 
# Fitting the model
gs11.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                      

In [97]:
# Checking the best score 
gs11.best_score_

0.7690991843582039

In [100]:
# Checking the best parameters 
# gs11.best_params_

In [101]:
# Setting the best estimator as the model 
gs_model11 = gs11.best_estimator_

In [102]:
# Checking the training score 
gs_model11.score(X_train, y_train)

0.9809605407206435

In [103]:
# Checking the testing score
gs_model11.score(X_test, y_test)

0.776095958874768

### Conclusions about this model:

Like the the other tree-based models, this one also was very overfit, however, the accuracy score on the testing data was still near the top of the list of models. The best parameters were max_features: 100, ngram_range: (1, 1), stop words included, and for Extra Trees max_depth: None and n_estimators: 150. 

### 12. AdaBoostClassifier and TfidfVectorizer 

In [111]:
# Setting the pipeline 
pipe12 = Pipeline([('tfidf', TfidfVectorizer(max_features=10000, 
                                           ngram_range=(1, 1), 
                                           stop_words=stop_words.ENGLISH_STOP_WORDS)),
                     ('ada', AdaBoostClassifier()),
])
# Setting the pipeline parameters
pipe_params12 = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
}
# Instantiated a grid search
gs12 = GridSearchCV(pipe12, 
                  param_grid=pipe_params12) 
# Fitting the model
gs12.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=10000,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                     

In [105]:
# Finding the best score for the model 
gs12.best_score_

0.6912749406574735

In [109]:
# Checking the best parameters 
gs12.best_params_

{'tfidf__max_df': 0.25}

In [106]:
# Setting the best estimator to be the model
gs_model12 = gs12.best_estimator_

In [107]:
# Checking the training score
gs_model12.score(X_train, y_train)

0.6998905231091437

In [108]:
# Checking the testing score
gs_model12.score(X_test, y_test)

0.6981293731258033

### Conclusions about this model:

This model was not one of the stronger models, so it seems like boosting in general is not the best strategy for this data. I also tried a new parameter for TFIDF, and the best one for max_df was .25.

### 13. Gradient Boosting and TfidfVectorizer

In [112]:
# Making a pipeline
pipe13 = Pipeline([('tfidf', TfidfVectorizer(max_features=10000, 
                                           ngram_range=(1, 1), 
                                           stop_words=stop_words.ENGLISH_STOP_WORDS)),
                     ('gbc', GradientBoostingClassifier()),
])
# Setting the pipeline parameters 
pipe_params13 = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
}
# Instantiated a grid search pipeline 
gs13 = GridSearchCV(pipe13, 
                  param_grid=pipe_params13) 
# Fitting the model
gs13.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=10000,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                     

In [113]:
# Getting the best score 
gs13.best_score_

0.697605588116698

In [114]:
# Setting the best estimator to the model
gs_model13 = gs13.best_estimator_

In [115]:
# Checking the training score 
gs_model13.score(X_train, y_train)

0.7138845256794707

In [116]:
# Checking the testing score 
gs_model13.score(X_test, y_test)

0.7061259460231329

### Conclusions about this model:

This model did not perform strongly, and it seems that boosting is not a particularly effective model for this data set. 

### 14. SVM and TfidfVectorizer

In [117]:
# Setting the pipeline, this time just putting the features for TFIDF that have been most successful in the
# pipeline rather than in the parameters. 
pipe14 = Pipeline([('tfidf', TfidfVectorizer(max_features=10000, 
                                           ngram_range=(1, 1), 
                                           stop_words=stop_words.ENGLISH_STOP_WORDS)),
                     ('svc', SVC(gamma='scale')),
])
# Setting the parameters
pipe_params14 = {
    'tfidf__max_df': (0.25, 0.5, 0.75),
}
# Instantiating a grid search pipeline
gs14 = GridSearchCV(pipe14, 
                  param_grid=pipe_params14) 
# Fitting the model
gs14.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=10000,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                     

In [118]:
# Checking the best score
gs14.best_score_

0.7906613314003227

In [119]:
# Setting the best estimator to the model
gs_model14 = gs14.best_estimator_

In [120]:
# Checking the training score
gs_model14.score(X_train, y_train)

0.9611119044219144

In [121]:
# Checking the testing score 
gs_model14.score(X_test, y_test)

0.7996572897329716

### Conclusions about this model:

The SVM model was one of the strongest. It was basically neck in neck with Multinomial Naive Bayes. However, the downside to this model seems to be that it is very taxing on my computer's resources, so I did not dare run this with too many hyperparameters. 

## Overall Conclusions 

Of the models I had the chance to run, the best models for this type of data seem to be Naive Bayes and SVM. The accuracy results were very close for each. However, the downside to SVM is that it seems very prone to overfitting and seems to be very taxing on computer resources, as it took a long time to run each time I ran it. For this type of data, in the future I would choose Multinomial Naive Bayes because it was the most efficient with the most accuracy. I am going to do some tweaking to both models and try both with authors to see what I get.