# PART TWO 

## Content List- Part 2

- [Data Cleaning and EDA](#Data-Cleaning-and-EDA)
- [Preprocessing and Modeling](#Preprocessing-and-Modeling)
- [Evaluation and Conceptual Understanding](#Evaluation-and-Conceptual-Understanding)
- [Conclusion and Recommendations](#Conclusion-and-Recommendations)

## Data Cleaning and EDA

### Importing packages

In [13]:
import pandas as pd
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import text, stop_words
from sklearn.metrics import accuracy_score,recall_score,precision_score, confusion_matrix

#### Importing the Dataframe from csv

In [2]:
master_df = pd.read_csv('./data/master_df.csv')

In [3]:
len(master_df)

1778

In [4]:
#check for nulls
master_df.isnull().sum()

ID                 0
Length of Title    0
Post Text          0
Subreddit          0
dtype: int64

## Preprocessing and Modeling

In [6]:
#check shape of new, combined dataframe
master_df.shape

(1778, 4)

In [7]:
master_df.columns

Index(['ID', 'Length of Title', 'Post Text', 'Subreddit'], dtype='object')

In [8]:
master_df['Length of Title'].mean()

100.04499437570304

In [9]:
#set feature and targets
X = master_df[['Post Text', 'Length of Title']]
y = master_df['Subreddit']

In [10]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify= y)

## Determining Baseline Score

As Accuracy is our metric, it is vital to determine a baseline so that we can compare our results. We will do this by performing a quick analysis on the distribution of the classes, in order to see if there is any inherent imbalance.

In [11]:
# Baseline Accuracy
y_test.value_counts(normalize=True)

1    0.557303
0    0.442697
Name: Subreddit, dtype: float64

The baseline Accuracy of 55.58% is important for the model as it provides a metric on which the model should be judged. 55% is the equivalent of random chance pick by the Majority class, even higher than a coin flip. 

In [12]:
#show us the shape of our data
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(1333, 2)
(1333,)
(445, 2)
(445,)


## Amending stop word lists

In [16]:
additional_politics_english_stop = ['www', 'things', 'does', 'x200b', 'amp', 'want', 'watch',
                           'just', 'like', 'https', 'com', 'trump', 'republican', 'republicans',
                           'libertarians', 'democrats', 'democrat', 'people', 'libertarian',
                           'says', 'say', 'did', 'this', 'conservative', 'conservatives' ]

additional_english_stop = ['www', 'things', 'does', 'x200b', 'amp',
                           'just', 'like', 'https', 'com', 'watch', 'want',
                           'says', 'say', 'did', 'this']

new_stop_list = stop_words.ENGLISH_STOP_WORDS.union(additional_english_stop)
new_politics_english_stop_list = stop_words.ENGLISH_STOP_WORDS.union(additional_politics_english_stop)
print(len(stop_words.ENGLISH_STOP_WORDS))
print(len(additional_english_stop))
print(len(new_politics_english_stop_list))
print(len(new_stop_list))



318
15
342
332


In [17]:
new_stop_list

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'amp',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
  

## Pipeline & GridSearchCV

When doing gridsearch with vectorizer, add onto X_train the feature desired (length of post)

### CountVectorizer with Logistic Regression

In [None]:
pipe_cvec_lr = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression())
])

pipe_params_cvec_lr = {
    'cvec__max_features': [None,500,1000],
    'cvec__min_df': [2,3],
    'cvec__max_df': [.3,.4,],
    'cvec__ngram_range': [(1,2),(1,3)],
    'cvec__stop_words': [None,'english',new_stop_list],
    'lr__penalty': ['l2']
}

gs = GridSearchCV(pipe_cvec_lr, param_grid=pipe_params_cvec_lr, cv=5,n_jobs = -1,verbose = 1)

gs.fit(X_train['Post Text'],y_train)

print(f'Best CV Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Pretty strong results with CountVectorizer and Logistic Regression, with a Best CV Score: 0.79581; where the 'cvec__max_df': 0.3, 'cvec__max_features': None, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 3), 'cvec__stop_words':'english', 'lr__penalty': 'l2'. 

Train Accuracy Score: 0.9798055347793567

Test Accuracy Score: 0.8094170403587444

The train score of approx 0.9798 was much better than the test score of 0.8094 indicating that this model is overfit despite tuning the hyperparameters and the strong training data score.

### TF-IDF with Logistic Regression

In [None]:
pipe_tvec_lr = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression())
])

pipe_params_tvec_lr = {
    'tvec__max_features': [None,1000],
    'tvec__min_df': [2,3,4],
    'tvec__max_df': [.3,.5],
    'tvec__ngram_range': [(1,1),(1,3)],
    'tvec__stop_words': [None, new_stop_list,'english'],
    'lr__penalty': ['l2']
}

gs = GridSearchCV(pipe_tvec_lr, param_grid=pipe_params_tvec_lr, cv=4, n_jobs=-1, verbose = 1)

gs.fit(X_train['Post Text'],y_train)

print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Results for TFIDF and Logistic Regression, with a Best cv score of ~0.7644; where the optimal parameters were 'tvec__max_df': 0.3, 'tvec__max_features': None, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': new_stop_list, 'lr__penalty': 'l2'.

Train Accuracy Score: 0.9281974569932685

Test Accuracy Score: 0.7982062780269058

The train score was better than the test score indicating that this model is overfit despite tuning the hyperparameters.

### Count Vectorizer with Multinomial Naive Bayes

In [None]:
pipe_cvec_mnb = Pipeline([
    ('cvec', CountVectorizer()),
    ('mnb', MultinomialNB())
])

pipe_params_cvec_mnb = {
    'cvec__max_features': [None,500,1000,2500],
    'cvec__min_df': [2,3],
    'cvec__max_df': [.4, .8],
    'cvec__ngram_range': [(1,1),(1,2),(1,3)],
    'cvec__stop_words': [None, new_stop_list,'english']
}

gs = GridSearchCV(pipe_cvec_mnb, param_grid=pipe_params_cvec_mnb, cv=4, n_jobs = 4, verbose = 1)

gs.fit(X_train['Post Text'],y_train)

In [None]:
print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Count Vectorizer and Multinomial Naive Bayes, with a Best cv score of 0.7255; where the optimal parameters were 'cvec__max_df': 0.4, 'cvec__max_features': None, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 3), 'cvec__stop_words': 'english'.

Train Accuracy Score: 0.8833208676140614

Test Accuracy Score: 0.7556053811659192

The train score of approx 0.8833 was much better than the test score of 0.7556 indicating that this model is very overfit despite tuning the hyperparameters.

### TF-IDF with Multinomial Naive Bayes

In [None]:
pipe_tvec_mnb = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('mnb', MultinomialNB())
])

pipe_params_tvec_mnb = {
    'tvec__max_features': [None,500,1000,3000],
    'tvec__min_df': [2,3],
    'tvec__max_df': [.2,.3,.4,],
    'tvec__ngram_range': [(1,1),(1,2),(1,3)],
    'tvec__stop_words': [None, new_stop_list,'english']
}

gs = GridSearchCV(pipe_tvec_mnb, param_grid=pipe_params_tvec_mnb, cv=4, n_jobs = -1, verbose = 1)

gs.fit(X_train['Post Text'],y_train)

print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Not bad! Results for TFIDF and Multinomial Naive Bayes, with a Best cv score of 0.7128; where the optimal parameters were 'tvec__max_df': 0.4, 'tvec__max_features': 1000, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': 'english'.

Train Accuracy Score: 0.9012715033657442

Test Accuracy Score: 0.7623318385650224

The train score of approx 0.9013 was much better than the test score of 0.7623 indicating that this model is very overfit despite tuning the hyperparameters.

## Random Forest with CountVectorizer

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_pipe = Pipeline([
        ('cvec', CountVectorizer()),
        ('rfc', RandomForestClassifier())])

rf_params = [{
    'cvec__max_features': [None, 500,1000],
    'cvec__min_df': [2,3],
    'cvec__max_df': [.3,.4,.8],
    'cvec__ngram_range': [(1,1),(1,2),(1,3)],
    'rfc__bootstrap': [True],
    'rfc__max_features': [.5, .6],
    'rfc__min_samples_leaf': [3,6],
    'rfc__min_samples_split':[3,6],
    'rfc__n_estimators':[10,100]
}]

In [None]:
gs = GridSearchCV(rf_pipe, 
                   param_grid=rf_params, 
                   cv = 4,
                   verbose = 1,
                   n_jobs = -1)

gs.fit(X_train['Post Text'],y_train)

print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

This model drastically improved on variance with the combination of CountVectorizer and RandomForestClassifier. The ideal param: were as follows: 'cvec__max_df': 0.9, 'cvec__max_features': None, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 1), 'rfc__bootstrap': True, 'rfc__max_features': 0.5, 'rfc__min_samples_leaf': 4, 'rfc__min_samples_split': 3, 'rfc__n_estimators': 100}

Train Accuracy Score: 0.868362004487659

Test Accuracy Score: 0.757847533632287

Furthermore, the fact that the train accuracy score is still higher than the test accuracy score indicates the model is still overfit, albeit suffering from a lower bias as well as a lower variance than the prior.

## Random Forest with TFIDF

In [None]:
rf_pipe = Pipeline([
        ('tvec', TfidfVectorizer()),
        ('rfc', RandomForestClassifier())])

rf_params = [{
    'tvec__max_features': [None],
    'tvec__min_df': [2,4],
    'tvec__max_df': [.3,.4, .5],
    'tvec__ngram_range': [(1,1),(1,2),(1,3)],
    'tvec__stop_words': [None],
    'rfc__bootstrap': [False, True],
    'rfc__n_estimators': [10,100],
    'rfc__max_features': [.5, .6, .7],
    'rfc__min_samples_leaf': [10],
    'rfc__min_samples_split':[3]
}]

In [None]:
gs= GridSearchCV(rf_pipe, 
                   param_grid=rf_params, 
                   cv = 4,
                   verbose = 1,
                   n_jobs = 3)

gs.fit(X_train['Post Text'],y_train)

print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

This models score with the combination of TFIDF and RandomForestClassifier average of .7457 was a little lower than the prior model.

The ideal paramaters were as follows:{'rfc__bootstrap': False, 'rfc__max_features': 0.5, 'rfc__min_samples_leaf': 10, 'rfc__min_samples_split': 3, 'rfc__n_estimators': 100, 'tvec__max_df': 0.3, 'tvec__max_features': None, 'tvec__min_df': 2, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': None.

Train Accuracy Score: 0.8160059835452506

Test Accuracy Score: 0.7309417040358744

Furthermore, the fact that the train accuracy score is still higher than the test accuracy score indicates the model is still overfit (0.78608 vs 0.7511) , albeit suffering from a lower bias as well as a lower variance than the prior.

### Adaboost with CountVectorizer

In [16]:
from sklearn.ensemble import AdaBoostClassifier

  return f(*args, **kwds)


In [None]:
ada_pipe = Pipeline([
        ('cvec', CountVectorizer()),
        ('ada', AdaBoostClassifier())
])

ada_params = {
    'cvec__max_features': [None,500,1000],
    'cvec__min_df': [3,5],
    'cvec__max_df': [.4,.3],
    'cvec__ngram_range': [(1,2),(2,3),(1,3)],
    'cvec__stop_words': [None, 'english', new_stop_list],
    'ada__learning_rate': [0.3,.5,.7]}

gs= GridSearchCV(ada_pipe, 
                   param_grid=ada_params, 
                   cv = 5,
                   verbose = 1,
                   n_jobs = -1)

gs.fit(X_train['Post Text'],y_train)

print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

### AdaBoost with TFIDF

In [None]:
ada_pipe = Pipeline([
        ('tvec', TfidfVectorizer()),
        ('ada', AdaBoostClassifier())
])

ada_params = {
    'tvec__max_features': [None,500,1000],
    'tvec__min_df': [2,3,4],
    'tvec__max_df': [.5,.4,.3],
    'tvec__ngram_range': [(1,1),(1,3)],
    'tvec__stop_words': [None, 'english', new_stop_list],
    'ada__learning_rate': [.5]}

gs= GridSearchCV(ada_pipe, 
                   param_grid=ada_params, 
                   cv = 3,
                   verbose = 1,
                   n_jobs = -1)

gs.fit(X_train['Post Text'],y_train)

print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

AdaBoost with TFIDF proved the best so far, with lower variance and higher accuracy with optimal settings of:'ada__learning_rate': 0.5, 'tvec__max_df': 0.5, 'tvec__max_features': 500, 'tvec__min_df': 3, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': new_stop_list.

Train Accuracy Score: 0.8032909498878086

Test Accuracy Score: 0.7645739910313901

Scores show that there is still a tiny bit of overfit, but all in all this model should generalize the best to new data and so we will make our predictions using it.

## XGBoost with CountVectorizer


In [21]:
xgb_pipe = Pipeline([
        ('cvec', CountVectorizer()),
        ('xgb', XGBClassifier())
])

xgb_params = {
    'cvec__max_features': [None,500,1000],
    'cvec__min_df': [3,5],
    'cvec__max_df': [.4,.3],
    'cvec__ngram_range': [(1,2),(2,3),(1,3)],
    'cvec__stop_words': [None, 'english', new_stop_list]}

gs= GridSearchCV(xgb_pipe, 
                   param_grid= xgb_params, 
                   cv = 5,
                   verbose = 1,
                   n_jobs = -1)

gs.fit(X_train['Post Text'],y_train)

print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   17.1s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   45.5s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:  2.1min finished


Best Score: 0.7569392348087022
Best Parameters: {'cvec__max_df': 0.4, 'cvec__max_features': None, 'cvec__min_df': 5, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': frozenset({'from', 'might', 'hundred', 'each', 'sixty', 'five', 'elsewhere', 'ours', 'was', 'too', 'herself', 'found', 'ie', 'nor', 'fifteen', 'onto', 'i', 'herein', 'name', 'yet', 'sometimes', 'sometime', 'always', 'there', 'became', 'first', 'just', 'beside', 'un', 'via', 'once', 'last', 'of', 'couldnt', 'more', 'am', 'mine', 'done', 'inc', 'every', 'your', 'upon', 'again', 'amount', 'another', 'this', 'all', 'we', 'off', 'hasnt', 'had', 'become', 'fire', 'whole', 'becoming', 'while', 'things', 'myself', 'since', 'these', 'indeed', 'is', 'did', 'because', 'show', 'seemed', 'formerly', 'amoungst', 'in', 'itself', 'below', 'due', 'hereby', 'sincere', 'further', 'at', 'whereas', 'anyone', 'otherwise', 'least', 'some', 'someone', 'himself', 'those', 'eight', 'her', 'although', 'forty', 'which', 'thus', 'whereafter', 'www', '

## XGBoost with TF-IDF

In [19]:
xgb_pipe = Pipeline([
        ('tvec', TfidfVectorizer()),
        ('xgb', XGBClassifier())
])

xgb_params = {
    'tvec__max_features': [None,500,1000],
    'tvec__min_df': [2,3,4],
    'tvec__max_df': [.5,.4,.3],
    'tvec__ngram_range': [(1,1),(1,3)],
    'tvec__stop_words': [None, 'english', new_stop_list]}

gs= GridSearchCV(xgb_pipe, 
                   param_grid=xgb_params, 
                   cv = 3,
                   verbose = 1,
                   n_jobs = -1)

gs.fit(X_train['Post Text'],y_train)

print(f'Best Score: {gs.best_score_}')
print(f'Best Parameters: {gs.best_params_}')
print(f'Train Accuracy Score: {gs.score(X_train["Post Text"],y_train)}')
print(f'Test Accuracy Score: {gs.score(X_test["Post Text"],y_test)}')

Fitting 3 folds for each of 162 candidates, totalling 486 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   21.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 486 out of 486 | elapsed:  2.5min finished


Best Score: 0.741185296324081
Best Parameters: {'tvec__max_df': 0.5, 'tvec__max_features': 1000, 'tvec__min_df': 3, 'tvec__ngram_range': (1, 3), 'tvec__stop_words': 'english'}
Train Accuracy Score: 0.8132033008252063
Test Accuracy Score: 0.7415730337078652


## Predictions Utilizing the Top Performing Models
#### I would classify two of the models as the best, the one with the highest overall score (lowest bias) and the one with the smallest overall difference between the train and test data (lowest variance). These models are tested out below with their optimized hyperparameters.

In [13]:
master_df.head()

Unnamed: 0,ID,Length of Title,Post Text,Subreddit
0,t3_bafvy6,293,In her own words... Antonia Okafor responds to...,0
1,t3_baf5b7,54,Science shows that white liberals condescend t...,0
2,t3_bah4s9,20,That's not racist...\n,0
3,t3_bagod9,237,Sir Mick Jagger had a heart valve problem and ...,0
4,t3_bah4s7,24,🎶My Heart Will Go Onnnn🎶\n,0


In [14]:
#define features
X = master_df['Post Text']
y = master_df['Subreddit']

#train test split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify=y,
                                                    random_state=42)

### AdaBoost with TF-IDF

In [26]:
#instantiate Adaboost with learning rate of 0.5 as optimized by GridSearch
ada = AdaBoostClassifier(learning_rate=0.5)

In [27]:
#instantiate TF-IDF and choose optimized hyperparameters from prior section's GridSearch
tf= TfidfVectorizer(max_df= 0.4, 
                max_features= None,
                min_df= 3,
                ngram_range=(1, 3),
                stop_words='english')


# Fit our TfidfVectorizer on the training data and transform training data.
X_train_tf = pd.DataFrame(tf.fit_transform(X_train).todense()
                           ,columns = tf.get_feature_names())

# Fit our TfidfVectorizer on the test data and transform training data.
X_test_tf = pd.DataFrame(tf.transform(X_test).todense()
                           ,columns = tf.get_feature_names())

In [28]:
#fit the model to our data
ada = ada.fit(X_train_tf, y_train)

In [20]:
X_test_tf.shape

(445, 1793)

In [21]:
y_test.shape

(445,)

In [22]:
ada.score(X_train_tf, y_train)

0.8079519879969993

In [23]:
ada.score(X_test_tf, y_test)

0.7617977528089888

### LogisticRegression and CountVectorizer

In [37]:
#instantiate countvectorizer 
cvec = CountVectorizer(stop_words= new_stop_list,
                       ngram_range=(1,2), min_df=2,
                       max_features=None, max_df = 0.4)

In [38]:
# Fit our CountVectorizer on the training data and transform training data.
X_train_cvec = pd.DataFrame(cvec.fit_transform(X_train).todense()
                           ,columns = cvec.get_feature_names())

# Fit our CountVectorizer on the test data and transform training data.
X_test_cvec = pd.DataFrame(cvec.transform(X_test).todense()
                           ,columns = cvec.get_feature_names())

In [39]:
#instantiate logisticregression
lr = LogisticRegression()
#fit data
lr = lr.fit(X_train_cvec, y_train)



In [40]:
#examine and verify shape
X_test_cvec.shape

(445, 3212)

#examine and verify shape
X_test_cvec.shape

In [41]:
#examine shape to verify a fit
y_test.shape

(445,)

In [42]:
#score our logistic regression model on our fitted training data
lr.score(X_train_cvec, y_train)

0.9789947486871718

In [43]:
#score our logistic regression model on our fitted testing data
lr.score(X_test_cvec, y_test)

0.8224719101123595

### Evaluation and Conceptual Understanding

Although our models performed well, there are inherent limitations. For starters, we are asked to choose between a model that has very high variance (Logistic Regression) and one that has slightly worse accuracy but much lower variance (Adaboost). We are also limited by the computational requirements of putting every function into a gridsearch in order to tune the hyperparameters towards optimization.

In [59]:
#generate predictions
pred = ada.predict(X_test_tf)

#generate confusion matrix
conf = confusion_matrix( y_test,# True values.
                     pred)# Predicted values.
tn, fp, fn, tp = conf.ravel()

In [60]:
#convert confusion matrix to dataframe
df_ada= pd.DataFrame(conf, index =  ['actual republican', 'actual democrats'], columns = ['predicted republican', 'predicted democrats'])


#### Confusion Matrix- Adaboost

In [61]:
df_ada

Unnamed: 0,predicted republican,predicted democrats
actual republican,150,47
actual democrats,59,189


This provides another visualization into the Accuracy score, in which there is approximately 1 in 5 misclassified data points.

#### Confusion Matrix- Logistic Regression

In [56]:
#generate predictions
pred = lr.predict(X_test_cvec)

#generate confusion matrix
conf = confusion_matrix( y_test,# True values.
                     pred)# Predicted values.
tn, fp, fn, tp = conf.ravel()

In [57]:
#convert confusion matrix to dataframe
df_lr= pd.DataFrame(conf, index =  ['actual republican', 'actual democrats'], columns = ['predicted republican', 'predicted democrats'])


In [58]:
df_lr

Unnamed: 0,predicted republican,predicted democrats
actual republican,163,34
actual democrats,45,203



## Conclusion and Recommendations

All of our models performed better than the baseline accuracy metric of ~55%, and although almost all of the models displayed different varying degrees of bias, variance and overfitting, the optimal models were LogisticRegression with CountVectorizer. These were determined not only in terms of overall raw accuracy, but in terms of variance and goodness of fit. 

In recommending this model to be used for the purpose of advertising companies who wish to target potential clients, it is important to weigh the pros and cons of 82.25% accuracy as offered by the Logistic Regression version of our model. This would mean that although 4 out of 5 recipients would be accurate, there would still exist a consistent 1 out of 5 audience that was not actually in the class described by our model. 

Additional features could also serve to improve the accuracy of our model, three ideas for that in future iterations include:

1. Fixing typos or other spelling errors that may have impacted our model's ability to interpret text
      
2. Incorporating a sentiment analysis aspect, which would involve creating two bags of words in which  we define positive and negative sentiment words, then filter and weight them accordingly.
      
3. Incorporate a loudness aspect, in which we would look at the prevalence of capital letters in sequence. Although our preprocessing transforms all text to lowercase, there is an argument to be made for the inclusion of series of uppercase text as it usually conveys intense emotion. 
      