### Data Modeling

In [59]:
from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import os

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

import warnings
warnings.simplefilter(action='ignore')

%matplotlib inline

In [60]:
#load the datasets
df=pd.read_csv('df_subs.csv')

In [61]:
df.head()

Unnamed: 0,subreddit,is_self,is_video,selftext,title,subreddit_id,created_utc,upvote_ratio,author,num_comments,...,title_lem,selftext_lem_stop,title_lem_stop,title_text_lem_stop,title_len,selftext_len,scores,compound,compound_score,is_ldr
0,lanadelrey,True,False,I really wanted to buy one and I finally saved...,What happened to the etsy shop that sold the h...,t5_2tegk,1640906654,1.0,artisticphangirl,0,...,What happened to the etsy shop that sold the h...,I really wanted buy I finally saved But shop D...,What happened etsy shop sold heart necklace,What happened etsy shop sold heart necklace I ...,7,11,"{'neg': 0.0, 'neu': 0.888, 'pos': 0.112, 'comp...",0.2263,positive,1
1,lanadelrey,True,False,i was lucky to find it for 56$ at a local reco...,is the standard black nfr vinyl rare?,t5_2tegk,1640901920,1.0,ambriebat,0,...,is the standard black nfr vinyl rare,lucky 56 local record shop couple day ago look...,standard black nfr vinyl rare,standard black nfr vinyl rare lucky 56 local r...,5,23,"{'neg': 0.044, 'neu': 0.791, 'pos': 0.165, 'co...",0.5859,positive,1
2,lanadelrey,True,False,Hi everyone We are a community-focused music j...,Is Blue Banisters the best folk (or folk-adjac...,t5_2tegk,1640901006,1.0,BLIGATORY,0,...,Is Blue Banisters the best folk or folk adjace...,Hi We community focused music journalism outle...,Is Blue Banisters best folk folk adjacent albu...,Is Blue Banisters best folk folk adjacent albu...,9,44,"{'neg': 0.0, 'neu': 0.736, 'pos': 0.264, 'comp...",0.9531,positive,1
3,lanadelrey,True,False,what Lana song has this effect on you? VG for ...,I can listen to Video Games every single day a...,t5_2tegk,1640900733,1.0,throwitawayar,0,...,I can listen to Video Games every single day a...,Lana song effect VG surreal,I listen Video Games single day I mesmerized b...,I listen Video Games single day I mesmerized b...,10,5,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neutral,1
4,lanadelrey,True,False,What album out of Lana’s discography would you...,What’s Lana’s Most “Lana Del Rey” Album,t5_2tegk,1640893285,1.0,Which_Relation_9766,0,...,What s Lana s Most Lana Del Rey Album,What album Lana discography consider Lana Del Rey,What Lana Most Lana Del Rey Album,What Lana Most Lana Del Rey Album What album L...,7,8,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neutral,1


### Baseline Model

To compare our models with a baseline model, we will first create the baseline model using the normalized value of our response (y) or in other words the percentage of y within our target. This will be the representative of the simplest model whereby assigning a post randomly will yield ~60% chance of correct classification.


In [62]:
df['subreddit'].value_counts()

Metallica     561
lanadelrey    521
Name: subreddit, dtype: int64

In [63]:
df['subreddit'].value_counts(normalize=True)

Metallica     0.518484
lanadelrey    0.481516
Name: subreddit, dtype: float64

In [64]:
X = df['title_text_lem_stop']
y = df['is_ldr']

### Train Test Split

In [65]:
# Split our data into train and test data. We will stratify during the split to ensure that the train and test sets 
# contains the same percentage of samplesto avoid imbalanced classes.
X_train,X_test,y_train,y_test = train_test_split(X,y,
                                                 random_state=42,
                                                 stratify=y) # account for slight class unbalanced 

### Model Preparation

In [66]:
# Vectorizer and Model imports:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier,GradientBoostingClassifier, AdaBoostClassifier,VotingClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report, plot_roc_curve, roc_auc_score,accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

import matplotlib.pyplot as plt
%matplotlib inline

In [67]:
#custom stop word list identified
custom_stopword=['lanadelrey','metallica','ha','lol','wa']

In [68]:
cv = CountVectorizer(stop_words=custom_stopword)
tvec = TfidfVectorizer(stop_words=custom_stopword)
hv= HashingVectorizer(stop_words=custom_stopword)

In [69]:
# Instantiate vectorizers
vectorizers = {'cvec': cv,
               'tvec': tvec,
               'hv': hv}

In [70]:
# Instiantiate models
models = {'lr': LogisticRegression(max_iter=1_000, random_state=42),
          'rf': RandomForestClassifier(random_state=42),
          'gb': GradientBoostingClassifier(random_state=42),
          'et': ExtraTreesClassifier(random_state=42),
          'ada': AdaBoostClassifier(random_state=42),
          'nb': MultinomialNB(),
          'svc': SVC(random_state=42)}

In [71]:
# Function to run model -- input vectorizer and model
def model_run(vec, mod, vec_params={}, mod_params={}, grid_search=False):
    
    results = {}
    
    pipe = Pipeline([
            (vec, vectorizers[vec]),
            (mod, models[mod])
            ])
    
    if grid_search:
        gs = GridSearchCV(pipe, param_grid = {**vec_params, **mod_params}, verbose=3, n_jobs=-1)
        gs.fit(X_train, y_train)
        pipe = gs
        
    else:
        pipe.fit(X_train, y_train)
    
    # Retrieve metrics
    results['model'] = mod
    results['vectorizer'] = vec
    results['train'] = pipe.score(X_train, y_train)
    results['test'] = pipe.score(X_test, y_test)
    predictions = pipe.predict(X_test)
    results['roc'] = roc_auc_score(y_test, predictions)
    results['precision'] = precision_score(y_test, predictions)
    results['recall'] = recall_score(y_test, predictions)
    results['f_score'] = f1_score(y_test, predictions)
    
    if grid_search:
        tuning_list.append(results)
        print('PARAMS')
        display(pipe.best_params_)
        
    else:
        eval_list.append(results)
    
    print('METRICS')
    display(results)
    
    tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
    print(f"True Negatives: {tn}")
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Positives: {tp}")
    
    return pipe

### Model testing

In [72]:
# Create list to store model testing results
eval_list = []

### Benchmark Model

Since there are significant differences on the numerical features of the posts extracted from both subreddits (e.g. title/post length and number of comments), we should also test out a model with only these features.

In [73]:
df.head()

Unnamed: 0,subreddit,is_self,is_video,selftext,title,subreddit_id,created_utc,upvote_ratio,author,num_comments,...,title_lem,selftext_lem_stop,title_lem_stop,title_text_lem_stop,title_len,selftext_len,scores,compound,compound_score,is_ldr
0,lanadelrey,True,False,I really wanted to buy one and I finally saved...,What happened to the etsy shop that sold the h...,t5_2tegk,1640906654,1.0,artisticphangirl,0,...,What happened to the etsy shop that sold the h...,I really wanted buy I finally saved But shop D...,What happened etsy shop sold heart necklace,What happened etsy shop sold heart necklace I ...,7,11,"{'neg': 0.0, 'neu': 0.888, 'pos': 0.112, 'comp...",0.2263,positive,1
1,lanadelrey,True,False,i was lucky to find it for 56$ at a local reco...,is the standard black nfr vinyl rare?,t5_2tegk,1640901920,1.0,ambriebat,0,...,is the standard black nfr vinyl rare,lucky 56 local record shop couple day ago look...,standard black nfr vinyl rare,standard black nfr vinyl rare lucky 56 local r...,5,23,"{'neg': 0.044, 'neu': 0.791, 'pos': 0.165, 'co...",0.5859,positive,1
2,lanadelrey,True,False,Hi everyone We are a community-focused music j...,Is Blue Banisters the best folk (or folk-adjac...,t5_2tegk,1640901006,1.0,BLIGATORY,0,...,Is Blue Banisters the best folk or folk adjace...,Hi We community focused music journalism outle...,Is Blue Banisters best folk folk adjacent albu...,Is Blue Banisters best folk folk adjacent albu...,9,44,"{'neg': 0.0, 'neu': 0.736, 'pos': 0.264, 'comp...",0.9531,positive,1
3,lanadelrey,True,False,what Lana song has this effect on you? VG for ...,I can listen to Video Games every single day a...,t5_2tegk,1640900733,1.0,throwitawayar,0,...,I can listen to Video Games every single day a...,Lana song effect VG surreal,I listen Video Games single day I mesmerized b...,I listen Video Games single day I mesmerized b...,10,5,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neutral,1
4,lanadelrey,True,False,What album out of Lana’s discography would you...,What’s Lana’s Most “Lana Del Rey” Album,t5_2tegk,1640893285,1.0,Which_Relation_9766,0,...,What s Lana s Most Lana Del Rey Album,What album Lana discography consider Lana Del Rey,What Lana Most Lana Del Rey Album,What Lana Most Lana Del Rey Album What album L...,7,8,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neutral,1


In [74]:
df['title_len'] = len(df['title_lem_stop'])
df['selftext_len'] = len(df['selftext_lem_stop'])

In [75]:
# bm stands for benchmark
X_bm =df[['upvote_ratio', 'num_comments', 'title_len', 'selftext_len', 'compound']]
y_bm = df['is_ldr']

In [76]:
# Split our data into train and test data
X_bm_train, X_bm_test, y_bm_train, y_bm_test = train_test_split(X_bm, y_bm, test_size=0.3, stratify=y, random_state=42)

In [77]:
logreg = LogisticRegression()
logreg.fit(X_bm_train, y_bm_train)

LogisticRegression()

In [78]:
logreg.score(X_bm_train, y_bm_train)

0.5759577278731837

In [79]:
logreg.score(X_bm_test, y_bm_test)

0.556923076923077

In [80]:
bm1_pred = logreg.predict(X_bm_test)

In [81]:
tn, fp, fn, tp = confusion_matrix(y_bm_test, bm1_pred).ravel()

print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 91
False Positives: 78
False Negatives: 66
True Positives: 90


True Positives are r/lanadelrey posts that were correctly classified by our model. True Negatives are r/metallica posts that were correctly classified by our model. We can see that the model still incorrectly classifies about 78 posts from lanadelrey subreddit and 60 from metallica subreddit which is not ideal. 

In [82]:
print(classification_report(y_bm_test, bm1_pred))

              precision    recall  f1-score   support

           0       0.58      0.54      0.56       169
           1       0.54      0.58      0.56       156

    accuracy                           0.56       325
   macro avg       0.56      0.56      0.56       325
weighted avg       0.56      0.56      0.56       325



### Logistic Regression

In [83]:
# Logistic Regression with CountVectorizer
cvec_lr = model_run('cvec', 'lr')

METRICS


{'model': 'lr',
 'vectorizer': 'cvec',
 'train': 0.9975339087546239,
 'test': 0.9114391143911439,
 'roc': 0.9109929078014184,
 'precision': 0.9140625,
 'recall': 0.9,
 'f_score': 0.9069767441860466}

True Negatives: 130
False Positives: 11
False Negatives: 13
True Positives: 117


In [84]:
# Logistic Regression with TfdifVectorizer
tvec_lr = model_run('tvec', 'lr')

METRICS


{'model': 'lr',
 'vectorizer': 'tvec',
 'train': 0.9901356350184957,
 'test': 0.9446494464944649,
 'roc': 0.9444080741953083,
 'precision': 0.9457364341085271,
 'recall': 0.9384615384615385,
 'f_score': 0.942084942084942}

True Negatives: 134
False Positives: 7
False Negatives: 8
True Positives: 122


In [85]:
# Logistic Regression with TfdifVectorizer
tvec_lr = model_run('hv', 'lr')

METRICS


{'model': 'lr',
 'vectorizer': 'hv',
 'train': 0.9852034525277436,
 'test': 0.9261992619926199,
 'roc': 0.9248772504091654,
 'precision': 0.9508196721311475,
 'recall': 0.8923076923076924,
 'f_score': 0.9206349206349206}

True Negatives: 135
False Positives: 6
False Negatives: 14
True Positives: 116


HashingVectorizer() was also experimented as it has low memory requirement by storing tokens as strings but because this we can no longer retrieve the features after vectorizing. Besides, we decided to drop using this in subsequent models due to weaker scores compared to the other two vectorizers.

Comparing TVEC,CVEC and HashVectorizer, clearly tvec yields a higher R2 score and f1-score.

### Random Forest / Extra Tree Classifier with CVEC

Random forest is a tree-based machine learning algorithm that leverages the power of multiple decision trees for making decisions. Each node in the decision tree works on a random subset of features to calculate the output which is aggregated to form the final output.

The Extra Trees classifier works similar to this, but incorporates bootstrap aggregation (or random sampling with replacement) in order to reduce variance and help with overfitting. In general, both methods provided worse results relative to logistic regression.

In [86]:
cvec_rf = model_run('cvec', 'rf')

METRICS


{'model': 'rf',
 'vectorizer': 'cvec',
 'train': 1.0,
 'test': 0.9261992619926199,
 'roc': 0.9281778505182761,
 'precision': 0.8819444444444444,
 'recall': 0.9769230769230769,
 'f_score': 0.927007299270073}

True Negatives: 124
False Positives: 17
False Negatives: 3
True Positives: 127


In [87]:
tvec_rf = model_run('tvec', 'rf')

METRICS


{'model': 'rf',
 'vectorizer': 'tvec',
 'train': 1.0,
 'test': 0.9188191881918819,
 'roc': 0.9207855973813421,
 'precision': 0.875,
 'recall': 0.9692307692307692,
 'f_score': 0.9197080291970802}

True Negatives: 123
False Positives: 18
False Negatives: 4
True Positives: 126


In [88]:
cvec_et = model_run('cvec', 'et')

METRICS


{'model': 'et',
 'vectorizer': 'cvec',
 'train': 1.0,
 'test': 0.9372693726937269,
 'roc': 0.9385160938352428,
 'precision': 0.9064748201438849,
 'recall': 0.9692307692307692,
 'f_score': 0.9368029739776952}

True Negatives: 128
False Positives: 13
False Negatives: 4
True Positives: 126


In [89]:
tvec_et = model_run('tvec', 'et')

METRICS


{'model': 'et',
 'vectorizer': 'tvec',
 'train': 1.0,
 'test': 0.9188191881918819,
 'roc': 0.9210856519367159,
 'precision': 0.8698630136986302,
 'recall': 0.9769230769230769,
 'f_score': 0.9202898550724639}

True Negatives: 122
False Positives: 19
False Negatives: 3
True Positives: 127


Both random forest and extra trees classifier did not score that well compared to logistic regression with tvec. We would need to explore further on how to get a better model.

### Naive Bayes

Naive Bayes classifiers works off Bayes' theroem, which describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

In this case we will be using Multinomial Naive Bayes that looks at the frequency of the words present in our data.

In [90]:
cvec_nb = model_run('cvec', 'nb')

METRICS


{'model': 'nb',
 'vectorizer': 'cvec',
 'train': 0.9901356350184957,
 'test': 0.955719557195572,
 'roc': 0.9562465902891435,
 'precision': 0.9402985074626866,
 'recall': 0.9692307692307692,
 'f_score': 0.9545454545454547}

True Negatives: 133
False Positives: 8
False Negatives: 4
True Positives: 126


In [91]:
tvec_nb = model_run('tvec', 'nb')

METRICS


{'model': 'nb',
 'vectorizer': 'tvec',
 'train': 0.9889025893958077,
 'test': 0.955719557195572,
 'roc': 0.9562465902891435,
 'precision': 0.9402985074626866,
 'recall': 0.9692307692307692,
 'f_score': 0.9545454545454547}

True Negatives: 133
False Positives: 8
False Negatives: 4
True Positives: 126


The Multinomial Naive Bayes classifier with count vectorizer has an extremely high recall and f-score which is great at predicting r/lanadelrey posts.

### Support Vector Machine Classifier

Support vector machine classifer separates data points using a hyperplane with the largest amount of margin and classifies them accordingly. In other words, the algorithm determines the best decision boundary between vectors that belong to a given group (or category) and vectors that do not belong to it.

In [92]:
cvec_svc = model_run('cvec', 'svc')

METRICS


{'model': 'svc',
 'vectorizer': 'cvec',
 'train': 0.9605425400739828,
 'test': 0.8782287822878229,
 'roc': 0.8739770867430441,
 'precision': 0.970873786407767,
 'recall': 0.7692307692307693,
 'f_score': 0.8583690987124465}

True Negatives: 138
False Positives: 3
False Negatives: 30
True Positives: 100


tvec_svc = model_run('tvec', 'svc')

In [93]:
eval_df = pd.DataFrame(eval_list)

In [94]:
# Top results (Accuracy >= 0.790)
eval_df.sort_values(by='test', ascending=False).reset_index(drop=True).head(6)

Unnamed: 0,model,vectorizer,train,test,roc,precision,recall,f_score
0,nb,cvec,0.990136,0.95572,0.956247,0.940299,0.969231,0.954545
1,nb,tvec,0.988903,0.95572,0.956247,0.940299,0.969231,0.954545
2,lr,tvec,0.990136,0.944649,0.944408,0.945736,0.938462,0.942085
3,et,cvec,1.0,0.937269,0.938516,0.906475,0.969231,0.936803
4,lr,hv,0.985203,0.926199,0.924877,0.95082,0.892308,0.920635
5,rf,cvec,1.0,0.926199,0.928178,0.881944,0.976923,0.927007


we can see that both naive bayes with cvec & tvec rank did better in terms of test score, roc score, recall and f1-score.

### Model Tuning

In [97]:
# Instantiate list to store tuning results
tuning_list = []

In [98]:
cv.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': ['lanadelrey', 'metallica', 'ha', 'lol', 'wa'],
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}

In [99]:
cvec_params = {
    # Setting a limit of n-number of features included/vocab size
    'cvec__max_features': [None, 12_000],

    # Setting a minimum number of times the word/token has to appear in n-documents
    'cvec__min_df':[1, 2, 3],
    
    # Setting an upper threshold/max percentage of n% of documents from corpus 
    'cvec__max_df': [0.1, 0.2, 1],
    
    # Testing with bigrams and trigrams
    'cvec__ngram_range':[(1,1), (1,2)],
}

In [100]:
tvec_params = {
    'tvec__max_features': [None],
    'tvec__min_df':[3, 4, 5],
    'tvec__max_df': [0.2, 0.3, 0.4],
    'tvec__stop_words': ['english'],
    'tvec__ngram_range':[(1,1), (1,2)]
}

### Model Params

In [101]:
MultinomialNB().get_params()

{'alpha': 1.0, 'class_prior': None, 'fit_prior': True}

In [102]:
lr_params = {
    # Trying different types of regularization
    'lr__penalty':['l2','l1'],

     # Trying different alphas of: 10, 1, 0.1 (C = 1/alpha)
    'lr__C':[0.1, 1, 10],
}

In [103]:
nb_params = {
    'nb__fit_prior': [True, False],
    'nb__alpha': [0.8, 0.9, 1.0],
}

In [104]:
svc_params = {
    'svc__C':[0.1, 1, 10],
    'svc__gamma':[0.01, 0.1, 0.3], 
    'svc__kernel':['linear','rbf'],
}

### Logistic Regression with CVEC

In [106]:
# Always stop_words & never trigrams (best_results without model tuning)
cvec_lr_gs = model_run('cvec', 'lr', vec_params=cvec_params, mod_params=lr_params, grid_search=True)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
PARAMS


{'cvec__max_df': 0.2,
 'cvec__max_features': None,
 'cvec__min_df': 1,
 'cvec__ngram_range': (1, 2),
 'lr__C': 0.1,
 'lr__penalty': 'l2'}

METRICS


{'model': 'lr',
 'vectorizer': 'cvec',
 'train': 0.9963008631319359,
 'test': 0.9003690036900369,
 'roc': 0.9012547735951992,
 'precision': 0.8759124087591241,
 'recall': 0.9230769230769231,
 'f_score': 0.8988764044943821}

True Negatives: 124
False Positives: 17
False Negatives: 10
True Positives: 120


### Logistic Regression with TVEC

In [108]:
tvec_lr_gs = model_run('tvec', 'lr', vec_params=tvec_params, mod_params=lr_params, grid_search=True)

Fitting 5 folds for each of 108 candidates, totalling 540 fits
PARAMS


{'lr__C': 1,
 'lr__penalty': 'l2',
 'tvec__max_df': 0.3,
 'tvec__max_features': None,
 'tvec__min_df': 3,
 'tvec__ngram_range': (1, 2),
 'tvec__stop_words': 'english'}

METRICS


{'model': 'lr',
 'vectorizer': 'tvec',
 'train': 0.9926017262638718,
 'test': 0.959409594095941,
 'roc': 0.9585924713584287,
 'precision': 0.976,
 'recall': 0.9384615384615385,
 'f_score': 0.9568627450980391}

True Negatives: 138
False Positives: 3
False Negatives: 8
True Positives: 122


### Multinomial Naive Bayes with CVEC

In [109]:
cvec_nb_gs = model_run('cvec', 'nb', vec_params=cvec_params, mod_params=nb_params, grid_search=True)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits
PARAMS


{'cvec__max_df': 0.2,
 'cvec__max_features': None,
 'cvec__min_df': 1,
 'cvec__ngram_range': (1, 2),
 'nb__alpha': 0.9,
 'nb__fit_prior': False}

METRICS


{'model': 'nb',
 'vectorizer': 'cvec',
 'train': 0.9963008631319359,
 'test': 0.9520295202952029,
 'roc': 0.9524004364429897,
 'precision': 0.9398496240601504,
 'recall': 0.9615384615384616,
 'f_score': 0.9505703422053231}

True Negatives: 133
False Positives: 8
False Negatives: 5
True Positives: 125


### Support Machine Classifier with TVEC

In [110]:
tvec_svc_gs = model_run('tvec', 'svc', vec_params=tvec_params, mod_params=svc_params, grid_search=True)

Fitting 5 folds for each of 324 candidates, totalling 1620 fits
PARAMS


{'svc__C': 10,
 'svc__gamma': 0.01,
 'svc__kernel': 'rbf',
 'tvec__max_df': 0.4,
 'tvec__max_features': None,
 'tvec__min_df': 3,
 'tvec__ngram_range': (1, 1),
 'tvec__stop_words': 'english'}

METRICS


{'model': 'svc',
 'vectorizer': 'tvec',
 'train': 0.9728729963008631,
 'test': 0.940959409594096,
 'roc': 0.9405619203491545,
 'precision': 0.9453125,
 'recall': 0.9307692307692308,
 'f_score': 0.937984496124031}

True Negatives: 134
False Positives: 7
False Negatives: 9
True Positives: 121


### Final Model Selection

In [112]:
tuning_df = pd.DataFrame(tuning_list)

In [113]:
tuning_df.sort_values(by=['test', 'roc'], ascending=False).reset_index(drop=True)

Unnamed: 0,model,vectorizer,train,test,roc,precision,recall,f_score
0,lr,tvec,0.992602,0.95941,0.958592,0.976,0.938462,0.956863
1,nb,cvec,0.996301,0.95203,0.9524,0.93985,0.961538,0.95057
2,svc,tvec,0.972873,0.940959,0.940562,0.945312,0.930769,0.937984
3,lr,cvec,0.996301,0.900369,0.901255,0.875912,0.923077,0.898876


From the table above, linear regression with TfidfVectorizer returned the highest R2 accuracy of 0.96 in gridsearch while 0.96 w/o gridsearch even on the default params. This is because grid search creates subsamples of the data repeatedly. GridSearch is used for selecting a combination of hyperparameters, performance estimation has not yet happened. The only comparison we could be making is between the parameter combinations within the CV itself.

In other words, our model is able accurately predict about 96% of the test data based on our text features. The model also has the best AUC-ROC score of 0.96. We can interpret this metric as proof that that this model is the best at distinguishing between classes. The model does particularly very well in terms of recall (0.94), with only 8 false negatives (predicted r/metallica but actually belong to r/lanadelrey posts which is potentially a loss of Lana Del Rey fans).

In summary, our final model:

- The model uses Tfid Vectorization 
- ignores terms that that appear in more than 30% of posts


### Conclusion and Recommendations

Conclusion & Recommendations
In conclusion, the model we chosed to better classify the posts for r/lanadelrey and r/metallica is the linear regression model with tfid vectorizer due to its higher r2 score, recall and f-1 score as a whole.

Besides maximizing the funds and manpower to classify posts from two different subreddits based on their title and selftext, there are a number of other possible applications for this model.

By looking at the probabilities associated with each post, marketing teams can better appeal to the potential listeners when they are promoting to the LDR fans or Metallica fans. In fact, this can also be useful as words can be trending with time but withi this model we can accurately ride the trend with words that have high probability of being classified as artist names and song names.

The sentiment analysis we implemented can also determine the mood of the potential clients which either is a positive attraction or vice versa.

The recommendation we would propose at this point of time for the Moozeek is that we can focus on are as follows:

- When someone wants to listen to a certain genre, they do not want to be interupted with a sudden change of tone/tune. Classifying keywords should be a priority to categorize the correct artist to the correct genre. Also, makes a run for a great listening experience.

- Dig further into keyword predicitions for moods in regards to genre.