# Project 3
# Using NLP to classify posts to one of two subreddits

Initially, I'll build a simplistic logistic regression model. First, I'll split the data into a training set and a test set. I'll take the unedited text, fit/transform the count vectorizer to the training data, and transform the test data. Then, I'll run a logistic regression in order to generate predictions for the target. I'll assess the accuracy of the model on the training data and the testing data.

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
import re

import warnings
warnings.filterwarnings('ignore')

## Initial Model to Establish Baseline Performance

In [2]:
jam = pd.read_csv("./jam0.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train_cv, y_train)

print(f"number of features: {len(cv.get_feature_names())}")
print("")
print(f"""training accuracy score: {lr.score(X_train_cv, y_train)}
 testing accuracy score: {lr.score(X_test_cv, y_test)}""")

number of features: 23133

training accuracy score: 0.9895916733386709
 testing accuracy score: 0.8929674099485421


The model already performs fairly accurately without doing hardly anything to clean/organize the data. It will be interesting to see how much I can improve the performance by implementing techniques like lemmatization and TFIDF and tuning hyperparameters like stop words, max features, and ngram range. The variance of the model is high, so I will try to reduce the overfitness.

### Data Update: removed "\n", "[deleted]", "[removed]"

In [3]:
jam = pd.read_csv("./jam1.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train_cv, y_train)

print(f"number of features: {len(cv.get_feature_names())}")
print("")
print(f"""training accuracy score: {lr.score(X_train_cv, y_train)}
 testing accuracy score: {lr.score(X_test_cv, y_test)}""")

number of features: 23133

training accuracy score: 0.989820427770788
 testing accuracy score: 0.8926243567753002


This esesentially had no effect. Judging by the unchanged number of features, it's possible that these 3 substrings were already getting filtered out somehow.

### Data Update: removed urls


In [4]:
jam = pd.read_csv("./jam2.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train_cv, y_train)

print(f"number of features: {len(cv.get_feature_names())}")
print("")
print(f"""training accuracy score: {lr.score(X_train_cv, y_train)}
 testing accuracy score: {lr.score(X_test_cv, y_test)}""")

number of features: 20061

training accuracy score: 0.9887859022771485
 testing accuracy score: 0.8850377487989018


Interestingly, the above text cleaning steps didn't change the performance very much at all, and actually, removing the urls reduced the accuracy scores. Moving forward, I'll focus on lemmatizing/stemming and tuning hyperparameters. I'll also try other types of models.

### Data Update: Lemmatized Text

In [5]:
jam = pd.read_csv("./jam3.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)

lr = LogisticRegression()
lr.fit(X_train_cv, y_train)

print(f"number of features: {len(cv.get_feature_names())}")
print("")
print(f"""training accuracy score: {lr.score(X_train_cv, y_train)}
 testing accuracy score: {lr.score(X_test_cv, y_test)}""")

number of features: 17960

training accuracy score: 0.9878676891381482
 testing accuracy score: 0.8887744593202883


Lemmatization boosted the model performance very slightly.

## Model Comparison and Hyperparameter Tuning

For the next models, I'll use pipelines and gridsearching in order to find an optimum model. I'll try using the following models:
- Logistic Regression
- Naive Bayes
- Support Vector Machines

### Logistic Regression with Count Vectorizer

In [6]:
jam = pd.read_csv("./jam3.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

pipe_cv_lr = Pipeline([
    ("cvec", CountVectorizer()),
    ("lr",   LogisticRegression())
])

pipe_params = {
    "cvec__max_features" : [None, 1_000, 2_000, 5_000, 10_000],
    "cvec__ngram_range"  : [(1,1), (1,2)],
    "cvec__stop_words"   : [None, "english"],
    "lr__C"              : [1, 10, 100, 0.1, 0.01]
}

gs_cv_lr = GridSearchCV(pipe_cv_lr, param_grid=pipe_params, cv=5)
gs_cv_lr.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'cvec__max_features': [None, 1000, 2000, 5000, 10000], 'cvec__ngram_range': [(1, 1), (1, 2)], 'cvec__stop_words': [None, 'english'], 'lr__C': [1, 10, 100, 0.1, 0.01]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [7]:
gs_cv_lr.best_params_

{'cvec__max_features': None,
 'cvec__ngram_range': (1, 2),
 'cvec__stop_words': 'english',
 'lr__C': 1}

In [8]:
gs_cv_lr.score(X_train, y_train)

0.9986265308458281

In [9]:
gs_cv_lr.score(X_test, y_test)

0.8966700995537247

This model is even more overfit than before. That makes sense, given that the gridsearch optimized to not limit the number of features and to include all 2 word sequences.

### Naive Bayes with Count Vectorizer

In [10]:
from sklearn.naive_bayes import MultinomialNB

In [11]:
jam = pd.read_csv("./jam3.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

pipe_cv_nb = Pipeline([
    ("cvec", CountVectorizer()),
    ("nb",   MultinomialNB())
])

pipe_params = {
    "cvec__max_features" : [None, 1_000, 2_000, 5_000, 10_000],
    "cvec__ngram_range"  : [(1,1), (1,2)],
    "cvec__stop_words"   : [None, "english"],
    "nb__alpha"          : [0, 1, 0.1, 0.5, 5]
}

gs_cv_nb = GridSearchCV(pipe_cv_nb, param_grid=pipe_params, cv=5)
gs_cv_nb.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'cvec__max_features': [None, 1000, 2000, 5000, 10000], 'cvec__ngram_range': [(1, 1), (1, 2)], 'cvec__stop_words': [None, 'english'], 'nb__alpha': [0, 1, 0.1, 0.5, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [12]:
gs_cv_nb.best_params_

{'cvec__max_features': None,
 'cvec__ngram_range': (1, 2),
 'cvec__stop_words': 'english',
 'nb__alpha': 1}

In [13]:
gs_cv_nb.score(X_train, y_train)

0.9898134371065583

In [14]:
gs_cv_nb.score(X_test, y_test)

0.9028492962581531

These scores are basically unchanged from what I've been getting with every model. The testing score improved very slightly.

### Logistic Regression with TF-IDF

In [15]:
jam = pd.read_csv("./jam3.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

pipe_tv_lr = Pipeline([
    ("tvec", TfidfVectorizer()),
    ("lr",   LogisticRegression())
])

pipe_params = {
    "tvec__max_features" : [None, 1_000, 2_000, 5_000, 10_000],
    "tvec__ngram_range"  : [(1,1), (1,2)],
    "tvec__stop_words"   : [None, "english"],
    "lr__C"              : [1, 10, 100, 0.1, 0.01]
}

gs_tv_lr = GridSearchCV(pipe_tv_lr, param_grid=pipe_params, cv=5)
gs_tv_lr.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'tvec__max_features': [None, 1000, 2000, 5000, 10000], 'tvec__ngram_range': [(1, 1), (1, 2)], 'tvec__stop_words': [None, 'english'], 'lr__C': [1, 10, 100, 0.1, 0.01]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [16]:
gs_tv_lr.best_params_

{'lr__C': 10,
 'tvec__max_features': None,
 'tvec__ngram_range': (1, 2),
 'tvec__stop_words': 'english'}

In [17]:
gs_tv_lr.score(X_train, y_train)

0.9991988096600664

In [18]:
gs_tv_lr.score(X_test, y_test)

0.9035358736697563

Again, the model is very overfit from inlcuding all features and 2 word sequences. 

### Naive Bayes with TF-IDF

In [19]:
jam = pd.read_csv("./jam3.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

pipe_tv_nb = Pipeline([
    ("tvec", TfidfVectorizer()),
    ("nb",   MultinomialNB())
])

pipe_params = {
    "tvec__max_features" : [None, 1_000, 2_000, 5_000, 10_000],
    "tvec__ngram_range"  : [(1,1), (1,2)],
    "tvec__stop_words"   : [None, "english"],
    "nb__alpha"          : [0, 1, 0.1, 0.5, 5]
}

gs_tv_nb = GridSearchCV(pipe_tv_nb, param_grid=pipe_params, cv=5)
gs_tv_nb.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...True,
        vocabulary=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'tvec__max_features': [None, 1000, 2000, 5000, 10000], 'tvec__ngram_range': [(1, 1), (1, 2)], 'tvec__stop_words': [None, 'english'], 'nb__alpha': [0, 1, 0.1, 0.5, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [23]:
gs_tv_nb.best_params_

{'nb__alpha': 0.5,
 'tvec__max_features': None,
 'tvec__ngram_range': (1, 2),
 'tvec__stop_words': 'english'}

In [24]:
gs_tv_nb.score(X_train, y_train)

0.9947350349090077

In [25]:
gs_tv_nb.score(X_test, y_test)

0.8973566769653278

There is only a small difference between the models that use count vectorizer and the models that use TF-IDF.

### Support Vector Machine with Count Vectorizer

I'll now try support vector machines models and random forests models.

In [26]:
from sklearn import svm

In [28]:
jam = pd.read_csv("./jam3.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

pipe_cv_sv = Pipeline([
    ("cvec", CountVectorizer()),
    ("svc", svm.SVC())
])

pipe_params = {
    "cvec__max_features" : [None, 1_000, 2_000, 5_000, 10_000],
    "cvec__ngram_range"  : [(1,1), (1,2)],
    "cvec__stop_words"   : [None, "english"],
    "svc__C"             : [1, 0.1, 0.5, 10],
    "svc__kernel"        : ["rbf", "poly"],
    "svc__gamma"         : ["auto", "scale"]
}

gs_cv_sv = GridSearchCV(pipe_cv_sv, param_grid=pipe_params, cv=5)
gs_cv_sv.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...f', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'cvec__max_features': [None, 1000, 2000, 5000, 10000], 'cvec__ngram_range': [(1, 1), (1, 2)], 'cvec__stop_words': [None, 'english'], 'svc__C': [1, 0.1, 0.5, 10], 'svc__kernel': ['rbf', 'poly'], 'svc__gamma': ['auto', 'scale']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [29]:
gs_cv_sv.best_params_

{'cvec__max_features': 5000,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': 'english',
 'svc__C': 10,
 'svc__gamma': 'scale',
 'svc__kernel': 'rbf'}

In [30]:
gs_cv_sv.score(X_train, y_train)

0.9434588531532563

In [31]:
gs_cv_sv.score(X_test, y_test)

0.8918640576725025

With about the same performance on unseen data as the previous models, this model has substantially lower variance. This is possibly a result of limiting the features to 5000.

### Random Forests with Count Vectorizer

In [33]:
from sklearn.ensemble import RandomForestClassifier

In [35]:
jam = pd.read_csv("./jam3.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

pipe_cv_rf = Pipeline([
    ("cvec", CountVectorizer()),
    ("rf", RandomForestClassifier())
])

pipe_params = {
    "cvec__max_features"    : [None, 1_000, 2_000, 5_000, 10_000],
    "cvec__ngram_range"     : [(1,1), (1,2)],
    "cvec__stop_words"      : [None, "english"],
    "rf__n_estimators"      : [75, 100, 125, 150],
    "rf__max_depth"         : [3, 5, 8, 15],
    "rf__min_samples_split" : [3, 7, 13]
}

gs_cv_rf = GridSearchCV(pipe_cv_rf, param_grid=pipe_params, cv=5)
gs_cv_rf.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'cvec__max_features': [None, 1000, 2000, 5000, 10000], 'cvec__ngram_range': [(1, 1), (1, 2)], 'cvec__stop_words': [None, 'english'], 'rf__n_estimators': [75, 100, 125, 150], 'rf__max_depth': [3, 5, 8, 15], 'rf__min_samples_split': [3, 7, 13]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [36]:
gs_cv_rf.best_params_

{'cvec__max_features': None,
 'cvec__ngram_range': (1, 2),
 'cvec__stop_words': 'english',
 'rf__max_depth': 15,
 'rf__min_samples_split': 13,
 'rf__n_estimators': 125}

In [37]:
gs_cv_rf.score(X_train, y_train)

0.9011102208996223

In [38]:
gs_cv_rf.score(X_test, y_test)

0.8637143837967731

This model also has lower variance than the first few models, but it sacrifices substantial accuracy overall compared to the SVM model.

### Support Vector Machine with TF-IDF

In [40]:
jam = pd.read_csv("./jam3.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

pipe_tv_sv = Pipeline([
    ("tvec", TfidfVectorizer()),
    ("svc", svm.SVC())
])

pipe_params = {
    "tvec__max_features"    : [None, 5_000],
    "tvec__ngram_range"     : [(1,1), (1,2)],
    "tvec__stop_words"      : [None, "english"],
    "svc__C"                : [1, 0.1, 10],
    "svc__kernel"           : ["rbf"],
    "svc__gamma"            : ["auto", "scale"]
}

gs_tv_sv = GridSearchCV(pipe_tv_sv, param_grid=pipe_params, cv=5)
gs_tv_sv.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...f', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'tvec__max_features': [None, 5000], 'tvec__ngram_range': [(1, 1), (1, 2)], 'tvec__stop_words': [None, 'english'], 'svc__C': [1, 0.1, 10], 'svc__kernel': ['rbf'], 'svc__gamma': ['auto', 'scale']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [41]:
gs_tv_sv.best_params_

{'svc__C': 10,
 'svc__gamma': 'scale',
 'svc__kernel': 'rbf',
 'tvec__max_features': 5000,
 'tvec__ngram_range': (1, 1),
 'tvec__stop_words': 'english'}

In [42]:
gs_tv_sv.score(X_train, y_train)

0.9442600434931899

In [43]:
gs_tv_sv.score(X_test, y_test)

0.8959835221421215

This model has a slight performance boost over the SVM/CountVectorizer model.

### Random Forests with TFI-DF

In [48]:
jam = pd.read_csv("./jam3.csv")

X = jam["text"]
y = jam["subreddit"]

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify = y)

pipe_tv_rf = Pipeline([
    ("tvec", TfidfVectorizer()),
    ("rf", RandomForestClassifier())
])

pipe_params = {
    "tvec__max_features"    : [None, 5_000],
    "tvec__ngram_range"     : [(1,1), (1,2)],
    "tvec__stop_words"      : [None, "english"],
    "rf__n_estimators"      : [125],
    "rf__max_depth"         : [25, 35],
    "rf__min_samples_split" : [17, 23]
}

gs_tv_rf = GridSearchCV(pipe_tv_rf, param_grid=pipe_params, cv=5)
gs_tv_rf.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'tvec__max_features': [None, 5000], 'tvec__ngram_range': [(1, 1), (1, 2)], 'tvec__stop_words': [None, 'english'], 'rf__n_estimators': [125], 'rf__max_depth': [25, 35], 'rf__min_samples_split': [17, 23]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [49]:
gs_tv_rf.best_params_

{'rf__max_depth': 25,
 'rf__min_samples_split': 17,
 'rf__n_estimators': 125,
 'tvec__max_features': None,
 'tvec__ngram_range': (1, 1),
 'tvec__stop_words': 'english'}

In [50]:
gs_tv_rf.score(X_train, y_train)

0.9298386173743848

In [51]:
gs_tv_rf.score(X_test, y_test)

0.8805355303810505

This model has about the same level of overfitness as the SVM models, but it has higher bias. It seems that the SVM model with TF-IDF vectorization will be the best one to use for this analysis. 