## Final Logistic Regression Model (Best Fit)

Imports, read-ins, and `train-test-split`:

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression

In [2]:
df = pd.read_csv('../datasets/main_final.csv')[['comp_stem', 'subreddit']].reset_index(drop=True)

X = df['comp_stem']
y = df['subreddit']

X_train, X_test, y_train, y_test=train_test_split(X, y, stratify = y)

After the initial round of modeling, some last movements toward fine-tuning helped inch towards greater accuracy. Realistically, both this model and the final random forest model began hitting an upper ceiling between `0.83` and `0.84`. Still, logistic regression, edges out, though future exploration (when the looming project deadline is a thing of the past) might make additional strides. The final hyperparameters are as follows:

`TfidVectorizer()`:
- `min_df` = `3`
- `max_df` = `0.9`
- `ngram_range` = `(1, 2)`

`LogisticRegression()`:
- `penalty` = `l2`
- `C` = `3`
- `solver` = `liblinear`

In [3]:
tvec = TfidfVectorizer(min_df = 3, max_df = 0.9, ngram_range = (1,2))
tvec.fit(X_train)
X_train_tv = pd.DataFrame(tvec.transform(X_train).todense(), columns = tvec.get_feature_names())
X_test_tv = pd.DataFrame(tvec.transform(X_test).todense(), columns = tvec.get_feature_names())

In [4]:
logreg = LogisticRegression()

pipe_params = {
    'penalty' : ['l2'],
    'C' : [3],
    'solver' : ['liblinear']
}

In [5]:
gs = GridSearchCV(logreg, param_grid = pipe_params, cv = 5)

In [7]:
gs.fit(X_train_tv, y_train)

GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': [3], 'penalty': ['l2'], 'solver': ['liblinear']})

Based on previous modeling, the `0.838` cross-validation score below seems more trustworthy than the unusually high test score. In either case, blue ribbon, despite the lingering overfit of the training data. 

In [8]:
gs.best_score_

0.838835399391993

In [10]:
gs.score(X_train_tv, y_train)

0.9755206025697829

In [11]:
gs.score(X_test_tv, y_test)

0.8551495016611296

In [12]:
new_df = y_test.to_frame()

In [13]:
new_df['predictions'] = gs.predict(X_test_tv)

The results are written to a separate dataset for use in the main notebook:

In [14]:
new_df.to_csv('../datasets/logreg.csv', index = False)