# Modeling: 1st Iteration of Logistic Regression
---
#### Import libraries and read data


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

In [2]:
df = pd.read_csv('../data/explored_skincare.csv')

df.head(2)

Unnamed: 0,author,title,selftext,num_comments,score,subreddit,is_ab,title_word_count,title_char_count,selftext_word_count,selftext_char_count,combined_text
0,AutoModerator,"Anti-Haul Monthly April 23, 2020",Are you on a no buy? Trying to stick to a more...,0,1,asianbeauty,1,4,32,22,132,"Anti-Haul Monthly April 23, 2020 Are you on a ..."
1,lululi_lululi,After working with seasoned estheticians throu...,Through our own project ([Glowism] my friend a...,4,1,asianbeauty,1,24,164,611,3798,After working with seasoned estheticians throu...


## Assign $X$ and $y$ variables
---

In [6]:
features = [
    'combined_text',
    'num_comments',
    'score',
    'title_word_count',
    'title_char_count',
    'selftext_word_count',
    'selftext_char_count']

In [7]:
X = df[features]
y = df['is_ab']

## Train-test-split data
---

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 42,
                                                   stratify = y)

## Set and fit model
---
**The Model**

After applying gridsearch on different combination of parameters, the first iteration of the Logistic Regression model performs best utilizing the following combination,
- `penalty` of l2 (ridge regularization),
- `ngram_range` between 1 - 2 words in a sequence when transforming text data into matrix format using `CountVectorizer`, and
- `cv` of 5 folds.



##### Step 1: Define functions to grab each features separately

In [9]:
numeric = FunctionTransformer(lambda x: x.drop(columns = 'combined_text'), validate = False)

category = FunctionTransformer(lambda x: x['combined_text'], validate = False)

# Riley Dallas and/or Daniel Kim

##### Step 2: Set a pipeline with transformers and an estimator

In [10]:
pipe = Pipeline([
    ('features', FeatureUnion([
            ('numeric_pipe', Pipeline([
                ('selector', numeric),
                ('ss', StandardScaler())
            ])),
            ('category_pipe', Pipeline([
                ('selector', category),
                ('cvec', CountVectorizer())
            ]))
    ])),
    ('logreg', LogisticRegression())
])

# Daniel Kim

##### Step 3: Set parameters for gridsearch

In [11]:
params = {
    'features__category_pipe__cvec__ngram_range' : [(1,2)],
    'logreg__penalty' : ['l2']}

##### Step 4: Instantiate and fit gridsearch

In [12]:
gs = GridSearchCV(pipe, params, cv=5)

gs.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('features',
                                        FeatureUnion(n_jobs=None,
                                                     transformer_list=[('numeric_pipe',
                                                                        Pipeline(memory=None,
                                                                                 steps=[('selector',
                                                                                         FunctionTransformer(accept_sparse=False,
                                                                                                             check_inverse=True,
                                                                                                             func=<function <lambda> at 0x1090dca70>,
                                                                                                             

## Evaluate model
---
**Baseline score**
If I randomly choose a class for any observation, I have a 50% chance of correctly getting y = 1. My model should at least beat this probability.

**Accuracy Scores**

The model performed suspiciously well on the first try with train and test accuracy scores that are above 90%, which means it is likely that there are certain keywords that distinguish the two subreddits easily.

- Train Score: 0.996
- Train Score: 0.938

**Cofficient Weights of Features**

After investigating the coefficient weights of the top 50 most distinguishing features, there are in deed keywords that are specific to AsianBeauty subreddits. For example, of acronyms of the subreddit (AB or ABers), asian beauty brand names, asian beauty stores (Jolse and Yesstyle) or asian country names (Japan and Korea) are obviously mentioned more in the Asian beauty subreddit than SkincareAddiction. 

Reflecting on the goal of the project, I want the chatroom to recommend accurately which subreddits have people with the same questions and concers. I'm confident that if our (chatroom) user knows about AsianBeauty subreddit or have questions about the 10 step *Korean* skincare routine, they would know where to go on reddit.com. Hence, I will add acronyms of the subreddit, asian beauty stores, and asian country names to a custom stopwords list along with other noise and filler words to better understand the types of beauty and skincare problems that our users have that are better often discussed in AsianBeauty subreddit in my next iteration of the model.

However, I am keeping brand names in my model even though they are big tells because when considering the use case of the chatroom, a user might have a question about certain brands that they are not familiar with without knowing the roots of the brand (if it's asian or not). In this case, it would be beneficial to be able to recommend them to the right subreddit by having these keywords in my modeling. 

### Baseline score
Randomly predicting a class for any observation will have a 50% chance of getting y = 1.

In [14]:
y.value_counts(normalize=True)

1    0.500269
0    0.499731
Name: is_ab, dtype: float64

### Accuracy scores

In [15]:
print(f'Train Score: {gs.score(X_train, y_train)}')
print(f'Train Score: {gs.score(X_test, y_test)}')

Train Score: 0.9998802753666567
Train Score: 0.9455916681630454


### Coefficient weights

In [16]:
coefficients = gs.best_estimator_.named_steps['logreg'].coef_[0]

feature_names = ['num_comments','score','title_word_count','title_char_count','selftext_word_count','selftext_char_count'] + \
gs.best_estimator_.named_steps['features'].transformer_list[1][1].named_steps['cvec'].get_feature_names()

coef_df = pd.DataFrame({'features': feature_names, 
              'coef' : coefficients,
              'exp_coef': [np.exp(coef) for coef in coefficients]
             })

# Daniel Kim

In [17]:
coef_df.sort_values('exp_coef', ascending=False).head(50)

# Daniel Kim

Unnamed: 0,features,coef,exp_coef
1,score,4.179748,65.349351
144145,discussion,3.735036,41.889519
13137,ab,3.046533,21.042258
55787,asian,1.623538,5.071002
408414,removed,1.320767,3.746293
129413,cushion,1.008709,2.742058
67929,beauty,0.966811,2.629546
268647,korea,0.913609,2.493305
162158,essence,0.902441,2.465614
261078,japanese,0.867564,2.381104


## Pickle Top 50 Features to create custom stopwords in next model iteration
---
I'm building out the list of custom stopwords in the cleaning notebook (002_clean.ipynb) to be used in other notebooks, such as EDA.

In [20]:
sorted_coef_features = list(coef_df.sort_values('exp_coef', ascending=False)['features'].values)

file_name = '../assets/sorted_features_lg_model1.pkl'

pickle.dump(sorted_coef_features, open(file_name, 'wb'))

['score',
 'discussion',
 'ab',
 'asian',
 'removed',
 'cushion',
 'beauty',
 'korea',
 'essence',
 'japanese',
 'yesstyle',
 'do you',
 'actives',
 'innisfree',
 'sheet',
 'num_comments',
 'hg',
 'korean',
 'bb',
 'cosrx',
 'fluff',
 'http',
 'masks',
 'shop',
 'memebox',
 'edit',
 'product product',
 'post accutane',
 'jolse',
 'shiseido',
 'help removed',
 'accutane routine',
 'abers',
 'post',
 'mizon',
 'snail',
 'missha',
 'ampoule',
 'nice',
 'care routine',
 'your',
 'biore',
 'pack',
 'favourite',
 'ebay',
 'where',
 'power',
 'items',
 'in japan',
 'eyeliner',
 'have you',
 'atomy',
 'news',
 'japan',
 'double',
 'from amazon',
 'your routine',
 'western',
 'www',
 'asia',
 'lenses',
 'with an',
 'perfect',
 'img',
 'deals',
 'rice',
 'to look',
 'asian beauty',
 'lightening',
 'prone',
 'hadalabo',
 '10 step',
 'dewy',
 'usually',
 'shipping',
 'cool',
 'products',
 'labo',
 'ab products',
 'sulwhasoo',
 'brand',
 'in korea',
 'asian skincare',
 'eyes',
 'milk',
 'the sun',
