# Logistic Regression Modeling

In [1]:
#Import necessary libraries 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

from nltk.corpus import stopwords

#For streamlit app
import pickle

In [2]:
#import the dataset
friends = pd.read_csv('../../Datasets/friends-modeling.csv')
friends.head()

Unnamed: 0,season,episode,character,dialogue
0,s01,e01,Monica Geller,There's nothing to tell! He's just some guy I ...
1,s01,e01,Joey Tribbiani,"C'mon, you're going out with the guy! There's ..."
2,s01,e01,Chandler Bing,"All right Joey, be nice. So does he have a hum..."
3,s01,e01,Phoebe Buffay,"Wait, does he eat chalk?"
4,s01,e01,Phoebe Buffay,"Just, 'cause, I don't want her to go through w..."


In [3]:
#Make sure no nulls
friends.isnull().sum()

season       0
episode      0
character    0
dialogue     0
dtype: int64

### Make X and y values 

In [4]:
X = friends['dialogue']
y = friends['character']

### Split into Train and Test

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

### Baseline Accuracy 

Baseline to beat when making the models

In [6]:
y_train.value_counts(normalize=True)

Rachel Green      0.177657
Ross Geller       0.177152
Chandler Bing     0.169355
Monica Geller     0.167242
Joey Tribbiani    0.160725
Phoebe Buffay     0.147869
Name: character, dtype: float64

### Instaniating Count Vectorizer, Fit and Transform 

In [7]:
cv = CountVectorizer()

Fitting only on the training data and transforming only on X_train and X_test

In [8]:
cv.fit(X_train)

CountVectorizer()

In [9]:
X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

### Modeling 

Instantiate a logisitic regression model with an instance of the class LogisticRegression.

In [14]:
logreg = LogisticRegression(max_iter=1000, random_state=42)

In [15]:
#fitting on the training set -- need to pass in X_train_cv!
logreg.fit(X_train_cv, y_train)

LogisticRegression(max_iter=1000, random_state=42)

In [111]:
#Scoring on the training and testing sets to see if there is overfitting or underfitting.
print(f'Train score: {logreg.score(X_train_cv, y_train)} \nTest score: {logreg.score(X_test_cv, y_test)}')

Train score: 0.5459469110820141 
Test score: 0.3064631315836458



---
**Making Predictions**

In [17]:
#Making predictions using X_test_cv
preds_1 = logreg.predict(X_test_cv)

I created a dataframe consisting of the predicted results, actual results, and the dialouge. 

In [18]:
df_params_1 = pd.DataFrame(y_test)
df_params_1['predictions'] = preds_1 
df_params_1['dialogue'] = X_test
df_params_1.rename(columns={'character': 'actual'}, inplace=True)
df_params_1.head(10)

Unnamed: 0,actual,predictions,dialogue
8260,Monica Geller,Ross Geller,Then what's the problem?
12970,Phoebe Buffay,Rachel Green,"Yeah, well, everybody does! I'm a really cool ..."
9682,Rachel Green,Rachel Green,What? What? He's interested in you. He-he like...
22017,Monica Geller,Monica Geller,I've never loved anybody as much as I love you.
5611,Rachel Green,Joey Tribbiani,And I'm in it? Then let me read it.
22331,Joey Tribbiani,Joey Tribbiani,"Yeah, I gotta go! I got an acting job. Like yo..."
18609,Monica Geller,Phoebe Buffay,Great. So the ball is in his court?
23737,Monica Geller,Monica Geller,"Dad, please don't pick your teeth out here! Al..."
35446,Ross Geller,Chandler Bing,"Excellent! Excellent, now-now do you want anot..."
3756,Monica Geller,Monica Geller,How are you?


In [76]:
df_params_1['predictions'].value_counts()

Rachel Green      2233
Monica Geller     2096
Chandler Bing     1937
Ross Geller       1925
Joey Tribbiani    1741
Phoebe Buffay     1270
Name: predictions, dtype: int64

In [77]:
df_params_1['actual'].value_counts()

Rachel Green      1991
Ross Geller       1985
Chandler Bing     1897
Monica Geller     1873
Joey Tribbiani    1800
Phoebe Buffay     1656
Name: actual, dtype: int64

In [19]:
#How many rows were missclassified?
df_params_1.loc[df_params_1['actual']!= df_params_1['predictions']].count()

actual         7769
predictions    7769
dialogue       7769
dtype: int64

In [20]:
#How many rows were accurately predicted?
df_params_1.loc[df_params_1['actual']== df_params_1['predictions']].count()

actual         3433
predictions    3433
dialogue       3433
dtype: int64

---
**Predicting Some Phrases**

In [106]:
#logreg.predict(["How you doin'?"])[0]

In [107]:
#logreg.predict(['Smelly cat, smelly cat, what are they feeding you'])[0]

In [108]:
#logreg.predict(['We were on a break!'])[0]

### Looking at the Coefficients from Logistic Regression Model 

Using logisitic regression, I can further explore the six sets of coefficients, where it will help me understand if the word has a impact on predicting each character and which words are the best predictors for each character. Here I will show the top 10 words for each character, starting with Joey Tribbiani. To get this information, I need to make a dataframe consisting of the coefficients which is the data, the columns are the six different characters, and the indices are the words. (Remember when actually making the dataframe, the columns were the words and the indices were the characters but I transposed it.)

In [33]:
df = pd.DataFrame(
    logreg.coef_,
    columns=cv.get_feature_names(),
    index=logreg.classes_
)
df = df.T
df.head()

Unnamed: 0,Chandler Bing,Joey Tribbiani,Monica Geller,Phoebe Buffay,Rachel Green,Ross Geller
0,1.122053,0.79146,-0.717463,-0.622547,-0.690586,0.117083
0,-0.179013,0.898924,-0.466402,-0.50638,0.417975,-0.165104
7,0.066757,-0.371595,0.991696,-0.343806,-0.393975,0.050923
2,-0.03684,-0.017912,0.585175,-0.233608,-0.073132,-0.223683
3815,-0.059028,-0.013153,0.247717,-0.022275,-0.004807,-0.148453


**Next, I will sort the values by each character and get the top 10 words for each.**

In [36]:
df.sort_values(by='Joey Tribbiani', ascending=False).head(10)

Unnamed: 0,Chandler Bing,Joey Tribbiani,Monica Geller,Phoebe Buffay,Rachel Green,Ross Geller
uhhh,-0.411649,2.063666,-0.602201,-0.835778,-0.057266,-0.156773
dude,0.021437,1.912205,-1.261027,-0.873892,-0.711424,0.912701
director,-0.535808,1.841766,-0.613894,-0.357207,0.017902,-0.352759
tribbiani,-0.775518,1.840583,-0.395308,-0.031483,-0.142568,-0.495706
everest,-0.25136,1.746315,-0.354284,-0.507239,-0.264366,-0.369066
scene,-0.577302,1.637579,-0.421642,-0.368242,-0.013835,-0.256558
estelle,-0.747482,1.631693,-0.640872,1.054406,-0.618321,-0.679425
folks,-0.341342,1.608462,0.022183,-0.393247,-0.533365,-0.362691
sakes,-0.503067,1.583188,-0.448587,-0.138323,-0.07534,-0.417871
sexually,-0.315251,1.556689,-0.357091,0.138526,-0.430931,-0.591942


In [37]:
df.sort_values(by='Phoebe Buffay', ascending=False).head(10)

Unnamed: 0,Chandler Bing,Joey Tribbiani,Monica Geller,Phoebe Buffay,Rachel Green,Ross Geller
philange,-0.51794,-0.387428,-0.564585,2.282863,-0.158441,-0.654469
minsk,-0.372431,-0.398096,-0.469877,2.167199,-0.586963,-0.339833
psychic,-0.57593,-0.465273,-0.439646,1.990957,-0.557295,0.047188
yay,0.046586,-1.143319,-0.949635,1.965664,0.119856,-0.039152
buffay,0.012476,0.180503,-1.205571,1.870891,-0.45205,-0.406247
ursula,-0.71535,0.048349,-0.632508,1.86497,-0.064709,-0.500753
frank,-0.15329,-0.682272,-0.061494,1.802164,-0.017816,-0.887291
maternity,-0.563503,-0.000965,-0.581183,1.583491,0.044476,-0.482316
client,-0.443758,-0.035795,-0.020833,1.568047,-0.709253,-0.358408
smelly,0.452394,-0.741656,-0.421347,1.50458,-0.33291,-0.461062


In [61]:
df.sort_values(by='Rachel Green', ascending = False).head(10)

Unnamed: 0,Chandler Bing,Joey Tribbiani,Monica Geller,Phoebe Buffay,Rachel Green,Ross Geller
joshua,-0.192291,-0.483839,-0.81041,-0.663109,2.35993,-0.210281
zelner,-0.371665,-0.310912,-0.44626,-0.405054,1.938983,-0.405091
gavin,-0.462715,-0.506766,0.15555,-0.528672,1.795289,-0.452686
amy,-0.470817,0.191751,-0.704352,-0.766115,1.784144,-0.034612
joanna,0.44009,-0.431358,-0.534815,-0.558607,1.782162,-0.697472
cart,-0.383549,-0.708348,-0.399471,-0.359724,1.759775,0.091317
barry,-0.990932,-0.351703,0.671269,-0.580861,1.620368,-0.368142
honey,0.453374,-1.873074,1.076049,-0.716581,1.607648,-0.547416
pierced,-0.174968,-0.395451,-0.295799,-0.220571,1.541924,-0.455135
spider,-0.547781,-0.625766,0.090334,0.183793,1.521354,-0.621933


In [62]:
df.sort_values(by='Monica Geller', ascending= False).head(10)

Unnamed: 0,Chandler Bing,Joey Tribbiani,Monica Geller,Phoebe Buffay,Rachel Green,Ross Geller
adopt,-0.551459,-0.399173,1.852177,-0.3974,-0.192521,-0.311624
pad,-0.296282,-0.439225,1.555357,-0.218264,-0.331562,-0.270024
mockolate,-0.236686,-0.281074,1.476158,-0.22473,-0.327675,-0.405992
sweetie,-1.131756,-0.888943,1.471105,-0.352228,0.135975,0.765847
pete,-0.085616,-0.28919,1.434152,-0.478153,0.045826,-0.627019
established,0.438517,-0.62231,1.427404,-0.333192,-0.422634,-0.487786
michelle,-0.332087,-0.492445,1.411618,-0.535546,-0.43852,0.38698
ovulating,0.386706,-0.490665,1.407336,-0.370582,-0.543232,-0.389563
chef,-0.64669,0.366563,1.4032,-0.090959,-0.948371,-0.083744
gosh,-0.384182,-0.90839,1.374677,-0.18004,0.815681,-0.717745


In [63]:
df.sort_values(by='Ross Geller', ascending=False).head(10)

Unnamed: 0,Chandler Bing,Joey Tribbiani,Monica Geller,Phoebe Buffay,Rachel Green,Ross Geller
correct,-0.210432,-0.044399,-1.046738,-0.197487,-0.269739,1.768795
students,-0.390423,-0.049648,-0.404242,-0.430721,-0.455758,1.730791
threesome,-0.533807,0.194043,-0.504462,-0.405019,-0.387902,1.637147
marcel,-1.165878,-0.090207,-0.293171,-0.297001,0.213047,1.63321
crab,-0.380383,0.253467,-0.318866,-0.378816,-0.652395,1.476993
rage,-0.34193,-0.200693,-0.275893,-0.295671,-0.336524,1.450712
lesabre,-0.302983,-0.182576,-0.416132,-0.171536,-0.344841,1.418068
hanukkah,0.418723,-0.317944,-0.605766,-0.435199,-0.47683,1.417016
bike,0.187085,-0.936251,-0.32438,0.745074,-1.084926,1.413398
force,0.090679,-0.224838,-0.237013,-0.281639,-0.756817,1.409628


In [64]:
df.sort_values(by='Chandler Bing', ascending=False).head(10)

Unnamed: 0,Chandler Bing,Joey Tribbiani,Monica Geller,Phoebe Buffay,Rachel Green,Ross Geller
exact,1.716326,-0.386254,-0.612808,-0.027325,-0.496346,-0.193593
eddie,1.699147,-0.698956,0.072689,-0.389698,-0.550914,-0.132268
fourth,1.657946,-0.460425,-0.21477,-0.26261,0.122516,-0.842657
stern,1.546066,-0.46322,-0.265259,-0.165591,-0.185327,-0.46667
unpack,1.54363,-0.465887,-0.089972,-0.212029,-0.339661,-0.436081
tulsa,1.516199,0.20372,0.724007,-0.736089,-0.883963,-0.823874
joe,1.470315,-0.814464,-0.689223,-0.412352,0.188005,0.257721
cameras,1.467605,0.188398,-0.188295,-0.654597,-0.688091,-0.125019
needy,1.460093,-0.205609,-0.387475,0.28645,-0.498177,-0.655281
bumped,1.431796,-0.412429,-0.389852,-0.322897,-0.376169,0.06955



---
**Looking for words common in the show and who has a higher chance at being predicted for this word.**

In [69]:
df.loc["smelly"].sort_values(ascending=False) #the song smelly cat 

Phoebe Buffay     1.504580
Chandler Bing     0.452394
Rachel Green     -0.332910
Monica Geller    -0.421347
Ross Geller      -0.461062
Joey Tribbiani   -0.741656
Name: smelly, dtype: float64

In [68]:
df.loc["crap"].sort_values(ascending=False) #in regards to Phoebe's husband Mike 

Phoebe Buffay     0.496845
Chandler Bing     0.460297
Joey Tribbiani   -0.033165
Monica Geller    -0.173652
Rachel Green     -0.265924
Ross Geller      -0.484402
Name: crap, dtype: float64

In [67]:
df.loc['doin'].sort_values(ascending=False) #Joey's famous line "How you doin'?"

Joey Tribbiani    1.454687
Rachel Green      0.310109
Chandler Bing     0.045769
Monica Geller    -0.227674
Phoebe Buffay    -0.631454
Ross Geller      -0.951438
Name: doin, dtype: float64


<br>

-----
### Setting up a Pipe for all Logisitic Regression Modeling

In [21]:
pipe = Pipeline(steps=[('cv', CountVectorizer()),
                      ('log', LogisticRegression(random_state=42))])

### Modeling: Basic Model with Default Parameters and using CountVectorizer

In [85]:
grid_d = {'cv__stop_words':[None, 'english'],
         'log__max_iter': [1000, 1250, 1500, 1750, 2000]}

In [86]:
# Instaniate a gridSearch 
gs_d = GridSearchCV(estimator=pipe, param_grid=grid_d)
gs_d.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('cv', CountVectorizer()),
                                       ('log',
                                        LogisticRegression(random_state=42))]),
             param_grid={'cv__stop_words': [None, 'english'],
                         'log__max_iter': [1000, 1250, 1500, 1750, 2000]})

In [87]:
gs_d.best_params_

{'cv__stop_words': None, 'log__max_iter': 1000}

In [112]:
print(f"Train score: {gs_d.score(X_train, y_train)} \nTest score: {gs_d.score(X_test, y_test)}")

Train score: 0.5459469110820141 
Test score: 0.3064631315836458



---
**Making Predictions**

In [89]:
#Making predictions
preds_gs = gs_d.predict(X_test)

I created a dataframe consisting of the predicted results, actual results, and the dialouge. 

In [90]:
df_gs = pd.DataFrame(y_test)
df_gs['predictions'] = preds_gs 
df_gs['dialogue'] = X_test
df_gs.rename(columns={'character': 'actual'}, inplace=True)
df_gs.head(10)

Unnamed: 0,actual,predictions,dialogue
8260,Monica Geller,Ross Geller,Then what's the problem?
12970,Phoebe Buffay,Rachel Green,"Yeah, well, everybody does! I'm a really cool ..."
9682,Rachel Green,Rachel Green,What? What? He's interested in you. He-he like...
22017,Monica Geller,Monica Geller,I've never loved anybody as much as I love you.
5611,Rachel Green,Joey Tribbiani,And I'm in it? Then let me read it.
22331,Joey Tribbiani,Joey Tribbiani,"Yeah, I gotta go! I got an acting job. Like yo..."
18609,Monica Geller,Phoebe Buffay,Great. So the ball is in his court?
23737,Monica Geller,Monica Geller,"Dad, please don't pick your teeth out here! Al..."
35446,Ross Geller,Chandler Bing,"Excellent! Excellent, now-now do you want anot..."
3756,Monica Geller,Monica Geller,How are you?


In [91]:
df_gs['predictions'].value_counts()

Rachel Green      2233
Monica Geller     2096
Chandler Bing     1937
Ross Geller       1925
Joey Tribbiani    1741
Phoebe Buffay     1270
Name: predictions, dtype: int64

In [92]:
df_gs['actual'].value_counts()

Rachel Green      1991
Ross Geller       1985
Chandler Bing     1897
Monica Geller     1873
Joey Tribbiani    1800
Phoebe Buffay     1656
Name: actual, dtype: int64

In [93]:
#How many rows were missclassified?
df_gs.loc[df_params_1['actual']!= df_params_1['predictions']].count()

actual         7769
predictions    7769
dialogue       7769
dtype: int64

In [94]:
#How many rows were accurately predicted?
df_gs.loc[df_params_1['actual']== df_params_1['predictions']].count()

actual         3433
predictions    3433
dialogue       3433
dtype: int64

---
**Predicting Some Phrases**

In [95]:
gs_d.predict(["How you doin'?"])[0]

'Joey Tribbiani'

In [96]:
gs_d.predict(['Smelly cat, smelly cat, what are they feeding you'])[0]

'Phoebe Buffay'

In [97]:
gs_d.predict(['We were on a break!'])[0]

'Ross Geller'

<br>

---- 
### Exporting the Model Using Pickle

In [114]:
#Using the best model which was the 
with open('../logistic-regression.pkl', mode='wb') as pickle_out:
    pickle.dump(gs_d, pickle_out)