# Week 10 Classification Lecture Demo

This notebook contains an example demonstrate the method of logistics regression. Two datasets for the demonstration are available in the `data` folder of this repo. We will continue to use the `statsmodel` and `scikit-learn` libraries for the analysis. 

## Example-Logistics Regression

**Case Background**

The `Lasagna Triers Logistic Regression.csv` file contains data on 856 people who have either tried or not tried a company’s new frozen lasagna product. The categorical dependent variable, Have Tried, and several of the potential explanatory variables contain text. Using the numeric variables, including dummies, how well is logistic regression able to classify the triers and nontriers?

Therefore, the objective of this case is to use logistic regression to classify users as triers or nontriers, and to interpret the resulting output. 

<center><img src="../images/lasana.jpg" width=400 height=400 /></center>

In [1]:
#Importing the libraries we need to use

import pandas as pd
import statsmodels.formula.api as smf
import numpy as np

#Importing the dataset as new dataframe and overview the head of datasframe
df_lasagna = pd.read_csv('../data/Lasagna Triers Logistic Regression.csv')
df_lasagna.head()

Unnamed: 0,Person,Age,Weight,Income,Pay Type,Car Value,CC Debt,Gender,Live Alone,Dwell Type,Mall Trips,Nbhd,Have Tried
0,1,48,175,65500,Hourly,2190,3510,Male,No,Home,7,East,No
1,2,33,202,29100,Hourly,2110,740,Female,No,Condo,4,East,Yes
2,3,51,188,32200,Salaried,5140,910,Male,No,Condo,1,East,No
3,4,56,244,19000,Hourly,700,1620,Female,No,Home,3,West,No
4,5,28,218,81400,Salaried,26620,600,Male,No,Apt,3,West,Yes


In [2]:
#Recalling the skills we have learned in renaming the variables
#Variable names with space should be renamed for the analysis in statsmodel

df_lasagna = df_lasagna.rename(columns={'Pay Type':'Pay_Type', 'Live Alone':'Live_Alone',
                                        'Dwell Type':'Dwell_Type','Have Tried':'Have_Tried',
                                        'Car Value':'Car_Value','CC Debt':'CC_Debt','Mall Trips':'Mall_Trips'})

#Creating multiple dummy variables in the dataframe - df_lasagna and overview the head of dataframe again

df_lasagna = pd.get_dummies(df_lasagna, columns=['Pay_Type','Gender','Live_Alone','Dwell_Type','Have_Tried'])
df_lasagna.head()

Unnamed: 0,Person,Age,Weight,Income,Car_Value,CC_Debt,Mall_Trips,Nbhd,Pay_Type_Hourly,Pay_Type_Salaried,Gender_Female,Gender_Male,Live_Alone_No,Live_Alone_Yes,Dwell_Type_Apt,Dwell_Type_Condo,Dwell_Type_Home,Have_Tried_No,Have_Tried_Yes
0,1,48,175,65500,2190,3510,7,East,1,0,0,1,1,0,0,0,1,1,0
1,2,33,202,29100,2110,740,4,East,1,0,1,0,1,0,0,1,0,0,1
2,3,51,188,32200,5140,910,1,East,0,1,0,1,1,0,0,1,0,1,0
3,4,56,244,19000,700,1620,3,West,1,0,1,0,1,0,0,0,1,1,0
4,5,28,218,81400,26620,600,3,West,0,1,0,1,1,0,1,0,0,0,1


In [3]:
#Introducing a different way to form the statsmodel syntax
our_formula = 'Have_Tried_Yes ~ Age + Weight + Income \
            + Car_Value + CC_Debt + Mall_Trips \
           + Pay_Type_Salaried + Gender_Male \
           + Live_Alone_Yes + Dwell_Type_Condo + Dwell_Type_Home'
logitfit = smf.logit(formula=str(our_formula), data=df_lasagna).fit()
print(logitfit.summary())

Optimization terminated successfully.
         Current function value: 0.401836
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:         Have_Tried_Yes   No. Observations:                  856
Model:                          Logit   Df Residuals:                      844
Method:                           MLE   Df Model:                           11
Date:                Wed, 01 Dec 2021   Pseudo R-squ.:                  0.4098
Time:                        07:58:36   Log-Likelihood:                -343.97
converged:                       True   LL-Null:                       -582.80
Covariance Type:            nonrobust   LLR p-value:                 1.853e-95
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            -2.5406      0.910     -2.793      0.005      -4.324      -0.758
Age     

In [4]:
#Using summary2 to avoid scientific nottation in the outputs
print(logitfit.summary2())

                         Results: Logit
Model:              Logit            Pseudo R-squared: 0.410     
Dependent Variable: Have_Tried_Yes   AIC:              711.9429  
Date:               2021-12-01 07:58 BIC:              768.9701  
No. Observations:   856              Log-Likelihood:   -343.97   
Df Model:           11               LL-Null:          -582.80   
Df Residuals:       844              LLR p-value:      1.8527e-95
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     7.0000                                       
-----------------------------------------------------------------
                   Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-----------------------------------------------------------------
Intercept         -2.5406   0.9097 -2.7928 0.0052 -4.3236 -0.7576
Age               -0.0697   0.0108 -6.4476 0.0000 -0.0909 -0.0485
Weight             0.0070   0.0038  1.8270 0.0677 -0.0005  0.0146
Income             0.0000   0.0000  

In [5]:
#The purpose of this step is to obtain an odd ratio for better interpretation

model_odds = pd.DataFrame(np.exp(logitfit.params), columns= ['OR'])
model_odds

Unnamed: 0,OR
Intercept,0.07882
Age,0.932684
Weight,1.007058
Income,1.000005
Car_Value,0.999973
CC_Debt,1.000078
Mall_Trips,1.987755
Pay_Type_Salaried,3.79145
Gender_Male,1.291162
Live_Alone_Yes,3.753284


In [6]:
#Creating a classification matrix to check the correctness of our model
logitfit.pred_table()

array([[280.,  81.],
       [ 73., 422.]])

In [7]:
#Ok, this remind me the total number of the observations in this dataframe
df_lasagna.shape[0]

856

### Explaination of the above matrix

In the upper-lef corner, we can see that 280 of our observations are true negative; these observations have actual and predicted values of 0 on the outcome. 81 are the observations that our model did not classify correctly. Bottom right corner, we have 422 true positive; theses observations have actual and predicted values of 1 on the outomces. More specifically, 422 of the 495 triers, or 85.25% are classified correctly as triers.

Thus our model correctly classified 702 of 856 (see the above cell to quickly check the total number of observations). Thus, our model can predict 82.01% (702/856) of the correct classifications.

In [8]:
#In sample prediction
predict = logitfit.predict(df_lasagna)

#Creating a new variables of the model prediction in the original dataframe
#The Prediction variable tells the probability of the observation classify as Trier

df_lasagna['Prediction'] = predict
df_lasagna.head()


Unnamed: 0,Person,Age,Weight,Income,Car_Value,CC_Debt,Mall_Trips,Nbhd,Pay_Type_Hourly,Pay_Type_Salaried,Gender_Female,Gender_Male,Live_Alone_No,Live_Alone_Yes,Dwell_Type_Apt,Dwell_Type_Condo,Dwell_Type_Home,Have_Tried_No,Have_Tried_Yes,Prediction
0,1,48,175,65500,2190,3510,7,East,1,0,0,1,1,0,0,0,1,1,0,0.752757
1,2,33,202,29100,2110,740,4,East,1,0,1,0,1,0,0,1,0,0,1,0.351476
2,3,51,188,32200,5140,910,1,East,0,1,0,1,1,0,0,1,0,1,0,0.07649
3,4,56,244,19000,700,1620,3,West,1,0,1,0,1,0,0,0,1,1,0,0.091847
4,5,28,218,81400,26620,600,3,West,0,1,0,1,1,0,1,0,0,0,1,0.602193


In [9]:
# If this probability (i.e. the prediction value) is greater than 0.5, the person is classified as a trier; 
# If it is less than 0.5, the person is classified as a nontrier. 
# Using the below codes to create a new data frame for demonstration

def case(row):
    if row['Prediction'] > 0.5:
        val = 1
    else:
        val = 0
    return val

df_lasagna['Analysis_Case'] = df_lasagna.apply(case, axis='columns')
df_lasagna.head()

Unnamed: 0,Person,Age,Weight,Income,Car_Value,CC_Debt,Mall_Trips,Nbhd,Pay_Type_Hourly,Pay_Type_Salaried,...,Gender_Male,Live_Alone_No,Live_Alone_Yes,Dwell_Type_Apt,Dwell_Type_Condo,Dwell_Type_Home,Have_Tried_No,Have_Tried_Yes,Prediction,Analysis_Case
0,1,48,175,65500,2190,3510,7,East,1,0,...,1,1,0,0,0,1,1,0,0.752757,1
1,2,33,202,29100,2110,740,4,East,1,0,...,0,1,0,0,1,0,0,1,0.351476,0
2,3,51,188,32200,5140,910,1,East,0,1,...,1,1,0,0,1,0,1,0,0.07649,0
3,4,56,244,19000,700,1620,3,West,1,0,...,0,1,0,0,0,1,1,0,0.091847,0
4,5,28,218,81400,26620,600,3,West,0,1,...,1,1,0,1,0,0,0,1,0.602193,1


In [10]:
#For clearer presentation, we can create a new dataframe only involved the actual case and predictions fro investigation
df_lasgana_classification = df_lasagna[['Have_Tried_Yes','Prediction','Analysis_Case']].copy()
df_lasgana_classification

Unnamed: 0,Have_Tried_Yes,Prediction,Analysis_Case
0,0,0.752757,1
1,1,0.351476,0
2,0,0.076490,0
3,0,0.091847,0
4,1,0.602193,1
...,...,...,...
851,0,0.245208,0
852,1,0.976333,1
853,1,0.841151,1
854,1,0.702493,1


## What can we do next?

Explanatory values for new people, those whose trier status is unknown, could be fed into the logistic regression equation to score them (probabilities). Then perhaps some incentives could be sent to the top scorers (or the middle scorers) to increase their chances of trying the product. The point is that logistic regression is then being used as a tool to identify the people most likely to be triers.

To demonstrate this step, we can use a new dataset (or we called it as testing set) to make the prediction. The file name of the testing set is `New_Customers.csv`.

In [11]:
df_testing = pd.read_csv('../data/New_Customers.csv')
df_testing.head()

Unnamed: 0,New_Person,Age,Weight,Income,Pay Type,Car Value,CC Debt,Gender,Live Alone,Dwell Type,Mall Trips,Nbhd
0,1,36,146,85568,Salaried,10213,510,Female,No,Condo,3,South
1,2,40,225,68725,Salaried,3041,80,Female,No,Home,9,West
2,3,48,197,86876,Salaried,4806,2100,Male,No,Condo,9,West
3,4,49,177,38436,Salaried,8679,590,Male,Yes,Apt,3,West
4,5,34,223,77784,Salaried,12456,590,Female,No,Condo,8,West


In [12]:
df_testing = df_testing.rename(columns={'Pay Type':'Pay_Type', 'Live Alone':'Live_Alone',
                                        'Dwell Type':'Dwell_Type','Have Tried':'Have_Tried',
                                        'Car Value':'Car_Value','CC Debt':'CC_Debt','Mall Trips':'Mall_Trips'})

df_testing = pd.get_dummies(df_testing, columns=['Pay_Type','Gender','Live_Alone','Dwell_Type'])
df_testing.head()

Unnamed: 0,New_Person,Age,Weight,Income,Car_Value,CC_Debt,Mall_Trips,Nbhd,Pay_Type_Salaried,Gender_Female,Gender_Male,Live_Alone_No,Live_Alone_Yes,Dwell_Type_Apt,Dwell_Type_Condo,Dwell_Type_Home
0,1,36,146,85568,10213,510,3,South,1,1,0,1,0,0,1,0
1,2,40,225,68725,3041,80,9,West,1,1,0,1,0,0,0,1
2,3,48,197,86876,4806,2100,9,West,1,0,1,1,0,0,1,0
3,4,49,177,38436,8679,590,3,West,1,0,1,0,1,1,0,0
4,5,34,223,77784,12456,590,8,West,1,1,0,1,0,0,1,0


In [13]:
#Applyting our trained model logitfit for the testing dataset to make the prediction on the newly collected data
new_predict = logitfit.predict(df_testing)
df_testing['New_Prediction'] = new_predict
df_testing.head()

def case(row):
    if row['New_Prediction'] > 0.5:
        val = 1
    else:
        val = 0
    return val

df_testing['Analysis_Case'] = df_testing.apply(case, axis='columns')
df_testing.head()

Unnamed: 0,New_Person,Age,Weight,Income,Car_Value,CC_Debt,Mall_Trips,Nbhd,Pay_Type_Salaried,Gender_Female,Gender_Male,Live_Alone_No,Live_Alone_Yes,Dwell_Type_Apt,Dwell_Type_Condo,Dwell_Type_Home,New_Prediction,Analysis_Case
0,1,36,146,85568,10213,510,3,South,1,1,0,1,0,0,1,0,0.369351,0
1,2,40,225,68725,3041,80,9,West,1,1,0,1,0,0,0,1,0.985216,1
2,3,48,197,86876,4806,2100,9,West,1,0,1,1,0,0,1,0,0.974404,1
3,4,49,177,38436,8679,590,3,West,1,0,1,0,1,1,0,0,0.564361,1
4,5,34,223,77784,12456,590,8,West,1,1,0,1,0,0,1,0,0.97041,1


### Explaination of the above code

We simply replicate the process in the previous step to make sure the variable names are consistent with the model speficiation. `logitfit` is the trained model name we specified based on the Lasagna Trier data. The difference here is that there is the trier status of the customer in `New_Customer.csv` is unknown and we use the model we specified `Logitfit` to make the prediction about the classification of each new observations.