# Fitting a Logistic Regression Model - Lab

## Introduction
You were previously given a broad overview of logistic regression. This included two separate packages for creating logistic regression models. In this lab, you'll be investigating fitting logistic regressions with statsmodels.



## Objectives

You will be able to:
* Implement logistic regression with statsmodels
* Interpret the statistical results associated with regression model parameters


## Review

The stats model example we covered had four essential parts:
* Importing the data
* Defining X and y
* Fitting the model
* Analyzing model results

The corresponding code to these four steps was:

```
import pandas as pd
import statsmodels.api as sm

#Step 1: Importing the data
salaries = pd.read_csv("salaries_final.csv", index_col = 0)

#Step 2: Defining X and y
x_feats = ["Race", "Sex", "Age"]
X = pd.get_dummies(salaries[x_feats], drop_first=True, dtype=float)
y = pd.get_dummies(salaries["Target"], dtype=float)

#Step 3: Fitting the model
X = sm.add_constant(X)
logit_model = sm.Logit(y.iloc[:,1], X)
result = logit_model.fit()

#Step 4: Analyzing model results
result.summary()
```

Most of this should be fairly familiar to you; importing data with Pandas, initializing a regression object, and calling the fit method of that object. However, step 2 warrants a slightly more in depth explanation.

Recall that we fit the salary data using `Race`, `Sex`, and `Age`. Since `Race` and `Sex` are categorical, we converted them to dummy variables using the `get_dummies()` method. The ```get_dummies()``` method will only convert `object` and `category` data types to dummy variables so it is safe to pass `Age`. Note that we also passed two additional arguments, ```drop_first=True``` and ```dtype=float```. The ```drop_first=True``` argument removes the first level for each categorical variable and the ```dtype=float``` argument converts the data type of all of the dummy variables to float. The data must be float in order to obtain accurate statistical results from statsmodel. Finally, note that y itself returns a pandas DataFrame with two columns as y itself was originally a categorical variable. With that, it's time to try and define a logistic regression model on your own!

## Your Turn - Step 1: Import the Data

Import the data stored in the file **titanic.csv**.

In [1]:
# Your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import sklearn.preprocessing as preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

In [2]:
titanic_df = pd.read_csv('titanic.csv')

In [3]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Step 2: Define X and Y

For your first foray into logistic regression, you are going to attempt to build a model that classifies whether an individual survived the Titanic shipwreck or not (yes it's a bit morbid). Follow the programming patterns described above to define X and y.

In [5]:
# Your code here
X_y = titanic_df[['Age', 'Sex', 'Pclass', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Survived']]

In [6]:
X_y = X_y.loc[~X_y['Age'].isnull()]

In [7]:
X_y = X_y.loc[~X_y['Embarked'].isnull()]

In [8]:
X_y.head()

Unnamed: 0,Age,Sex,Pclass,SibSp,Parch,Fare,Embarked,Survived
0,22.0,male,3,1,0,7.25,S,0
1,38.0,female,1,1,0,71.2833,C,1
2,26.0,female,3,0,0,7.925,S,1
3,35.0,female,1,1,0,53.1,S,1
4,35.0,male,3,0,0,8.05,S,0


In [9]:
features = ['Age', 'Sex', 'Pclass', 'SibSp', 'Parch', 'Fare', 'Embarked']
y = X_y['Survived']
X = pd.get_dummies(X_y[features], drop_first=True, dtype=float)
X[['Pclass', 'SibSp', 'Parch']] = X[['Pclass', 'SibSp', 'Parch']].astype(float)

In [10]:
X.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,22.0,3.0,1.0,0.0,7.25,1.0,0.0,1.0
1,38.0,1.0,1.0,0.0,71.2833,0.0,0.0,0.0
2,26.0,3.0,0.0,0.0,7.925,0.0,0.0,1.0
3,35.0,1.0,1.0,0.0,53.1,0.0,0.0,1.0
4,35.0,3.0,0.0,0.0,8.05,1.0,0.0,1.0


In [11]:
log_reg = LogisticRegression(solver='liblinear')
log_reg.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [12]:
log_reg.coef_, log_reg.intercept_

(array([[-0.03004581, -0.86352712, -0.29957712, -0.05202131,  0.0046735 ,
         -2.33697214, -0.47279306, -0.18371987]]), array([3.97264629]))

In [13]:
y_pred = log_reg.predict(X)

In [14]:
(y==y_pred).sum()/len(y)

0.7921348314606742

## Step 3: Fit the model

Now with everything in place, initialize a regression object and fit your model!

### Warning: If you receive an error of the form "LinAlgError: Singular matrix"

Stats models was unable to fit the model due to some Linear Algebra problems. Specifically, the matrix was not invertible due to not being full rank. In layman's terms, there was a lot of redundant, superfluous data. Try removing some features from the model and running it again.

In [15]:
# Your code here
X_sm = sm.add_constant(X)
# fit model
logit_model = sm.Logit(y, X_sm)
# get results of the fit
result = logit_model.fit()

Optimization terminated successfully.
         Current function value: 0.444061
         Iterations 6


  return ptp(axis=axis, out=out, **kwargs)


## Step 4: Analyzing results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

In [17]:
# Your code here
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,703.0
Method:,MLE,Df Model:,8.0
Date:,"Fri, 31 Jan 2020",Pseudo R-squ.:,0.3419
Time:,12:55:46,Log-Likelihood:,-316.17
converged:,True,LL-Null:,-480.45
Covariance Type:,nonrobust,LLR p-value:,3.3919999999999997e-66

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,5.6374,0.635,8.884,0.000,4.394,6.881
Age,-0.0433,0.008,-5.266,0.000,-0.059,-0.027
Pclass,-1.1993,0.165,-7.285,0.000,-1.522,-0.877
SibSp,-0.3632,0.129,-2.815,0.005,-0.616,-0.110
Parch,-0.0603,0.124,-0.486,0.627,-0.303,0.183
Fare,0.0014,0.003,0.566,0.572,-0.004,0.006
Sex_male,-2.6385,0.222,-11.871,0.000,-3.074,-2.203
Embarked_Q,-0.8235,0.600,-1.372,0.170,-2.000,0.353
Embarked_S,-0.4012,0.270,-1.484,0.138,-0.931,0.129


## Your analysis here

## Level - up

Create a new model, this time only using those features you determined were influential based on your analysis in step 4.

In [18]:
# Your code here
X_2 = X[['Age', 'Pclass', 'Sex_male', 'SibSp']]

In [19]:
log_reg_2 = LogisticRegression(solver='liblinear')
log_reg_2.fit(X_2, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [20]:
y_pred_2 = log_reg_2.predict(X_2)
(y==y_pred_2).sum()/len(y)

0.7907303370786517

In [21]:
X2_sm = sm.add_constant(X_2)
# fit model
logit_model = sm.Logit(y, X2_sm)
# get results of the fit
result = logit_model.fit()

Optimization terminated successfully.
         Current function value: 0.446755
         Iterations 6


  return ptp(axis=axis, out=out, **kwargs)


In [22]:
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,707.0
Method:,MLE,Df Model:,4.0
Date:,"Fri, 31 Jan 2020",Pseudo R-squ.:,0.3379
Time:,12:55:53,Log-Likelihood:,-318.09
converged:,True,LL-Null:,-480.45
Covariance Type:,nonrobust,LLR p-value:,5.015e-69

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,5.5908,0.543,10.288,0.000,4.526,6.656
Age,-0.0446,0.008,-5.457,0.000,-0.061,-0.029
Pclass,-1.3139,0.141,-9.324,0.000,-1.590,-1.038
Sex_male,-2.6148,0.215,-12.177,0.000,-3.036,-2.194
SibSp,-0.3747,0.121,-3.098,0.002,-0.612,-0.138


In [23]:
X_3 = X[['Age', 'Pclass', 'Sex_male', 'SibSp']]

In [24]:
X_3['Pclass_1'] = X_3['Pclass'].apply(lambda x: 1.0 if x==1.0 else 0.0)
X_3['Pclass_2'] = X_3['Pclass'].apply(lambda x: 1.0 if x==2.0 else 0.0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [25]:
X_3 = X_3.drop(['Pclass'], axis=1)

In [26]:
X_3.head()

Unnamed: 0,Age,Sex_male,SibSp,Pclass_1,Pclass_2
0,22.0,1.0,1.0,0.0,0.0
1,38.0,0.0,1.0,1.0,0.0
2,26.0,0.0,0.0,0.0,0.0
3,35.0,0.0,1.0,1.0,0.0
4,35.0,1.0,0.0,0.0,0.0


In [27]:
log_reg_3 = LogisticRegression(solver='liblinear')
log_reg_3.fit(X_3, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [28]:
y_pred_3 = log_reg_3.predict(X_3)
(y==y_pred_3).sum()/len(y)

0.8089887640449438

In [29]:
X3_sm = sm.add_constant(X_3)
# fit model
logit_model = sm.Logit(y, X3_sm)
# get results of the fit
result = logit_model.fit()

Optimization terminated successfully.
         Current function value: 0.446657
         Iterations 6


  return ptp(axis=axis, out=out, **kwargs)


In [30]:
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,706.0
Method:,MLE,Df Model:,5.0
Date:,"Fri, 31 Jan 2020",Pseudo R-squ.:,0.3381
Time:,12:56:17,Log-Likelihood:,-318.02
converged:,True,LL-Null:,-480.45
Covariance Type:,nonrobust,LLR p-value:,4.498e-68

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,1.6804,0.302,5.567,0.000,1.089,2.272
Age,-0.0449,0.008,-5.456,0.000,-0.061,-0.029
Sex_male,-2.6190,0.215,-12.181,0.000,-3.040,-2.198
SibSp,-0.3786,0.121,-3.119,0.002,-0.616,-0.141
Pclass_1,2.6450,0.286,9.251,0.000,2.085,3.205
Pclass_2,1.2387,0.245,5.053,0.000,0.758,1.719


## Summary 

Well done! In this lab, you practiced using stats models to build a logistic regression model. You then reviewed interpreting the results, building upon your previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Sci-kit learn!