# Fitting a Logistic Regression Model - Lab

## Introduction
In the last lecture, you were given a broad overview of logistic regression. This included two seperate packages for creating logistic regression models. We'll first investigate building logistic regression models with 

## Objectives

You will be able to:
* Understand and implement logistic regression


## Review

The stats model example we covered had four essential parts:
    * Importing the data
    * Defining X and y
    * Fitting the model
    * Analyzing model results

The corresponding code to these four steps was:

```
import pandas as pd
from patsy import dmatrices
import statsmodels.api as sm

#Step 1: Importing the data
salaries = pd.read_csv("salaries_final.csv", index_col = 0)

#Step 2: Defining X and y
y, X = dmatrices('Target ~ Age  + C(Race) + C(Sex)',
                  salaries, return_type = "dataframe")

#Step 3: Fitting the model
logit_model = sm.Logit(y.iloc[:,1], X)
result = logit_model.fit()

#Step 4: Analyzing model results
result.summary()
```

Most of this should be fairly familiar to you; importing data with Pandas, initializing a regression object, and calling the fit method of that object. However, step 2 warrants a slightly more in depth explanation.

The `dmatrices()` method above mirrors the R languages syntax. The first parameter is a string representing the conceptual formula for our model. Afterwards, we pass the dataframe where the data is stored, as well as an optional parameter for the formate in which we would like the data returned. The general pattern for defining the formula string is: `y_feature_name ~ x_feature1_name + x_feature2_name + ... + x_featuren_name`. You should also notice that two of the x features, Race and Sex, are wrapped in `C()`. This indicates that these variables are *categorical* and that dummy variables need to be created in order to convert them to numerical quantities. Finally, note that y itself returns a Pandas DataFrame with two columns as y itself was originally a categorical variable. With that, it's time to try and define a logistic regression model on your own! 

## Your Turn - Step 1: Import the Data

Import the data stored in the file **titanic**.

In [52]:
#Your code here
import pandas as pd
import numpy as np
from patsy import dmatrices
import statsmodels.api as sm

df = pd.read_csv('titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [54]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## Step 2: Define X and Y

For our first foray into logistic regression, we are going to attempt to build a model that classifies whether an indivdual survived the Titanic shiwrech or not (yes its a bit morbid). Follow the programming patterns described above to define X and y.

In [55]:
fit_df = df.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis=1)

In [56]:
#Your code here
cat_cols = ['Sex', 'Pclass', 'SibSp', 'Parch']
cat_eq = ' + '.join([f'C({x})' for x in cat_cols])
full_eq = 'Survived ~ ' + cat_eq + ' + ' + ' + '.join([x for x in fit_df.columns if x not in cat_cols + ['Survived']])
y, X = dmatrices(full_eq, fit_df, return_type='dataframe')

In [57]:
X.columns

Index(['Intercept', 'C(Sex)[T.male]', 'C(Pclass)[T.2]', 'C(Pclass)[T.3]',
       'C(SibSp)[T.1]', 'C(SibSp)[T.2]', 'C(SibSp)[T.3]', 'C(SibSp)[T.4]',
       'C(SibSp)[T.5]', 'C(SibSp)[T.8]', 'C(Parch)[T.1]', 'C(Parch)[T.2]',
       'C(Parch)[T.3]', 'C(Parch)[T.4]', 'C(Parch)[T.5]', 'C(Parch)[T.6]',
       'Embarked[T.Q]', 'Embarked[T.S]', 'Age', 'Fare'],
      dtype='object')

## Step 3: Fit the model

Now with everything in place, initialize a regression object and fit your model!

### Warning: If you receive an error of the form "LinAlgError: Singular matrix"
Stats models was unable to fit the model due to some Linear Algebra problems. Specifically, the matrix was not invertable due to not being full rank. In layman's terms, there was a lot of redundant superfulous data. Try removing some features from the model and running it again.

In [58]:
# Your code here
logit_model = sm.Logit(y, X.drop(['C(SibSp)[T.8]', 'Embarked[T.S]', 'C(Pclass)[T.3]'], axis=1))
result = logit_model.fit()

         Current function value: 0.467459
         Iterations: 35




## Step 4: Analyzing results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

In [59]:
#Your code here
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,695.0
Method:,MLE,Df Model:,16.0
Date:,"Wed, 27 Mar 2019",Pseudo R-squ.:,0.3073
Time:,13:44:32,Log-Likelihood:,-332.83
converged:,False,LL-Null:,-480.45
,,LLR p-value:,2.464e-53

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.2029,0.310,3.874,0.000,0.594,1.811
C(Sex)[T.male],-2.4742,0.210,-11.785,0.000,-2.886,-2.063
C(Pclass)[T.2],0.3463,0.223,1.556,0.120,-0.090,0.782
C(SibSp)[T.1],0.0156,0.240,0.065,0.948,-0.454,0.486
C(SibSp)[T.2],-0.8189,0.546,-1.499,0.134,-1.890,0.252
C(SibSp)[T.3],-2.6165,0.852,-3.071,0.002,-4.286,-0.947
C(SibSp)[T.4],-2.0735,0.776,-2.672,0.008,-3.594,-0.553
C(SibSp)[T.5],-19.7170,6640.173,-0.003,0.998,-1.3e+04,1.3e+04
C(Parch)[T.1],0.2869,0.294,0.977,0.328,-0.289,0.862


## Your analysis here

## Level - up

Create a new model, this time only using those features you determined were influential based on your analysis in step 4.

In [76]:
#your code here
#Your code here
fit_df_new = df.drop(['PassengerId', 'Ticket', 'Cabin', 'Name', 'Parch', 'Fare'], axis=1)
fit_df_new.SibSp = fit_df_new.SibSp > 0
cat_cols = ['Sex', 'Pclass', 'SibSp']
cat_eq = ' + '.join([f'C({x})' for x in cat_cols])
full_eq = 'Survived ~ ' + cat_eq + ' + ' + ' + '.join([x for x in fit_df_new.columns if x not in cat_cols + ['Survived']])
y, X = dmatrices(full_eq, fit_df_new, return_type='dataframe')

In [77]:
X.columns

Index(['Intercept', 'C(Sex)[T.male]', 'C(Pclass)[T.2]', 'C(Pclass)[T.3]',
       'C(SibSp)[T.True]', 'Embarked[T.Q]', 'Embarked[T.S]', 'Age'],
      dtype='object')

In [78]:
logit_model = sm.Logit(y, X)
result = logit_model.fit()

Optimization terminated successfully.
         Current function value: 0.450173
         Iterations 6


In [79]:
result.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,712.0
Model:,Logit,Df Residuals:,704.0
Method:,MLE,Df Model:,7.0
Date:,"Wed, 27 Mar 2019",Pseudo R-squ.:,0.3329
Time:,13:47:33,Log-Likelihood:,-320.52
converged:,True,LL-Null:,-480.45
,,LLR p-value:,3.4590000000000003e-65

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,4.2429,0.465,9.130,0.000,3.332,5.154
C(Sex)[T.male],-2.5594,0.213,-11.990,0.000,-2.978,-2.141
C(Pclass)[T.2],-1.1758,0.293,-4.014,0.000,-1.750,-0.602
C(Pclass)[T.3],-2.4573,0.295,-8.337,0.000,-3.035,-1.880
C(SibSp)[T.True],-0.2692,0.212,-1.269,0.204,-0.685,0.147
Embarked[T.Q],-0.8516,0.575,-1.481,0.139,-1.979,0.276
Embarked[T.S],-0.4883,0.268,-1.824,0.068,-1.013,0.037
Age,-0.0379,0.008,-4.803,0.000,-0.053,-0.022


## Summary 

Well done. In this lab we practiced using stats models to build a logistic regression model. We then reviewed interpreting the results, building upon our previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Sci-kit learn!