# Fitting a Logistic Regression Model - Lab

## Introduction

In the last lesson you were given a broad overview of logistic regression. This included an introduction to two separate packages for creating logistic regression models. In this lab, you'll be investigating fitting logistic regressions with `statsmodels`. For your first foray into logistic regression, you are going to attempt to build a model that classifies whether an individual survived the [Titanic](https://www.kaggle.com/c/titanic/data) shipwreck or not (yes, it's a bit morbid).


## Objectives

In this lab you will: 

* Implement logistic regression with `statsmodels` 
* Interpret the statistical results associated with model parameters

## Import the data

Import the data stored in the file `'titanic.csv'` and print the first five rows of the DataFrame to check its contents. 

In [3]:
import pandas as pd

# Import the data
df = pd.read_csv('titanic.csv', index_col=0)

# Display the first five rows of the DataFrame
df.head()


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Define independent and target variables

Your target variable is in the column `'Survived'`. A `0` indicates that the passenger didn't survive the shipwreck. Print the total number of people who didn't survive the shipwreck. How many people survived?

In [4]:
# Total number of people who survived/didn't survive
# Define the target variable (y)
y = df['Survived']

# Print the total number of people who didn't survive (0) and those who survived (1)
num_not_survived = (y == 0).sum()
num_survived = (y == 1).sum()

print(f"Total number of people who didn't survive: {num_not_survived}")
print(f"Total number of people who survived: {num_survived}")

Total number of people who didn't survive: 549
Total number of people who survived: 342


Only consider the columns specified in `relevant_columns` when building your model. The next step is to create dummy variables from categorical variables. Remember to drop the first level for each categorical column and make sure all the values are of type `float`: 

In [5]:
# Create dummy variables
relevant_columns = ['Pclass', 'Age', 'SibSp', 'Fare', 'Sex', 'Embarked', 'Survived']
df_relevant = df[relevant_columns]

dummy_dataframe = pd.get_dummies(df_relevant, drop_first=True).astype(float)

dummy_dataframe.shape

(891, 8)

Did you notice above that the DataFrame contains missing values? To keep things simple, simply delete all rows with missing values. 

> NOTE: You can use the [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method to do this. 

In [6]:
# Drop missing rows
dummy_dataframe = dummy_dataframe.dropna()
dummy_dataframe.shape

(714, 8)

Finally, assign the independent variables to `X` and the target variable to `y`: 

In [8]:
# Split the data into X and y
y = dummy_dataframe['Survived']
X = dummy_dataframe.drop(columns=['Survived'])

## Fit the model

Now with everything in place, you can build a logistic regression model using `statsmodels` (make sure you create an intercept term as we showed in the previous lesson).  

> Warning: Did you receive an error of the form "LinAlgError: Singular matrix"? This means that `statsmodels` was unable to fit the model due to certain linear algebra computational problems. Specifically, the matrix was not invertible due to not being full rank. In other words, there was a lot of redundant, superfluous data. Try removing some features from the model and running it again.

In [9]:
# Build a logistic regression model using statsmodels
import statsmodels.api as sm

# Add an intercept term to X
X = sm.add_constant(X)

# Fit the logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit()

# Print the summary of the model
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.443267
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  714
Model:                          Logit   Df Residuals:                      706
Method:                           MLE   Df Model:                            7
Date:                Mon, 19 Aug 2024   Pseudo R-squ.:                  0.3437
Time:                        20:39:51   Log-Likelihood:                -316.49
converged:                       True   LL-Null:                       -482.26
Covariance Type:            nonrobust   LLR p-value:                 1.103e-67
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6503      0.633      8.921      0.000       4.409       6.892
Pclass        -1.2118      0.

## Analyze results

Generate the summary table for your model. Then, comment on the p-values associated with the various features you chose.

# Summary table
## Model Overview
### Dependent Variable (Dep. Variable): 
Survived — Indicates whether a passenger survived (1) or not (0).
Number of Observations (No. Observations): 714 — The total number of samples used in the model.
### Log-Likelihood (Log-Likelihood):
-316.49 — The log-likelihood of the fitted model.
### Pseudo R-squared (Pseudo R-squ.): 
0.3437 — A measure of model fit; higher values indicate better fit.
### LL-Null (LL-Null): 
-482.26 — The log-likelihood of the null model (model with no predictors).
### LR p-value (LLR p-value): 
1.103e-67 — Indicates whether the model is statistically significant. A very low p-value suggests the model is statistically significant.

## Coefficients Table
The coefficients table provides information about each predictor in the model:

### const: 
The intercept term with a coefficient of 5.6503. This is the log-odds of the baseline probability when all predictors are zero.

### Pclass: 
Coefficient of -1.2118. A one-unit increase in the passenger class decreases the log-odds of survival by 1.2118, which implies lower class passengers had lower odds of survival.

### Age: 
Coefficient of -0.0431. Each additional year of age decreases the log-odds of survival by 0.0431, indicating older passengers had lower odds of survival.

### SibSp: 
Coefficient of -0.3806. Each additional sibling or spouse on board decreases the log-odds of survival by 0.3806. Having more family members aboard was associated with lower survival odds.

### Fare: 
Coefficient of 0.0012. Each additional unit of fare increases the log-odds of survival by 0.0012, though the effect is not statistically significant (p-value = 0.636).

### Sex_male: 
Coefficient of -2.6236. Being male decreases the log-odds of survival by 2.6236 compared to females, indicating males had lower odds of survival.

### Embarked_Q: 
Coefficient of -0.8260. Embarking from port Q decreases the log-odds of survival by 0.8260, though this effect is not statistically significant (p-value = 0.167).

### Embarked_S: 
Coefficient of -0.4130. Embarking from port S decreases the log-odds of survival by 0.4130, also not statistically significant (p-value = 0.125).

# Interpretation
## Coefficients: 
Show the change in the log-odds of the outcome (survival) for a one-unit change in the predictor.
## Standard Error (std err): 
Measures the variability of the coefficient estimate.
## z-value (z): 
The coefficient divided by its standard error; used to test the null hypothesis that the coefficient is zero.
## P-value (P>|z|): 
Tests the null hypothesis that the coefficient is zero; a value less than 0.05 generally indicates statistical significance.
## Confidence Interval ([0.025, 0.975]): 
Provides a range within which the true coefficient value is likely to fall, with 95% confidence.


## Level up (Optional)

Create a new model, this time only using those features you determined were influential based on your analysis of the results above. How does this model perform?

In [10]:
# Your code here
# Select only the significant features
significant_features = ['Pclass', 'Age', 'SibSp', 'Sex_male', 'Survived']
X_significant = dummy_dataframe[significant_features].drop(columns=['Survived'])
y_significant = dummy_dataframe['Survived']

# Add an intercept term to the significant features
X_significant = sm.add_constant(X_significant)

# Fit the logistic regression model with only significant features
logit_model_significant = sm.Logit(y_significant, X_significant)
result_significant = logit_model_significant.fit()

# Print the summary of the new model
print(result_significant.summary())

Optimization terminated successfully.
         Current function value: 0.445882
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  714
Model:                          Logit   Df Residuals:                      709
Method:                           MLE   Df Model:                            4
Date:                Mon, 19 Aug 2024   Pseudo R-squ.:                  0.3399
Time:                        20:51:56   Log-Likelihood:                -318.36
converged:                       True   LL-Null:                       -482.26
Covariance Type:            nonrobust   LLR p-value:                 1.089e-69
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6008      0.543     10.306      0.000       4.536       6.666
Pclass        -1.3174      0.

# Analysis:
## Pseudo R-squared:

Previous Model: 0.3437
New Model: 0.3399
The Pseudo R-squared of the new model is slightly lower than the previous model. This indicates a minor decrease in the proportion of variance explained by the model. However, the change is quite small.

## Log-Likelihood:

Previous Model: -316.49
New Model: -318.36
The Log-Likelihood of the new model is slightly worse (more negative) than the previous model, which means the fit is marginally worse.

## Coefficients and p-values:

*Intercept (const)*: Coefficient of 5.6008 with a p-value of 0.000, similar to the previous model.
*Pclass*: Coefficient of -1.3174 (previously -1.2118) with a p-value of 0.000, showing it is a strong predictor.
*Age*: Coefficient of -0.0444 (previously -0.0431) with a p-value of 0.000, still significant.
*SibSp*: Coefficient of -0.3761 (previously -0.3806) with a p-value of 0.002, also significant.
*Sex_male*: Coefficient of -2.6235 (previously -2.6236) with a p-value of 0.000, maintaining its importance.

## Summary 

Well done! In this lab, you practiced using `statsmodels` to build a logistic regression model. You then interpreted the results, building upon your previous stats knowledge, similar to linear regression. Continue on to take a look at building logistic regression models in Scikit-learn!