# Building a Logistic Regression Model

This notebook will walk through the end to end process of preparing, executing, and improving a logistic regression model to predict Titanic survivors. Logistic Regressions are useful for predicting outcomes that are binary like survive/not survive in the case of the Titanic dataset.

In [169]:
import pandas as pd
import os
import numpy as np

### Bring in the data

In [170]:
train_file_path = r'C:\Users\Owner\Desktop\School\Prog in Pyth C996\Notebooks\Data for LogReg\train.csv'

test_file_path = r'C:\Users\Owner\Desktop\School\Prog in Pyth C996\Notebooks\Data for LogReg\test.csv'

train_df = pd.read_csv(filepath_or_buffer = train_file_path, index_col = 'PassengerId')
test_df = pd.read_csv(filepath_or_buffer = test_file_path, index_col = 'PassengerId')

display(train_df.info())
display(test_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 33 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Survived            891 non-null    int64  
 1   Age                 891 non-null    float64
 2   Fare                891 non-null    float64
 3   FamilySize          891 non-null    int64  
 4   IsMother            891 non-null    int64  
 5   IsMale              891 non-null    int64  
 6   Deck_A              891 non-null    float64
 7   Deck_B              891 non-null    float64
 8   Deck_C              891 non-null    float64
 9   Deck_D              891 non-null    float64
 10  Deck_E              891 non-null    float64
 11  Deck_F              891 non-null    float64
 12  Deck_G              891 non-null    float64
 13  Deck_Z              891 non-null    float64
 14  Pclass_1            891 non-null    float64
 15  Pclass_2            891 non-null    float64
 16  Pclass_3

None

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 32 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 418 non-null    float64
 1   Fare                418 non-null    float64
 2   FamilySize          418 non-null    int64  
 3   IsMother            418 non-null    int64  
 4   IsMale              418 non-null    int64  
 5   Deck_A              418 non-null    float64
 6   Deck_B              418 non-null    float64
 7   Deck_C              418 non-null    float64
 8   Deck_D              418 non-null    float64
 9   Deck_E              418 non-null    float64
 10  Deck_F              418 non-null    float64
 11  Deck_G              418 non-null    float64
 12  Deck_Z              418 non-null    float64
 13  Pclass_1            418 non-null    float64
 14  Pclass_2            418 non-null    float64
 15  Pclass_3            418 non-null    float64
 16  Title

None

Building a model to predict the likeliness of surviving for the 418 entries in the test data

### Data Prep

In [171]:
# Most algotrhims expect numeric arrays. Making arrays for inputs - X (everything but Survived) and output y 'Survived'

X = train_df.loc[:, 'Age':].to_numpy(dtype = 'float')
y = train_df['Survived'].ravel()

print(type(X), X.shape, type(y), y.shape)

<class 'numpy.ndarray'> (891, 32) <class 'numpy.ndarray'> (891,)


In [172]:
# Splitting the train dataframe into train and test with 20% for testing and 80% for training
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(712, 32) (712,)
(179, 32) (179,)


712 rows in train and 179 in test. Next looking into how many positive class types are in the test and train sets

In [173]:
# average survival in train and test sets
print('Mean survival in train: {:.3f}'.format(np.mean(y_train)))
print('Mean survival in test: {:.3f}'.format(np.mean(y_test)))

Mean survival in train: 0.383
Mean survival in test: 0.385


Similar survival in the test and train sets ~39%. This is good because we ideally want them to be similar. One thing to note is the data is not balanced because of the low survival rates. Imbalance checks are good to carryout to see if there are any other steps in preparation needed.

# Baseline Model

Making a Baseline model is one of the first steps in creating a model. What it does is make a model that always outputs the majority class. In the case of the titanic dataset we observe that most did not survive. The baseline model will always output "0" for 'Survived'. This will provide an accuracy value that we can use to compare with the actual model accuracy. The final model accuracy should exceed the baseline accuracy.

In [174]:
# import function
from sklearn.dummy import DummyClassifier

In [175]:
# create model -- specifying 'most frequent' in this case "0" - not survive
model_dummy = DummyClassifier(strategy = 'most_frequent', random_state = 0)

# train model
model_dummy.fit(X_train, y_train)

# Testing model
print('Score for Baseline Model accuracy: {:.2f}'.format(model_dummy.score(X_test, y_test)))

Score for Baseline Model accuracy: 0.61


If you predict not survive for each outcome you would be 61.5% accurate

In [176]:
# Imports to calculate other performance metrics
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

In [177]:
# We are able to calculate accuracy score in another way
print('Accuracy for the Baseline Model: {:.2f}'.format(accuracy_score(y_test, model_dummy.predict(X_test))))

Accuracy for the Baseline Model: 0.61


In [178]:
# Confusion Matrix
print('Confusion Matrix for the Baseline Model: \n {0}'.format(confusion_matrix(y_test, model_dummy.predict(X_test))))

Confusion Matrix for the Baseline Model: 
 [[110   0]
 [ 69   0]]


In [179]:
# Precision and Recall scores
print('Precision for the Baseline Model: {:.2f}'.format(precision_score(y_test, model_dummy.predict(X_test))))
print('Recall for the Baseline Model: {:.2f}'.format(recall_score(y_test, model_dummy.predict(X_test))))

Precision for the Baseline Model: 0.00
Recall for the Baseline Model: 0.00


  _warn_prf(average, modifier, msg_start, len(result))


# Precision and Recall
Accuracy tells one thing about the data but recall and precision can help tell more information about how the model is performing.

The model predicted 179 non survivors out of what actually was 110 non survivors (True Negative) and 69 survivors (False Negative).

The base model's precision in this case "how many of the predicted survivors are actually survivors" was 0% because we told it to only predict negative results. The base model's recall in this case "how many of all the survivors were predicted" was also 0% for the same reason.

Definitions of an ideal value for one of these metrics in the real world changes like in the healthcare field. You could want recall to be 100% if you're trying to predict who has COVID-19 so that no one is incorrectly informed they are clear "False Negative"and go untreated. This case the accuracy and precision are okay to sacrifice to an extent.


### Precision

True Positives / (True Positives + False Positives)

Correctly Predicted Survivors / (Correctly Predicted Survivors + Wrongly Predicted Survivors)


### Recall
True Positives / (True Positives + False Negatives)

Correctly Predicted Survivors / (Correctly Predicted Survivors + Wrongly Predicted Non Survivors)

# Logistic Regression Model

In [180]:
from sklearn.linear_model import LogisticRegression

In [181]:
# create model
model_lr_1 = LogisticRegression(random_state = 0, max_iter = 1000) # 1000 max_iter to get rid of convergence error

# train model
model_lr_1.fit(X_train, y_train)

# evalute model on test
print('The accuracy for the Logistic Regression: {:.2f}'.format(model_lr_1.score(X_test, y_test)))

The accuracy for the Logistic Regression: 0.83


In [182]:
# Performance Metrics

# Accuracy
print('Accuracy for the regression: {:.2f}'.format(accuracy_score(y_test, model_lr_1.predict(X_test))))

# Confusion Matrix
print('Confusion Matrix for the regression: \n {0}'.format(confusion_matrix(y_test, model_lr_1.predict(X_test))))

# Precision and Recall scores
print('Precision - how good the model is at predicting survivors: {:.2f}'.format(precision_score(y_test, model_lr_1.predict(X_test))))
print('Recall - out of all survivors how many were predicted: {:.2f}'.format(recall_score(y_test, model_lr_1.predict(X_test))))

Accuracy for the regression: 0.83
Confusion Matrix for the regression: 
 [[95 15]
 [15 54]]
Precision - how good the model is at predicting survivors: 0.78
Recall - out of all survivors how many were predicted: 0.78


In [183]:
# Model Coefficients or Model Weights/Parameters for each of the inputs. Allows to see the impact of each input
model_lr_1.coef_

array([[-0.03269776,  0.00421194, -0.52678741,  0.63495839, -1.11331495,
        -0.01565171, -0.30461563, -0.53291311,  0.39978536,  0.97618985,
         0.2816164 , -0.28284974, -0.52112569,  0.58637084,  0.1314043 ,
        -0.71733942,  0.22267681,  1.10011527,  0.15521095, -1.55631686,
         0.73586292, -0.15641711, -0.50069625, -0.12249976, -0.04209096,
         0.00383228,  0.16119416,  0.12092417,  0.11394671, -0.23443516,
        -0.18259502,  0.18303074]])

# Improving the Model

##### Hyperparameter Optimization 
Allows us to use the model that has the best combination of parameters

##### K-Fold Cross Validation
Technique to improve model by using a cross validation set to build the best model then only test on the test dataset one time. Train data split into 3(folds), 1 for testing one for training and get a score. Then a different split of 3 is used to get a score. This method will help reduce the chances of over/underfitting

##### Feature Normalization and Standardization
Most algorithms perform better if the features are on the same scale like 0 to 1. Specifically Standardization is able to account for the distribution of each feature by ensuring that along with the 0 to 1 scale the mean is made to be 0 and the variance 1

## Hyperparameter Optimization + Cross Validation

In [184]:
from sklearn.model_selection import GridSearchCV

# Creating a model
model_lr = LogisticRegression(random_state = 0, max_iter = 5000) # 5000 to remove max iter error

# Creating parameters to try in the process
parameters = {'C' : [1.0, 10.0, 50.0, 100.0, 1000.0],
'penalty' : ['l1', 'l2'],
'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

# Here the model is designed with the parameters that will be iterated through as well as the K = 3 cross validation
# The train will be further split into 3 folds during this process
clf = GridSearchCV(model_lr, param_grid = parameters, cv = 3, n_jobs=-1)

Where C parameter controls how complex or over/underfit the regression is, cv is the cross validation used to come up with the optimal arrangement, n_jobs = -1  = run everything in parallel

In [185]:
# Passing the train data to train different models with the parameter combinations specified above to find the best model
clf.fit(X_train, y_train)

clf.best_params_

{'C': 1.0, 'penalty': 'l1', 'solver': 'liblinear'}

In [186]:
# Prints the accuracy of the model with the best parameter arrangement
print('Best Score: {:.2f} Accuracy'.format((clf.best_score_)))

Best Score: 0.83 Accuracy


Not much improvement but it is still best practice to perform this method

In [187]:
# evalute model on test
print('The accuracy for the Logistic Regression V2: {:.2f}'.format(clf.score(X_test, y_test)))

The accuracy for the Logistic Regression V2: 0.83


 # Feature Normalization and Standardization

In [188]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

#### Feature Normalization

In [189]:
# Feature Normalization on the train data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Looking at min and max to check result
X_train_scaled[:,0].min(), X_train_scaled[:,0].max()

(0.0, 1.0)

In [190]:
# Normalize test data features as well
X_test_scaled = scaler.transform(X_test)

#### Feature Standardization

In [191]:
# Feature Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#### Create Model After Standardizing

In [192]:
# Find the best model and perform cross validation with the standardized features
model_lr = LogisticRegression(random_state = 0, max_iter = 5000) # 5000 to remove max iter error
parameters = {'C' : [1.0, 10.0, 50.0, 100.0, 1000.0],
'penalty' : ['l1', 'l2'],
'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

clf = GridSearchCV(model_lr, param_grid = parameters, cv = 3)
clf.fit(X_train_scaled, y_train)

ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.

ValueErr

GridSearchCV(cv=3, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=5000, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=0, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1.0, 10.0, 50.0, 100.0, 1000.0],
                         'penalty': ['l1', 'l2'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
                                    'saga']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [193]:
clf.best_score_

0.8132231795671855

In [194]:
# evalute model on test
print('The accuracy for the Logistic Regression V3 after standardization: {:.2f}'.format(clf.score(X_test_scaled, y_test)))

# Confusion Matrix
print('Confusion Matrix for the regression: \n {0}'.format(confusion_matrix(y_test, clf.predict(X_test))))

# Precision and Recall scores
print('Precision - how good the model is at predicting survivors: {:.2f}'.format(precision_score(y_test, clf.predict(X_test))))
print('Recall - out of all survivors how many were predicted: {:.2f}\n'.format(recall_score(y_test, clf.predict(X_test))))

#Prior Result
print('''\nAccuracy for the regression before standardization: 0.83\nConfusion Matrix for the regression: 
 [[95 15]
 [15 54]]
Precision - how good the model is at predicting survivors: 0.78
Recall - out of all survivors how many were predicted: 0.78''')

The accuracy for the Logistic Regression V3 after standardization: 0.84
Confusion Matrix for the regression: 
 [[106   4]
 [ 48  21]]
Precision - how good the model is at predicting survivors: 0.84
Recall - out of all survivors how many were predicted: 0.30


Accuracy for the regression before standardization: 0.83
Confusion Matrix for the regression: 
 [[95 15]
 [15 54]]
Precision - how good the model is at predicting survivors: 0.78
Recall - out of all survivors how many were predicted: 0.78


Standardization usually does not have a large affect on Logistic Regression Models. With the slight increase in accuracy there was a drop in Recall to 30% but an increase in Precision to 84%. With that the Logistic Regression Model has being optimized to predict the survivors of the Titantic with accuracy of 84%.

# Model Persistance
To make the model resuable to use whenever

In [195]:
import pickle

# file path to put model
model_file_path = r'C:\Users\Owner\Desktop\School\Prog in Pyth C996\Notebooks\lr_model.pkl'
scaler_file_path = r'C:\Users\Owner\Desktop\School\Prog in Pyth C996\Notebooks\lr_scaler.pkl'

# open the files to write
model_file_pickle = open(model_file_path, 'wb')
scaler_file_pickle = open(scaler_file_path, 'wb')

# Persist the model 
pickle.dump(clf, model_file_pickle)
pickle.dump(scaler, scaler_file_pickle)

# Close the file
model_file_pickle.close()
scaler_file_pickle.close()

In [196]:
# opening in read mode
model_file_pickle = open(model_file_path, 'rb')
scaler_file_pickle = open(scaler_file_path, 'rb')

# loading file
clf_loaded = pickle.load(model_file_pickle)
scaler_loaded = pickle.load(scaler_file_pickle)

# close
model_file_pickle.close()
scaler_file_pickle.close()

In [197]:
# Test to see if we can see the log reg classifier
clf_loaded

GridSearchCV(cv=3, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=5000, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=0, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1.0, 10.0, 50.0, 100.0, 1000.0],
                         'penalty': ['l1', 'l2'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
                                    'saga']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [198]:
# Test to see of the scaler is there
scaler_loaded

StandardScaler(copy=True, with_mean=True, with_std=True)

# The Final Test
Now we can finally test the fully functional algorithm on the test data we read in at the beginning

In [199]:
# Converting data into numeric
X_f_test = test_df.to_numpy(dtype = 'float')
# Scaling the data with the loaded scaler object to get the 0 to 1 ranges
X_f_test_scaled = scaler_loaded.transform(X_f_test)

In [207]:
# Predicting the suvivors of the test data csv

y_f = clf_loaded.predict(X_f_test_scaled)

In [211]:
# We have predictions for the 418 entries in the final test csv
len(y_f)

418

In [219]:
unique, counts = np.unique(y_f, return_counts=True)
dict(zip(unique, counts))

{0: 247, 1: 171}

Finally we can see the results of the model on unknown data where the actual values are not known. Because of this there is no way to check the accuracy of this output. The model predicted when looking at 418 instances that 171 would be survivors and 247 would unfortunately not survive, a survival rate of ~40%.