<a href="https://colab.research.google.com/github/DevanshParmar/ICG-Summer-Program-2021-DS/blob/main/Logistic_Regression_Model_on_Titanic_Survival_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Logistic Regression Model on Titanic Survival Dataset**
This is an implementation of the Logistic Regression Model, one of the most basic machine learning model on the Titanic survival dataset. 

#### **Uploads**
Setting up libraries and uploading dataset files.


In [132]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files

In [133]:
upload_train = files.upload()
upload_test = files.upload()

Saving train.csv to train (1).csv


Saving test.csv to test (1).csv


#### **Data Preprocessing**
Constructing a dataframe and making necessary changes, such as deleting PassId and converting Sex to male=1, female=0 objective case.

In [134]:
col_names = ['PassId', 'Survived', 'PClass', 'Sex', 'Age', 'SibSp', 'ParCh', 'Fare']

print("Input Dataframes:")
data = pd.read_csv('/content/train.csv')
data.columns = col_names
print(data.head())
data.pop("PassId")
print(" ")

test = pd.read_csv('/content/test.csv')
test.columns = col_names
print(test.head())
test.pop("PassId")
print(" ")

print("Output Dataframes:")
ifmale = pd.get_dummies(data['Sex'], drop_first = True)
data = pd.concat([data, ifmale], axis = 1)
data.pop("Sex")
print(data.head())
print(" ")

ifmale = pd.get_dummies(test['Sex'], drop_first = True)
test = pd.concat([test, ifmale], axis = 1)
test.pop("Sex")
print(test.head())

Input Dataframes:
   PassId  Survived  PClass     Sex   Age  SibSp  ParCh     Fare
0       1         0       3    male  22.0      1      0   7.2500
1       2         1       1  female  38.0      1      0  71.2833
2       3         1       3  female  26.0      0      0   7.9250
3       4         1       1  female  35.0      1      0  53.1000
4       5         0       3    male  35.0      0      0   8.0500
 
   PassId  Survived  PClass   Sex   Age  SibSp  ParCh     Fare
0     621         0       3  male  27.0      1      0  14.4542
1     622         1       1  male  42.0      1      0  52.5542
2     623         1       3  male  20.0      1      1  15.7417
3     624         0       3  male  21.0      0      0   7.8542
4     625         0       3  male  21.0      0      0  16.1000
 
Output Dataframes:
   Survived  PClass   Age  SibSp  ParCh     Fare  male
0         0       3  22.0      1      0   7.2500     1
1         1       1  38.0      1      0  71.2833     0
2         1       3  26.0 

Filling NaN Age values with median age

In [135]:
data['Age'] = data['Age'].fillna(data['Age'].median())
test['Age'] = test['Age'].fillna(test['Age'].median())
print(data)
print(" ")
print(test)

     Survived  PClass   Age  SibSp  ParCh     Fare  male
0           0       3  22.0      1      0   7.2500     1
1           1       1  38.0      1      0  71.2833     0
2           1       3  26.0      0      0   7.9250     0
3           1       1  35.0      1      0  53.1000     0
4           0       3  35.0      0      0   8.0500     1
..        ...     ...   ...    ...    ...      ...   ...
615         1       2  24.0      1      2  65.0000     0
616         0       3  34.0      1      1  14.4000     1
617         0       3  26.0      1      0  16.1000     0
618         1       2   4.0      2      1  39.0000     0
619         0       2  26.0      0      0  10.5000     1

[620 rows x 7 columns]
 
     Survived  PClass   Age  SibSp  ParCh     Fare  male
0           0       3  27.0      1      0  14.4542     1
1           1       1  42.0      1      0  52.5542     1
2           1       3  20.0      1      1  15.7417     1
3           0       3  21.0      0      0   7.8542     1
4    

Sending dataframe to NumPy

In [136]:
input_train = data[['PClass', 'Age', 'SibSp', 'ParCh', 'Fare', 'male']].to_numpy()
output_train = data['Survived'].to_numpy()
input_train.shape, len(output_train)

input_test = test[['PClass','Age','SibSp','ParCh','Fare','male']].to_numpy()
output_test = test['Survived'].to_numpy()
input_test.shape, len(output_test)

((271, 6), 271)

#### **Modeling**
In the next blocks, we define, train and optimize our model.
1. The first block defines the sigmoid function, necessary for logistic regression.
2. The second block defines the optimization function.
3. The third block puts up the inital parameters as 0.
4. The fourth block runs the training sessions of the model, and finallly outputs the trained theta parameters of the model.


In [137]:
def g(z):
    return 1/(1+np.exp(-z))

In [138]:
def optimize(x, y, learning_rate, N_iterations, parameters):
    size = x.shape[0] #620
    weights = parameters["weights"]                                             #theta1-6
    bias = parameters["bias"]                                                   #theta0
    for i in range(N_iterations):
        h = g(bias + np.dot(x, weights))                                        #h becomes the hypothesis function
        loss = -1/size * np.sum(y * np.log(h)) + (1 - y) * np.log(1-h)          #log-likelihood
        del_weights = 1/size * np.dot(x.T, (h-y))                               #change in weights
        del_bias = 1/size * np.sum(h-y)                                         #change in bias
        weights = weights - learning_rate * del_weights                         #learning
        bias = bias - learning_rate * del_bias                                  #learning
    
    parameters["weights"] = weights
    parameters["bias"] = bias
    return parameters

In [139]:
initial_parameters = {}
initial_parameters["weights"] = np.zeros(input_train.shape[1])
initial_parameters["bias"] = 0

In [140]:
def train(x, y, lr, N_it):
    return optimize(x, y, lr, N_it, initial_parameters)

theta = train(input_train, output_train, lr = 0.005, N_it = 5000)
print(theta['weights'])

[-0.06204788  0.00468488 -0.31236042  0.06896576  0.03489376 -1.67715314]


#### **Predictions and Accuracy**
In the next two blocks, we have measured the various statistical parameters of our model, such as accuracy, loss, F1 score, sensitivity and precision.

In [141]:
def stats(dataset):
    z = []
    for i in range(dataset.shape[0]):
        p  = 0.0
        p += theta["bias"]
        p += theta["weights"][0] * dataset['PClass'][i]
        p += theta["weights"][1] * dataset['Age'][i]
        p += theta["weights"][2] * dataset['SibSp'][i]
        p += theta["weights"][3] * dataset['ParCh'][i]
        p += theta["weights"][4] * dataset['Fare'][i]
        p += theta["weights"][5] * dataset['male'][i]
        z.insert(len(z), p)
    #                              
    sigmoids = []
    for val in z:
        sigmoids.insert(len(sigmoids), g(val))
    #                              
    predictions = []
    for p in sigmoids:
        if p >= 0.5:
            predictions.insert(len(predictions), 1)
        else:
            predictions.insert(len(predictions), 0)
    prediction = np.array(predictions)
    survive_data = np.array(dataset['Survived'])
    #                              
    loss = 0
    f_neg = 0
    f_pos = 0 
    t_neg = 0
    t_pos = 0
    #                              
    for i, j in zip(prediction, survive_data):
        if i == 1 and j == 1:
            t_pos+=1
        elif i == 1 and j == 0:
            f_pos+=1
            loss+=1
        elif i==0 and j == 1:
            f_neg+=1
            loss+=1
        else:
            t_neg+=1
    #                              
    rec = t_pos / (t_pos + f_neg)
    prc = t_pos / (t_pos + f_pos)
    acc = (t_pos + t_neg) / (t_pos + t_neg + f_pos + f_neg)
    f1s = 2 * prc * rec / (prc + rec)
    #                              
    print('   Accuracy is {:.2f}%'.format(100*acc))
    print('       Loss is',loss)
    print('   F1 Score is {:.4f}'.format(f1s))
    print('Sensitivity is {:.4f}'.format(rec))
    print('  Precision is {:.4f}'.format(prc))

In [142]:
print("Statistics for Training dataset are:")
print(" ")
stats(data)
print(" ")
print(" ")
print("Statistics for Test dataset are:")
print(" ")
stats(test)

Statistics for Training dataset are:
 
   Accuracy is 74.52%
       Loss is 158
   F1 Score is 0.7008
Sensitivity is 0.7582
  Precision is 0.6514
 
 
Statistics for Test dataset are:
 
   Accuracy is 78.23%
       Loss is 59
   F1 Score is 0.7177
Sensitivity is 0.7653
  Precision is 0.6757


#### **References**

1. These lecture notes were extremely helpful in understanding the mathematics of the logistic regression: https://see.stanford.edu/materials/aimlcs229/cs229-notes1.pdf
2. Many Towards Data Science (TDS) articles were helpful, especially: www.towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8
3. Another TDS blog was a good help: www.towardsdatascience.com/optimization-loss-function-under-the-hood-part-ii-d20a239cde11
4. This Exsilio blog was greatly helpful in visualing the final statistics of the model: www.blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/