# What is Regression?
Regression is a type of analysis in machine learning, in which the relationship between dependent and independent variables is analyzed. For instance the relationship between malignant or benign tumours with age, gender and weight.
Logistic regression  is a type of regression used for classification, it gives 'Yes' or 'No' outputs, eg Did an individual survive the titanic or not?
Logistic regression equation is a sigmoid curve as compared to that of linear regression which is of a straight line.



Let's start by importing the necessary libraries we are going to use.


In [3]:
# data analysis and wrangling
import pandas as pd
import numpy as np

# for machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report


In [4]:
# Loading the dataset to a variable

df = pd.read_csv('titanic.csv')

The titanic dataset is a huge dataset and can be visualized in multiple ways. But we'll just generate the number of rows and print the first ten rows to get an insight of the data type and list of columns present in the data.  

In [5]:
# print number of rows
print("no of rows are :",len(df))

# print names of columns
print(df.columns.values)

# print column datatypes
print(df.dtypes)

# Print the first 10 columns
print(df.head(10))

# print a description of the data
df.describe()

no of rows are : 891
['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
5            6         0       3   
6            7         0       1   
7            8         0       3   
8            9         1       3   
9           10         1       2   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


It's always good practice to check for null values in the dataset before training. 

In [6]:
# Check for null values in columns
print(df.isna())

# Get sum of null values in each column
df.isna().sum()

     PassengerId  Survived  Pclass   Name    Sex    Age  SibSp  Parch  Ticket  \
0          False     False   False  False  False  False  False  False   False   
1          False     False   False  False  False  False  False  False   False   
2          False     False   False  False  False  False  False  False   False   
3          False     False   False  False  False  False  False  False   False   
4          False     False   False  False  False  False  False  False   False   
..           ...       ...     ...    ...    ...    ...    ...    ...     ...   
886        False     False   False  False  False  False  False  False   False   
887        False     False   False  False  False  False  False  False   False   
888        False     False   False  False  False   True  False  False   False   
889        False     False   False  False  False  False  False  False   False   
890        False     False   False  False  False  False  False  False   False   

      Fare  Cabin  Embarked

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

After identifying null values in the 'age' and 'cabin' columns, I'll have to provide values for the null values, because the cabin column has more than 70% of missing values I'll have to drop that column. 
For the age column I'll replace the null values with the mean of the age.
It is advised to only use this technique where the missing values are not more than 30%.

In [7]:
# Drop cabin column

df.drop('Cabin',axis=1,inplace=True)

In [8]:
# Replacing the null values in Age with the mean of age

df['Age'].fillna(df['Age'].mean(),inplace=True)

When you run ```df.dtypes``` you realize that some of the columns like sex,ticket,name and embarked are integers and since our model only accepts integers we will have to convert the columns to integers and drop any columns that are not useful to the model.  

<b>Let's start by converting the sex columns to integers </b></br>
By using the `get_dummies` method we get two columns of male and female each with a boolean values of whether true to being female or male. 

In [9]:
# getting boolean values for each column
pd.get_dummies(df['Sex'])

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1
...,...,...
886,0,1
887,1,0
888,1,0
889,0,1


when you run the above code you realize that the two columns are the same logically and one would be enough, to do that we'll use a filter method `drop_first = True` which will drop the female column.

In [10]:
pd.get_dummies(df['Sex'],drop_first=True)

Unnamed: 0,male
0,1
1,0
2,0
3,0
4,1
...,...
886,1
887,0
888,0
889,1


Let's add a new Gender column to the dataset which will have boolean values.

In [11]:
# Add the Gender column
df['Gender']=pd.get_dummies(df['Sex'],drop_first=True)

# Verifying the new Gender column
print(df.columns.values,df['Gender'])

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Embarked' 'Gender'] 0      1
1      0
2      0
3      0
4      1
      ..
886    1
887    0
888    0
889    1
890    1
Name: Gender, Length: 891, dtype: uint8


Drop all the non-integer columns that are not useful to the model.

In [12]:
df.drop(['Sex','Name','Embarked','Ticket'],axis=1,inplace=True)

Confirm that only integer values are in the dataset

In [13]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Gender
0,1,0,3,22.0,1,0,7.25,1
1,2,1,1,38.0,1,0,71.2833,0
2,3,1,3,26.0,0,0,7.925,0
3,4,1,1,35.0,1,0,53.1,0
4,5,0,3,35.0,0,0,8.05,1


## Separate the dependent and independent variables
The purpose of our model is to predict whether or not a passenger survived the titanic. The independent variable or the target will be the `y-axis` and the dependent variables will be on the `x-axis`

In [14]:
X = df[['Pclass','Age','SibSp','Parch' ,'Fare','Gender']]
y = df['Survived']

## Data spliting

Before building the model we'll have to split the dataset into training and testing datasets.

In [15]:
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.33, random_state= 42)

## Training the model
We'll train the model by using `LogisticRegression` we imported from `sklearn.linear_models`

In [16]:
# Model training
Model = LogisticRegression()

Model.fit(x_train,y_train)


## Making predictions

We give the model the testing data `x-test` for it to make predictions on who survived the titanic, this predictions are stored in the variable `predict`

In [17]:

predict = Model.predict(x_test)

## Testing the models performance 
We can tets the model's performance by using a `confusion_matrix` which outputs a matrix with the values of true positive, false positive ,true negative and true negative.


In [18]:

confusion_matrix(y_test,predict)


array([[156,  19],
       [ 34,  86]])

Since the output is not easily readable we convert it to a data frame that can be easily read 

In [19]:
pd.DataFrame(confusion_matrix(y_test,predict),columns=['Predicted did not survive','Predicted survived'],index=['Actually did not survive','Actually survived'])

Unnamed: 0,Predicted did not survive,Predicted survived
Actually did not survive,156,19
Actually survived,34,86


We can also generate a `classification_report` which will show the models accuracy according to it's precision, recall, f1-score. 


In [20]:
# Generating the classification report

print(classification_report(y_test,predict))

              precision    recall  f1-score   support

           0       0.82      0.89      0.85       175
           1       0.82      0.72      0.76       120

    accuracy                           0.82       295
   macro avg       0.82      0.80      0.81       295
weighted avg       0.82      0.82      0.82       295



To improve the models performance you can use more features by including the columns we dropped or using a different model