# What is Regression?
Regression is a type of analysis in machine learning, in which the relationship between dependent and independent variables is analyzed. For instance the relationship between malignant or benign tumours with age, gender and weight.
Logistic regression  is a type of regression used for classification, it gives 'Yes' or 'No' outputs, eg Did an individual survive the titanic or not?
Logistic regression equation is a sigmoid curve as compared to that of linear regression which is of a straight line.



Let's start by importing the necessary libraries we are going to use.


In [1]:
# data analysis and wrangling
import pandas as pd
import numpy as np

# for machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report


In [2]:
# Loading the dataset to a variable

df = pd.read_csv('titanic.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'titanic.csv'

The titanic dataset is a huge dataset and can be visualized in multiple ways. But we'll just generate the number of rows and print the first ten rows to get an insight of the data type and list of columns present in the data.  

In [None]:
# print number of rows
print("no of rows are :",len(df))

# print names of columns
print(df.columns.values)

# print column datatypes
print(df.dtypes)

# Print the first 10 columns
print(df.head(10))

# print a description of the data
df.describe()

It's always good practice to check for null values in the dataset before training. 

In [None]:
# Check for null values in columns
print(df.isna())

# Get sum of null values in each column
df.isna().sum()

After identifying null values in the 'age' and 'cabin' columns, I'll have to provide values for the null values, because the cabin column has more than 70% of missing values I'll have to drop that column. 
For the age column I'll replace the null values with the mean of the age.
It is advised to only use this technique where the missing values are not more than 30%.

In [None]:
# Drop cabin column

df.drop('Cabin',axis=1,inplace=True)

In [None]:
# Replacing the null values in Age with the mean of age

df['Age'].fillna(df['Age'].mean(),inplace=True)

When you run ```df.dtypes``` you realize that some of the columns like sex,ticket,name and embarked are integers and since our model only accepts integers we will have to convert the columns to integers and drop any columns that are not useful to the model.  

<b>Let's start by converting the sex columns to integers </b></br>
By using the `get_dummies` method we get two columns of male and female each with a boolean values of whether true to being female or male. 

In [None]:
# getting boolean values for each column
pd.get_dummies(df['Sex'])

when you run the above code you realize that the two columns are the same logically and one would be enough, to do that we'll use a filter method `drop_first = True` which will drop the female column.

In [None]:
pd.get_dummies(df['Sex'],drop_first=True)

Let's add a new Gender column to the dataset which will have boolean values.

In [None]:
# Add the Gender column
df['Gender']=pd.get_dummies(df['Sex'],drop_first=True)

# Verifying the new Gender column
print(df.columns.values,df['Gender'])

Drop all the non-integer columns that are not useful to the model.

In [None]:
df.drop(['Sex','Name','Embarked','Ticket'],axis=1,inplace=True)

Confirm that only integer values are in the dataset

In [None]:
df.head()

## Separate the dependent and independent variables
The purpose of our model is to predict whether or not a passenger survived the titanic. The independent variable or the target will be the `y-axis` and the dependent variables will be on the `x-axis`

In [None]:
X = df[['Pclass','Age','SibSp','Parch' ,'Fare','Gender']]
y = df['Survived']

## Data spliting

Before building the model we'll have to split the dataset into training and testing datasets.

In [None]:
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.33, random_state= 42)

## Training the model
We'll train the model by using `LogisticRegression` we imported from `sklearn.linear_models`

In [None]:
# Model training
Model = LogisticRegression()

Model.fit(x_train,y_train)


## Making predictions

We give the model the testing data `x-test` for it to make predictions on who survived the titanic, this predictions are stored in the variable `predict`

In [None]:

predict = Model.predict(x_test)

## Testing the models performance 
We can tets the model's performance by using a `confusion_matrix` which outputs a matrix with the values of true positive, false positive ,true negative and true negative.


In [None]:

confusion_matrix(y_test,predict)


Since the output is not easily readable we convert it to a data frame that can be easily read 

In [None]:
pd.DataFrame(confusion_matrix(y_test,predict),columns=['Predicted did not survive','Predicted survived'],index=['Actually did not survive','Actually survived'])

We can also generate a `classification_report` which will show the models accuracy according to it's precision, recall, f1-score. 


In [None]:
# Generating the classification report

print(classification_report(y_test,predict))

To improve the models performance you can use more features by including the columns we dropped or using a different model