## Titanic survivors
Logistic regression model using scikit-learn
<div stlye="text-align:center">
<img src="titanic.png" width=75%>
</div>

In [124]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot') or plt.style.use('ggplot')

## Data
Lest get the raw data, to see what is like

In [160]:

data = pd.read_csv('dataset/train.csv')
test_data = pd.read_csv('dataset/test.csv')

In [161]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Cleaning data


## Removing non important columns
There are columns that won't help the algorithm, so we are going to take those out

In [162]:
columns_to_drop = ['PassengerId','Name','Ticket']
df = data.drop(columns=columns_to_drop)
df

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.2500,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.9250,,S
3,1,1,female,35.0,1,0,53.1000,C123,S
4,0,3,male,35.0,0,0,8.0500,,S
...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,,S
887,1,1,female,19.0,0,0,30.0000,B42,S
888,0,3,female,,1,2,23.4500,,S
889,1,1,male,26.0,0,0,30.0000,C148,C



Let's check for null values

In [163]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Cabin     204 non-null    object 
 8   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 62.8+ KB


### Categorical Data
We have three categorical data
- Sex -> Binary nominal
- Embarked -> We will asumme it as nominal (One hot encoding)
- Cabin -> Ordinal


Let's first see the cabin.
We will construct a function to transform null values into one of the existing cabins. We will use the cheapest cabins, because we asume that people without cabins is because the might had cheap tickets

Dealing with missing values

In [164]:
df = df.dropna(subset='Age')
df

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.2500,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.9250,,S
3,1,1,female,35.0,1,0,53.1000,C123,S
4,0,3,male,35.0,0,0,8.0500,,S
...,...,...,...,...,...,...,...,...,...
885,0,3,female,39.0,0,5,29.1250,,Q
886,0,2,male,27.0,0,0,13.0000,,S
887,1,1,female,19.0,0,0,30.0000,B42,S
889,1,1,male,26.0,0,0,30.0000,C148,C


In [151]:
# Returnss a list of cheapest cabins (F)
def get_cheapest_cabin(cabin_list):
    list = []
    for cabin in cabin_list:
        if(type(cabin)!=float):
            if(cabin.startswith("F")):
                list.append(cabin)
    return list


In [152]:
cabins_list = df.Cabin.unique().tolist()
basic_cabins = get_cheapest_cabin(cabins_list)
basic_cabins


['F33', 'F G73', 'F2', 'F4', 'F G63']

In [165]:
import random
def get_cabin(cabin):
  #  print(cabin)
  # print(type(cabin))
    if type(cabin)==float:
    #    print("is here")
        return random.choice(basic_cabins)
    else:
        return cabin

df['Cabin'] = df['Cabin'].map(get_cabin)
df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Cabin'] = df['Cabin'].map(get_cabin)


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.2500,F2,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.9250,F33,S
3,1,1,female,35.0,1,0,53.1000,C123,S
4,0,3,male,35.0,0,0,8.0500,F G73,S
...,...,...,...,...,...,...,...,...,...
885,0,3,female,39.0,0,5,29.1250,F G63,Q
886,0,2,male,27.0,0,0,13.0000,F4,S
887,1,1,female,19.0,0,0,30.0000,B42,S
889,1,1,male,26.0,0,0,30.0000,C148,C


Transforming string data of Sex and Cabin into ordinal data. (In the case of sex is not ordinal but a binary value)

In [166]:
labelencoder = LabelEncoder()
df['Sex'] = labelencoder.fit_transform(df['Sex'])
df['Cabin'] = labelencoder.fit_transform(df['Cabin'])
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Sex'] = labelencoder.fit_transform(df['Sex'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Cabin'] = labelencoder.fit_transform(df['Cabin'])


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,1,22.0,1,0,7.2500,129,S
1,1,1,0,38.0,1,0,71.2833,73,C
2,1,3,0,26.0,0,0,7.9250,130,S
3,1,1,0,35.0,1,0,53.1000,49,S
4,0,3,1,35.0,0,0,8.0500,128,S
...,...,...,...,...,...,...,...,...,...
885,0,3,0,39.0,0,5,29.1250,127,Q
886,0,2,1,27.0,0,0,13.0000,131,S
887,1,1,0,19.0,0,0,30.0000,26,S
889,1,1,1,26.0,0,0,30.0000,53,C


Let's remove the two non nulls rows of Embarked. Because is the only one with null values we can just use df.dropna

In [169]:
df =df.dropna()



## One hot encoding
We are going to get our nominal data into one hot encoding

In [171]:
df = pd.get_dummies(df,prefix=['Embarked'],columns=['Embarked'])
df

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S
0,0,3,1,22.0,1,0,7.2500,129,0,0,1
1,1,1,0,38.0,1,0,71.2833,73,1,0,0
2,1,3,0,26.0,0,0,7.9250,130,0,0,1
3,1,1,0,35.0,1,0,53.1000,49,0,0,1
4,0,3,1,35.0,0,0,8.0500,128,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
885,0,3,0,39.0,0,5,29.1250,127,0,1,0
886,0,2,1,27.0,0,0,13.0000,131,0,0,1
887,1,1,0,19.0,0,0,30.0000,26,0,0,1
889,1,1,1,26.0,0,0,30.0000,53,1,0,0


In [175]:
X_train = df.drop(columns=['Survived'])
y_train = df['Survived']

In [176]:
X_train

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked_C,Embarked_Q,Embarked_S
0,3,1,22.0,1,0,7.2500,129,0,0,1
1,1,0,38.0,1,0,71.2833,73,1,0,0
2,3,0,26.0,0,0,7.9250,130,0,0,1
3,1,0,35.0,1,0,53.1000,49,0,0,1
4,3,1,35.0,0,0,8.0500,128,0,0,1
...,...,...,...,...,...,...,...,...,...,...
885,3,0,39.0,0,5,29.1250,127,0,1,0
886,2,1,27.0,0,0,13.0000,131,0,0,1
887,1,0,19.0,0,0,30.0000,26,0,0,1
889,1,1,26.0,0,0,30.0000,53,1,0,0


In [178]:
model = LogisticRegression()
model.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
