# The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## What Data Will I Use in This Competition?

In this competition, I’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.

**Train.csv** will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The **test.csv** dataset contains similar information but does not disclose the “ground truth” for each passenger. 

## Algorithm

### The Goal: 
to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

#### Tasks:

**1. Formalized Task**

**2. Preprocessing Data**
        
        2.1. Importing the libraries
        2.2. Importing the dataset
        2.3. Dealing with the missing data in the dataset
        2.4. Encoding categorical data
        2.5. Splitting the dataset into training set and test set
        2.6. Feature scaling

**3. Building the model**

### Step 1. Formalized Task

When choosing an algorithm for the problem of predicting a survivor on the Titanic, we will be guided by the principle of model interpretability. This means that we will choose algorithms that make it easy to understand which factors influence survival prediction and how much.

For example, linear regression is one of the most interpretable algorithms, as it allows you to evaluate the impact of each of the factors on survival. If we see that simplifying the model to linear regression does not lead to a significant deterioration in the results, then we will use this algorithm.

However, if we see that simplifying the model to linear regression leads to a significant deterioration in the results, then we will use more complex algorithms, such as gradient boosting or neural networks. At the same time, we will take into account that the results of these algorithms may be less interpretable and require additional analysis and verification before making decisions based on these results.

Thus, when choosing an algorithm for a problem, we will pay special attention to the interpretability of the model and use more complex algorithms only if simplifying the model leads to worse results.

Thus, the formalized task at the first stage of the primary model selection is as follows:

**Model** with k features:
$$
a(x) = w_0 + w_1x^1 + \dots w_kx^k = \langle w, x \rangle,\\
x = (1, x^1, \dots, x^k)
$$


At the same time, we understand that life/death assessment is a binary category. Thus, to solve the problem, we will use **logistic regression**, which is a special case of generalized linear regression


We will teach a linear model to correctly predict some object associated with a probability, but with a range of values $$(\infty; -\infty),$$ 

and convert the model's responses to a probability. Such an object is **logit** or **log odds** — the logarithm of the ratio of the probability of a positive event to a negative one $$
\log(\frac{p}{1-p}).
$$

For further analysis, we will use the sigmoid $$p = \sigma(\langle w, x \rangle) $$

The assumptions of logistic regression are almost identical to those for linear regression:

1. the dependent variable must be binary, usually encoded as zero and one;
2. independence of observations from each other;
3. lack of multicollinearity;
4. sufficient sample size (at least thirty observations);
5. no outliers.

### Step 2. Preprocessing Data

Basic **preprocessing** techniques used to convert raw data into clean data involve the following steps:
    
1. Conversion of data: Data in any form must be converted into numeric form for machine learning models to handle it.
2. Handling the missing values: Whenever any missing data in a data set is observed, one of the following can be considered:

    a. Ignoring the missing values: The corresponding row or column of data can be removed as needed.
    
    b. Filling the missing values: In this approach, the missing data can be filled manually. The value added in the missing data field can be the mean, median or the highest frequency value.
    
    c. Machine learning: Another approach is to predict the data that can be added in the empty position based on the existing data.
    
3. Outlier detection: Some data might deviate drastically from other observations in the data set; such type of data is said to be error data.

Thus, data preprocessing forms one of the most important steps in machine learning to build accurate machine learning models.

#### 2.1. Imports dependencies

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

#### 2.2. Load dataset

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
train.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
293,294,0,3,"Haas, Miss. Aloisia",female,24.0,0,0,349236,8.85,,S
319,320,1,1,"Spedden, Mrs. Frederic Oakley (Margaretta Corn...",female,40.0,1,1,16966,134.5,E34,C
820,821,1,1,"Hays, Mrs. Charles Melville (Clara Jennings Gr...",female,52.0,1,1,12749,93.5,B69,S
866,867,1,2,"Duran y More, Miss. Asuncion",female,27.0,1,0,SC/PARIS 2149,13.8583,,C
693,694,0,3,"Saad, Mr. Khalil",male,25.0,0,0,2672,7.225,,C


In [4]:
test.sample(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
307,1199,3,"Aks, Master. Philip Frank",male,0.83,0,1,392091,9.35,,S
234,1126,1,"Cumings, Mr. John Bradley",male,39.0,1,0,PC 17599,71.2833,C85,C
345,1237,3,"Abelseth, Miss. Karen Marie",female,16.0,0,0,348125,7.65,,S
63,955,3,"Bradley, Miss. Bridget Delia",female,22.0,0,0,334914,7.725,,Q
70,962,3,"Mulvihill, Miss. Bertha E",female,24.0,0,0,382653,7.75,,Q


Let's concatenate two datasets for the convenience of further processing

In [5]:
test['survived'] = np.nan

In [6]:
train['survived'] = train['Survived']

In [7]:
train = train.drop('Survived', axis=1)

In [8]:
test.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'survived'],
      dtype='object')

In [9]:
train.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'survived'],
      dtype='object')

In [10]:
df = pd.concat([train, test])

In [11]:
df.sample(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,survived
159,160,3,"Sage, Master. Thomas Henry",male,,8,2,CA. 2343,69.55,,S,0.0
731,732,3,"Hassan, Mr. Houssein G N",male,11.0,0,0,2699,18.7875,,C,0.0
497,498,3,"Shellard, Mr. Frederick William",male,,0,0,C.A. 6212,15.1,,S,0.0
435,436,1,"Carter, Miss. Lucile Polk",female,14.0,1,2,113760,120.0,B96 B98,S,1.0
137,1029,2,"Schmidt, Mr. August",male,26.0,0,0,248659,13.0,,S,


For convenience, we will reduce the names of the features to lower case.

In [12]:
df.columns = df.columns.str.lower()

#### 2.3. Missing Values

In [13]:
df.isnull().sum()/len(df)*100

passengerid     0.000000
pclass          0.000000
name            0.000000
sex             0.000000
age            20.091673
sibsp           0.000000
parch           0.000000
ticket          0.000000
fare            0.076394
cabin          77.463713
embarked        0.152788
survived       31.932773
dtype: float64

From the data, we see that three features have missing values — ```age```, ```cabin```, ```embarked```

##### Figure out why the data is missing
Is this value missing becuase it wasn't recorded or becuase it dosen't exist?


Let's start with age

In [14]:
df[df['age'].isnull()]

Unnamed: 0,passengerid,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,survived
5,6,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,0.0
17,18,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S,1.0
19,20,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C,1.0
26,27,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C,0.0
28,29,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
408,1300,3,"Riordan, Miss. Johanna Hannah""""",female,,0,0,334915,7.7208,,Q,
410,1302,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.7500,,Q,
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,


In [15]:
df.loc[df['age'].isnull(), 'age'] = df['age'].median()

In [16]:
df.loc[df['fare'].isnull(), 'fare'] = df['fare'].median()

In [17]:
df.loc[df['embarked'].isnull(), 'embarked'] = 'S'

My next hypothesis is that perhaps the location of the cabin also affects the survival of passengers during the tragedy on the Titanic. It is possible that the cabins located under certain letters or numbers were much further from the rescue or evacuation exits, which made it difficult for passengers to get out of their cabins in an emergency.

##### Dropping Features


```PassengerId``` may be dropped from training dataset as it does not contribute to survival.

```Name``` feature is relatively non-standard, may not contribute directly to survival, so maybe dropped.

In [18]:
df = df.drop(['passengerid', 'name', 'cabin'], axis=1)

In [19]:
df = df.drop(['ticket'], axis=1)

In [20]:
df.isnull().sum()/len(df)*100

pclass       0.000000
sex          0.000000
age          0.000000
sibsp        0.000000
parch        0.000000
fare         0.000000
embarked     0.000000
survived    31.932773
dtype: float64

#### 2.4. Encoding Categorical Data

This dataset contains both numerical and categorical characteristics. The latter take string values, each representing a specific category. Logistic regression can only work with numerical variables, and categorical data must be additionally prepared before training the model.

In [22]:
df['sex'].unique()

array(['male', 'female'], dtype=object)

In [23]:
df['embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

In [24]:
df = pd.concat([df, pd.get_dummies(df['embarked'])], axis=1)

In [25]:
df = df.drop('embarked', axis=1)

In [26]:
df = pd.concat([df, pd.get_dummies(df['sex'])], axis=1)

In [27]:
df = df.drop('sex', axis=1)

In [28]:
from pandas.api.types import CategoricalDtype

In [29]:
df['C'] = df['C'].astype(int)    

In [30]:
df['C'] = df['C'].astype('category')
df['Q'] = df['Q'].astype('category')
df['S'] = df['S'].astype('category')
df['female'] = df['female'].astype('category')
df['male'] = df['male'].astype('category')

In [31]:
df['C'] = df['C'].astype('int')
df['Q'] = df['Q'].astype('int')
df['S'] = df['S'].astype('int')
df['female'] = df['female'].astype('int')
df['male'] = df['male'].astype('int')

#### 2.5. Train-Test Split

In [35]:
train = df[df['survived'].notnull()]

In [36]:
test = df[df['survived'].isnull()]

In [37]:
test = test.drop('survived', axis=1)

#### 2.6. Feature Scaling

#### Data Balance
🔍 Data imbalance is a property of the distribution of categorical data, where one class is represented significantly more than all the others in the sample.

In such cases, we would have to exclude some observations from the entire sample in order to make the comparison more accurate. Let's assume that we are sure that the data collected reflects the general population, so we will not do anything extra. Our data contains all passengers without exception, so we can do nothing.

### Step 3. Building the Model

In [38]:
import statsmodels.api as sm
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod import families
import statsmodels.stats.tests.test_influence
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn import linear_model

We build as many models as we have predictors. In our case, these are seven models. We start with a model in which there is one predictor, then we successively add the remaining ones. We focus on the likelihood ratios (Log-Likelihood) to understand which model is best suited to explain the data.

Note that if, when an additional predictor is included, another independent variable turns from insignificant to significant, then this will be a reason to think about the violation of the assumption of uncorrelated predictors.

If two variables are strongly related to each other, then their coefficients will be indeterminate, that is, an arbitrary mutual change in the coefficients in front of them will lead to the same model.

To create a model, use the GLM command from sm. Write the results of the command to the model_1 object. The first parameter is the survivability dependent variable ```train['survived']```. The second parameter specifies a sheet with all independent variables. Let's start with the gender variable. As the last parameter, we specify ```family=families.Binomial()``` to indicate that we are using binomial logistic regression.

Finally, we initialize the creation of the model with the ```fit()``` method. Finally, we will display the results.

In [39]:
sfs = SequentialFeatureSelector(linear_model.LogisticRegression(),
                                k_features=3,
                                forward=True,
                                scoring='accuracy',
                                cv=None)

In [40]:
X = train.drop('survived', axis=1)

In [41]:
y = train['survived']

In [42]:
selected_features = sfs.fit(X, y)

In [43]:
selected_features.k_feature_names_ 

('pclass', 'sibsp', 'female')

In [44]:
X_tr = train.drop('survived', axis=1)

In [45]:
y_tr = train['survived']

In [46]:
logreg = linear_model.LogisticRegression()
logreg.fit(X_tr[['pclass', 'sibsp', 'female']], y_tr)

LogisticRegression()

In [47]:
X_te = test[['pclass', 'sibsp', 'female']]

In [48]:
y_pred = logreg.predict(X_te)

In [49]:
df

Unnamed: 0,pclass,age,sibsp,parch,fare,survived,C,Q,S,female,male
0,3,22.0,1,0,7.2500,0.0,0,0,1,0,1
1,1,38.0,1,0,71.2833,1.0,1,0,0,1,0
2,3,26.0,0,0,7.9250,1.0,0,0,1,1,0
3,1,35.0,1,0,53.1000,1.0,0,0,1,1,0
4,3,35.0,0,0,8.0500,0.0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...
413,3,28.0,0,0,8.0500,,0,0,1,0,1
414,1,39.0,0,0,108.9000,,1,0,0,1,0
415,3,38.5,0,0,7.2500,,0,0,1,0,1
416,3,28.0,0,0,8.0500,,0,0,1,0,1


In [50]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
 
X = train[['pclass', 'sibsp', 'female']]
 
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
 
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(len(X.columns))]
  
print(vif_data)

  feature       VIF
0  pclass  1.525366
1   sibsp  1.252569
2  female  1.405486


In [51]:
test_pred = pd.read_csv('test.csv')

In [52]:
test_pred[['PassengerId']]

Unnamed: 0,PassengerId
0,892
1,893
2,894
3,895
4,896
...,...
413,1305
414,1306
415,1307
416,1308


In [53]:
final = pd.concat([test_pred[['PassengerId']], pd.DataFrame(y_pred)], axis=1)

In [54]:
final = final.rename(columns={0 : 'Survived'})

In [55]:
final['Survived']  = final['Survived'].astype(int)

In [56]:
final.to_csv('final_titanic.csv', index=False)

In [57]:
final

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0
