### 1. Libraries and packages
#### Exercise 1
********************
To be able to use some functions we need to load the required packages in our workspace. Run the following cell. 
********************

In [None]:
import os
import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree

import statsmodels.api as sm
from scipy import stats

stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
#stats.chi2 - A chi-squared continuous random variable.
#.sf - Survival function (also defined as 1 - cdf (Cumulative distribution function), but sf is sometimes more accurate).

import matplotlib.pyplot as plt 
plt.rc("font", size=14)
import seaborn as sns
sns.set(style="white") #white background  style for seaborn plots
sns.set(style="whitegrid", color_codes=True)

### 2. Data Access
#### Exercise 2
********************
Find a function to read the files "train.csv" and "test.csv". Then look at the output of the first 5 lines of the files (_Hint:_ **_DataFrame_.head(...)** may be helpful). What do you notice?
********************

In [None]:
# Get current working directory
#os.getcwd()
# Change directory if required
#os.chdir('REPLACE PATH HERE')

#---------------------------------Exercise 2---------------------------------------------------
#get titanic train & test csv files as a DataFrame

#train data 
titanic_train_df = #TO DO
#test data 
titanic_test_df = #TO DO

In [None]:
# preview train data
#TO DO

#### Data Dictionary 
Here you can find a description of the *columns*.
<br><br>
*survival*: Survival 0 = No, 1 = Yes 
<br>
*pclass*: Ticket class 1 = 1st (Upper), 2 = 2nd (Middle), 3 = 3rd (Lower)
<br>
*sex*: Sex male/female 
<br>
*Age*: Age in years (is fractional if less than 1)
<br>
*sibsp*: number of siblings (brother, sister, stepbrother, stepsister) / spouses (husband, wife) aboard the Titanic 
<br>
*parch*: number of parents(mother, father) / children (daughter, son, stepdaughter, stepson) aboard the Titanic 
<br>
*ticket*: Ticket number 
<br>
*fare*: Passenger fare 
<br>
*cabin*: Cabin number 
<br>
*embarked*: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

In [None]:
# preview test data
#TO DO
#----------------------------------------------------------------------------------------------

### 3. Data Preparation
We want to prepare our data for modeling and analyzing. 
#### Exercise 3.1.
********************
Check the two imported files for the number of so-called NULL values. _Hint:_ You could combine the functions **.isnull()** and **.sum()** for this...
********************

In [None]:
#---------------------------------Exercise 3.1.------------------------------------------------
# check missing values in train dataset
#TO DO

In [None]:
# check missing values in test dataset
#TO DO
#----------------------------------------------------------------------------------------------

#### Exercise 3.2.
********************
To find out how high the proportion of missing values is, you can divide the output by the number of Passengers...( _Hint:_ len() ). 
********************

In [None]:
#---------------------------------Exercise 3.2.------------------------------------------------

# proportion of missing values in the train data
#TO DO
#----------------------------------------------------------------------------------------------

#### Cabin 
~77% of records in the column 'Cabin' are missing. It is not wise to replace missing values by any value, so it's better to ignore this variable for prediction. 

#### Exercise 3.3.
********************
Let's drop the variable "Cabin" (see Cheat Sheet) and save teh new dataframe as 'titanic_train_df_adj'.
********************

In [None]:
# drop the variable 'Cabin'
titanic_train_df_adj = #TO DO

#### Age
~20% of entries for passenger age and ~0.2% for 'Embarked' are missing. An idea would be to replace the missing values...


#### Exercise 3.4.
********************
Let's see what the 'Age' variable looks like in general. Plot a histogram of the 'Age' variable. What can you say about the distribution of the variable?
********************


In [None]:
#---------------------------------Exercise 3.4.------------------------------------------------
# histogram of Age
ax = #TO DO

#----------------------------------------------------------------------------------------------
# set axes 
ax.set(xlabel='Age', ylabel='Count')
plt.show()

"Age" is right skewed, so using the mean to replace the NAs might give us biased results. To deal with this, we could use the median..
#### Exercise 3.5.
********************
Compute the mean and the median of the variable and then replace the missing values ( _Hint:_ **mean(), median(), fillna(...)** may help you). 
********************

In [None]:
#---------------------------------Exercise 3.5.------------------------------------------------
# compute mean
#TO DO

In [None]:
# compute median
#TO DO

In [None]:
# impute with 28 instead of NAs
#TO DO
#----------------------------------------------------------------------------------------------

#### Embarked
There are only 2 missing values for "Embarked", so we can just impute with the port where most people boarded.
#### Exercise 3.6.
********************
Use the function **value_counts()** to count each value of the 'Embarked' variable. (An alternative could be the function **countplot()** in the package **seaborn** which visualize the counts of the values.) Then replace the missing values with the corresponding value.
********************

In [None]:
#---------------------------------Exercise 3.6.------------------------------------------------
# counts of values 
#TO DO

In [None]:
# optional countplot to visualize the counts
#TO DO
plt.show()

'S' (Southhampton) is the port with the most passengers, so we will set the NAs to 'S'

In [None]:
# impute with 'S' instead of NAs
#TO DO
#----------------------------------------------------------------------------------------------

#### Exercise 3.7.
********************
A) Investigate how the function **np.where()** works. <br>
B) **Create a new variable**
SibSp = number of siblings/ spouses and Parch = number of parents/ children relate to travveling with family. For simplicity's sake let's create a new categorical variable "TravelAlone": whether ("TravelAlone" = 1) or not ("TravelAlone" = 0) that individual was traveling alone. <br> 
C) Drop "SibSp" and "Parch" after creating the new variable.
********************

In [None]:
#---------------------------------Exercise 3.7.------------------------------------------------
# create additional variable for traveling alone
#TO DO

In [None]:
titanic_train_df_adj.head(5)

In [None]:
# drop SibSp and Parch
#TO DO
#---------------------------------------------------------------------------------------------

#### Data Transformation
Let's encode nominal variables "Sex" and "Embarked"
#### Exercise 3.8.
********************
We want to create dummy variables from the nominal variables "Sex" and "Embarked". <br> 
A) Since the "Sex" variable has only two values (male and female), we first change the type of the variable to 'category' and then encode the values to 0 and 1 (_Hint_: cat.codes) . Examine how **cat.codes** assign values. <br> 
B) Add a function (_Hint_: cheatsheet may help) to create dummy variables from "Embarked". <br> 
C) Examine the output and drop the unnecessary variables!
********************

In [None]:
#---------------------------------Exercise 3.8.------------------------------------------------
# make a copy of a DataFrame to have the changes in a new version and save the "original" values
titanic_train_df_adj_2 = titanic_train_df_adj.copy()
# A
# change type and encode 'Sex' to 1 = male, 0 = female
titanic_train_df_adj_2["Sex"] = #TO DO
titanic_train_df_adj_2["Sex_male"] = #TO DO

# B
# create dummy variables of 'Embarked' 
titanic_train_df_adj_2 = #TO DO
# C - examine
#TO DO

Drop unnecessary variables

In [None]:
# C - drop
#TO DO
#---------------------------------------------------------------------------------------------

In [None]:
train_final = titanic_train_df_adj_2
train_final.head(5)

#### Exercise 3.9.
********************
Understand the following lines and run the code to make the same changes in **test data** .
********************

In [None]:
#---------------------------------Exercise 3.9.------------------------------------------------
titanic_test_df_adj = titanic_test_df.copy()
titanic_test_df_adj["Age"].fillna(28, inplace = True)
titanic_test_df_adj.drop("Cabin", axis = 1, inplace = True)

In [None]:
# instead of Embarked we have to take a look at the variable "Fare" that has one missing value
# compute mean
titanic_test_df["Fare"].mean(skipna=True)

In [None]:
# compute median
titanic_test_df["Fare"].median(skipna=True)

In [None]:
titanic_test_df_adj["Fare"].fillna(14.45, inplace = True)

In [None]:
# create additional variable for traveling alone
titanic_test_df_adj["TravelAlone"] = np.where(titanic_test_df_adj["SibSp"]+titanic_test_df_adj["Parch"] > 0, 0, 1)
titanic_test_df_adj.drop("SibSp", axis = 1, inplace = True)
titanic_test_df_adj.drop("Parch", axis = 1, inplace = True)

In [None]:
#titanic_test_df_adj_2 = pd.get_dummies(titanic_test_df_adj, columns = ["Sex", "Embarked", "Pclass"])
titanic_test_df_adj_2 = titanic_test_df_adj.copy()
# change type and encode 'Sex' to 1 = male, 0 = female
titanic_test_df_adj_2["Sex"] = titanic_test_df_adj_2["Sex"].astype('category')
titanic_test_df_adj_2["Sex_code"] = titanic_test_df_adj_2["Sex"].cat.codes
# create dummy variables of 'Embarked' 
titanic_test_df_adj_2 = pd.get_dummies(titanic_test_df_adj_2, columns = ["Embarked"])
# drop unnecessary variables
titanic_test_df_adj_2.drop("PassengerId", axis = 1 ,inplace = True)
titanic_test_df_adj_2.drop("Ticket", axis = 1 ,inplace = True)
titanic_test_df_adj_2.drop("Name", axis = 1 ,inplace = True)
titanic_test_df_adj_2.drop("Sex", axis = 1 ,inplace = True)

test_final = titanic_test_df_adj_2
test_final.head(5)
#--------------------------------------------------------------------------------------------------

### 4. Exploratory Data Analysis
#### Exercise 4.1.
********************
We want to plot the distribution of 'Age' conditioned 'Survived = yes' or 'Survived = no'. 
<br> <br>
A) Add missing information and then run the code. What can you say about the age distribution? <br>
B) Add a barplot (package seaborn). What do you notice? <br>
C) Create a new dummy variable "Is_Minor" where the value is set to 1 if the passenger is under 16. 
********************
#### Age 

In [None]:
plt.figure(figsize = (17,8))
#---------------------------------Exercise 4.1.------------------------------------------------
# A
sns.kdeplot(train_final["Age"][#TO DO], color = 'r', shade = True)
sns.kdeplot(train_final["Age"][#TO DO], color = 'b', shade = True)

plt.legend(['Survived', 'Died'])
plt.show()

In [None]:
# B
plt.figure(figsize = (20,8))
sns.barplot(#TO DO)
plt.show()

Considering the survival rate of passengers under 16, we will include another categorical variable in our dataset: "Is_Minor"

In [None]:
# C
# add a variable 'Minor', which is set to 1 if the person is under 16
train_final['Is_Minor'] = np.where(#TO DO)
test_final['Is_Minor'] = np.where(#TO DO)
#----------------------------------------------------------------------------------------------

#### Fare
#### Exercise 4.2
********************
Plot the distribution of 'Fare' conditioned 'Survived = yes' or 'Survived = no'. Add missing information and then run the code. What can you say about the distribution of 'Fare'? 
********************

In [None]:
#---------------------------------Exercise 4.2.------------------------------------------------
plt.figure(figsize = (17,8))
sns.kdeplot(train_final["Fare"][#TO DO], color = 'r', shade = True)
sns.kdeplot(train_final["Fare"][#TO DO], color = 'b', shade = True)
plt.legend(['Survived', 'Died'])
plt.show()
#----------------------------------------------------------------------------------------------

As the distributions are different, it's likely that 'Fare' would be a significant predictor in our model. Passengers with lower fares seem to have been less likely to survive. 

#### Passenger Class 
#### Exercise 4.3
********************
Let's check if there is a correlation with the Passenger Class with a cross table. Add missing information and run the code! What can you see?
********************

In [None]:
#---------------------------------Exercise 4.3.------------------------------------------------
pd.crosstab(#TO DO , margins = True, normalize = 'index')
#----------------------------------------------------------------------------------------------

As expected, it was safest to be a first class passenger.

#### Embarked
#### Exercise 4.4.
********************
Explore if there is any correlation with Port of Embarkation ('Embarked')... Add missing information and run the code! What can you see?
********************

In [None]:
#---------------------------------Exercise 4.4.------------------------------------------------
pd.crosstab(#TO DO , margins = True, normalize = 'index' )
#----------------------------------------------------------------------------------------------

Passengers who boarded in Cherbourg, France, appear to have the highest survival rate.

#### TravelAlone
#### Exercise 4.5.
********************
Investigate if there is any difference between traveling alone and with family ... Add missing information and run the code! What can you see?
********************

In [None]:
#---------------------------------Exercise 4.5.------------------------------------------------
pd.crosstab(#TO DO, margins = True, normalize = 'index' )
#----------------------------------------------------------------------------------------------

Traveling with the family appears to be safer than traveling alone.

#### Gender
#### Exercise 4.6.
********************
Investigate if there is any difference between men and women ... Add missing information and run the code! What can you see? Use here a barplot below to visualize the values.
********************

In [None]:
#---------------------------------Exercise 4.5.------------------------------------------------
pd.crosstab(#TO DO , margins = True, normalize = 'index' )

In [None]:
# Let's use here a barplot for visualization
sns.barplot('Sex', 'Survived', data = titanic_train_df_adj, color="aquamarine")
plt.show()
#----------------------------------------------------------------------------------------------

There is a very obvious difference -  being female strongly increased your chance to survive.

### 5. Data Analysis

We will use a logistic regression to predict the likelihood of survival using the train_final dataset for the training. 

In [None]:
train_final.head(5)

#### Exercise 5.1.
********************
Add the missing information and run the code lines. <br><br>
A) For your first model we suggest to select the following <br>
independent variables: "Pclass", "Age", "Fare", "TravelAlone", "Sex_code", "Embarked_C", "Embarked_Q", "Is_Minor" <br>
and the following dependent/target variable: "Survived" .<br>
B) Use the **Logit(...)** function from package we called **sm** to build the model. <br>
C) Take a look at the output (_Hint:_ **summary()**): In the column'P >|z|' you find the p-values for the variables. Which variables are significant at the 0.05 alpha level (p < 0.05)?
********************

In [None]:
#---------------------------------Exercise 5.1.------------------------------------------------
# select variables for your first model
# A
cols = [#TO DO]
X_1 = train_final[cols]
# set the target variable
Y_1 = train_final[#TO DO]
# B
# logistic regression model 
logit_model_1 = #TO DO
# fit the best model
result_1 = logit_model_1.fit()
# C
print(#TO DO)
#----------------------------------------------------------------------------------------------

#### Exercise 5.2.
********************
Run the following cell to compute the prediction score. 
********************

In [None]:
#---------------------------------Exercise 5.2.------------------------------------------------
# predict train labels
logreg = LogisticRegression()
logreg.fit(X_1, Y_1)
logreg.score(X_1, Y_1)
#----------------------------------------------------------------------------------------------

#### Exercise 5.3.
********************
We will test the 80 - 20 cross validation. <br> 
A) Split your **labeled** data into train and test data (_Hint:_ Function **train_test_split(...)**). <br> 
B) Use the same columns for your prediction. <br> 
C) Compute the model score after training <br> 
D) Test the model with the test values you created in A) <br> 
********************


In [None]:
#---------------------------------Exercise 5.3.------------------------------------------------
# A
train, test = #TO DO
# B
cols = #TO DO
X_2 = train[cols]
Y_2 = train[#TO DO]
logit_model_2 = sm.Logit(#TO DO)
result_2 = logit_model_2.fit()
print(result_2.summary())

In [None]:
# C
#TO DO

In [None]:
# D
#predict test labels
X_2_test = #TO DO
Y_2_test = #TO DO
Y_2_test_pred = logreg.predict(X_2_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_2_test, Y_2_test)))
#----------------------------------------------------------------------------------------------

The model's out of sample performance show similar results.

#### Random Forest with 100 trees
#### Exercise 5.4.
********************
A) Run the following lines to build a Random Forest model for prediction. <br>
B) Predict for the test values and count how many people would survive according to these values. (We have only unlabeled test data, so we can't use it to compute the accuracy.)
********************

In [None]:
#---------------------------------Exercise 5.4.------------------------------------------------
# A
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_2, Y_2)
random_forest.score(X_2, Y_2)

In [None]:
random_forest.score(test[cols], test['Survived'])

In [None]:
# B
# predict for the test values
Y_pred_RF = random_forest.predict(test[cols])
Y_pred_RF_0_1 = #TO DO
#TO DO
#----------------------------------------------------------------------------------------------

#### Decision Tree
Compare the results with a decision tree.
#### Exercise 5.5.
********************
A) Run the following lines to build a Decision tree model for prediction. <br>
B) Predict for the test values and count how many people would survive according to these values. (We have only unlabeled test data, so we can't use it to compute the accuracy.) <br>
C) Visualize the tree and interpret the result.
********************

In [None]:
# A
tree_1 = tree.DecisionTreeClassifier(criterion='gini', splitter='best',max_depth=3, min_samples_leaf=20)
tree_1.fit(X_2, Y_2)

In [None]:
# B
Y_pred_DT = tree_1.predict(test_final[cols])
Y_pred_DT_0_1 = #TO DO
#TO DO

In [None]:
# C
import graphviz 
tree_1_view = tree.export_graphviz(tree_1, out_file=None, feature_names = X_DT.columns.values, rotate=True) 
tree_1_viz = graphviz.Source(tree_1_view)
tree_1_viz