# Create your first ML Model - Titanic Dataset

In this workshop we will explore one of the "Hello-World" datasets of machine learning, the Titanic dataset. This was also a challenge on [Kaggle](https://www.kaggle.com/competitions/titanic/overview).

Goals:

* Create a machine learning model to predict..
* Calculate the survival chance if you were on board

In [None]:
# import ML libraries
import numpy as np 
import pandas as pd 

# data visualization
import seaborn as sns
from matplotlib import pyplot

# Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_predict

## Data Importing

In [None]:
# More information on the dataset https://www.kaggle.com/competitions/titanic/overview
test_df = pd.read_csv("titanic/test.csv")
train_df = pd.read_csv("titanic/train.csv")

## Data Exploration

### First Conclusions

**Useful Data**:

* Age: Age in years
* SibSp: # of siblings / spouses aboard the Titanic
* Parch: # of parents / children aboard the Titanic
* Fare: Passenger fare
* Survived: Survived or not
* pclass: Ticket class
* sex: Sex
* embarked: Port of Embarkation

**Does it contain important information?**:

* PassengerId: Unique ID of a passenger
* Name: Name
* ticket: Ticket number
* cabin: Cabin number


**Contains Null-data**:

* Age
* Cabin
* Embarked

## Data Analysis

In [None]:
sns.displot(train_df['Age'])

### Finding Correlations

In [None]:
sns.lmplot(x='Pclass',y='Survived',data=train_df)

# Can you explain the model?

In [None]:
sns.boxplot(data=train_df,x="Pclass",y="Age")

# Can you explain the boxplot?

## Feature Engineering

### PassengerId

In [None]:
train_df["PassengerId"].value_counts()

# Is this useful data?

### Ticket

In [None]:
train_df["Ticket"].value_counts()

# Is this useful data? What could we do more to define it?

### Cabin

In [None]:
train_df["Cabin"].value_counts()

# Is this useful data? What could we do more to define it?

### Name

In [None]:
train_df["Name"].head(10)

# Is this useful data? Can we filter more information from the names?

In [None]:
# Creating a new column named Name_title
# We need to do that for the train AND test dataframe

both_dfs = [train_df, test_df]

for dataset in both_dfs:
    dataset["Name_title"] = dataset.Name.apply(lambda x: x.split(",")[1].split(".")[0].strip())

## Data Pre-Processing

After our data analysis and feature engineering, we can now prepare our data for pre-processing.

**Leave**:

* Survived
* Sex
* Age
* SibSp
* Parch
* Pclass
* Fare
* Embarked

**Drop**:

* PassengerId - just the unique ID
* Name - Extracting the title, then dropping the whole passenger name
* Ticket - maybe extract letters. Drop for now.
* Cabin - too many null data rows

**Add new column**:
    
* Name_title

### Sex

In [None]:
# We need to convert categorical features to numeric values.
# Otherwise the machine learning algorithm won't be able to directly take in those features as inputs
# We define a 0 for male and a 1 for female
# We need to apply this to the train AND test dataset



### Age

In [None]:
# As we know, we have missing Age-values in the train AND test df
# Therefore, we apply the mean age of all passengers

both_dfs = [train_df, test_df]

for dataset in both_dfs:
    dataset["Age"] = dataset["Age"].fillna(dataset.Age.mean())

In [None]:
train_df["Age"].isna().sum() # double check if there are any na values

In [None]:
test_df["Age"].isna().sum() # double check if there are any na values

### Fare

In [None]:
# There is one missing fare value in the test dataset, which is not allowed.
test_df.isna().sum().sort_values(ascending=False)

In [None]:
# We will just drop this data row from the test_df

test_df.dropna(subset=["Fare"],inplace = True)

### Embark

In [None]:
# At first we are dropping the 2 datarows where we don't have any data about their embarking
# Alternatively we could have filled this value with the most common embarking port

train_df.dropna(subset=["Embarked"],inplace = True)

# Same as with the genders we need to put this feature into a numeric value

cities = {"S": 0, "C": 1, "Q": 2}
both_dfs = [train_df, test_df]

for dataset in both_dfs:
    dataset['Embarked'] = dataset['Embarked'].map(cities)

Survived, SibSp, Parch, Pclass don't any processing

### Name & Name_title

In [None]:
# Let's check our extracted titles again

train_df["Name_title"].value_counts() #checking the unique value counts on the train_df

In [None]:
test_df["Name_title"].value_counts()

In [None]:
# We should clean up Name_title a little bit:

# aggregating some common titles into a new category
# correcting misspelled titles
# creating a numeric categorization

titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Special": 5}
both_dfs = [train_df, test_df]

for dataset in both_dfs:
    
    # replace titles with a more common title or as Special
    dataset['Name_title'] = dataset['Name_title'].replace(['Lady','the Countess','Capt', 'Col','Don', 'Dr','Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Special')
    dataset['Name_title'] = dataset['Name_title'].replace('Mlle', 'Miss')
    dataset['Name_title'] = dataset['Name_title'].replace('Ms', 'Miss')
    dataset['Name_title'] = dataset['Name_title'].replace('Mme', 'Mrs')
    
    # convert titles into numbers
    dataset['Name_title'] = dataset['Name_title'].map(titles)

train_df["Name_title"].value_counts() #checking the new unique value counts on the train_df

### Dropping Columns

In [None]:
# Dropping columns / data which is not needed
# we need to do that on both datasets!
    
train_df = train_df.drop(["Name","PassengerId","Ticket","Cabin"],axis=1)
test_df = test_df.drop(["Name","PassengerId","Ticket","Cabin"],axis=1)

## Final Data Checking

In [None]:
train_df.head()

In [None]:
# Are there any na fields?
train_df.isna().sum().sort_values(ascending=False)

In [None]:
test_df.head()

In [None]:
# Are there any na fields?
test_df.isna().sum().sort_values(ascending=False)

## Building the Machine Learning Model

In [None]:
# Now it's time to build the ML model

# Fortunately, the dataset is already splitted in a training and test dataset

# For training the modeL:
x_train = train_df.drop("Survived", axis=1) # hiding the result
y_train = train_df["Survived"] #df with the result

# For testing the model:
x_test = test_df 

In [None]:
print(x_test)

### Logistic Regression

In [None]:
# Since we have a classification problem, let's try with a logistic regression model



In [None]:
predictions = cross_val_predict(logmodel, x_train, y_train)
confusion_matrix(y_train, predictions)

In [None]:
importance = logmodel.coef_[0]
# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

## Test the Model yourself

In [None]:
# Pclass = 1,2,3
# Sex = {"male": 0, "female": 1} 
# Age = you age
# SibSp - with how many siblings did you travel?
# Parch - with how many children/parents did you travel?
# Fare - How much did you pay for the ticket?
# Embarked = {"Southampton": 0, "C": 1, "Q": 2}
# Name_title = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Special": 5}

yourself = [[1,1,31,0,0,50,0,1]]

predict_yourself = logmodel.predict(yourself)

print(predict_yourself)

## Outcome of this lab

Thanks! I hope you've enjoyed this lab!

What did you learn today?

* Good Data is important!
* Data Cleaning + pre-processing takes a lot of time
* Domain knowledge is important
* Selecting the right model depends on the scenario