# Install necessary packages

We can install the necessary package by either running `pip install --user <package_name>` or include everything in a `requirements.txt` file and run `pip install --user -r requirements.txt`. Since we have a few packages we need to install we will use the second option.

> NOTE: Do not forget to use the --user argument. It is necessary if you want to use Kale to transform this notebook into a Kubeflow pipeline

In [None]:
!pip install --user -r requirements.txt

# Imports

In this section we import the packages we need for this example. Make it a habbit to gather your imports in a single place. It will make your life easier if you are going to transform this notebook into a Kubeflow pipeline using Kale.

In [None]:
import os
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.metrics import log_loss
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Load the data

In this section, we load the data. The data come in a CSV format, thus, `pandas` offers some great options.

In [None]:
data_path = "data/"
train_path = os.path.join(data_path, "train.csv")
test_path = os.path.join(data_path, "test.csv")

In [None]:
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

In [None]:
train_df.head()

# Data processing

We are now ready to preprocess the data set. This includes cleaning the dataset, imputing missing values, scaling numerical values, encoding categorical attributes etc.

In [None]:
train_df.info()

In [None]:
missing_values = train_df.isnull().sum().sort_values(ascending=False)
missing_values

## Prepare the data

Let us now separate the features from the labels.

In [None]:
predictors = strat_train_set.drop("Survived", axis=1)
labels = strat_train_set["Survived"].copy()

### Data Cleaning

Most Machine Learning algorithms cannot work with missing features, so let us create a few functions to take care of them. We noticed earlier that the `Cabin`, `Age` and `Embarked` attributes have some missing values, so let us fix this. We have three options:

* Get rid of the corresponding rows
* Get rid of the whole attribute
* Set the missing values (zero, mean, median, etc.)

We will drop the `Cabin` entirely, since the most values are missing, and later impute the `Age` and `Embarked` attributes in a pipeline.

In [None]:
predictors.drop("Cabin", axis=1, inplace=True)

In [None]:
num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

### Handling Text and Categorical Attributes

Machine Learning algorithms prefer to work with numbers anyway, so let’s convert these text labels to numbers.

`scikit-learn` provides a transformer for this task called `OneHotEncoder`, that we will use to encode the `Embarked` and `Sex` attributes. 

In [None]:
encoder = OneHotEncoder()

### Feature scaling

One of the most important transformations we need to apply to our data is feature scaling. With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. This is the case for the Titanic data. `Fare`, `Age` and `Pclass` differ significantly. Note, that scaling the target values is generally not required.

There are two common ways to get all attributes to have the same scale: min-max scaling and standardization. Min-max scaling (many people call this normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min. Scikit-Learn provides a transformer called `MinMaxScaler` for this. It has a `feature_range` hyperparameter that lets you change the range if you don’t want 0–1 for some reason.

Standardization is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance. Scikit-Learn provides a transformer called `StandardScaler` for standardization.

In [None]:
scaler = StandardScaler()

### Putting it together

As we can see, there are many data transformation steps that need to be executed in the right order. Scikit-Learn provides the Pipeline class to help with such sequences of transformations.

In [None]:
num_attribs = ["Pclass", "SibSp", "Parch", "Fare"]
cat_attribs = ["Sex", "Embarked"]

In [None]:
num_pipeline = Pipeline([
    ("num_imputer", num_imputer),
    ("std_scaler", scaler)
])

In [None]:
cat_pipeline = Pipeline([
    ("cat_imputer", cat_imputer),
    ("encoder", encoder)
])

In [None]:
full_pipeline = ColumnTransformer([
        ('num', num_pipeline, num_attribs),
        ('cat', cat_pipeline, cat_attribs),
    ])

predictors_prepared = full_pipeline.fit_transform(predictors)

# Model Training

Now that we have framed the problem, got the data and explored it, sampled a training set and a test set, and wrote our transformation pipelines to clean up and prepare our data for Machine Learning algorithms, we are ready to select and train a machine learning model. We will use cross validation to evaluate five different models:

* Support Vector Machines
* Decision Trees
* K Nearest Neighbors
* Random Forests
* Logistic Regression

In [None]:
svc = SVC(gamma="auto")
svc.fit(predictors_prepared, labels)
svc_scores = cross_val_score(svc, predictors_prepared, labels, scoring="accuracy", cv=4, )
print(svc_scores)

In [None]:
tree = DecisionTreeClassifier()
tree.fit(predictors_prepared, labels)
tree_scores = cross_val_score(tree, predictors_prepared, labels, scoring="accuracy", cv=4)
print(tree_scores)

In [None]:
knn = KNeighborsClassifier()
knn.fit(predictors_prepared, labels)
knn_scores = cross_val_score(knn, predictors_prepared, labels, scoring="accuracy", cv=4)
print(knn_scores)

In [None]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(predictors_prepared, labels)
random_forest_scores = cross_val_score(random_forest, predictors_prepared, labels, scoring="accuracy", cv=4)
print(random_forest_scores)

In [None]:
logistic_regression = LogisticRegression(solver="lbfgs")
logistic_regression.fit(predictors_prepared, labels)
logistic_regression_scores = cross_val_score(logistic_regression, predictors_prepared, labels, scoring="accuracy", cv=4)
print(logistic_regression_scores)

# Model Evalluation

We are now at the final stage of our experiment. We are ready to evaluate our algorithms using the test set.

In [None]:
X_test = strat_test_set.drop(["Survived", "Cabin"], axis=1)
y_test = strat_test_set["Survived"].copy()

In [None]:
X_test_prepared = full_pipeline.transform(X_test)

In [None]:
svc_predictions = svc.predict(X_test_prepared)
svc_accuracy = accuracy_score(y_test, svc_predictions)

In [None]:
tree_predictions = tree.predict(X_test_prepared)
tree_accuracy = accuracy_score(y_test, tree_predictions)

In [None]:
knn_predictions = knn.predict(X_test_prepared)
knn_accuracy = accuracy_score(y_test, knn_predictions)

In [None]:
random_forest_predictions = random_forest.predict(X_test_prepared)
random_forest_accuracy = accuracy_score(y_test, random_forest_predictions)

In [None]:
logistic_regression_predictions = logistic_regression.predict(X_test_prepared)
logistic_regression_accuracy = accuracy_score(y_test, logistic_regression_predictions)

In [None]:
print(svc_accuracy)
print(tree_accuracy)
print(random_forest_accuracy)
print(logistic_regression_accuracy)