# Install necessary packages

We can install the necessary package by either running `pip install --user <package_name>` or include everything in a `requirements.txt` file and run `pip install --user -r requirements.txt`. Since we have a few packages we need to install we will use the second option.

> NOTE: Do not forget to use the --user argument. It is necessary if you want to use Kale to transform this notebook into a Kubeflow pipeline

In [1]:
!pip install --user -r requirements.txt

Collecting numpy==1.16.4
  Downloading numpy-1.16.4-cp36-cp36m-manylinux1_x86_64.whl (17.3 MB)
[K     |████████████████████████████████| 17.3 MB 2.9 MB/s eta 0:00:01
[?25hCollecting pandas==0.25.1
  Downloading pandas-0.25.1-cp36-cp36m-manylinux1_x86_64.whl (10.5 MB)
[K     |████████████████████████████████| 10.5 MB 56.2 MB/s eta 0:00:01
[?25hCollecting scikit-learn==0.20.4
  Downloading scikit_learn-0.20.4-cp36-cp36m-manylinux1_x86_64.whl (5.4 MB)
[K     |████████████████████████████████| 5.4 MB 36.9 MB/s eta 0:00:01
Installing collected packages: numpy, pandas, scikit-learn
Successfully installed numpy-1.16.4 pandas-0.25.1 scikit-learn-0.20.4
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


# Imports

In this section we import the packages we need for this example. Make it a habbit to gather your imports in a single place. It will make your life easier if you are going to transform this notebook into a Kubeflow pipeline using Kale.

In [2]:
import os
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.metrics import log_loss
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Load the data

In this section, we load the data. The data come in a CSV format, thus, `pandas` offers some great options.

In [3]:
data_path = "dataset/"
train_path = os.path.join(data_path, "train.csv")
test_path = os.path.join(data_path, "test.csv")

In [4]:
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

In [5]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Data processing

We are now ready to preprocess the data set. This includes cleaning the dataset, imputing missing values, scaling numerical values, encoding categorical attributes etc.

In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
missing_values = train_df.isnull().sum().sort_values(ascending=False)
missing_values

Cabin          687
Age            177
Embarked         2
Fare             0
Ticket           0
Parch            0
SibSp            0
Sex              0
Name             0
Pclass           0
Survived         0
PassengerId      0
dtype: int64

## Prepare the data

Let us now separate the features from the labels.

In [8]:
predictors = train_df.drop("Survived", axis=1)
labels = train_df["Survived"].copy()

### Data Cleaning

Most Machine Learning algorithms cannot work with missing features, so let us create a few functions to take care of them. We noticed earlier that the `Cabin`, `Age` and `Embarked` attributes have some missing values, so let us fix this. We have three options:

* Get rid of the corresponding rows
* Get rid of the whole attribute
* Set the missing values (zero, mean, median, etc.)

We will drop the `Cabin` entirely, since the most values are missing, and later impute the `Age` and `Embarked` attributes in a pipeline.

In [9]:
predictors.drop("Cabin", axis=1, inplace=True)

In [10]:
num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

### Handling Text and Categorical Attributes

Machine Learning algorithms prefer to work with numbers anyway, so let’s convert these text labels to numbers.

`scikit-learn` provides a transformer for this task called `OneHotEncoder`, that we will use to encode the `Embarked` and `Sex` attributes. 

In [11]:
encoder = OneHotEncoder()

### Feature scaling

One of the most important transformations we need to apply to our data is feature scaling. With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. This is the case for the Titanic data. `Fare`, `Age` and `Pclass` differ significantly. Note, that scaling the target values is generally not required.

There are two common ways to get all attributes to have the same scale: min-max scaling and standardization. Min-max scaling (many people call this normalization) is quite simple: values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min. Scikit-Learn provides a transformer called `MinMaxScaler` for this. It has a `feature_range` hyperparameter that lets you change the range if you don’t want 0–1 for some reason.

Standardization is quite different: first it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance. Scikit-Learn provides a transformer called `StandardScaler` for standardization.

In [12]:
scaler = StandardScaler()

### Putting it together

As we can see, there are many data transformation steps that need to be executed in the right order. Scikit-Learn provides the Pipeline class to help with such sequences of transformations.

In [13]:
num_attribs = ["Pclass", "SibSp", "Parch", "Fare"]
cat_attribs = ["Sex", "Embarked"]

Create one pipeline for the numerical data and one for the categorical data. We bring everything together in a final `ColumnTransformer`, where we prepare our predictors for fitting.

In [14]:
num_pipeline = Pipeline([
    ("num_imputer", num_imputer),
    ("std_scaler", scaler)
])

In [15]:
cat_pipeline = Pipeline([
    ("cat_imputer", cat_imputer),
    ("encoder", encoder)
])

In [16]:
full_pipeline = ColumnTransformer([
        ('num', num_pipeline, num_attribs),
        ('cat', cat_pipeline, cat_attribs),
    ])

predictors_prepared = full_pipeline.fit_transform(predictors)

# Model Training

Now that we have framed the problem, got the data and explored it, sampled a training set and a test set, and wrote our transformation pipelines to clean up and prepare our data for Machine Learning algorithms, we are ready to select and train a machine learning model.

In this example though, we will fit 5 different models in parallel:

* Support Vector Machines
* Decision Trees
* K Nearest Neighbors
* Random Forests
* Logistic Regression

> Besides fitting the models, we use cross validation to evaluate them.

In [17]:
svc = SVC(gamma="auto")
svc.fit(predictors_prepared, labels)
svc_scores = cross_val_score(svc, predictors_prepared, labels, scoring="accuracy", cv=4)
svc_accuracy = max(svc_scores)

In [18]:
tree = DecisionTreeClassifier()
tree.fit(predictors_prepared, labels)
tree_scores = cross_val_score(tree, predictors_prepared, labels, scoring="accuracy", cv=4)
tree_accuracy = max(tree_scores)

In [19]:
knn = KNeighborsClassifier()
knn.fit(predictors_prepared, labels)
knn_scores = cross_val_score(knn, predictors_prepared, labels, scoring="accuracy", cv=4)
knn_accuracy = max(knn_scores)

In [20]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(predictors_prepared, labels)
random_forest_scores = cross_val_score(random_forest, predictors_prepared, labels, scoring="accuracy", cv=4)
random_forest_accuracy = max(random_forest_scores)

In [21]:
logistic_regression = LogisticRegression(solver="lbfgs")
logistic_regression.fit(predictors_prepared, labels)
logistic_regression_scores = cross_val_score(logistic_regression, predictors_prepared, labels, scoring="accuracy", cv=4)
logistic_regression_accuracy = max(logistic_regression_scores)

# Making Predictions

We are now at the final stage of our experiment. We are ready to make out predictions on the test set. First we should process the test set the same way we processed the train set.

In [22]:
test_predictors = test_df.drop("Cabin", axis=1)

In [23]:
test_predictors_prepared = full_pipeline.transform(test_predictors)

In [24]:
svc_predictions = svc.predict(test_predictors_prepared)

In [25]:
tree_predictions = tree.predict(test_predictors_prepared)

In [26]:
knn_predictions = knn.predict(test_predictors_prepared)

In [27]:
random_forest_predictions = random_forest.predict(test_predictors_prepared)

In [28]:
logistic_regression_predictions = logistic_regression.predict(test_predictors_prepared)

Finally, we print the results in the last cell. This is what Kale will use to extract the pipeline metrics.

# Prepare Kaggle Submission

We can use the best classifier to prepare a CSV file subbmision for the corresponding Kaggle competition.

In [29]:
prediction_df = test_df.copy()
prediction_df["Survived"] = tree_predictions

submission_df = prediction_df[["PassengerId", "Survived"]]

In [30]:
submission_df.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [31]:
submission_df.to_csv("submission.csv", index=False,)

# Export Pipeline Metrics

These metrics will be exported as the final pipeline metrics. We will be able to see them in the KFP graphical user interface.

In [32]:
print(svc_accuracy)
print(tree_accuracy)
print(knn_accuracy)
print(random_forest_accuracy)
print(logistic_regression_accuracy)

0.820627802690583
0.8475336322869955
0.8295964125560538
0.8251121076233184
0.8071748878923767
