# Model pipeline
Let's start with our prepared dataset:

In [1]:
import pandas as pd
penguins_train_scaled = pd.read_csv('data/penguins_train_scaled.csv')

To define a classifier, we need to say what is the input data (`X`) and what is the target (`y`).
For now, we focus on only the numerical features

In [2]:
numerical_features = penguins_train_scaled.columns[2:6]
print(numerical_features)

Index(['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], dtype='object')


In [3]:
X = penguins_train_scaled[numerical_features]
y = penguins_train_scaled['species']

We try out one of the simplest classification: a k-nearest neighbor classifier. To classify an examlpe, this model looks at the `k` neirest neighbors in the training set and chooses the majority class.

In [4]:
from sklearn.neighbors import KNeighborsClassifier

In [5]:
# First we define the model and its parameters, in this case the number of neighbors
classifier = KNeighborsClassifier(n_neighbors=3)
# Now we train it on the data
classifier.fit(X, y)

KNeighborsClassifier(n_neighbors=3)

We can now use our classifier to make predictions:

In [6]:
predictions = classifier.predict(X)
predictions[:10]

array(['Adelie', 'Gentoo', 'Gentoo', 'Adelie', 'Chinstrap', 'Gentoo',
       'Adelie', 'Adelie', 'Gentoo', 'Chinstrap'], dtype=object)

and compare this with the true target labels, for example using a confusion matrix. The rows in the confusion matrix 
are the true labels and the columns are the predicted labels

In [7]:
from sklearn.metrics import confusion_matrix
labels = y.unique()
conf_mat = confusion_matrix(y, predictions, labels=labels)
pd.DataFrame(conf_mat, columns=labels, index=labels)

Unnamed: 0,Adelie,Gentoo,Chinstrap
Adelie,117,0,0
Gentoo,0,98,0
Chinstrap,0,0,58


### Exercise
What do you conclude from the confusion matrix? Does this confusion matrix tell us anything on how well the model will perform on new data?

Answer: The model seems to do really well, it only makes two mistakes.
However, we predict on the same data that we trained on. To get a fair estimate of the perfomance, we need to apply it on a new data set, such as the test set

## Pipelines
Ok, suppose we want to predict on our test set. But remember, we also did a data transformation, namely scaling, in the preprocessing phase. We need to do the exact same steps for our test set!

To make it easier to reproduce all steps for new datasets, and to make sure that for each step, the train-test split is well guarded, sklearn provides pipeline. 
So we start again with the unscaled data, but now define a pipeline:

In [8]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline

In [9]:
penguins_train_nona = pd.read_csv('data/penguins_train_nona.csv')
X = penguins_train_nona[numerical_features]
y = penguins_train_nona['species']

In [10]:
# We give the pipeline tuples of step names, and step objects
pipe = Pipeline([
    ('scale', MinMaxScaler()),
    ('model', KNeighborsClassifier(n_neighbors=3))
])

In [11]:
pipe.fit(X, y)

Pipeline(steps=[('scale', MinMaxScaler()),
                ('model', KNeighborsClassifier(n_neighbors=3))])

To now test in on our test set: (note that we still have to drop null values)

In [12]:
penguins_test = pd.read_csv('data/penguins_test.csv').dropna(subset=numerical_features)
X_test = penguins_test[numerical_features]
y_test = penguins_test['species']

In [13]:
pred_test = pipe.predict(X_test)
conf_mat = confusion_matrix(y_test, pred_test, labels=labels)
pd.DataFrame(conf_mat, columns=labels, index=labels)

Unnamed: 0,Adelie,Gentoo,Chinstrap
Adelie,34,0,0
Gentoo,0,25,0
Chinstrap,1,0,9
