# Pipelines 

We can stack various pre-processing methods and the machine learning model in a pipeline

For example, if all features were nominal (categorical without order), we could stack a `OneHotEncoder()` followed by a classifier, e.g. `LogisticRegression()`

Conversion of the features happens under the hood

In [None]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [None]:
X = np.array([
    ['sunny', 'summer'],
    ['rainy', 'summer'],
    ['rainy', 'fall'],
    ['cloudy', 'winter'],
    ['very rainy', 'spring'],
    ['sunny', 'winter'],
    ['partially cloudy', 'spring']
])

y = np.array([
    'T-shirt',
    'T-shirt',
    'Coat',
    'Coat',
    'Coat',
    'Coat',
    'T-shirt'
])

In [None]:
X.shape

In [None]:
y.shape

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(OneHotEncoder(),
                      LogisticRegression())

In [None]:
model.fit(X, y)

In [None]:
model.score(X, y)

In [None]:
model.predict([['cloudy', 'spring']])

We can setup the same pipeline with an `OrdinalEncoder()`. However, these features do not have order (so we should not!)

In [None]:
model = make_pipeline(OrdinalEncoder(),
                      LogisticRegression())
model.fit(X, y)
model.score(X, y)

In [None]:
model.predict([['cloudy', 'spring']])

We get a different answer. Note that Decision Tree methods might be less susceptible to this

## Practice encoding and pipeline with a decision tree

In [None]:
X = np.array([
    ['sunny', 'summer'],
    ['rainy', 'summer'],
    ['rainy', 'fall'],
    ['cloudy', 'winter'],
    ['very rainy', 'spring'],
    ['sunny', 'winter'],
    ['partially cloudy', 'spring']
])

y = np.array([
    'T-shirt',
    'T-shirt',
    'Coat',
    'Coat',
    'Coat',
    'Coat',
    'T-shirt'
])

### Question: Can you build and visualize a decision tree using this data and a pipeline?

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

model = make_pipeline(OneHotEncoder(),
                      DecisionTreeClassifier())

In [None]:
# Fit model
model.fit(X, y)

In [None]:
# Get training score
model.score(X, y)

In [None]:
# export tree as graphviz object

from sklearn.tree import export_graphviz
export_graphviz(model.named_steps['decisiontreeclassifier'], 
                feature_names=model.named_steps['onehotencoder'].get_feature_names_out(),
                class_names=np.unique(y),
                out_file="weather-tree.dot", impurity=True, filled=True)

import graphviz

with open("weather-tree.dot") as f:
    dot_graph = f.read()
display(graphviz.Source(dot_graph))

gini impurity is $\sum_{i=classes} p_{i}*(1-p_{i})$

with  
Number of samples=5  
class 0 (Coat): 4  
class 1 (T-shirt): 1  

$p_{coat} = 4/5$  

$p_{shirt} = 1/5$

In [None]:
# calculate gini impurity
gini = 4/5*(1-4/5) + 1/5*(1-1/5)
gini