# Introduction
In this notebook, we are going to optimise the model to perform well on new (unseen) data.

Let's load in our data and import all the tools we will need.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
URL = 'https://github.com/data-analytics-in-business/lasagna-pipeline-demo/raw/main/data/sample_lasagna.csv'
df = pd.read_csv(URL)
df.head()

Unnamed: 0,Person,Age,Weight,Income,Pay Type,Car Value,CC Debt,Gender,Live Alone,Dwell Type,Mall Trips,Nbhd,Have Tried
0,1,48,175,65500,Hourly,2190,3510,Male,No,Home,7,East,No
1,2,33,202,29100,Hourly,2110,740,Female,No,Condo,4,East,Yes
2,3,51,188,32200,Salaried,5140,910,Male,No,Condo,1,East,No
3,4,56,244,19000,Hourly,700,1620,Female,No,Home,3,West,No
4,5,28,218,81400,Salaried,26620,600,Male,No,Apt,3,West,Yes


# Training and Testing
Let's specify our input $X$ and output $y$ variables, but let's also split them both into a set of samples we will use for training our model and a set of samples we will use for testing our model

In [2]:
y = df['Have Tried']
X = df.drop(columns=['Person','Have Tried'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)


Given training and tesing data, we can create a classification pipeline like before, but this time we will train it on the training data and test it on the test data.

In [3]:
# Define previous pipeline
numeric_features = ["Age", "Weight","Income","Car Value","CC Debt","Mall Trips"]
categorical_features = ["Pay Type","Gender","Live Alone","Dwell Type","Nbhd"]

preprocessor = ColumnTransformer(
    transformers=[
        ("num", MinMaxScaler(), numeric_features),
        ("cat", OneHotEncoder(drop='first', handle_unknown="ignore", sparse=False), categorical_features),
    ],
)

clf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

clf_pipeline.fit(X_train, y_train)
 
print('Training set score: ' + str(clf_pipeline.score(X_train,y_train)))
print('Test set score: ' + str(clf_pipeline.score(X_test,y_test)))

Training set score: 0.8286604361370716
Test set score: 0.8504672897196262


# Model Selection
We can use the idea of "training and testing" to optimise our model based on an estimate of its performance on unseen data.

To do this, we define a grid of parameters that we wish to search over. 

For this example, we will try different preprocessors for the numeric variables, and we will try different classifiers for the classifer "head" of our pipeline.

Run the code below to search over the parameter grid and print the results.

In [4]:
param_grid = {
    'preprocessor__num' : [StandardScaler(), MinMaxScaler(), Normalizer()],
    'classifier' : [LogisticRegression(), DecisionTreeClassifier(), GaussianNB()]
}

grid = GridSearchCV(clf_pipeline, param_grid).fit(X_train, y_train)
 
df_results = pd.DataFrame(grid.cv_results_)
df_results[['param_classifier','param_preprocessor__num','mean_test_score','rank_test_score']].head(30)

Unnamed: 0,param_classifier,param_preprocessor__num,mean_test_score,rank_test_score
0,LogisticRegression(),StandardScaler(),0.823983,1
1,LogisticRegression(),MinMaxScaler(),0.819283,2
2,LogisticRegression(),Normalizer(),0.741424,5
3,DecisionTreeClassifier(),StandardScaler(),0.733539,7
4,DecisionTreeClassifier(),MinMaxScaler(),0.739777,6
5,DecisionTreeClassifier(),Normalizer(),0.721148,8
6,GaussianNB(),StandardScaler(),0.803755,3
7,GaussianNB(),MinMaxScaler(),0.803755,3
8,GaussianNB(),Normalizer(),0.683854,9


**Question**: Which classification pipeline do you think will perform best on the unseen data?

# Exercise (Advanced)
Explore other [supervised learning](https://scikit-learn.org/stable/supervised_learning.html) classification methods and [preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) methods, and explore the parameters within those methods, to find the best `mean_test_score` you can find using the `sample_lasagna.csv` data.

In [5]:
# (SOLUTION)