# Demo Notebook for quick_regression utility script

### <font color="darkred">What if we could score 12+ common regression classifiers on our data with just one line of code? 

For the Ames Housing Price competition I experimented with a scikit Pipeline to automically clean and prepare the data, handle numerical and categorical values and crossvalidate on the most common classifiers being used by fellow Kagglers. See the complete notebook here:

https://www.kaggle.com/chmaxx/sklearn-pipeline-playground-for-10-classifiers

This experimental playground I extended to this small utility script that can be used on any training data for a regression problem:

https://www.kaggle.com/chmaxx/quick-regression

In [None]:
import pandas as pd

from tabulate import tabulate

from sklearn.datasets import load_boston
from sklearn.datasets import load_diabetes
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

# we import the three main functions from the utility script for scoring, training and prediction
from quick_regression import score_models
from quick_regression import train_models
from quick_regression import predict_from_models

BASE = "/kaggle/input"

Let's try the script on a first regression problem: Predicting miles per gallon from car data. The data is [taken from seaborn demo data here](https://github.com/mwaskom/seaborn-data).

In [None]:
df = pd.read_csv(f"{BASE}/sample-data/mpg.csv")
# score_models() just expects your training data as a Pandas dataframe and the column name of the target variable
# the function prints out scoring values ("r2" by default) and processing times per classifier
scores_mpg = score_models(df, "mpg")

In just a couple of seconds we get a first impression how several algorithms perform!

In [None]:
# the utility script returns a dataframe with a sorted list of scores of 14 classifiers
print(tabulate(scores_mpg, showindex=False, floatfmt=".3f", headers="keys"))

We can tune the scoring by providing several parameters. These are the defaults. 

In [None]:
scores_mpg = score_models(df=df, 
                          target_name="mpg", 
                          sample_size=None, 
                          impute_strategy="mean", 
                          scoring_metric="r2", 
                          log_x=False,
                          log_y=False, 
                          verbose=True,
                         )

If we have a larger dataset we can e.g. score on a subsample of our data to speed up execution.

In [None]:
# the diamonds data set has more than 50k samples which would take a while to crossvalidate on 14 classifiers
# we therefore reduce to 1000 samples
df = pd.read_csv(f"{BASE}/sample-data/diamonds.csv")
scores_diamonds = score_models(df, "price", sample_size=1000, verbose=False)
print()
print(tabulate(scores_diamonds, showindex=False, floatfmt=".3f", headers="keys"))

We can log transform the target variable to see if it improves scoring. We can also log transform all the numerical predictive variables.

In [None]:
df = pd.read_csv(f"{BASE}/house-prices-advanced-regression-techniques/train.csv")
scores_ames = score_models(df, "SalePrice", verbose=False)
print(tabulate(scores_ames, showindex=False, floatfmt=".3f", headers="keys"))
print()

# now trying with log transformed target variable y
scores_ames = score_models(df, "SalePrice", log_y=True, verbose=False)
print(tabulate(scores_ames, showindex=False, floatfmt=".3f", headers="keys"))
print()

# now trying with log transformed predictive variables
scores_ames = score_models(df, "SalePrice", log_x=True, log_y=True, verbose=False)
print(tabulate(scores_ames, showindex=False, floatfmt=".3f", headers="keys"))
print()

Now on to training and prediction... With just one more line we train all classifiers on the full training set. The function returns the fitted scikit Pipelines.

We can use these in the next step to predict from the test data. Just be aware: The first column are the predictions from the DummyRegressor. This will very likely spoil your result... 😉

In [None]:
pipelines = train_models(df, "SalePrice", log_y=True)

df_test = pd.read_csv(f"{BASE}/house-prices-advanced-regression-techniques/test.csv")
predictions = predict_from_models(df_test, pipelines)
predictions.head()

Let's try some more Kaggle datasets out of the box and see what happens.

In [None]:
df = pd.read_csv(f"{BASE}/tmdb-box-office-prediction/train.csv")
baseline_tmdb = score_models(df, "revenue", 1000)

## Behind the scenes
At the core of the util script I used scikit-learn's pipeline class. This allows to chain arbitrary transformers with a final estimator. Let's look at a simple example. 

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [None]:
df = pd.read_csv(f"{BASE}/house-prices-advanced-regression-techniques/train.csv")
X = df.select_dtypes("number").drop("SalePrice", axis=1)
y = df.SalePrice

# using the convenience function make_pipeline() to build a whole data pipeline in just one line of code
pipe = make_pipeline(SimpleImputer(), RobustScaler(), LinearRegression())
print(f"The R2 score is: {cross_val_score(pipe, X, y).mean():.4f}")

The same we can setup with two pipeline branches for numerical and categorical data.

In [None]:
num_cols = df.drop("SalePrice", axis=1).select_dtypes("number").columns
cat_cols = df.select_dtypes("object").columns

# we instantiate a first Pipeline, that processes our numerical values
numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer()),
        ('scaler', RobustScaler())])

# the same we do for categorical data
categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    
# a ColumnTransformer combines the two created pipelines
# each tranformer gets the proper features according to «num_cols» and «cat_cols»
preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, num_cols),
            ('cat', categorical_transformer, cat_cols)])

pipe = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LinearRegression())])

X = df.drop("SalePrice", axis=1)
y = df.SalePrice
print(f"The R2 score is: {cross_val_score(pipe, X, y).mean():.4f}")


<font color="darkred">**The script simply abstracts all this away and in addition takes care of instantiating the classifiers, crossvalidation, training and prediction.**

### References

[Alexis' Kaggle Tutorial](https://www.kaggle.com/alexisbcook/pipelines)<br>
[Dan Becker's Pipeline Tutorial](https://www.kaggle.com/dansbecker/pipelines)<br>
[Using the Column Transformer](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html)<br>