## Auto-mpg<br>
Workflow should be:
1. Exploring Auto-mpg dataset
2. Train models to regress on Auto-mpg dataset
3. Apply explanation methods and visualize the effect
4. Evaluate explanation methods with several metrics

focus is not how accurate the model's predictions are, __instead focus on how each explanation methods is explaining the model <span style="color: red;">(fidelity, accuracy, predictable, stability, consistency, explicitness, certainty)</span>. Therefore, I only perform necessary preprocessing stages.__ What's more, since data size is small, I extract extra data points from current time for prediction.
<br>
My initial thought is
1. Identical preprocessing
2. Train 5 models independently, consisting an underfitted one, 3 proper ones, an overfitted one.
3. Apply explanation methods
4. Evaluation <br>(Fidelity / Accuracy - MSE calculated using predictions / true), <br>(stability / consistency - difference between explanations for a single instances across multiple models), <br>(certainty - difference between explanations given by identical model for different instances with similar target), <br>(<span style="color: red;">Explicitness</span> - Audience-based, how to do remains obsecure)

In [1]:
import pandas as pd
import numpy as np
import joblib
import os

autompg = pd.read_csv('Datasets/auto_mpg.csv')
print(f'Shape: {autompg.shape}')
autompg.head(2)

Shape: (398, 9)


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320


In [2]:
X, y = autompg.iloc[:, 1:], autompg.iloc[:, 0] # mpg is the target

#### preprocessing
1. __Centered around median (but not normalized with MAD)__: cylinder, displacement, weight, acceleration, model_year
2. __Impute with median and centered around median__
3. __One-hot Encode (with dropping 1)__: origin
4. __Extract company & One-hot encode (with dropping 1)__: name

Median-centered & Drop 1 during one hot encoding enables compare the feature important with reference.

In [3]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

class MedianCenterer(BaseEstimator, TransformerMixin): 
    def __init__(self):
        pass
    def fit(self, X, y=None):
        self.medians = np.median(X, axis=0)
        # self.mads_ = np.median(np.abs(X - self.medians), axis=0)
        return self
    def transform(self, X):
        return X - self.medians
        # return (X - self.medians) / self.mads_
    
class CompanyNameExtractor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.map(lambda name: name.split()[0]).values
    
median_center_columns = ['cylinders', 'displacement', 'weight', 'acceleration', 'model_year']

impute_and_center_median_columns = ['horsepower']

one_hot_columns = ['origin']

company_name_columns = ['name']

preprocessor = ColumnTransformer(
    transformers = [
        ('median_center', MedianCenterer(), median_center_columns),
        ('impute_and_center_median', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('centerer', MedianCenterer())
        ]), impute_and_center_median_columns),
        ('one_hot_origin', OneHotEncoder(sparse_output=False, drop='first', dtype=int), one_hot_columns),
        ('company_name', Pipeline([
            ('extractor', CompanyNameExtractor()),
            ('one_hot', OneHotEncoder(sparse_output=False, drop='first', dtype=int))
        ]), company_name_columns)
    ],
    remainder = 'passthrough'
)

def get_feature_names_out(pipeline):
    # Obtain the feature names from the pipeline
    
    preprocessor = pipeline.named_steps['preprocessor']
    origin_encoder = preprocessor.named_transformers_['one_hot_origin']
    company_name_encoder = preprocessor.named_transformers_['company_name'].named_steps['one_hot']
    
    origin_columns = origin_encoder.get_feature_names_out(['origin'])
    company_name_columns = company_name_encoder.get_feature_names_out(['company_name'])
    
    return (median_center_columns + impute_and_center_median_columns + list(origin_columns) + list(company_name_columns))

#### model training stage.
1. Underfitted model: Sparse linear model with Lasso
2. Proper model: Sparse Linear model with Lasso (smaller penalty), SVR, and k-NN
3. Overfitted model: Poly Linear

__The trained models shall be stored__

In [37]:
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from pygam import LinearGAM, s, f, te

In [5]:
# training an underfitted model
"""
sparselr_underfit_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', Lasso(fit_intercept=False, alpha=.1, max_iter=100))
])
sparselr_underfit_pipeline.fit(X, y)
y_pred = sparselr_underfit_pipeline.predict(X)
mse = mean_squared_error(y, y_pred)
mse
#"""

"\nsparselr_underfit_pipeline = Pipeline(steps=[\n    ('preprocessor', preprocessor),\n    ('model', Lasso(fit_intercept=False, alpha=.1, max_iter=100))\n])\nsparselr_underfit_pipeline.fit(X, y)\ny_pred = sparselr_underfit_pipeline.predict(X)\nmse = mean_squared_error(y, y_pred)\nmse\n#"

In [38]:
sparsel_proper_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('poly_features', PolynomialFeatures(degree=2, include_bias=False)),
    ('model', LinearRegression(fit_intercept=False))
])
sparsel_proper_pipeline.fit(X, y)
y_pred = sparsel_proper_pipeline.predict(X)
mse = mean_squared_error(y, y_pred)
mse

2.497834808573334

In [40]:
joblib.dump(sparsel_proper_pipeline, 'Models/Autompg_overfitted_poly.pkl')

['Models/Autompg_overfitted_poly.pkl']