# How to prepare a Pipeline model and dataset for EXPAI

Welcome to this EXPAI tutorial. We share with you a sample code to implement a Pipeline using Python. We strongly recommend using this object.

In [36]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn import model_selection, preprocessing
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn import metrics as ms
import pickle as pickle
import os

**Loading the dataset**

We use Pandas function for reading a CSV file. Consider other functions if your file is not CSV.

In [29]:
# Define the path to the sample file
original_sample_path = os.path.abspath("./car_ad_display.csv")

# Read the file
df = pd.read_csv(original_sample_path, encoding='iso-8859-1', sep = ";", index_col=0)
df.head()

Unnamed: 0,car,price,body,mileage,engV,engType,registration,year,model,drive
0,Ford,15500.0,crossover,68,2.5,Gas,yes,2010,Kuga,full
1,Mercedes-Benz,20500.0,sedan,173,1.8,Gas,yes,2011,E-Class,rear
2,Mercedes-Benz,35000.0,other,135,5.5,Petrol,yes,2008,CL 550,rear
3,Mercedes-Benz,17800.0,van,162,1.8,Diesel,yes,2012,B 180,front
5,Nissan,16600.0,crossover,83,2.0,Petrol,yes,2013,X-Trail,full


### Simple data cleaning

In this section, we clean some outliers and corrupted entries we found in our dataset.

In [30]:
# Drop registers with negative price (corrupted)
df = df.drop(df[df.price <= 0 ].index)

# Drop outliers
df = df.drop(df[df.engV > 40].index)

# Drop null values
df = df.dropna()

### Train-Test split

Divide the dataset into train and test.

In [31]:
# Select target column
y_train = df["price"]

# Drop target from input
x_train = df.drop(["price"], axis=1)

# Split with 20% for test
data_train, data_test, label_train, label_test = train_test_split(x_train, y_train, test_size = 0.2, random_state = 42)

### Store input data

It is really important to store the very same data we used for train-test split. This will be the input dataset for EXPAI.

In [None]:
df.to_csv('./expai_input_data.csv')

### Create a Pipeline

Pipelines are implemented by Scikit-Learn and allow users to build an unique object for the whole analyticial process.

In this case, we will build a Pipeline that:
- Encodes categorical variables
- Scales numerical variables
- Implements a XGBoost Regressor

In [33]:
# Define a transformer for numerical and categorical variables
transformer = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ["mileage", "engV", "year"]),
        ('cat', OrdinalEncoder(handle_unknown='ignore'), ["car", "body", "engType", "registration", "model", "drive"])
    ]
)

In [34]:
# Define XGBoost Regressor parameters
xgb_params = {
    'eta': 0.05,
    'max_depth': 5,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    'silent': 1
}

# Init model
model = xgb.XGBRegressor(**xgb_params)

In [35]:
# Create Pipeline object where steps are transformation and model.
clf = Pipeline(steps=[
    ('preprocessor', transformer),
    ('model', model)
])

# Fit the pipeline
clf.fit(X = data_train, y = label_train)

Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['mileage', 'engV', 'year']),
                                                 ('cat',
                                                  OrdinalEncoder(handle_unknown='ignore'),
                                                  ['car', 'body', 'engType',
                                                   'registration', 'model',
                                                   'drive'])])),
                ('model',
                 XGBRegressor(base_score=0.5, booster='gbtree',
                              colsample_bylevel=1, colsample_bynode=1,
                              colsample_bytree=0.7, eta=0.05,
                              eval_...
                              importance_type='gain',
                              interaction_constraints='',
                              learning_rate=0.0500000007, max_delta_step

### Measure performance

Check the performance of the model we have just trained

In [39]:
y_hat = clf.predict(data_test)
y_hat

array([ 6125.9272, 13683.277 , 18706.371 , ..., 35528.973 , 55109.83  ,
        2186.3943], dtype=float32)

In [41]:
ms.mean_squared_error(label_test, y_hat)

51754384.06679244

### Export model using Pickle

We use Pickle to export the Pipeline object that will be uploaded to EXPAI.

In [38]:
model_path = os.path.abspath("./model_pipeline.pkl")
with open(model_path, 'wb') as f:
    pickle.dump(clf, f)