# Machine Learning Model

Goal of this notebook is to create a pipeline with a machine learning model which predicts how many days after the creation of a shipment it get's picked up and arrives at the last mile carrier hub.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
!pip install -e ../.

Obtaining file:///Users/christianklaus/code/christianklausML/dispatcher-project
  Preparing metadata (setup.py) ... [?25ldone






Installing collected packages: dispatcher
  Attempting uninstall: dispatcher
    Found existing installation: dispatcher 1.0
    Uninstalling dispatcher-1.0:
      Successfully uninstalled dispatcher-1.0
  Running setup.py develop for dispatcher
Successfully installed dispatcher-1.0
You should consider upgrading via the '/Users/christianklaus/.pyenv/versions/3.8.12/envs/lewagon/bin/python3.8 -m pip install --upgrade pip' command.[0m[33m
[0m

In [3]:
import matplotlib.pyplot as plt
plt.style.use(['dark_background'])
import seaborn as sns
plt.style.use("dark_background")
import pandas as pd
import numpy as np
from dispatcher.data.join import JoinTables
import datetime
import math
from dispatcher.transformers.encoders import TimeFeaturesEncoder, DayFeaturesEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer, make_column_selector
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.base import TransformerMixin, BaseEstimator

To generate our train-test-split we'll use a function to join and pre-process the initial tables. We predefined this function in the package `join.py`.

In [28]:
X_train, X_test, y_train, y_test = JoinTables().join_tables()

  shipment_order = RawData.get_table_data('shipment', local=True,


ticket features loaded
order features loaded.
carrier features loaded.


  clean_shipment = RawData.get_table_data('shipment', local=True, clean=True)


warehouse features loaded
[32mTables joined.[0m
[32mNAs filled. (─‿‿─)[0m
[32mNumber of rows: 1403405[0m
[32m                   percent_missing
WAREHOUSE_ID                     0
RELATION_DISTANCE                0
SCR_HRS                          0
SCR_MIN                          0
PPU_DOW                          0[0m
[32m0.0% of rows dropped.[0m
[32mtrain test split ready. ( ◑‿◑)[0m


The following class helps to debug the pipeline.

In [9]:
class Debug(BaseEstimator, TransformerMixin):

    def transform(self, X):
        print(pd.DataFrame(X).head())
        print(X.shape)
        return X

    def fit(self, X, y=None, **fit_params):
        return self

In [14]:
X_train.keys()

Index(['WAREHOUSE_ID', 'RELATION_DISTANCE', 'CARRIER_COMPANY_ID', 'SHOP_ID',
       'DESTINATION_ZIP_CODE', 'CUSTOMER_ADDRESS_ZIP_CODE',
       'CUSTOMER_ADDRESS_COUNTRY_ID', 'OC_MIN', 'OC_HRS', 'OC_DOW', 'OC_MONTH',
       'PPU_MIN', 'PPU_HRS', 'PPU_DOW', 'SCR_MIN', 'SCR_HRS', 'SCR_DOW'],
      dtype='object')

In [29]:
y_train

127874     1.0
198548     2.0
542386     1.0
1528287    0.0
1322138    1.0
          ... 
183554     4.0
1223641    1.0
192345     4.0
1220372    1.0
998077     0.0
Name: DIFF_TRUE, Length: 1122719, dtype: float64

In [None]:
X_train.sample(10)

In [6]:
 # visualizing pipelines in HTML
from sklearn import set_config; set_config(display='diagram')

In [7]:
SimpleImputer.get_feature_names_out = (lambda self, names=None: self.feature_names_in_)

In [32]:
num_transformer = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler())])
#max_categories > 30 for one col and > 20 for two cols take very long
cat_transformer = OneHotEncoder(max_categories=3, handle_unknown='infrequent_if_exist')

preprocessor = ColumnTransformer([
    ('num_tr', num_transformer, ['RELATION_DISTANCE']),
    ('cat_tr', cat_transformer, ['WAREHOUSE_ID', 'CARRIER_COMPANY_ID', 'SCR_DOW', 'SCR_HRS'])],
    remainder='drop')

pipe = make_pipeline(preprocessor, Debug(), LogisticRegression(max_iter=1000))
pipe

In [33]:
# Train pipeline
pipe.fit(X_train, y_train)

# Score model 
pipe.score(X_test, y_test)

         0    1    2    3    4    5    6    7    8    9    10   11   12
0  0.198456  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0
1  0.177652  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0
2  0.494201  0.0  0.0  1.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  1.0
3 -1.165238  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0
4  1.029845  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0
(1122719, 13)
         0    1    2    3    4    5    6    7    8    9    10   11   12
0  0.991325  1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0
1 -0.783375  0.0  0.0  1.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0  0.0  1.0
2 -0.785218  0.0  1.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0  0.0  1.0  0.0
3 -0.785218  0.0  1.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0  0.0  0.0  1.0
4 -1.083691  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0
(280680, 13)


0.4484395040615648

There is the first score! What could we improve?
- try different models
- hyperparameter tuning