## Packaging algorithm for Yandex.Market Hackathon

There is a Yandex.Market warehouse where orders are processed, packed, and sent for delivery. Packers ("users") take prepared goods from a list and pack them into suitable packaging - cardboard boxes, film, or bags.

**Task**: to select the optimal packaging for each set of goods, so that the items can fit inside and the size of the box is not too large.

Currently, there is a heuristic algorithm that selects packaging based on the total volume of all items in the order and their sizes. However, users often reject its suggestions and pack the goods in a different box or packaging.

We need to find a way to improve this algorithm.

**Solution:** In this work I will use [Multi Instance Learning](https://nilg.ai/202105/an-introduction-to-multiple-instance-learning/) approach which allows to solve our task as classical supervised machine learning problem.

In [1]:
import pandas as pd
import numpy as np

from catboost import CatBoostClassifier

from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from skmultilearn.model_selection import iterative_train_test_split

## Loading and preprocessing the data

Items with their sizes

In [2]:
items = pd.read_csv('./data/sku.csv', index_col=0, low_memory=False)
items.columns = ['sku', 'width', 'height', 'depth']
items.head()

Unnamed: 0,sku,width,height,depth
0,8ba57dcdba9a58b0c4edd180bef6afc9,11.0,31.0,28.0
1,d9af6ce6f9e303f4b1a8cb47cde21975,29.0,14.0,40.0
2,8b91fd242bde88f0891380506d9c3caa,12.0,13.0,35.0
3,e8af308a7659e34194770d1e3a48e144,3.0,13.0,8.0
4,dc0e2542e122731217289b8e6d3bd3f8,96.0,18.0,56.0


In [3]:
items.describe()

Unnamed: 0,width,height,depth
count,6385961.0,6385961.0,6385961.0
mean,21.08468,12.03353,17.82524
std,18.90676,14.87745,15.08838
min,0.0,0.0,0.0
25%,10.0,3.0,8.0
50%,18.0,8.0,15.0
75%,28.0,16.0,24.0
max,6554.0,2050.0,593.0


In [4]:
items.nunique()

sku       6385961
width        2250
height       1948
depth        1844
dtype: int64

In [5]:
SIZE_COLS = ['width', 'height', 'depth']

We need to have less unique values and split them into groups. But first let's convert them to nearest ceil integer

In [6]:
items[SIZE_COLS] = np.ceil(items[SIZE_COLS]).astype(int)

In [7]:
items.nunique()

sku       6385961
width         367
height        344
depth         319
dtype: int64

We can shrink them more, but for now let's leave it as it is.

Orders which we need to convert to supervised learning problem using Multi Instance Learning Techniques

In [8]:
orders = pd.read_csv('./data/data.csv')
orders.head()

Unnamed: 0.1,Unnamed: 0,whs,orderkey,selected_cartontype,box_num,recommended_cartontype,selected_carton,sel_calc_cube,recommended_carton,pack_volume,rec_calc_cube,goods_wght,sku,who,trackingid
0,0,0,d48f3211c1ffccdc374f23139a9ab668,NONPACK,1,YML,NONPACK,0,YML,2046,108000,0.1,af49bf330e2cf16e44f0be1bdfe337bd,b7325da1af89a46059164618eb03ae38,6c304d5c2815ccd2ba5046c101294c24
1,1,0,d48f3211c1ffccdc374f23139a9ab668,NONPACK,1,YML,NONPACK,0,YML,2046,108000,0.1,af49bf330e2cf16e44f0be1bdfe337bd,b7325da1af89a46059164618eb03ae38,6c304d5c2815ccd2ba5046c101294c24
2,2,0,d48f3211c1ffccdc374f23139a9ab668,NONPACK,1,YML,NONPACK,0,YML,2046,108000,0.1,af49bf330e2cf16e44f0be1bdfe337bd,b7325da1af89a46059164618eb03ae38,6c304d5c2815ccd2ba5046c101294c24
3,3,0,d48f3211c1ffccdc374f23139a9ab668,NONPACK,1,YML,NONPACK,0,YML,2046,108000,0.1,af49bf330e2cf16e44f0be1bdfe337bd,b7325da1af89a46059164618eb03ae38,6c304d5c2815ccd2ba5046c101294c24
4,4,0,d48f3211c1ffccdc374f23139a9ab668,NONPACK,1,YML,NONPACK,0,YML,2046,108000,0.1,af49bf330e2cf16e44f0be1bdfe337bd,b7325da1af89a46059164618eb03ae38,6c304d5c2815ccd2ba5046c101294c24


In [9]:
# Column by which we will group/count orders (order or tracking id)
ORDER_COLUMN = 'trackingid'
TARGET_COLUMN = 'selected_cartontype'

In [10]:
orders[TARGET_COLUMN].value_counts()

selected_cartontype
MYB        55937
MYC        48837
NONPACK    30497
YMC        27149
MYD        24663
YMG        23610
MYA        20401
YMF        19256
YMW        19173
YMA        15795
YME        12685
STRETCH    12465
MYE         9719
YML         3282
MYF         1350
YMX          802
YMB            2
Name: count, dtype: int64

In [11]:
orders['box_num'].value_counts()

box_num
1      296683
2       15199
3        4311
4        1739
5         981
        ...  
87          1
86          1
85          1
84          1
210         1
Name: count, Length: 235, dtype: int64

Need futher analysis and discussion. From preliminary exploration: even several boxes have the same carton type.

Converting data to supervised learning problem

In [12]:
# Selecting the most valuable columns
orders = orders[[ORDER_COLUMN, TARGET_COLUMN, 'recommended_cartontype', 'pack_volume', 'goods_wght', 'sku']]
orders

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,sku
0,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd
1,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd
2,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd
3,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd
4,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd
...,...,...,...,...,...,...
325618,f94f078101752133502202383bc87743,MYC,YMC,2080,0.100,86dcc1a44eb2939fea4d2dd3604e1f9e
325619,f94f078101752133502202383bc87743,MYC,YMC,2080,0.100,86dcc1a44eb2939fea4d2dd3604e1f9e
325620,58054d533ef06746ffd8cf99fad4a8cb,YMC,YMC,3523,0.284,9db21acf9e6c1a66493c246c1461f989
325621,1666b5c878be124f05fb9a1d95dd8a68,MYB,YMU,552,0.230,4aedb72c5662562524f6119918c7179b


In [13]:
orders = orders.merge(items, on='sku', how='inner')
orders

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,sku,width,height,depth
0,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31
1,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31
2,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31
3,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31
4,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31
...,...,...,...,...,...,...,...,...,...
314825,05466bb9687134041a318a1578e9101e,YMG,YMW,1944,0.550,2a7f6883fadf6fa7a2306376e598b56c,18,6,18
314826,e0a3958ffb898492600503c7e000d961,YMW,YMW,6000,0.150,b7ab1a9260ebc1d53774714b24c02dd3,15,20,20
314827,9c3c0b98d0b2de8b9fe27d7d0770162b,MYE,MYE,1400,0.160,ba0b176dc645058663de285b479163b3,50,1,28
314828,dc68860f37af41c7b3c12257a9e7eff1,MYC,YMC,2688,0.140,0940ce4ed3c65c5713a169bdf824dfcc,28,8,12


Adding sku cargotypes

In [14]:
item_cargotypes = pd.read_csv('./data/sku_cargotypes.csv')

In [15]:
item_cargotypes = item_cargotypes[['sku', 'cargotype']]
item_cargotypes.head()

Unnamed: 0,sku,cargotype
0,4862bf0e760a593b13f3f2fcf822e533,290
1,4862bf0e760a593b13f3f2fcf822e533,901
2,50d3c4fc66ad423b7feaadff2d682ee0,290
3,50d3c4fc66ad423b7feaadff2d682ee0,901
4,24ce9dba9f301ada55f60e25ee1498d2,290


In [16]:
orders = orders.merge(item_cargotypes, on='sku', how='left')
orders

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,sku,width,height,depth,cargotype
0,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,290.0
1,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,600.0
2,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,610.0
3,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,950.0
4,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,970.0
...,...,...,...,...,...,...,...,...,...,...
1456741,dc68860f37af41c7b3c12257a9e7eff1,MYC,YMC,2688,0.140,0940ce4ed3c65c5713a169bdf824dfcc,28,8,12,980.0
1456742,fb38e0e582b9b2a6a3c7434820c9d829,NONPACK,NONPACK,1832,0.878,74ec431f87644bfed9baca351fffcdee,231,1,8,292.0
1456743,fb38e0e582b9b2a6a3c7434820c9d829,NONPACK,NONPACK,1832,0.878,74ec431f87644bfed9baca351fffcdee,231,1,8,300.0
1456744,fb38e0e582b9b2a6a3c7434820c9d829,NONPACK,NONPACK,1832,0.878,74ec431f87644bfed9baca351fffcdee,231,1,8,301.0


In [17]:
orders['cargotype'] = orders['cargotype'].fillna(-1)

In [18]:
orders['width'] = 'w' + orders['width'].astype(str)
orders['height'] = 'h' + orders['height'].astype(str)
orders['depth'] = 'd' + orders['depth'].astype(str)
orders['cargotype'] = 'ct' + orders['cargotype'].astype(str)

In [19]:
orders.head()

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,sku,width,height,depth,cargotype
0,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.1,af49bf330e2cf16e44f0be1bdfe337bd,w11,h6,d31,ct290.0
1,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.1,af49bf330e2cf16e44f0be1bdfe337bd,w11,h6,d31,ct600.0
2,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.1,af49bf330e2cf16e44f0be1bdfe337bd,w11,h6,d31,ct610.0
3,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.1,af49bf330e2cf16e44f0be1bdfe337bd,w11,h6,d31,ct950.0
4,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.1,af49bf330e2cf16e44f0be1bdfe337bd,w11,h6,d31,ct970.0


Creating separate column for every value of each dimension.

In [20]:
%time

orders_w_cnt = orders.pivot_table(index=ORDER_COLUMN, columns='width', values=TARGET_COLUMN, aggfunc='count', fill_value=0).reset_index()
orders_w_cnt

CPU times: user 1 µs, sys: 0 ns, total: 1 µs
Wall time: 1.67 µs


width,trackingid,w0,w1,w10,w100,w101,w102,w103,w104,w105,...,w90,w91,w92,w93,w94,w95,w96,w97,w98,w99
0,000086e759f059d7373bb1d332392f1c,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0000d2725b0d7c18bfa85dba8fe3fc75,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0000eabbdb272c339beef96b96fe71bc,0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,00023dc128414e21b2f9f8307f627433,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0002506b15b27d032de84e195747c3e7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161827,fffc476844e0e3f8b8d465167e400ed9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161828,fffd2173857cc77c022234e36e28cdad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161829,fffe79e1af75d29bf334a4ef95fa9a2a,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161830,fffe9bf58fbaacabe99822f3e93131e8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
%time
orders_h_cnt = orders.pivot_table(index=ORDER_COLUMN, columns='height', values=TARGET_COLUMN, aggfunc='count', fill_value=0).reset_index()
orders_h_cnt

CPU times: user 1 µs, sys: 0 ns, total: 1 µs
Wall time: 3.1 µs


height,trackingid,h0,h1,h10,h100,h101,h102,h104,h108,h109,...,h89,h9,h90,h91,h92,h94,h95,h96,h97,h98
0,000086e759f059d7373bb1d332392f1c,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0000d2725b0d7c18bfa85dba8fe3fc75,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0000eabbdb272c339beef96b96fe71bc,0,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,00023dc128414e21b2f9f8307f627433,0,8,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0002506b15b27d032de84e195747c3e7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161827,fffc476844e0e3f8b8d465167e400ed9,0,6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161828,fffd2173857cc77c022234e36e28cdad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161829,fffe79e1af75d29bf334a4ef95fa9a2a,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161830,fffe9bf58fbaacabe99822f3e93131e8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
%time
orders_d_cnt = orders.pivot_table(index=ORDER_COLUMN, columns='depth', values=TARGET_COLUMN, aggfunc='count', fill_value=0).reset_index()
orders_d_cnt

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 3.1 µs


depth,trackingid,d0,d1,d10,d100,d101,d102,d104,d109,d11,...,d90,d91,d92,d93,d94,d95,d96,d97,d98,d99
0,000086e759f059d7373bb1d332392f1c,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0000d2725b0d7c18bfa85dba8fe3fc75,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0000eabbdb272c339beef96b96fe71bc,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,00023dc128414e21b2f9f8307f627433,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0002506b15b27d032de84e195747c3e7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161827,fffc476844e0e3f8b8d465167e400ed9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161828,fffd2173857cc77c022234e36e28cdad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161829,fffe79e1af75d29bf334a4ef95fa9a2a,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161830,fffe9bf58fbaacabe99822f3e93131e8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
%time
orders_ct_cnt = orders.pivot_table(index=ORDER_COLUMN, columns='cargotype', values=TARGET_COLUMN, aggfunc='count', fill_value=0).reset_index()
orders_ct_cnt

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 3.1 µs


cargotype,trackingid,ct-1.0,ct0.0,ct1010.0,ct1011.0,ct110.0,ct120.0,ct130.0,ct1300.0,ct140.0,...,ct911.0,ct920.0,ct930.0,ct931.0,ct950.0,ct955.0,ct960.0,ct970.0,ct980.0,ct990.0
0,000086e759f059d7373bb1d332392f1c,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0000d2725b0d7c18bfa85dba8fe3fc75,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0000eabbdb272c339beef96b96fe71bc,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
3,00023dc128414e21b2f9f8307f627433,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0002506b15b27d032de84e195747c3e7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161827,fffc476844e0e3f8b8d465167e400ed9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161828,fffd2173857cc77c022234e36e28cdad,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161829,fffe79e1af75d29bf334a4ef95fa9a2a,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161830,fffe9bf58fbaacabe99822f3e93131e8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [24]:
orders.isna().sum()

trackingid                0
selected_cartontype       0
recommended_cartontype    0
pack_volume               0
goods_wght                0
sku                       0
width                     0
height                    0
depth                     0
cargotype                 0
dtype: int64

In [25]:
orders = orders.dropna()

Now let's create aggregated table for orders with targets and other features.

In [26]:
orders_agg = orders.groupby(ORDER_COLUMN).agg({TARGET_COLUMN: lambda x: pd.Series.mode(x)[0],
                                'recommended_cartontype': lambda x: pd.Series.mode(x)[0],
                                'pack_volume': 'mean',
                                'goods_wght': 'mean',
                                'sku': 'count'}).reset_index()
orders_agg = orders_agg.rename(columns={'sku': 'items_count'})
orders_agg

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,items_count
0,000086e759f059d7373bb1d332392f1c,MYB,MYC,2439.000000,1.027000,3
1,0000d2725b0d7c18bfa85dba8fe3fc75,MYC,MYC,5698.000000,0.256000,2
2,0000eabbdb272c339beef96b96fe71bc,MYC,MYC,485.833333,0.186667,12
3,00023dc128414e21b2f9f8307f627433,MYA,MYA,18.000000,0.010000,8
4,0002506b15b27d032de84e195747c3e7,MYB,YMA,1428.000000,0.285000,6
...,...,...,...,...,...,...
161827,fffc476844e0e3f8b8d465167e400ed9,MYB,MYA,204.000000,0.127273,11
161828,fffd2173857cc77c022234e36e28cdad,MYC,MYC,2184.000000,0.400000,7
161829,fffe79e1af75d29bf334a4ef95fa9a2a,MYA,MYF,299.000000,0.117000,7
161830,fffe9bf58fbaacabe99822f3e93131e8,NONPACK,NONPACK,11520.000000,1.710000,3


In [27]:
orders_agg[TARGET_COLUMN].value_counts()

selected_cartontype
MYB        40052
MYC        29737
NONPACK    18952
MYA        14722
MYD        13659
STRETCH    10016
YMC         8236
YMA         6878
MYE         4708
YMF         4364
YMW         3986
YMG         3930
YME         1458
MYF          794
YML          203
YMX          136
YMB            1
Name: count, dtype: int64

In [28]:
orders_agg['recommended_cartontype'].value_counts()

recommended_cartontype
YMA        31105
YMC        18373
MYC        16882
MYA        13953
NONPACK    13029
YMF        10335
MYF        10266
MYB         9821
YMG         8718
YMW         5778
MYD         4958
YML         4739
YME         4701
MYE         2535
YMX         1919
STRETCH     1589
YMU          853
YMB          825
YMT          604
YMI          287
YMP          286
YMV          276
Name: count, dtype: int64

 Merging tables

In [29]:
orders_agg = orders_agg.merge(orders_w_cnt, on=ORDER_COLUMN, how='inner')
orders_agg = orders_agg.merge(orders_h_cnt, on=ORDER_COLUMN, how='inner')
orders_agg = orders_agg.merge(orders_d_cnt, on=ORDER_COLUMN, how='inner')
orders_agg = orders_agg.merge(orders_ct_cnt, on=ORDER_COLUMN, how='inner')
orders_agg

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,items_count,w0,w1,w10,w100,...,ct911.0,ct920.0,ct930.0,ct931.0,ct950.0,ct955.0,ct960.0,ct970.0,ct980.0,ct990.0
0,000086e759f059d7373bb1d332392f1c,MYB,MYC,2439.000000,1.027000,3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0000d2725b0d7c18bfa85dba8fe3fc75,MYC,MYC,5698.000000,0.256000,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0000eabbdb272c339beef96b96fe71bc,MYC,MYC,485.833333,0.186667,12,0,0,2,0,...,0,0,0,0,1,0,0,1,0,0
3,00023dc128414e21b2f9f8307f627433,MYA,MYA,18.000000,0.010000,8,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0002506b15b27d032de84e195747c3e7,MYB,YMA,1428.000000,0.285000,6,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161827,fffc476844e0e3f8b8d465167e400ed9,MYB,MYA,204.000000,0.127273,11,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161828,fffd2173857cc77c022234e36e28cdad,MYC,MYC,2184.000000,0.400000,7,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161829,fffe79e1af75d29bf334a4ef95fa9a2a,MYA,MYF,299.000000,0.117000,7,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
161830,fffe9bf58fbaacabe99822f3e93131e8,NONPACK,NONPACK,11520.000000,1.710000,3,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


## Modeling

In [30]:
orders_agg = orders_agg.drop(ORDER_COLUMN, axis=1)

In [31]:
# Removing box type with 1 occurence
orders_agg = orders_agg.loc[orders_agg[TARGET_COLUMN] != 'YMB']

In [32]:
X = orders_agg.drop(['recommended_cartontype', TARGET_COLUMN], axis=1).reset_index(drop=True)
y = orders_agg[TARGET_COLUMN]
y_rec_algo = orders_agg['recommended_cartontype']

In [33]:
# Using OrdinalEncoder because of unknown values in recommender carton type
ord_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# Encoding before split to make startification by target
y = ord_encoder.fit_transform(y.values.reshape(-1, 1))
y_rec_algo = ord_encoder.transform(y_rec_algo.values.reshape(-1, 1))

In [34]:
X_train, y_train, X_test, y_test = iterative_train_test_split(X.values, y, test_size = 0.2)
y_rec_algo_train, y_comp_train, y_rec_algo_test, y_comp_test = iterative_train_test_split(y_rec_algo, y, test_size = 0.2)

In [35]:
(y_comp_test==y_test).all()

True

In [36]:
%time

clf = CatBoostClassifier(random_state=42)

clf.fit(X_train, y_train)

CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 1.67 µs
Learning rate set to 0.101401
0:	learn: 2.4395188	total: 656ms	remaining: 10m 55s
1:	learn: 2.2682655	total: 1.3s	remaining: 10m 46s
2:	learn: 2.1421884	total: 1.92s	remaining: 10m 37s
3:	learn: 2.0508663	total: 2.55s	remaining: 10m 35s
4:	learn: 1.9732353	total: 3.18s	remaining: 10m 33s
5:	learn: 1.9113603	total: 3.81s	remaining: 10m 32s
6:	learn: 1.8612527	total: 4.42s	remaining: 10m 27s
7:	learn: 1.8173865	total: 5.03s	remaining: 10m 23s
8:	learn: 1.7774669	total: 5.71s	remaining: 10m 28s
9:	learn: 1.7455759	total: 6.38s	remaining: 10m 31s
10:	learn: 1.7184582	total: 7.04s	remaining: 10m 32s
11:	learn: 1.6948821	total: 7.73s	remaining: 10m 36s
12:	learn: 1.6711489	total: 8.39s	remaining: 10m 37s
13:	learn: 1.6540221	total: 9.14s	remaining: 10m 43s
14:	learn: 1.6357559	total: 9.82s	remaining: 10m 44s
15:	learn: 1.6210600	total: 10.6s	remaining: 10m 49s
16:	learn: 1.6098103	total: 11.3s	remaining: 10m 52s
17:	lear

<catboost.core.CatBoostClassifier at 0x293431810>

In [37]:
y_preds = clf.predict(X_test)

Metrics for ML model.

In [38]:

print(classification_report(y_test, y_preds, target_names=ord_encoder.categories_[0], zero_division=0))

              precision    recall  f1-score   support

         MYA       0.57      0.39      0.46      2945
         MYB       0.59      0.71      0.64      7960
         MYC       0.48      0.55      0.51      6015
         MYD       0.37      0.39      0.38      2734
         MYE       0.37      0.10      0.16       980
         MYF       0.00      0.00      0.00       158
     NONPACK       0.62      0.78      0.69      3728
     STRETCH       0.41      0.26      0.32      1963
         YMA       0.48      0.22      0.30      1416
         YMC       0.40      0.38      0.39      1671
         YME       0.51      0.34      0.41       296
         YMF       0.33      0.23      0.27       852
         YMG       0.41      0.49      0.45       781
         YML       0.70      0.18      0.29        39
         YMW       0.36      0.28      0.31       799
         YMX       0.00      0.00      0.00        30

    accuracy                           0.51     32367
   macro avg       0.41   

Metrics for recommendation system.

In [39]:
print(classification_report(y_test, y_rec_algo_test, labels=np.unique(y_test), target_names=ord_encoder.categories_[0]))

              precision    recall  f1-score   support

         MYA       0.41      0.39      0.40      2945
         MYB       0.53      0.13      0.21      7960
         MYC       0.43      0.23      0.30      6015
         MYD       0.36      0.14      0.20      2734
         MYE       0.18      0.09      0.12       980
         MYF       0.02      0.32      0.05       158
     NONPACK       0.73      0.49      0.59      3728
     STRETCH       0.49      0.09      0.15      1963
         YMA       0.14      0.60      0.22      1416
         YMC       0.17      0.37      0.23      1671
         YME       0.09      0.28      0.13       296
         YMF       0.12      0.30      0.17       852
         YMG       0.12      0.25      0.16       781
         YML       0.03      0.82      0.07        39
         YMW       0.14      0.21      0.17       799
         YMX       0.02      0.20      0.03        30

   micro avg       0.26      0.26      0.26     32367
   macro avg       0.25   

In [40]:
ord_encoder.categories_[0]

array(['MYA', 'MYB', 'MYC', 'MYD', 'MYE', 'MYF', 'NONPACK', 'STRETCH',
       'YMA', 'YMC', 'YME', 'YMF', 'YMG', 'YML', 'YMW', 'YMX'],
      dtype=object)

We can see that simple baseline ML model can improve current recommendation algorithm. 

**Further improvements:**
- Optimize targets: for the same groups of items we can choose the most optimal (small) package.
- Generate more features