## Packaging algorithm for Yandex.Market Hackathon

There is a Yandex.Market warehouse where orders are processed, packed, and sent for delivery. Packers ("users") take prepared goods from a list and pack them into suitable packaging - cardboard boxes, film, or bags.

**Task**: to select the optimal packaging for each set of goods, so that the items can fit inside and the size of the box is not too large.

Currently, there is a heuristic algorithm that selects packaging based on the total volume of all items in the order and their sizes. However, users often reject its suggestions and pack the goods in a different box or packaging.

We need to find a way to improve this algorithm.

**Solution:** In this work I will use approach where for each size dimension of every item in order I will create separate column which will represent the size of the item in this dimension. Since the vast majority of orders (~95%) contain less than or equal to 3 items, I will use model only for 3 items.

In [1]:
import pandas as pd
import numpy as np

from catboost import CatBoostClassifier

from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from skmultilearn.model_selection import iterative_train_test_split

## Loading and preprocessing the data

Items with their sizes

In [2]:
items = pd.read_csv('./data/sku.csv', index_col=0, low_memory=False)
items.columns = ['sku', 'width', 'height', 'depth']
items.head()

Unnamed: 0,sku,width,height,depth
0,8ba57dcdba9a58b0c4edd180bef6afc9,11.0,31.0,28.0
1,d9af6ce6f9e303f4b1a8cb47cde21975,29.0,14.0,40.0
2,8b91fd242bde88f0891380506d9c3caa,12.0,13.0,35.0
3,e8af308a7659e34194770d1e3a48e144,3.0,13.0,8.0
4,dc0e2542e122731217289b8e6d3bd3f8,96.0,18.0,56.0


In [3]:
items.describe()

Unnamed: 0,width,height,depth
count,6385961.0,6385961.0,6385961.0
mean,21.08468,12.03353,17.82524
std,18.90676,14.87745,15.08838
min,0.0,0.0,0.0
25%,10.0,3.0,8.0
50%,18.0,8.0,15.0
75%,28.0,16.0,24.0
max,6554.0,2050.0,593.0


In [4]:
items.nunique()

sku       6385961
width        2250
height       1948
depth        1844
dtype: int64

In [5]:
SIZE_COLS = ['width', 'height', 'depth']

We need to have less unique values and split them into groups. But first let's convert them to nearest ceil integer

In [6]:
items[SIZE_COLS] = np.ceil(items[SIZE_COLS]).astype(int)

In [7]:
items.nunique()

sku       6385961
width         367
height        344
depth         319
dtype: int64

We can shrink them more, but for now let's leave it as it is.

Orders which we need to preprocess so we can solve the problem with supervised machine learning algorithms

In [8]:
orders = pd.read_csv('./data/data.csv')
orders

Unnamed: 0.1,Unnamed: 0,whs,orderkey,selected_cartontype,box_num,recommended_cartontype,selected_carton,sel_calc_cube,recommended_carton,pack_volume,rec_calc_cube,goods_wght,sku,who,trackingid
0,0,0,d48f3211c1ffccdc374f23139a9ab668,NONPACK,1,YML,NONPACK,0,YML,2046,108000,0.100,af49bf330e2cf16e44f0be1bdfe337bd,b7325da1af89a46059164618eb03ae38,6c304d5c2815ccd2ba5046c101294c24
1,1,0,d48f3211c1ffccdc374f23139a9ab668,NONPACK,1,YML,NONPACK,0,YML,2046,108000,0.100,af49bf330e2cf16e44f0be1bdfe337bd,b7325da1af89a46059164618eb03ae38,6c304d5c2815ccd2ba5046c101294c24
2,2,0,d48f3211c1ffccdc374f23139a9ab668,NONPACK,1,YML,NONPACK,0,YML,2046,108000,0.100,af49bf330e2cf16e44f0be1bdfe337bd,b7325da1af89a46059164618eb03ae38,6c304d5c2815ccd2ba5046c101294c24
3,3,0,d48f3211c1ffccdc374f23139a9ab668,NONPACK,1,YML,NONPACK,0,YML,2046,108000,0.100,af49bf330e2cf16e44f0be1bdfe337bd,b7325da1af89a46059164618eb03ae38,6c304d5c2815ccd2ba5046c101294c24
4,4,0,d48f3211c1ffccdc374f23139a9ab668,NONPACK,1,YML,NONPACK,0,YML,2046,108000,0.100,af49bf330e2cf16e44f0be1bdfe337bd,b7325da1af89a46059164618eb03ae38,6c304d5c2815ccd2ba5046c101294c24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325618,325618,7,0e4f34db53e37d6bf171c2e055e2b4e0,MYC,1,YMC,MYC,4560,YMC,2080,8525,0.100,86dcc1a44eb2939fea4d2dd3604e1f9e,be7c9ad8b9430d358e6c276b94e2beff,f94f078101752133502202383bc87743
325619,325619,7,0e4f34db53e37d6bf171c2e055e2b4e0,MYC,1,YMC,MYC,4560,YMC,2080,8525,0.100,86dcc1a44eb2939fea4d2dd3604e1f9e,be7c9ad8b9430d358e6c276b94e2beff,f94f078101752133502202383bc87743
325620,325620,7,e71d2e750ce9a7a39c273c634be1665d,YMC,1,YMC,YMC,8525,YMC,3523,8525,0.284,9db21acf9e6c1a66493c246c1461f989,be7c9ad8b9430d358e6c276b94e2beff,58054d533ef06746ffd8cf99fad4a8cb
325621,325621,7,2e2a642f611b5a6f2c404ab945fbc2a3,MYB,1,YMU,MYB,2816,YMU,552,2592,0.230,4aedb72c5662562524f6119918c7179b,be7c9ad8b9430d358e6c276b94e2beff,1666b5c878be124f05fb9a1d95dd8a68


In [9]:
# Column by which we will group/count orders (order or tracking id)
ORDER_COLUMN = 'trackingid'
TARGET_COLUMN = 'selected_cartontype'

In [10]:
orders[TARGET_COLUMN].value_counts()

selected_cartontype
MYB        55937
MYC        48837
NONPACK    30497
YMC        27149
MYD        24663
YMG        23610
MYA        20401
YMF        19256
YMW        19173
YMA        15795
YME        12685
STRETCH    12465
MYE         9719
YML         3282
MYF         1350
YMX          802
YMB            2
Name: count, dtype: int64

In [11]:
orders['box_num'].value_counts()

box_num
1      296683
2       15199
3        4311
4        1739
5         981
        ...  
87          1
86          1
85          1
84          1
210         1
Name: count, Length: 235, dtype: int64

In [12]:
orders.loc[orders['box_num'] ==3]

Unnamed: 0.1,Unnamed: 0,whs,orderkey,selected_cartontype,box_num,recommended_cartontype,selected_carton,sel_calc_cube,recommended_carton,pack_volume,rec_calc_cube,goods_wght,sku,who,trackingid
131,131,0,49b16c00343c611f66b59ca6019aacb7,STRETCH,3,STRETCH,STRETCH,0,STRETCH,5482,0,1.048,c6132928cba6b44adc82ceda7e3690fa,0d5a07f7ac939ca7a4cc4afe355c7234,3d67e3d5c6dc19f916c730ecce78dfc1
155,155,0,869feb74f3045d9ab14c8fba6987f41a,NONPACK,3,NONPACK,NONPACK,0,NONPACK,17600,0,0.750,2ae7a054de2b06402837bca5e75cb572,0d5a07f7ac939ca7a4cc4afe355c7234,b8b1b929db80e863e8ea01c4262a8ab1
167,167,0,12a08dd5869a1abc64d825c165afc9c2,NONPACK,3,NONPACK,NONPACK,0,NONPACK,17955,0,9.450,7f09eeeb1982a7ceb04c78c01789eac2,c1ef50c2d16368c7039ca5f842e4cb6d,1e5c8de9d013143c1cc894b62ecc6222
204,204,0,e18c79259aac13bf30ee63dcccb857f6,NONPACK,3,STRETCH,NONPACK,0,STRETCH,31744,0,0.700,e1a683071f58f26c7c7fc3bdeb64410f,fb5b37a93c08094e4ff31c07eafaed63,344df72c66bd275b309760d1bb33477c
228,228,0,9fd0b8e9c1144e56c509b3b56de392a8,NONPACK,3,NONPACK,NONPACK,0,NONPACK,1925,0,0.145,ce93956193dc197de190d74f5f116689,094d644282d8c426463160b1f08bdfe0,09979a27f0dace2e577e97c0e638e39f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
325439,325439,7,1af40edb78f16537dd4e9930e569fd8a,NONPACK,3,NONPACK,NONPACK,0,NONPACK,8322,0,4.660,66d38708254823d36a8be0f49222ec22,ce21d9ab9bf6770610560ee7f6ec6602,ed5959ea626cd7fdec476ab82d9db41c
325500,325500,7,d06547ddebea26fd29d1851e64cc15b6,NONPACK,3,NONPACK,NONPACK,0,NONPACK,33660,0,38.900,1b37a7cb22aede48f5721cb87287f28e,027659f40187d994e0254a043e56aee7,be902a554595732928f25fff48acf302
325513,325513,7,78b241628f8feef6651c3913d99c2ec1,NONPACK,3,NONPACK,NONPACK,0,NONPACK,8322,0,4.660,66d38708254823d36a8be0f49222ec22,027659f40187d994e0254a043e56aee7,30e38480617f09c714917d2808b0ad26
325594,325594,7,7082728665f713581f584b2c4010bc42,NONPACK,3,NONPACK,NONPACK,0,NONPACK,117600,0,7.500,e0f94b79e5b02aecd844a2df53460f8b,eeafb158c529eaf1aa9b8f2068a82914,8f06ed7642247824c77709c7878fe2ee


Need futher analysis and discussion. From preliminary exploration: even several boxes have the same carton type.

Converting data to supervised learning problem

In [13]:
# Selecting the most valuable columns
orders = orders[[ORDER_COLUMN, TARGET_COLUMN, 'recommended_cartontype', 'pack_volume', 'goods_wght', 'sku']]
orders

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,sku
0,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd
1,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd
2,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd
3,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd
4,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd
...,...,...,...,...,...,...
325618,f94f078101752133502202383bc87743,MYC,YMC,2080,0.100,86dcc1a44eb2939fea4d2dd3604e1f9e
325619,f94f078101752133502202383bc87743,MYC,YMC,2080,0.100,86dcc1a44eb2939fea4d2dd3604e1f9e
325620,58054d533ef06746ffd8cf99fad4a8cb,YMC,YMC,3523,0.284,9db21acf9e6c1a66493c246c1461f989
325621,1666b5c878be124f05fb9a1d95dd8a68,MYB,YMU,552,0.230,4aedb72c5662562524f6119918c7179b


In [14]:
orders = orders.merge(items, on='sku', how='inner')
orders

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,sku,width,height,depth
0,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31
1,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31
2,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31
3,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31
4,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31
...,...,...,...,...,...,...,...,...,...
314825,05466bb9687134041a318a1578e9101e,YMG,YMW,1944,0.550,2a7f6883fadf6fa7a2306376e598b56c,18,6,18
314826,e0a3958ffb898492600503c7e000d961,YMW,YMW,6000,0.150,b7ab1a9260ebc1d53774714b24c02dd3,15,20,20
314827,9c3c0b98d0b2de8b9fe27d7d0770162b,MYE,MYE,1400,0.160,ba0b176dc645058663de285b479163b3,50,1,28
314828,dc68860f37af41c7b3c12257a9e7eff1,MYC,YMC,2688,0.140,0940ce4ed3c65c5713a169bdf824dfcc,28,8,12


Adding items cargotypes

In [15]:
item_cargotypes = pd.read_csv('./data/sku_cargotypes.csv')

In [16]:
item_cargotypes = item_cargotypes[['sku', 'cargotype']]
item_cargotypes.head()

Unnamed: 0,sku,cargotype
0,4862bf0e760a593b13f3f2fcf822e533,290
1,4862bf0e760a593b13f3f2fcf822e533,901
2,50d3c4fc66ad423b7feaadff2d682ee0,290
3,50d3c4fc66ad423b7feaadff2d682ee0,901
4,24ce9dba9f301ada55f60e25ee1498d2,290


In [17]:
item_cargotypes['cargotype'].value_counts()

cargotype
290     5216064
950     1591250
200     1009438
970      793449
310      758212
         ...   
900          39
333          10
907           9
1300          2
10            1
Name: count, Length: 88, dtype: int64

In [18]:
orders = orders.merge(item_cargotypes, on='sku', how='left')
orders

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,sku,width,height,depth,cargotype
0,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,290.0
1,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,600.0
2,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,610.0
3,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,950.0
4,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.100,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,970.0
...,...,...,...,...,...,...,...,...,...,...
1456741,dc68860f37af41c7b3c12257a9e7eff1,MYC,YMC,2688,0.140,0940ce4ed3c65c5713a169bdf824dfcc,28,8,12,980.0
1456742,fb38e0e582b9b2a6a3c7434820c9d829,NONPACK,NONPACK,1832,0.878,74ec431f87644bfed9baca351fffcdee,231,1,8,292.0
1456743,fb38e0e582b9b2a6a3c7434820c9d829,NONPACK,NONPACK,1832,0.878,74ec431f87644bfed9baca351fffcdee,231,1,8,300.0
1456744,fb38e0e582b9b2a6a3c7434820c9d829,NONPACK,NONPACK,1832,0.878,74ec431f87644bfed9baca351fffcdee,231,1,8,301.0


In [19]:
orders.isna().sum()

trackingid                  0
selected_cartontype         0
recommended_cartontype      0
pack_volume                 0
goods_wght                  0
sku                         0
width                       0
height                      0
depth                       0
cargotype                 216
dtype: int64

In [20]:
orders['cargotype'] = orders['cargotype'].fillna(-1)

In [21]:
orders = orders.dropna()

In [22]:
orders.head()

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,sku,width,height,depth,cargotype
0,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.1,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,290.0
1,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.1,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,600.0
2,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.1,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,610.0
3,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.1,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,950.0
4,6c304d5c2815ccd2ba5046c101294c24,NONPACK,YML,2046,0.1,af49bf330e2cf16e44f0be1bdfe337bd,11,6,31,970.0


In [23]:
orders_agg = orders.groupby(ORDER_COLUMN).agg({TARGET_COLUMN: lambda x: pd.Series.mode(x)[0],
                                'recommended_cartontype': lambda x: pd.Series.mode(x)[0],
                                'pack_volume': 'mean',
                                'goods_wght': 'mean',
                                'sku': 'count'}).reset_index()
orders_agg = orders_agg.rename(columns={'sku': 'items_count'})
orders_agg.head()

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,items_count
0,000086e759f059d7373bb1d332392f1c,MYB,MYC,2439.0,1.027,3
1,0000d2725b0d7c18bfa85dba8fe3fc75,MYC,MYC,5698.0,0.256,2
2,0000eabbdb272c339beef96b96fe71bc,MYC,MYC,485.833333,0.186667,12
3,00023dc128414e21b2f9f8307f627433,MYA,MYA,18.0,0.01,8
4,0002506b15b27d032de84e195747c3e7,MYB,YMA,1428.0,0.285,6


In [24]:
print(f'Percentage of orders with 1 item {len(orders_agg.loc[orders_agg["items_count"] == 1]) / len(orders_agg) * 100:.2f}%')

Percentage of orders with 1 item 7.14%


In [25]:
print(f'Percentage of orders with 2 items {len(orders_agg.loc[orders_agg["items_count"] == 2]) / len(orders_agg) * 100:.2f}%')

Percentage of orders with 2 items 9.43%


In [26]:
print(f'Percentage of orders with 3 items {len(orders_agg.loc[orders_agg["items_count"] == 3]) / len(orders_agg) * 100:.2f}%')

Percentage of orders with 3 items 10.13%


In [27]:
print(f'Percentage of orders with less than or equal 3 items {len(orders_agg.loc[orders_agg["items_count"] <= 3]) / len(orders_agg) * 100:.2f}%')

Percentage of orders with less than or equal 3 items 26.71%


In [28]:
orders_agg[TARGET_COLUMN].value_counts()

selected_cartontype
MYB        40052
MYC        29737
NONPACK    18952
MYA        14722
MYD        13659
STRETCH    10016
YMC         8236
YMA         6878
MYE         4708
YMF         4364
YMW         3986
YMG         3930
YME         1458
MYF          794
YML          203
YMX          136
YMB            1
Name: count, dtype: int64

In [29]:
orders_agg['recommended_cartontype'].value_counts()

recommended_cartontype
YMA        31105
YMC        18373
MYC        16882
MYA        13953
NONPACK    13029
YMF        10335
MYF        10266
MYB         9821
YMG         8718
YMW         5778
MYD         4958
YML         4739
YME         4701
MYE         2535
YMX         1919
STRETCH     1589
YMU          853
YMB          825
YMT          604
YMI          287
YMP          286
YMV          276
Name: count, dtype: int64

Aggregting item sizes

In [30]:
def unroll_sizes(order):
    # For up to 3 items per order
    items_sizes = [[]] * 3
    items_sizes[0] = [0] * 3
    items_sizes[1] = [0] * 3
    items_sizes[2] = [0] * 3

    cargotypes = [-1] * 3

    for i, item in enumerate(order.itertuples()):
        if i >= 3:
            break

        items_sizes[i][0] = item.width
        items_sizes[i][1] = item.height
        items_sizes[i][2] = item.depth

        cargotypes[i] = item.cargotype

        # Sort sizes
        items_sizes[i] = sorted(items_sizes[i])

    row_values = [getattr(order, ORDER_COLUMN).values[0]] + [dim for item in items_sizes for dim in item]
    row_values = row_values + cargotypes
    return pd.DataFrame([row_values],
                        columns=[ORDER_COLUMN, 'width_item_1', 'height_item_1', 'depth_item_1', 'width_item_2', 'height_item_2', 'depth_item_2', 'width_item_3', 'height_item_3', 'depth_item_3',\
                         'cargotype_item_1', 'cargotype_item_2', 'cargotype_item_3'])

orders_sizes = orders.groupby(ORDER_COLUMN).apply(unroll_sizes).reset_index(drop=True)


In [31]:
orders_sizes.head()

Unnamed: 0,trackingid,width_item_1,height_item_1,depth_item_1,width_item_2,height_item_2,depth_item_2,width_item_3,height_item_3,depth_item_3,cargotype_item_1,cargotype_item_2,cargotype_item_3
0,000086e759f059d7373bb1d332392f1c,7,17,22,7,17,22,7,17,22,200.0,290.0,340.0
1,0000d2725b0d7c18bfa85dba8fe3fc75,7,22,37,7,22,37,0,0,0,290.0,340.0,-1.0
2,0000eabbdb272c339beef96b96fe71bc,1,20,26,1,20,26,1,20,26,290.0,600.0,950.0
3,00023dc128414e21b2f9f8307f627433,1,3,6,1,3,6,1,3,6,290.0,410.0,440.0
4,0002506b15b27d032de84e195747c3e7,7,13,18,7,13,18,7,13,18,290.0,320.0,340.0


In [32]:
orders_agg = orders_agg.merge(orders_sizes, on=ORDER_COLUMN, how='inner')

In [33]:
orders_agg.head()

Unnamed: 0,trackingid,selected_cartontype,recommended_cartontype,pack_volume,goods_wght,items_count,width_item_1,height_item_1,depth_item_1,width_item_2,height_item_2,depth_item_2,width_item_3,height_item_3,depth_item_3,cargotype_item_1,cargotype_item_2,cargotype_item_3
0,000086e759f059d7373bb1d332392f1c,MYB,MYC,2439.0,1.027,3,7,17,22,7,17,22,7,17,22,200.0,290.0,340.0
1,0000d2725b0d7c18bfa85dba8fe3fc75,MYC,MYC,5698.0,0.256,2,7,22,37,7,22,37,0,0,0,290.0,340.0,-1.0
2,0000eabbdb272c339beef96b96fe71bc,MYC,MYC,485.833333,0.186667,12,1,20,26,1,20,26,1,20,26,290.0,600.0,950.0
3,00023dc128414e21b2f9f8307f627433,MYA,MYA,18.0,0.01,8,1,3,6,1,3,6,1,3,6,290.0,410.0,440.0
4,0002506b15b27d032de84e195747c3e7,MYB,YMA,1428.0,0.285,6,7,13,18,7,13,18,7,13,18,290.0,320.0,340.0


## Modelling

In [34]:
orders_agg = orders_agg.drop(ORDER_COLUMN, axis=1)

Removing orders with less than or equal to 3 items

In [35]:
orders_agg = orders_agg.loc[orders_agg['items_count'] <= 3]

In [36]:
# Removing box type with 1 occurence
orders_agg = orders_agg.loc[orders_agg[TARGET_COLUMN] != 'YMB']

In [37]:
X = orders_agg.drop(['recommended_cartontype', TARGET_COLUMN], axis=1).reset_index(drop=True)
y = orders_agg[TARGET_COLUMN]
y_rec_algo = orders_agg['recommended_cartontype']

In [38]:
# Using OrdinalEncoder because of unknown values in recommender carton type
ord_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# Encoding before split to make startification by target
y = ord_encoder.fit_transform(y.values.reshape(-1, 1))
y_rec_algo = ord_encoder.transform(y_rec_algo.values.reshape(-1, 1))

In [39]:
X_train, y_train, X_test, y_test = iterative_train_test_split(X.values, y, test_size = 0.2)
y_rec_algo_train, y_comp_train, y_rec_algo_test, y_comp_test = iterative_train_test_split(y_rec_algo, y, test_size = 0.2)

In [40]:
%time

clf = CatBoostClassifier(random_state=42)

clf.fit(X_train, y_train)

CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 2.86 µs
Learning rate set to 0.094798
0:	learn: 2.3854459	total: 288ms	remaining: 4m 47s
1:	learn: 2.1797140	total: 484ms	remaining: 4m 1s
2:	learn: 2.0377359	total: 698ms	remaining: 3m 52s
3:	learn: 1.9290618	total: 933ms	remaining: 3m 52s
4:	learn: 1.8431518	total: 1.04s	remaining: 3m 27s
5:	learn: 1.7738271	total: 1.26s	remaining: 3m 28s
6:	learn: 1.7135651	total: 1.41s	remaining: 3m 19s
7:	learn: 1.6645689	total: 1.54s	remaining: 3m 11s
8:	learn: 1.6230053	total: 1.68s	remaining: 3m 4s
9:	learn: 1.5846831	total: 1.8s	remaining: 2m 58s
10:	learn: 1.5536918	total: 1.9s	remaining: 2m 50s
11:	learn: 1.5258037	total: 2.11s	remaining: 2m 53s
12:	learn: 1.5019979	total: 2.35s	remaining: 2m 58s
13:	learn: 1.4802268	total: 2.46s	remaining: 2m 53s
14:	learn: 1.4600349	total: 2.61s	remaining: 2m 51s
15:	learn: 1.4412109	total: 2.76s	remaining: 2m 49s
16:	learn: 1.4266345	total: 2.92s	remaining: 2m 48s
17:	learn: 1.4134713	total: 

<catboost.core.CatBoostClassifier at 0x16a47d6d0>

In [41]:
y_preds = clf.predict(X_test)

Metrics for ML model.

In [42]:

print(classification_report(y_test, y_preds, labels=np.unique(y_test), target_names=ord_encoder.categories_[0], zero_division=0))

              precision    recall  f1-score   support

         MYA       0.58      0.43      0.49       962
         MYB       0.62      0.73      0.67      2473
         MYC       0.53      0.54      0.53      1635
         MYD       0.45      0.57      0.50       849
         MYE       0.32      0.20      0.25       303
         MYF       0.00      0.00      0.00        27
     NONPACK       0.69      0.78      0.73      1287
     STRETCH       0.44      0.31      0.36       579
         YMA       0.49      0.11      0.18       162
         YMC       0.34      0.23      0.27       177
         YME       0.25      0.07      0.11        14
         YMF       0.24      0.09      0.13        69
         YMG       0.38      0.10      0.15        52
         YML       0.32      0.15      0.20        54
         YMW       0.50      1.00      0.67         1

    accuracy                           0.57      8644
   macro avg       0.41      0.35      0.35      8644
weighted avg       0.55   



Metrics for recommendation system.

In [43]:
print(classification_report(y_test, y_rec_algo_test, labels=np.unique(y_test), target_names=ord_encoder.categories_[0]))

              precision    recall  f1-score   support

         MYA       0.42      0.50      0.46       962
         MYB       0.59      0.20      0.30      2473
         MYC       0.48      0.24      0.32      1635
         MYD       0.46      0.21      0.29       849
         MYE       0.25      0.14      0.18       303
         MYF       0.02      0.26      0.03        27
     NONPACK       0.79      0.60      0.68      1287
     STRETCH       0.51      0.12      0.19       579
         YMA       0.08      0.57      0.14       162
         YMC       0.10      0.42      0.17       177
         YME       0.02      0.29      0.04        14
         YMF       0.06      0.33      0.10        69
         YMG       0.04      0.31      0.07        52
         YML       0.03      0.09      0.04        54
         YMW       0.00      0.00      0.00         1

   micro avg       0.33      0.31      0.32      8644
   macro avg       0.26      0.29      0.20      8644
weighted avg       0.51   



We can see that simple baseline ML model can improve current recommendation algorithm. 

**Further improvements:**
- Optimize targets: for the same groups of items we can choose the most optimal (small) package.
- Generate more features
