## MeLi Data Challenge 2019

This notebook is part of a curated version of my original solution for the MeLi Data Challenge hosted by [Mercado Libre](https://www.mercadolibre.com/) in 2019

The goal of this first challenge was to create a model that would classify items into categories based solely on the item’s title. 

This title is a free text input from the seller that would become the header of the listings.

<div class="alert alert-block alert-info">
<b>Note</b> <p>Only 10% of the data is used in the notebooks to improve the experience.</p>
    <p>Also, data is not being splitted by language in this notebooks for simplicity reasons only</p>
    <p>In the scripted version, 100% of the data is used to improve results</p>
</div>

### 2 - Train Model

In this notebook, we train a CNN using all the data created in the previous steps

### Import libraries

In [14]:
import pandas as pd
import joblib
import numpy as np
from sklearn.model_selection import train_test_split
import copy

### Load data

In [4]:
df = pd.read_pickle('./data/df.pkl')
len_sent = joblib.load('./data/len_sent.h5')

### Encode cateogories and save the references

In [6]:
nb_classes = len(np.unique(df['category']))
labels, levels = pd.factorize(df['category'])        
joblib.dump(nb_classes,'./data/nb_classes')
joblib.dump(levels,'./data/levels')

['./data/levels']

### Split data 

We now split data into train and validation
<div class="alert alert-block alert-info">
<b>Note</b> <p>In <b>2- PreProcess</b> we also splitted the data in order to have a testing set</p>
</div>

In [12]:
train_df, val_df = train_test_split(df,test_size=0.1, stratify=df['category'])

### Data preparation steps
Here, we extract the values from the dataframes and generate the necessary encoded labels

In [21]:
y = copy.deepcopy(train_df['category'].values)
x = copy.deepcopy(train_df['input_data'].values)
y_val_in = copy.deepcopy(val_df['category'].values)
x_val_in = copy.deepcopy(val_df['input_data'].values)

In [22]:
def indices_to_one_hot(data, nb_classes):
    """Convert an iterable of indices to one-hot encoded labels."""
    targets = np.array(data, dtype=np.int16).reshape(-1)
    return np.eye(nb_classes,dtype=np.int8)[targets]

In [23]:
y = [np.where(levels==i)[0][0] for i in y]
y_val = [np.where(levels==i)[0][0] for i in y_val_in]

y = indices_to_one_hot(y, nb_classes)
y_val = indices_to_one_hot(y_val, nb_classes)

### Define model

In [24]:
output_shape = y_val.shape[1]

In [None]:
def cnn_model(input_dim, output_shape, path=''):
    
    weights = np.load(open(path+'/embeddings.npz', 'rb'))
    embedding_dim = weights.shape[1]
    
    inputs = Input(shape=(input_dim,), dtype='int32')
    
    embedding = Embedding(output_dim=weights.shape[1], input_dim=weights.shape[0], input_length=input_dim,
                              weights=[weights], trainable=True)(inputs)
                              
    spatial_dropout = SpatialDropout1D(0.5)(embedding)
        
    reshape = Reshape((input_dim, embedding_dim, 1))(spatial_dropout)

    conv_0 = Conv2D(num_filters, (filter_sizes[0], embedding_dim), padding='valid', kernel_initializer='normal',
                           activation='sigmoid', data_format='channels_last')(reshape)
    conv_1 = Conv2D(num_filters, (filter_sizes[1], embedding_dim), padding='valid', kernel_initializer='normal',
                           activation='sigmoid', data_format='channels_last')(reshape)
    conv_2 = Conv2D(num_filters, (filter_sizes[2], embedding_dim), padding='valid', kernel_initializer='normal',
                           activation='sigmoid', data_format='channels_last')(reshape)
    conv_3 = Conv2D(num_filters, (filter_sizes[3], embedding_dim), padding='valid', kernel_initializer='normal',
                       activation='sigmoid', data_format='channels_last')(reshape)


    maxpool_0 = MaxPooling2D(pool_size=(input_dim - filter_sizes[0] + 1, 1), strides=(1, 1),
                             padding='valid', data_format='channels_last')(conv_0)
    maxpool_1 = MaxPooling2D(pool_size=(input_dim - filter_sizes[1] + 1, 1), strides=(1, 1),
                             padding='valid', data_format='channels_last')(conv_1)
    maxpool_2 = MaxPooling2D(pool_size=(input_dim - filter_sizes[2] + 1, 1), strides=(1, 1),
                             padding='valid', data_format='channels_last')(conv_2)
    maxpool_3 = MaxPooling2D(pool_size=(input_dim - filter_sizes[3] + 1, 1), strides=(1, 1),
                             padding='valid', data_format='channels_last')(conv_3)

    merged_tensor = concatenate([maxpool_0, maxpool_1, maxpool_2, maxpool_3], axis=1)
    
    flatten = Flatten()(merged_tensor)
    
    #dense1 = Dense(units=output_dim, kernel_regularizer=regularizers.l2(0.01))(flatten)
    dense1 = Dense(units=output_dim)(flatten)
    #dense1 = BatchNormalization()(dense1)
    dense1 = Activation('relu')(dense1)
    dense1 = Dropout(drop)(dense1)

    #dense2 = Dense(units=output_dim, kernel_regularizer=regularizers.l2(0.01))(dense1)
    dense2 = Dense(units=output_dim)(dense1)
    #dense2 = BatchNormalization()(dense2)
    dense2 = Activation('relu')(dense2)
    dense2 = Dropout(drop)(dense2)

    #dense2 = Dense(units=output_dim, activation='relu')(dense1)
    #dense3 = Dense(units=output_dim, activation='relu')(dense2)
    output = Dense(units=output_shape)(dense1)
    #output = BatchNormalization()(output)
    output = Activation('softmax')(output)


    #output = Dense(units=output_shape, activation='softmax')(normalized_1)
    
    model = Model(inputs=inputs, outputs=output)
    
    return model

In [None]:
model = cnn_model(len_sent, output_shape, path='data')