## MeLi Data Challenge 2019

This notebook is part of a curated version of my original solution for the MeLi Data Challenge hosted by [Mercado Libre](https://www.mercadolibre.com/) in 2019

The goal of this first challenge was to create a model that would classify items into categories based solely on the item’s title. 

This title is a free text input from the seller that would become the header of the listings.

<div class="alert alert-block alert-info">
<b>Note</b> <p>Only 10% of the data is used in the notebooks to improve the experience.</p>
    <p>Also, data is not being splitted by language in this notebooks for simplicity reasons only</p>
    <p>In the scripted version, 100% of the data is used to improve results</p>
</div>

### 2 - PreProcess

In this notebook I'm collecting all the pre-processing steps and alternatives applied to the data

In [62]:
import pandas as pd
import unidecode
import re
from gensim.models.phrases import Phraser, Phrases
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from tqdm.auto import tqdm
tqdm.pandas()
import nltk
nltk.download('stopwords')
from statics import *
from gensim.models.word2vec import Word2Vec
import numpy as np
import json
from sklearn.model_selection import train_test_split
import joblib

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Basla\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Load Data

It can be downloaded from here:
- [train](https://meli-data-challenge.s3.amazonaws.com/train.csv.gz)

- [test](https://meli-data-challenge.s3.amazonaws.com/test.csv)

In [2]:
# train_data = pd.read_csv('data/train.csv.gz', compression='gzip')
train_data = pd.read_csv('data/sample_train.csv')
test_data = pd.read_csv('data/test.csv')

In [3]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,title,label_quality,language,category
0,0,Juego Sabanas Cannon Kids Niños Bebe Cuna Func...,unreliable,spanish,BED_SHEETS
1,1,Combo I7 7820x X299 Aorus Gaming 3 Corsair Lpx...,unreliable,spanish,COMPUTER_PROCESSORS
2,2,5 De Halloween Led Luz Completo -face Enmascar...,unreliable,spanish,BALLOONS
3,3,Mesa De Lcd A Medida,unreliable,spanish,TV_STORAGE_UNITS
4,4,Body Bebe Dragon Ball - Envios A Todo El Pa...,unreliable,spanish,BABY_BODYSUITS


### Clean titles

Many of the following steps should be run separately for each language.

Here, we run all together, but in the final script data will be splitted.

**Step 1**: run each title through unidecode to remove special characters present in spanish and portuguese

In [4]:
train_data.head(50)

Unnamed: 0.1,Unnamed: 0,title,label_quality,language,category
0,0,Juego Sabanas Cannon Kids Niños Bebe Cuna Func...,unreliable,spanish,BED_SHEETS
1,1,Combo I7 7820x X299 Aorus Gaming 3 Corsair Lpx...,unreliable,spanish,COMPUTER_PROCESSORS
2,2,5 De Halloween Led Luz Completo -face Enmascar...,unreliable,spanish,BALLOONS
3,3,Mesa De Lcd A Medida,unreliable,spanish,TV_STORAGE_UNITS
4,4,Body Bebe Dragon Ball - Envios A Todo El Pa...,unreliable,spanish,BABY_BODYSUITS
5,5,Ventana De Hierro Demolicion,unreliable,spanish,WINDOWS
6,6,Conjunto Remera Gris Gimos Talle N 4 Y Jogging...,unreliable,spanish,T_SHIRTS
7,7,Juego Faro Auxiliar Ford Focus 1999 2001 2002 ...,unreliable,spanish,FOG_LIGHTS
8,8,Mosqueton Reforzado #2 X 100 Unidades - Acero ...,unreliable,spanish,CARABINERS
9,9,Regulador Nacional Co2 Caudalimetro Soldar Mig...,unreliable,spanish,PRESSURE_GAUGES


In [5]:
train_data.tail(50)

Unnamed: 0.1,Unnamed: 0,title,label_quality,language,category
199950,199950,3 Papai Noel: 1 Paraquedas + 1 Bolhas + 1 Escada,unreliable,portuguese,CHRISTMAS_TREE_ORNAMENTS
199951,199951,Blusa De Frio Moletom Banda Pink Floyd Rock Ca...,unreliable,portuguese,SWEATSHIRTS_AND_HOODIES
199952,199952,Bolsa Capa Case Retrô P/ Câmera Sony Cyber-sho...,unreliable,portuguese,CAMERA_CASES
199953,199953,Alto Falante Woofer Eros 12 Pol 400 Rms 312lc ...,reliable,portuguese,VEHICLE_SPEAKERS
199954,199954,Conjunto Agasalho Croacia Calça E Blusa De Frio,unreliable,portuguese,FOOTBALL_KITS
199955,199955,Fogão Lua De Cristal Maxi House,unreliable,portuguese,CAMPING_STOVES
199956,199956,Calcinha Iplay Com Fralda Reutilizável - Bem T...,unreliable,portuguese,BABY_GROOMING_KITS
199957,199957,2 Vinhos Franceses La Chanson Du Soleil Cuvée ...,unreliable,portuguese,WINES
199958,199958,30 Un - Isolador Amarelo 4 Ranhuras Com Suporte,unreliable,portuguese,DESKTOP_COMPUTER_CASES
199959,199959,Cabo Flat Lvds Tv Philips 32pfl5007g Original,unreliable,portuguese,COMPUTER_AND_TV_FLEX_CABLES


In [6]:
for text in train_data.iloc[[10, 199955]].title.values:
    print(f"\n Original text: {text}")
    print(f"Processed text: {unidecode.unidecode(text)}")


 Original text: Ciento Noventa Espejos (libros Hiperión)
Processed text: Ciento Noventa Espejos (libros Hiperion)

 Original text: Fogão Lua De Cristal Maxi House
Processed text: Fogao Lua De Cristal Maxi House


**Step 2**: Clean special cases using RegEx

In [7]:
for string in train_data.head().title.values:
    stringA = re.sub(r"\s{2,}", " ", string)
    stringB = re.sub(r'(.)\1{3,}', r'\1\1', stringA)
    if string != stringB:
        print(f"\n Original text: {string}")
        print(f"Processed text: {stringB}")


 Original text: Body Bebe Dragon Ball  -   Envios A Todo El Pais Db008
Processed text: Body Bebe Dragon Ball - Envios A Todo El Pais Db008


In [8]:
# Remove extra spaces and characters repeated more than 3 times in a row
for string in train_data.iloc[[233, 413]].title.values:
    stringA = re.sub(r"\s{2,}", " ", string)
    stringB = re.sub(r'(.)\1{3,}', r'\1\1', stringA)
    if string != stringB:
        print(f"\n Original text: {string}")
        print(f"Processed text: {stringB}")

In [9]:
# Find spatial measures and replace them with a special tag
for string in train_data.head(50).title.values:
    stringB = re.sub(r"((\d)+(\,|\.){0,1}(\d)*( ){0,1}((mts*)|(pulgadas*)|('')|(polegadas*)|(m)|(mms*)|(cms*)|(metros*)|(mtrs*)|(centimetros*))+)(?!(?!x)[A-Za-z])", " <smeasure> ", string, flags=re.I)
    if string != stringB:
        print(f"\n Original text: {string}")
        print(f"Processed text: {stringB}")



 Original text: Puerta Embutir Corrediza Gromanti 70/15  9mm Ch18 Linea2000
Processed text: Puerta Embutir Corrediza Gromanti 70/15   <smeasure>  Ch18 Linea2000

 Original text: Olla A Presión 6 Litros Tramontina Vancouver Aluminio 24cm
Processed text: Olla A Presión 6 Litros Tramontina Vancouver Aluminio  <smeasure> 

 Original text: Caño Hierro Galvanizado 4'' X 6.4 M C/cupla Tenaris. 3df, Sm
Processed text: Caño Hierro Galvanizado  <smeasure>  X  <smeasure>  C/cupla Tenaris. 3df, Sm


**Step 3**
Putting it all together (and adding a "few" more expressions):

In [10]:
def clean_str(string):
    """
    Tokenization/string cleaning for datasets.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    
    string = unidecode.unidecode(string)
    string = re.sub(r"\s{2,}", " ", string)
    string = re.sub(r'(.)\1{3,}', r'\1\1', string)
    
    string = re.sub(r'\d{7,}', r' <longnumber> ', string)
    string = re.sub(r'\d{7,}', r' <longnumber> ', string)
    string = re.sub(r'(?<=[A-Za-z])[\d,]{10,}', r' <model> ', string)
    string = re.sub(r'[\d,]{10,}', r' <weird> ', string)
    
    # Spatial
    string = re.sub(r"((\d)+(\,|\.){0,1}(\d)*( ){0,1}((mts*)|(pulgadas*)|('')|"
                    "(polegadas*)|(m)|(mms*)|(cms*)|(metros*)|(mtrs*)|(centimetros*))+)"
                    "(?!(?!x)[A-Za-z])", " <smeasure> ", string, flags=re.I)
    string = re.sub(r"(mts)+( ){0,1}((\d)+(\,|\.){0,1}(\d)*)(?![A-Za-z])", " <smeasure> ", string, flags=re.I)
    string = re.sub(r"<smeasure> +[\/x] +<smeasure> +[\/x] +<smeasure>", " <smeasure> ", string, flags=re.I)
    string = re.sub(r"(<smeasure>|(\d)+(\,|\.)*(\d*)) +[\/x] "
                    "+(<smeasure>|(\d)+(\,|\.)*(\d*)) +[\/x] +<smeasure>", " <smeasure> ", string, 
                    flags=re.I)
    string = re.sub(r"<smeasure> +[\/x] +<smeasure>", " <smeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])(<smeasure>|(\d)+(\,|\.)*(\d*)) *[\/x-] *<smeasure>", 
                    " <smeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*))x((\d)+(\,|\.)*(\d*))x +<smeasure>", " <smeasure> ", 
                    string, flags=re.I)
    
    # Electrical
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *amperes", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"(?<!(?<![\dx])[\dA-Za-z])((\d)+(\,|\.)*(\d*)) *amps*", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *mah", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *vol.", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *kw\b", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *v+\b", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *volts*", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *w(ts)*(?![\dA-Za-z])", " <emeasure> ", 
                    string, flags=re.I)
    
    # Pressure
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *psi", " <pmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *bar", " <pmeasure> ", string, flags=re.I)
    
    # Weights
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *kgs*", " <wmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *kilos*", " <wmeasure> ", string, flags=re.I)
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *g\b", " <wmeasure> ", string, flags=re.I)
    
    # IT
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *tb", " <itmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *gb", " <itmeasure> ", string, flags=re.I)
    
    # Volume
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *cc(?![0-9])", " <vmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *litros*", " <vmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*))litrs*", " <vmeasure> ", string, flags=re.I)
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *l+\b", " <vmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *mls*", " <vmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *ltrs*", " <vmeasure> ", string, flags=re.I)
    
    # Horse power
    string = re.sub(r"\b(?<![\dA-Za-z])(\d)+ *cv\b", " <hpmeasure> ", string, flags=re.I)
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*))+ *hp\b", " <hpmeasure> ", string, flags=re.I)
    
    # Time
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\:)*(\d*)) *hs*(?![\d])\b", " <time> ", string, flags=re.I)
    
    # Quantity
    string = re.sub(r"\bX\d{1,4}\b", " <quantity> ", string, flags=re.I)
    
    # Money
    string = re.sub(r"(?<![\dA-Za-z])\$ *((\d)+(\,|\.)*(\d*))", " <money> ", string, flags=re.I)
    
    # Dimension (could be smeasure too)
    string = re.sub(r"(((\d)+(\,|\.)*(\d*))+ {0,1}(x|\*){1} {0,1}((\d)+(\,|\.)*(\d*))+)"
                    "+( {0,1}(x|\*){1} {0,1}((\d)+(\,|\.)*(\d*))+)*", " <dimension> ", string, flags=re.I) 
    
    # Resolution
    string = re.sub(r"\b(?<![A-Za-z\-])\d+p\b", " <res> ", string)
    
    # Date
    string = re.sub(r"\b\d{2}-\d{2}-(19\d{2}|20\d{2})\b", " <date> ", string)

    # Model
    string = re.sub(r"(?<!\d{4})[A-Za-z\-]+\d+[A-Za-z\-\.0-9]*", " <model> ", string, flags=re.I)
    string = re.sub(r"[A-Za-z\-\.0-9]*\d+[A-Za-z\-](?!\d{4})", " <model> ", string, flags=re.I)
    string = re.sub(r"<model> \d+", " <model> ", string, flags=re.I)
    string = re.sub(r"\d+ <model>", " <model> ", string, flags=re.I)
    
    # Years
    string = re.sub(r"(?<![A-Za-z0-9])19\d{2}|20\d{2}(?![A-Za-z0-9])", " <year> ", string)
    
    # Numbers
    string = re.sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", " <number> ", string)
    
    # String cleanup
    string = re.sub(r",", " , ", string)
    string = re.sub(r"/", " / ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\*", " * ", string)
    string = re.sub(r"\(", " ( ", string)
    string = re.sub(r"\)", " ) ", string)
    string = re.sub(r"\?", " ? ", string)
    string = re.sub(r"#\S+", " <hashtag> " , string)
    string = re.sub(r"\\", " ", string)
    string = re.sub(r"\+", " ", string)
    string = re.sub(r"\d+", " <number> ", string)
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`<>]", " ", string, flags=re.I)
    string = re.sub(r"\s{2,}", " ", string)
    
    return string.strip().lower()

In [11]:
for string in train_data.sample(5).title.values:
    print(f"\n Input string: {string}")
    print(f"Clean string: {clean_str(string)}")    


 Input string: Escritorio Y Sillon
Clean string: escritorio y sillon

 Input string: Triturador Picador Com Dispenser Paramont
Clean string: triturador picador com dispenser paramont

 Input string: Amortiguador Baul Escort 88 A 94 Con Limpia Luneta (kitx2)
Clean string: amortiguador baul escort <number> a <number> con limpia luneta ( <model> )

 Input string: Woofer Triton 12 400 Rms 4 Ohms.
Clean string: woofer triton <number> <number> rms <number> ohms

 Input string: Mercedes Benz A Control Remoto
Clean string: mercedes benz a control remoto


**Apply cleaning to entire dataset**

Considering that it is a 20M rows dataset, this process is quite time consuming. That's the reasong why it is performed separately and the results stored.

In [12]:
train_data['clean_title'] = train_data.progress_apply(lambda x : clean_str(x['title']), axis=1)

  0%|          | 0/200000 [00:00<?, ?it/s]

**Save cleaned dataset**

In [13]:
train_data.to_csv('data/sample_clean_train.csv')

### Bigrams

In [14]:
sentences = train_data['clean_title'].astype(str)

In [15]:
# Now that the text is clean, we can split it into words just by splitting on spaces
x_text = [s.split(" ") for s in sentences]

In [16]:
bigram = Phraser(Phrases(x_text, min_count=100, threshold=20))

In [17]:
x_text = [bigram[text] for text in x_text]

### Remove stopwords

In [18]:
def remove_stopwords(sentences, lang='english'):
    try:
        stpwrds = stopwords.words(lang)
    except Exception:
        stpwrds = stopwords.words('spanish')
        
    out_sentences = [[w for w in sentence if w not in stpwrds] for sentence in sentences]

    return out_sentences

In [19]:
x_text = remove_stopwords(x_text, lang='spanish')

### Stemming

In [20]:
def text_stemming(sentences, lang='english'):
    try:
        stemmer = SnowballStemmer(lang)
    except Exception:
        stemmer = SnowballStemmer('spanish')
    
    out_text = [[stemmer.stem(i) for i in text] for text in sentences]

    return out_text

In [21]:
x_text = text_stemming(x_text, lang='spanish')

### Pad sentences
In order to feed the data into a CNN, all input sentences must have the same length

This next step add a padding token or truncate sentences to make all of them of the same lenght

In [22]:
def pad_sentences(sentences, padding_word="<PAD/>", len_sent = None):
    """
    Pads all sentences to the same length. The length is defined by the longest sentence.
    Returns padded sentences.
    """
    if len_sent is None:
        sequence_length = max(len(x) for x in sentences)
    else:
        sequence_length = len_sent
    padded_sentences = []
    for i in range(len(sentences)):
        sentence = sentences[i]
        num_padding = sequence_length - len(sentence)
        if num_padding >= 0:
            new_sentence = sentence + [padding_word] * num_padding
        else:
            new_sentence = sentence[:sequence_length]
        padded_sentences.append(new_sentence)
    return padded_sentences

In [23]:
x_text = pad_sentences(x_text)

In [24]:
print(x_text[1])

['comb', '<model>', '<model>', '<quantity>', 'aorus', 'gaming', '<number>', 'cors', 'lpx', '<itmeasure>', '<model>', 'hz', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>']


### Word Embeddings

For this project I used Word2Vec and now it is a good point to calculate the vectors

In [26]:
EMBEDDING_DIM = 300
model = Word2Vec(sentences = x_text, vector_size=EMBEDDING_DIM, sg=1, window=7, min_count=20, seed=42, workers=8)

We should also save the weights and vocabulary to be used later in the CNN

In [35]:
weights = model.wv.vectors
np.save(open('data/embeddings.npz', 'wb'), weights)
vocab = model.wv.key_to_index 
with open('data/map.json', 'w') as f:
    f.write(json.dumps(vocab))

### Build input data

Now that we have the veight vectors and the word indices, we can replace each word with an index to be used as an input to the CNN

In [37]:
train_data['sentences_padded'] = x_text

In [41]:
def load_vocab(vocab_path='/map.json', path=''):
    """
    Load word -> index and index -> word mappings
    :param vocab_path: where the word-index map is saved
    :return: word2idx, idx2word
    """

    with open(path+vocab_path, 'r') as f:
        data = json.loads(f.read())
    word2idx = data
    idx2word = dict([(v, k) for k, v in data.items()])
    return word2idx, idx2word

In [42]:
def build_input_data(df, path=''):
    """
    Maps sentences and labels to vectors based on a vocabulary.
    """
    word2idx, idx2word = load_vocab(vocab_path='/map.json', path=path)
    df['input_data'] = df.apply(lambda x : np.array([word2idx[word] if word in word2idx.keys() else word2idx["<PAD/>"] for word in x['sentences_padded']], dtype=np.int32), axis=1)
    return df

In [44]:
train_data = build_input_data(train_data, path='data')

### Split data into train and test
Since we are using a 1% of the data only, some categories won't have enough data.

We will drop them for now abut this won't be included in the final code

In [45]:
# Clean title column is no longer needed
train_data.drop(columns=['clean_title'], inplace=True)

In [58]:
train_data = train_data[~train_data.category.isin(train_data.category.value_counts()[train_data.category.value_counts() < 2].index)]

In [59]:
train_df, test_df = train_test_split(train_data,test_size=0.1, stratify=train_data['category'])

### Save the data

Save the data to be used in the models

In [60]:
for each in train_df['sentences_padded']:
    len_sent = len(each)
    break

In [65]:
joblib.dump(len_sent, './data/len_sent.h5')      
train_df.to_pickle('./data/df.pkl')
test_df.to_pickle('./data/df_test.pkl')