## MeLi Data Challenge 2019

This notebook is part of a curated version of my original solution for the MeLi Data Challenge hosted by [Mercado Libre](https://www.mercadolibre.com/) in 2019

The goal of this first challenge was to create a model that would classify items into categories based solely on the item’s title. 

This title is a free text input from the seller that would become the header of the listings.

<div class="alert alert-block alert-info">
<b>Note</b> <p>Only 10% of the data is used in the notebooks to improve the experience.</p>
    <p>Also, only spanish data is used in this notebooks for simplicity reasons only</p>
    <p>In the scripted version, 100% of the data is used to improve results</p>
</div>

### 2 - PreProcess

In this notebook I'm collecting all the pre-processing steps and alternatives applied to the data

In [1]:
import pandas as pd
import unidecode
import re
from gensim.models.phrases import Phraser, Phrases
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from tqdm.auto import tqdm
tqdm.pandas()
import nltk
nltk.download('stopwords')
from gensim.models.word2vec import Word2Vec
import numpy as np
import json
from sklearn.model_selection import train_test_split
import joblib

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Basla\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Load Data

It can be downloaded from here:
- [train](https://meli-data-challenge.s3.amazonaws.com/train.csv.gz)

- [test](https://meli-data-challenge.s3.amazonaws.com/test.csv)

In [2]:
# train_data = pd.read_csv('data/train.csv.gz', compression='gzip')
train_data = pd.read_csv('./../data/sample_train.csv')
test_data = pd.read_csv('./../data/test.csv')

In [3]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,title,label_quality,language,category
0,0,Riñonera En Oferta Con Un Gran Espacio,unreliable,spanish,FANNY_PACKS
1,1,Uxcell Detector De Interruptor De Sensor De Pr...,unreliable,spanish,ALARMS_AND_SENSORS
2,2,Kit 8 Botadores Hidraulicos Ford Transit 2.4 Tdci,unreliable,spanish,ENGINE_TAPPET_GUIDE_HOLDS
3,3,"Bolso Bajo Asiento, Bicicleta, Ciclismo, Noaf.",unreliable,spanish,BICYCLE_BAGS
4,4,Funda Para Almohadon C Pelo Y Cierre,unreliable,spanish,CUSHION_COVERS


### Clean titles

Many of the following steps should be run separately for each language.

Here, we run all together, but in the final script data will be splitted.

**Step 1**: run each title through unidecode to remove special characters present in spanish and portuguese

In [4]:
for text in train_data.iloc[[0, 6]].title.values:
    print(f"\n Original text: {text}")
    print(f"Processed text: {unidecode.unidecode(text)}")


 Original text: Riñonera En Oferta Con Un Gran Espacio
Processed text: Rinonera En Oferta Con Un Gran Espacio

 Original text: Colchón Doble Inflable 191x137x22cm
Processed text: Colchon Doble Inflable 191x137x22cm


**Step 2**: Clean special cases using RegEx

In [5]:
# Remove extra spaces and characters repeated more than 3 times in a row
for string in train_data.iloc[[40, 101]].title.values:
    stringA = re.sub(r"\s{2,}", " ", string)
    stringB = re.sub(r'(.)\1{3,}', r'\1\1', stringA)
    if string != stringB:
        print(f"\n Original text: {string}")
        print(f"Processed text: {stringB}")


 Original text: Cuerina P/tapiceria -  Caramelo 29 - Mostaza
Processed text: Cuerina P/tapiceria - Caramelo 29 - Mostaza

 Original text:  Bateria  Moura M22gd  Blindada Para Nissan Acenta 12v X 65
Processed text:  Bateria Moura M22gd Blindada Para Nissan Acenta 12v X 65


In [6]:
# Find spatial measures and replace them with a special tag
for string in train_data.head(50).title.values:
    stringB = re.sub(r"((\d)+(\,|\.){0,1}(\d)*( ){0,1}((mts*)|(pulgadas*)|('')|(polegadas*)|(m)|(mms*)|(cms*)|(metros*)|(mtrs*)|(centimetros*))+)(?!(?!x)[A-Za-z])", " <smeasure> ", string, flags=re.I)
    if string != stringB:
        print(f"\n Original text: {string}")
        print(f"Processed text: {stringB}")



 Original text: Colchón Doble Inflable 191x137x22cm
Processed text: Colchón Doble Inflable 191x137x <smeasure> 

 Original text: Rollo De Vinilo Blanco Brillo Base Gris Ritrama De 1.05x50 M
Processed text: Rollo De Vinilo Blanco Brillo Base Gris Ritrama De 1.05x <smeasure> 

 Original text: Extension Hexagonal 230 Mm C/mecha Guia Bremen®
Processed text: Extension Hexagonal  <smeasure>  C/mecha Guia Bremen®

 Original text: Puntilla De Nylon Blanca De 1 Cm X 10 Mts
Processed text: Puntilla De Nylon Blanca De  <smeasure>  X  <smeasure> 


**Step 3**
Putting it all together (and adding a "few" more expressions):

In [7]:
def clean_str(string):
    """
    Tokenization/string cleaning for datasets.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    
    string = unidecode.unidecode(string)
    string = re.sub(r"\s{2,}", " ", string)
    string = re.sub(r'(.)\1{3,}', r'\1\1', string)
    
    string = re.sub(r'\d{7,}', r' <longnumber> ', string)
    string = re.sub(r'\d{7,}', r' <longnumber> ', string)
    string = re.sub(r'(?<=[A-Za-z])[\d,]{10,}', r' <model> ', string)
    string = re.sub(r'[\d,]{10,}', r' <weird> ', string)
    
    # Spatial
    string = re.sub(r"((\d)+(\,|\.){0,1}(\d)*( ){0,1}((mts*)|(pulgadas*)|('')|"
                    "(polegadas*)|(m)|(mms*)|(cms*)|(metros*)|(mtrs*)|(centimetros*))+)"
                    "(?!(?!x)[A-Za-z])", " <smeasure> ", string, flags=re.I)
    string = re.sub(r"(mts)+( ){0,1}((\d)+(\,|\.){0,1}(\d)*)(?![A-Za-z])", " <smeasure> ", string, flags=re.I)
    string = re.sub(r"<smeasure> +[\/x] +<smeasure> +[\/x] +<smeasure>", " <smeasure> ", string, flags=re.I)
    string = re.sub(r"(<smeasure>|(\d)+(\,|\.)*(\d*)) +[\/x] "
                    "+(<smeasure>|(\d)+(\,|\.)*(\d*)) +[\/x] +<smeasure>", " <smeasure> ", string, 
                    flags=re.I)
    string = re.sub(r"<smeasure> +[\/x] +<smeasure>", " <smeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])(<smeasure>|(\d)+(\,|\.)*(\d*)) *[\/x-] *<smeasure>", 
                    " <smeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*))x((\d)+(\,|\.)*(\d*))x +<smeasure>", " <smeasure> ", 
                    string, flags=re.I)
    
    # Electrical
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *amperes", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"(?<!(?<![\dx])[\dA-Za-z])((\d)+(\,|\.)*(\d*)) *amps*", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *mah", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *vol.", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *kw\b", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *v+\b", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *volts*", " <emeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *w(ts)*(?![\dA-Za-z])", " <emeasure> ", 
                    string, flags=re.I)
    
    # Pressure
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *psi", " <pmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *bar", " <pmeasure> ", string, flags=re.I)
    
    # Weights
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *kgs*", " <wmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *kilos*", " <wmeasure> ", string, flags=re.I)
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *g\b", " <wmeasure> ", string, flags=re.I)
    
    # IT
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *tb", " <itmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *gb", " <itmeasure> ", string, flags=re.I)
    
    # Volume
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *cc(?![0-9])", " <vmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *litros*", " <vmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*))litrs*", " <vmeasure> ", string, flags=re.I)
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *l+\b", " <vmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *mls*", " <vmeasure> ", string, flags=re.I)
    string = re.sub(r"(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*)) *ltrs*", " <vmeasure> ", string, flags=re.I)
    
    # Horse power
    string = re.sub(r"\b(?<![\dA-Za-z])(\d)+ *cv\b", " <hpmeasure> ", string, flags=re.I)
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\,|\.)*(\d*))+ *hp\b", " <hpmeasure> ", string, flags=re.I)
    
    # Time
    string = re.sub(r"\b(?<![\dA-Za-z])((\d)+(\:)*(\d*)) *hs*(?![\d])\b", " <time> ", string, flags=re.I)
    
    # Quantity
    string = re.sub(r"\bX\d{1,4}\b", " <quantity> ", string, flags=re.I)
    
    # Money
    string = re.sub(r"(?<![\dA-Za-z])\$ *((\d)+(\,|\.)*(\d*))", " <money> ", string, flags=re.I)
    
    # Dimension (could be smeasure too)
    string = re.sub(r"(((\d)+(\,|\.)*(\d*))+ {0,1}(x|\*){1} {0,1}((\d)+(\,|\.)*(\d*))+)"
                    "+( {0,1}(x|\*){1} {0,1}((\d)+(\,|\.)*(\d*))+)*", " <dimension> ", string, flags=re.I) 
    
    # Resolution
    string = re.sub(r"\b(?<![A-Za-z\-])\d+p\b", " <res> ", string)
    
    # Date
    string = re.sub(r"\b\d{2}-\d{2}-(19\d{2}|20\d{2})\b", " <date> ", string)

    # Model
    string = re.sub(r"(?<!\d{4})[A-Za-z\-]+\d+[A-Za-z\-\.0-9]*", " <model> ", string, flags=re.I)
    string = re.sub(r"[A-Za-z\-\.0-9]*\d+[A-Za-z\-](?!\d{4})", " <model> ", string, flags=re.I)
    string = re.sub(r"<model> \d+", " <model> ", string, flags=re.I)
    string = re.sub(r"\d+ <model>", " <model> ", string, flags=re.I)
    
    # Years
    string = re.sub(r"(?<![A-Za-z0-9])19\d{2}|20\d{2}(?![A-Za-z0-9])", " <year> ", string)
    
    # Numbers
    string = re.sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", " <number> ", string)
    
    # String cleanup
    string = re.sub(r",", " , ", string)
    string = re.sub(r"/", " / ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\*", " * ", string)
    string = re.sub(r"\(", " ( ", string)
    string = re.sub(r"\)", " ) ", string)
    string = re.sub(r"\?", " ? ", string)
    string = re.sub(r"#\S+", " <hashtag> " , string)
    string = re.sub(r"\\", " ", string)
    string = re.sub(r"\+", " ", string)
    string = re.sub(r"\d+", " <number> ", string)
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`<>]", " ", string, flags=re.I)
    string = re.sub(r"\s{2,}", " ", string)
    
    return string.strip().lower()

In [8]:
for string in train_data.sample(5).title.values:
    print(f"\n Input string: {string}")
    print(f"Clean string: {clean_str(string)}")    


 Input string: Mesas De Pizarra.
Clean string: mesas de pizarra

 Input string: Termix C6250060 Cepillo Termico Ceramic, 60 Mm 
Clean string: termix c <longnumber> cepillo termico ceramic , <smeasure>

 Input string: Titanio Fidget Spinner Blade Pocket Mano Juguete Dedo Giro E
Clean string: titanio fidget spinner blade pocket mano juguete dedo giro e

 Input string: Bomba Embrague Chevrolet Silverado-grand Blazer 3/4
Clean string: bomba embrague chevrolet silverado grand blazer <number> <number>

 Input string: Bolsa De Polietileno En Alta Densidad En 30x40 Por 500 Unid.
Clean string: bolsa de polietileno en alta densidad en <dimension> por <number> unid


**Apply cleaning to entire dataset**

Considering that it is a 20M rows dataset, this process is quite time consuming. That's the reasong why it is performed separately and the results stored.

In [9]:
train_data['clean_title'] = train_data.progress_apply(lambda x : clean_str(x['title']), axis=1)

  0%|          | 0/200000 [00:00<?, ?it/s]

**Save cleaned dataset**

In [10]:
train_data.to_csv('./../data/sample_clean_train.csv')

### Bigrams

In [11]:
sentences = train_data['clean_title'].astype(str)

In [12]:
# Now that the text is clean, we can split it into words just by splitting on spaces
x_text = [s.split(" ") for s in tqdm(sentences)]

  0%|          | 0/200000 [00:00<?, ?it/s]

In [13]:
bigram = Phraser(Phrases(x_text, min_count=100, threshold=20))

In [14]:
x_text = [bigram[text] for text in tqdm(x_text)]

  0%|          | 0/200000 [00:00<?, ?it/s]

### Remove stopwords

In [15]:
def remove_stopwords(sentences, lang='english'):
    try:
        stpwrds = stopwords.words(lang)
    except Exception:
        stpwrds = stopwords.words('spanish')
        
    out_sentences = [[w for w in sentence if w not in stpwrds] for sentence in tqdm(sentences)]

    return out_sentences

In [16]:
x_text = remove_stopwords(x_text, lang='spanish')

  0%|          | 0/200000 [00:00<?, ?it/s]

### Stemming

In [17]:
def text_stemming(sentences, lang='english'):
    try:
        stemmer = SnowballStemmer(lang)
    except Exception:
        stemmer = SnowballStemmer('spanish')
    
    out_text = [[stemmer.stem(i) for i in text] for text in tqdm(sentences)]

    return out_text

In [18]:
x_text = text_stemming(x_text, lang='spanish')

  0%|          | 0/200000 [00:00<?, ?it/s]

### Pad sentences
In order to feed the data into a CNN, all input sentences must have the same length

This next step add a padding token or truncate sentences to make all of them of the same lenght

In [19]:
def pad_sentences(sentences, padding_word="<PAD/>", len_sent = None):
    """
    Pads all sentences to the same length. The length is defined by the longest sentence.
    Returns padded sentences.
    """
    if len_sent is None:
        sequence_length = max(len(x) for x in sentences)
    else:
        sequence_length = len_sent
    padded_sentences = []
    for i in tqdm(range(len(sentences))):
        sentence = sentences[i]
        num_padding = sequence_length - len(sentence)
        if num_padding >= 0:
            new_sentence = sentence + [padding_word] * num_padding
        else:
            new_sentence = sentence[:sequence_length]
        padded_sentences.append(new_sentence)
    return padded_sentences

In [20]:
x_text = pad_sentences(x_text)

  0%|          | 0/200000 [00:00<?, ?it/s]

In [21]:
print(x_text[1])

['uxcell', 'detector', 'interruptor', 'sensor', 'proxim', 'induc', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>', '<PAD/>']


### Word Embeddings

For this project I used Word2Vec and now it is a good point to calculate the vectors

In [22]:
EMBEDDING_DIM = 300
model = Word2Vec(sentences = x_text, vector_size=EMBEDDING_DIM, sg=1, window=7, min_count=20, seed=42, workers=8)

We should also save the weights and vocabulary to be used later in the CNN

In [23]:
weights = model.wv.vectors
np.save(open('./../data/embeddings.npz', 'wb'), weights)
vocab = model.wv.key_to_index 
with open('./../data/map.json', 'w') as f:
    f.write(json.dumps(vocab))

### Build input data

Now that we have the veight vectors and the word indices, we can replace each word with an index to be used as an input to the CNN

In [24]:
train_data['sentences_padded'] = x_text

In [25]:
def load_vocab(vocab_path='/map.json', path=''):
    """
    Load word -> index and index -> word mappings
    :param vocab_path: where the word-index map is saved
    :return: word2idx, idx2word
    """

    with open(path+vocab_path, 'r') as f:
        data = json.loads(f.read())
    word2idx = data
    idx2word = dict([(v, k) for k, v in data.items()])
    return word2idx, idx2word

In [26]:
def build_input_data(df, path=''):
    """
    Maps sentences and labels to vectors based on a vocabulary.
    """
    word2idx, idx2word = load_vocab(vocab_path='/map.json', path=path)
    df['input_data'] = df.progress_apply(lambda x : np.array([word2idx[word] if word in word2idx.keys() else word2idx["<PAD/>"] for word in x['sentences_padded']], dtype=np.int32), axis=1)
    return df

In [27]:
train_data = build_input_data(train_data, path='./../data')

  0%|          | 0/200000 [00:00<?, ?it/s]

### Split data into train and test
Since we are using a 1% of the data only, some categories won't have enough data.

We will drop them for now abut this won't be included in the final code

In [28]:
# Clean title column is no longer needed
train_data.drop(columns=['clean_title'], inplace=True)

In [29]:
train_data = train_data[~train_data.category.isin(train_data.category.value_counts()[train_data.category.value_counts() < 2].index)]

In [30]:
train_df, test_df = train_test_split(train_data,test_size=0.1, stratify=train_data['category'])

### Save the data

Save the data to be used in the models

In [31]:
for each in train_df['sentences_padded']:
    len_sent = len(each)
    break

In [32]:
joblib.dump(len_sent, './../data/len_sent.h5')      
train_df.to_pickle('./../data/df.pkl')
test_df.to_pickle('./../data/df_test.pkl')