## Project of the course "_Development of Intelligent Computer Systems_": 1st Deliverable

In the first part of this project, it's implemented all components of a training pipeline for a classification problem. Sequentially, these components and their descriptions are as follows:

1. **Enviroment preparation** <br>
    Import all packages and libs, and instantiate env variables for all next steps.

2. **Data extraction** <br>
    Loads a dataset with product data from a specified path available in the
    environment variable `DATASET_PATH`.

3. **Data preparation** <br>
    Explore the data set, and process it for use in training and validation.

4. **Modeling the problem** <br>
    Specifies a model to handle the classification problem.

5. **Model validation** <br>
    Generates metrics about the model accuracy (precision, recall, F1, etc.)
    for each category and exports them to a specified path available in the
    environment variable `METRICS_PATH`

6. **Export model** <br>
    Exports a candidate model to a specified path available in the environment
    variable `MODEL_PATH`.

More specifically, we will train a model that should receive data related to products and return the best categories for them.

## Data

For this project, we will use data from an [open source dataset][1] made by [Elo7][2].
It contains data based on Elo7's search engine usage.

The columns composing each example in data are specified below:

| **Field**           | **Description**                                                                              |
| ------------------- | -------------------------------------------------------------------------------------------- |
| `product_id`        | Product numeric identification                                                               |
| `seller_id`         | Seller numeric identification                                                                |
| `query`             | The text inserted by users                                                                   |
| `search_page`       | The page number the product appeared (min 1 and max 5)                                       |
| `position`          | The position the product appeared in the search page (min 0 and max 38)                      |
| `title`             | Product title                                                                                |
| `concatenated_tags` | Product tags inserted by the seller                                                          |
| `creation_date`     | The date of product registration in Elo7 platform                                            |
| `price`             | The product price (R$)                                                                       |
| `weight`            | The weight (grams) of a product unit                                                         |
| `express_delivery`  | Indicates if the product has already been made (1) or not (0)                                |
| `minimum_quantity`  | The minimum quantity the seller sells the product                                            |
| `view_counts`       | The number of times the product was clicked in the last three months                         |
| `order_counts`      | The number of times the product was purchased in the last three months                       |
| `category`          | Product category                                                                             |

[1]: https://github.com/elo7/data7_oss/tree/master/elo7-search
[2]: https://elo7.com.br/sobre

---

### Env. preparation

In [1]:
import os
import re
import time
import spacy
import pickle
import string
import unidecode

import numpy as np
import pandas as pd

from dotenv import load_dotenv

from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support

from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
_ = load_dotenv()

In [None]:
# Download trained NLP pipeline for Portuguese.
# spaCy is a powerful NLP lib that offers a handful
# of tools for advanced processing natural language.

!python -m spacy download pt_core_news_sm

In [4]:
# Instantiate an array containing Portuguese stop words.
# These words were handwritten, and they are inside the file
# stop_words.txt in this directory.
pt_br_stop_words = []
with open('./stop_words.txt') as f:
    pt_br_stop_words = (
        [word if len(word.split(' ')) == 1 else word.split(' ')[1] for word in f.read().split(',')]
    )

    f.close()

In [5]:
nlp = spacy.load('pt_core_news_sm')

In [6]:
DATASET_PATH = os.getenv("DATASET_PATH")
METRICS_PATH = os.getenv("METRICS_PATH")
MODEL_PATH = os.getenv("MODEL_PATH")

In [7]:
print('Dataset Path: ' + DATASET_PATH + '\n' + 'Metrics Path: ' + METRICS_PATH + '\n' + 'Model Path: ' + MODEL_PATH)

Dataset Path: /usr/src/data/sample_products.csv
Metrics Path: /usr/src/data/metrics.txt
Model Path: /usr/src/data/model.pkl


---

### Data preparation

In [8]:
df1 = pd.read_csv(DATASET_PATH)

In [9]:
df1.head()

Unnamed: 0,product_id,seller_id,query,search_page,position,title,concatenated_tags,creation_date,price,weight,express_delivery,minimum_quantity,view_counts,order_counts,category
0,11394449,8324141,espirito santo,2,6,Mandala Espírito Santo,mandala mdf,2015-11-14 19:42:12,171.89,1200.0,1,4,244,,Decoração
1,15534262,6939286,cartao de visita,2,0,Cartão de Visita,cartao visita panfletos tag adesivos copos lon...,2018-04-04 20:55:07,77.67,8.0,1,5,124,,Papel e Cia
2,16153119,9835835,expositor de esmaltes,1,38,Organizador expositor p/ 70 esmaltes,expositor,2018-10-13 20:57:07,73.920006,2709.0,1,1,59,,Outros
3,15877252,8071206,medidas lencol para berco americano,1,6,Jogo de Lençol Berço Estampado,t jogo lencol menino lencol berco,2017-02-27 13:26:03,118.770004,0.0,1,1,180,1.0,Bebê
4,15917108,7200773,adesivo box banheiro,3,38,ADESIVO BOX DE BANHEIRO,adesivo box banheiro,2017-05-09 13:18:38,191.81,507.0,1,6,34,,Decoração


In [10]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38000 entries, 0 to 37999
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   product_id         38000 non-null  int64  
 1   seller_id          38000 non-null  int64  
 2   query              38000 non-null  object 
 3   search_page        38000 non-null  int64  
 4   position           38000 non-null  int64  
 5   title              38000 non-null  object 
 6   concatenated_tags  37998 non-null  object 
 7   creation_date      38000 non-null  object 
 8   price              38000 non-null  float64
 9   weight             37942 non-null  float64
 10  express_delivery   38000 non-null  int64  
 11  minimum_quantity   38000 non-null  int64  
 12  view_counts        38000 non-null  int64  
 13  order_counts       17895 non-null  float64
 14  category           38000 non-null  object 
dtypes: float64(3), int64(7), object(5)
memory usage: 4.3+ MB


In [11]:
df1['category'].unique()

array(['Decoração', 'Papel e Cia', 'Outros', 'Bebê', 'Lembrancinhas',
       'Bijuterias e Jóias'], dtype=object)

Looking at the data for the first time, and its description [above](#Data), allows us to hypothesize
that  for the task of correctly classifying the category of a product, text variables may have larger 
influence on target variable than, for instance, **creation_date** or **price** have. There also are 
fields (columns) that intuitively don't offer much information about product's category. For example, 
**creation_date**,  **price** and **weight** have high variance inside categories with such high abstraction 
(38k products to _only_ 6 categories).

Moreover, text fields - **query**, **title** and **concatenated_tags** - ease the process of data preparation,
given that they are (almost) all present in the dataset. Since only two samples in **concatenated_tags** are
_Null_ (in a 38k sample space), it won't harm performance removing these examples.

In [12]:
df2 = df1.copy()

In [13]:
df2['seq'] = df2['query'] + ' ' + df2['title'] + ' ' + df2['concatenated_tags']
seq_column = df2.pop('seq')
df2.insert(0, 'seq', seq_column)

In [14]:
df2 = df2[['seq', 'category']]

In [15]:
df2.head()

Unnamed: 0,seq,category
0,espirito santo Mandala Espírito Santo mandala mdf,Decoração
1,cartao de visita Cartão de Visita cartao visit...,Papel e Cia
2,expositor de esmaltes Organizador expositor p/...,Outros
3,medidas lencol para berco americano Jogo de Le...,Bebê
4,adesivo box banheiro ADESIVO BOX DE BANHEIRO a...,Decoração


In [16]:
df2.dropna(subset=['seq'], inplace=True)
df2.reset_index(inplace=True, drop=True)

To prepare text data to any model, preprocessing it is paramount. A list of preprocessing
steps used to prepare the text data are described below:

* **Disable Case Sensitivity** <br>
Words at the beginning of sentences or even errors when writing make the text has both 
uppercase and lowercase letters for two words that, in many cases are the same word. 
Therefore, converting the text to capital letters or lowercase is required, avoiding this redundancy.

* **Stopwords Removal** <br>
Stopwords are words that do not carry an intrinsic meaning, but generally serve
as connecting bridges between sentences within a sentence. Prepositions, conjunctions,
articles are some of the stopwords categories. As in almost all sentences these
words are used, there is no addition of information to define what is the category 
of a product from the stopwords inside the sentence. Some examples of Portuguese stopwords:

> [de, a, o, que, e, do, da, em, um, para, com, uma, os, no, se, na, por, mais, as, dos, como, mas, ao, ele, das, à, seu, sua]

* **Numbers Removal** <br>
Our vector representation of the feature space is a representation that separates each
word in a vector dimension. Thus, isolated numbers do not add information
to the text, and often not even while they are in conjunction with the text's words.
In this way, numbers present in the evaluations are removed, because statistically the
words present in the text are more representative of the product's category than the
numbers by itself.

* **Punctuation Marks Removal** <br>
Punctuation marks, like numbers, do not have as much significance for semantics'
evaluation, and may be disregarded during the pre-processing step.

* **Accent Marks Removal** <br>
Especially in Portuguese, accent marks bring problems in texts, because there is 
no homogeneity by native speakers of the language. So, with the same purpose of the 
case-sensitivity definition, the same words with or without accent will be treated
equally.

* **Lemmatization** <br>
Lemmatization is a well-known linguistic process in which a word is reduced to its
basic inflection. This mapping is done through a dictionary of morphological analysis 
of words. The spaCy library has excellent tools and dictionaries for lemmatization.

* **Single character word Removal** <br>
Since textual data is mostly acquired from people's (informal) texts on the web, it's 
expected that it has errors and/or typos, like abbreviations. In this way, remove (possible) 
single character words helps reduce feature dimensionality and improve model's performance.


In [17]:
def processText(df_column, stop_words=pt_br_stop_words, lemma_dict=nlp):

    # Disable case sensitivity
    df_column = df_column.apply(
        lambda seq: ' '.join([word.lower() for word in seq.split(' ')])
    )

    # Removing stop words
    df_column = df_column.apply(
        lambda seq: ' '.join([word for word in seq.split(' ') if word not in pt_br_stop_words])
    )

    # Remove numbers
    df_column = df_column.apply(
        lambda seq: ' '.join([re.sub(r'\d+', '', word) for word in seq.split(' ')])
    )

    # Remove punctuation marks
    df_column = df_column.apply(
        lambda seq: ' '.join([
            word.translate(
                str.maketrans('','', string.punctuation)) for word in seq.split(' ')
        ])
    )

    # Remove accent marks
    df_column = df_column.apply(
        lambda seq: ' '.join([unidecode.unidecode(word) for word in seq.split(' ')])
    )

    # Remove duplicates
    df_column = df_column.apply(
        lambda seq: ' '.join(list(set(seq.split(' '))))
    )

    # Lemmatization
    df_column = df_column.apply(
        lambda seq: ' '.join([
            word.lemma_ if word.pos_ == 'VERB' else str(word) for word in lemma_dict(seq) 
        ])
    )

    # Remove single char words
    df_column = df_column.apply(
        lambda seq: ' '.join([
            word for word in seq.split(' ') if len(word) > 1
        ])
    )

    return df_column

In [18]:
df2['seq'] = processText(df2['seq'])
df2.rename(columns={'seq': 'seq_process'}, inplace=True)

In [19]:
df2.head()

Unnamed: 0,seq_process,category
0,santo mdf mandala espirito,Decoração
1,canecas panfletos adesivos tag drink cartao vi...,Papel e Cia
2,expositor esmaltes organizador,Outros
3,americano jogo berco estampar menino medidas l...,Bebê
4,adesivo banheiro box,Decoração


In [20]:
# Encoding categorical variables
df3 = df2.copy()

le = LabelEncoder()
df3['category'] = le.fit_transform(df3['category'])

In [21]:
sorted_cats = df3['category'].unique()
sorted_cats.sort()
le.inverse_transform(sorted_cats)

array(['Bebê', 'Bijuterias e Jóias', 'Decoração', 'Lembrancinhas',
       'Outros', 'Papel e Cia'], dtype=object)

In [22]:
df3.head()

Unnamed: 0,seq_process,category
0,santo mdf mandala espirito,2
1,canecas panfletos adesivos tag drink cartao vi...,5
2,expositor esmaltes organizador,4
3,americano jogo berco estampar menino medidas l...,0
4,adesivo banheiro box,2


In [23]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37998 entries, 0 to 37997
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   seq_process  37998 non-null  object
 1   category     37998 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 593.8+ KB


In [24]:
# Absolute class count
val_count = df3['category'].value_counts()

val_count.sort_index()

0     6930
1      940
2     8722
3    17524
4     1132
5     2750
Name: category, dtype: int64

In [25]:
# Percentage class count
sum_val_count = val_count.values.sum()

(100*val_count / sum_val_count).sort_index()

0    18.237802
1     2.473814
2    22.953840
3    46.118217
4     2.979104
5     7.237223
Name: category, dtype: float64

---

### Train, validate, and export model

In [26]:
(X_train, X_test, 
 y_train, y_test) = train_test_split(df3['seq_process'], df3['category'], 
                                     test_size=.35, random_state=42)

In [27]:
print((X_train.shape, y_train.shape), (X_test.shape, y_test.shape))

((24698,), (24698,)) ((13300,), (13300,))


Since our independent variable is a text, we do need to transform it
into a numerical vector. There are a variety of these feature extraction 
methods: _bag of words_ (BoW), _term frequency-inverse document frequency_ 
(TF-IDF), _word2vec_, _GloVe_, to name a few.

Nevertheless, some of these methods (embedding ones) requires models 
pre-trained on the same language as the target one. Thus, those methods
are discarded - _word2vec_ and _GloVe_ are some that fits into that category 
and were previously cited.

In the set of remaining ones, TF-IDF has better word representations, due to
its consideration of the word inverse frequency along all examples. This is important
due to recurrency of some words along the dataset, and rareness of others. Briefly,
words that are common along the examples won't affect the target variable. On the other
hand, rare words, which would have less proeminency in methods like BoW, will have
greater weight in the model's classification.

In [28]:
# Tf-idf transform
vectorizer = TfidfVectorizer()

tfidf_X_train = vectorizer.fit_transform(X_train)
tfidf_X_test = vectorizer.transform(X_test)

On ML's zoo, ensemble methods stand out throughout different tasks. Thus,
Gradient Boosting Classifier is (_a priori_) handpicked. Its hyperparameters
were set in a trial and error fashion - grid search cross-val in a tree-based
ensemble method takes to long.

In [29]:
# Training model
start_time = time.time()

clf = GradientBoostingClassifier(
        n_estimators=1000, learning_rate=0.1,                             
        max_depth=3, random_state=0
    ).fit(tfidf_X_train, y_train)

end_time = time.time()

print(r'Tempo de execução (s): {end_time:.2f}'.format(end_time=end_time-start_time))

Tempo de execução (s): 307.04


In [30]:
print(r'Accuracy (Training data): {perc:.2f}%'.format(perc=100*clf.score(tfidf_X_train, y_train)))

Accuracy (Training data): 97.17%


In [31]:
# Model validation
print(r'Accuracy (Test data): {perc:.2f}%'.format(perc=100*clf.score(tfidf_X_test, y_test)))

Accuracy (Test data): 88.26%


In [32]:
# Model validation
print(classification_report(y_test, clf.predict(tfidf_X_test)))

              precision    recall  f1-score   support

           0       0.90      0.83      0.86      2426
           1       0.91      0.92      0.92       331
           2       0.89      0.89      0.89      3032
           3       0.88      0.94      0.91      6164
           4       0.81      0.63      0.71       385
           5       0.82      0.69      0.75       962

    accuracy                           0.88     13300
   macro avg       0.87      0.82      0.84     13300
weighted avg       0.88      0.88      0.88     13300



In [33]:
# Model validation
confusion_matrix(y_test, clf.predict(tfidf_X_test))

array([[2017,    1,  121,  265,    2,   20],
       [   0,  305,    2,   21,    2,    1],
       [  66,   12, 2704,  191,   28,   31],
       [ 135,    5,  122, 5802,   17,   83],
       [   9,    5,   30,   88,  243,   10],
       [  14,    6,   49,  218,    7,  668]])

In [34]:
# Recording classification metrics (precision, recall, f1 score, accuracy)
with open(METRICS_PATH, 'w+') as f:
    f.write(classification_report(y_test, clf.predict(tfidf_X_test)))

    f.close()

In [35]:
# Exporting model
with open(MODEL_PATH, 'wb') as f:
    pickle.dump(clf, f)

    f.close()

In [36]:
# Loading exported model (test purpose only!)
# with open(MODEL_PATH, 'rb') as f:
#     clf = pickle.load(f)

#     f.close()

---