MercadoLibre Data Challenge 2020

This is a repository with my solution to MercadoLibre's 2020 Data Challange.

The challenge

Build a Machine Learning model to predict next purchase based on the user’s navigation history.

Mercado Libre hosts millions of product and service listings. Users navigate through them on countless occasions, looking for what they want to buy or just to get ideas on a new purchase.

Understanding better what they might be interested in it’s critical to our business, so we can help them find what they’re looking for by providing customized recommendations based on the listings that are most suitable for each user's needs and preferences.

Given a week of a user’s navigation records, we challenge you to predict (aka recommend) the ten most likely items of the next purchase.

My take

The following is a somewhat high level description of my approach to solving this problem. You can find more insights diving into the notebooks folders. I tried my best to keep them as organized and as informative as possible.

Feature engineering

The original datasets are essentially compressed jsonlines. Each row is structure is as follows:

In order to properly work with this data, the first step would be to extract the most relevant information available from each user history. That way it can be further enhanced using it using pandas.

The main features extracted from each user_history list were

The two most and the two last viewed items
The two most and the two last searched terms/bi-grams

Getting the last/most viewed items is relatively straightforward.

Dealing with the "searched terms" portion of the dataset, on the other hand, was a bit more work. I relied on some string manipulation and common preprocessing NLP techniques with the help of nltk, such as tokenization and stopword removal.

Next, I decided focusing on recovering the domains for each viewed item and searched term. Each item belongs to at most one domain, such as MLB_CELLPHONES.

A simple join to the items dataset was enough to get the domain each most/last viewed item belonged to. The hard part was coming up with a way to find out the domain related to the search terms. This is certainly the most costly part of my approach.

First, a "master title" for each domain is built. This feature is comprised by the ten most common words found in the item titles from each respective domain. Then, I generate a vector out of these "master titles" using a distilled RoBERTa multilingual pretrained model from UKPLab's sentence-transformers.

Then, I transform the most/last searched terms/bi-grams using this same embedder. Finally, I can determine which domain each term is closest to using PyNNDescent (a blazingly fast Nearest Neighbour Descent algorithm) with the cosine metric.

After gathering the domain from all these different key fields, they are used as features to estimate the bought item's domain. I use sklearn's Random Forest classifier for this task.

In the end I have up to 8 different domains extracted from the original dataset for each user_history plus a predicted domain, via Random Forest. After some experimentation, the best approach to decide which is the one domain most likely to be the bought item's domain for each user_history is to employ a simple voting rule.

Say we have the following (simplified) scenario for a given user_history:

the last viewed item's domain is MLB_CELLPHONES
the most viewed item's domain is MLM_NOTEBOOKS
the last searched term's domain is MLM_GRAPHICS-CARD
the Random Forest predicted domain is MLB_CELLPHONES.

In this case, we pick MLB_CELLPHONES as this user's predicted domain.

Finally, to wrap things off, the recommendations.

A heuristic similar to the ones proposed during the challenge's workshop is used. The 10 items recommendation list for a given user is filled with the item ids from their last and most viewed items (which amount to 4) and the next 6 available slots are filled with the most bought items from the user's predicted domain.

Going back to the example above, that user would receive as part of their recommendation list the top 6 most bought cellphones.

Results

My strategy achieved a NDCG score of 0.23722, leaving me at the 55th position in the public ranking.

Usage

I wanted to make sure this project would be reasonably reproducible. That's why I went with the Cookiecutter Data Science project structure.

Once inside the project's root folder, all you have to do to reproduce my results is

make data
make predict

These commands take care of:

setting up the python environment necessary to run this project
executing all the necessary data manipulation and feature engineering
generating the predictions for the given test dataset

Most of the heavy lifting is carried out during the feature engineering phase. This part is expected to take the longest. The prediction is based on fairly lightweight heuristics and should be finished in no time once all the features are available.

Logging is used throughout the code so you can keep an eye for how long each step is going to take.

Environment

This project was develop on a Linux environment using Python version 3.8.0.

A computer with at least 16 GB of RAM is recommend.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MercadoLibre Data Challenge 2020

The challenge

My take

Feature engineering

Results

Usage

Environment

Project Organization

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
data/raw		data/raw
docs		docs
models		models
notebooks		notebooks
references		references
reports		reports
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini

License

atgmello/ml-challenge-2020

Folders and files

Latest commit

History

Repository files navigation

MercadoLibre Data Challenge 2020

The challenge

My take

Feature engineering

Results

Usage

Environment

Project Organization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages