Learning item embeddings with Gensim's Word2Vec

Gensim is one of the fastest libraries for training vector embeddings and provides the most well-known and oft-used implementation of the Word2Vec algorithm. While word embeddings are the most common, they're certainly not the only use case! Vector embeddings can be learned for just about anything: hotel listings, user profiles, products on an e-commerce website, and more. These embeddings can then serve as features for downstream tasks like classification, clustering, or in recommendation systems.

The notebook in this repo is developed against Python 3.8 and demonstrates how to train Gensim's Word2Vec algorithm on non-traditional (e.g. not human language) data while identifying some conceptual and technical challenges, including:

how to structure non-language data for Word2Vec consumption
hyperparameter tuning with the Ray Tune library
how to use callbacks for early stopping (thus speeding up hyperparameter optimization)

We also demonstrate two simple methods for evaluating the learned embeddings

qualitative comparison of similar embeddings
quantitative analysis of their performance in a simple recommendation system

Data

We make use of the Online Retail dataset, which consists of customer purchase orders through an e-commerce boutique over the course of a year. The data is open source with citation to the original authors. We obtained the data from the UCI Machine Learning Repository.

Chen, D. et al (2012). Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management. 19 (3), pp. 197-208. [LINK]

Related Work

Some of the code in the Jupyter Notebook was originally developed for the Cloudera Fast Forward Labs report on Session-based Recommendation Systems. The original repo can be found here and includes scripts that perform the tasks we demonstrate in this notebook.

Cloudera Fast Forward also published a blog post that explores the why behind the early stopping mechanism used in this repo.

Deploying on Cloudera Machine Learning (CML)

There are three ways to launch this notebook on CML:

From Prototype Catalog - Navigate to the Prototype Catalog in a CML workspace, select the "Train Embeddings with Gensim" tile, click "Launch as Project", click "Configure Project"
As ML Prototype - In a CML workspace, click "New Project", add a Project Name, select "ML Prototype" as the Initial Setup option, copy in the repo URL, click "Create Project", click "Configure Project"
Manual Setup - In a CML workspace, click "New Project", add a Project Name, select "Git" as the Initial Setup option, copy in the repo URL, click "Create Project".

Once the project has been initialized in a CML workspace, run the notebook by starting a Python3.8 JupyterLab session with at least 2CPU/4GiB.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data/ecomm		data/ecomm
images		images
.gitignore		.gitignore
.project-metadata.yaml		.project-metadata.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
train_gensim_word2vec.ipynb		train_gensim_word2vec.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/ecomm

data/ecomm

images

images

.gitignore

.gitignore

.project-metadata.yaml

.project-metadata.yaml

LICENSE.txt

LICENSE.txt

README.md

README.md

requirements.txt

requirements.txt

train_gensim_word2vec.ipynb

train_gensim_word2vec.ipynb

Repository files navigation

Learning item embeddings with Gensim's Word2Vec

Data

Related Work

Deploying on Cloudera Machine Learning (CML)

About

Releases

Packages

Languages

License

fastforwardlabs/train_embeddings_with_gensim

Folders and files

Latest commit

History

Repository files navigation

Learning item embeddings with Gensim's Word2Vec

Data

Related Work

Deploying on Cloudera Machine Learning (CML)

About

Resources

License

Stars

Watchers

Forks

Languages