📙 NLP and Data Viz Pipeline with GloVe, HDBSCAN, and t-SNE

This project uses NLP and unsupervised learning to visualize the text of the canonical machine learning book The Elements of Statistical Learning.

Click here to explore the data yourself.

The pipeline represents the text of the book with GloVe embeddings, clusters it with HDBSCAN, and visualizes it with t-SNE.

Pipeline steps:

Make HTTP request to obtain PDF
Convert single PDF file to array of PNG files
Use OCR to convert image to text
Apply rule-based pipeline to extract n-grams of theoretically unlimited length n if rules are met for all tokens in n-gram
Map tokens to GloVe embeddings (averaging where n-gram has n > 1)
Normalize vector embeddings
Cluster using HDBSCAN
Reduce dimensionality with PCA from dimensions (300,) --> (50,) for computational efficiency in subsequent t-SNE step
Reduce dimensionality further with t-SNE from dimensions (50,) --> (3,)
Plot vectors

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src		src
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
figure.html		figure.html
figure.json		figure.json
figure.png		figure.png
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
vscode.code-workspace		vscode.code-workspace