Skip to content

๐Ÿ“™ End-to-end NLP and data visualization pipeline of the text from a machine learning textbook.

Notifications You must be signed in to change notification settings

connor-mccarthy/nlp-visualization-of-statistical-learning-book

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

21 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“™ NLP and Data Viz Pipeline with GloVe, HDBSCAN, and t-SNE

Python 3.7.10 Code style: black

This project uses NLP and unsupervised learning to visualize the text of the canonical machine learning book The Elements of Statistical Learning.

Click here to explore the data yourself.

The pipeline represents the text of the book with GloVe embeddings, clusters it with HDBSCAN, and visualizes it with t-SNE.

Pipeline steps:

  1. Make HTTP request to obtain PDF
  2. Convert single PDF file to array of PNG files
  3. Use OCR to convert image to text
  4. Apply rule-based pipeline to extract n-grams of theoretically unlimited length n if rules are met for all tokens in n-gram
  5. Map tokens to GloVe embeddings (averaging where n-gram has n > 1)
  6. Normalize vector embeddings
  7. Cluster using HDBSCAN
  8. Reduce dimensionality with PCA from dimensions (300,) --> (50,) for computational efficiency in subsequent t-SNE step
  9. Reduce dimensionality further with t-SNE from dimensions (50,) --> (3,)
  10. Plot vectors