This project uses NLP and unsupervised learning to visualize the text of the canonical machine learning book The Elements of Statistical Learning.
Click here to explore the data yourself.
The pipeline represents the text of the book with GloVe embeddings, clusters it with HDBSCAN, and visualizes it with t-SNE.
- Make HTTP request to obtain PDF
- Convert single PDF file to array of PNG files
- Use OCR to convert image to text
- Apply rule-based pipeline to extract n-grams of theoretically unlimited length n if rules are met for all tokens in n-gram
- Map tokens to GloVe embeddings (averaging where n-gram has n > 1)
- Normalize vector embeddings
- Cluster using HDBSCAN
- Reduce dimensionality with PCA from dimensions (300,) --> (50,) for computational efficiency in subsequent t-SNE step
- Reduce dimensionality further with t-SNE from dimensions (50,) --> (3,)
- Plot vectors