Document Clustering with Python

In this guide, I will explain how to cluster a set of documents using Python. My motivating example is to identify the latent structures within the synopses of the top 100 films of all time (per an IMDB list). See the original postfor a more detailed discussion on the example. This guide covers:

tokenizing and stemming each synopsis
transforming the corpus into vector space using tf-idf
calculating cosine distance between each document as a measure of similarity
clustering the documents using the k-means algorithm
using multidimensional scaling to reduce dimensionality within the corpus
plotting the clustering output using matplotlib and mpld3
conducting a hierarchical clustering on the corpus using Ward clustering
plotting a Ward dendrogram
topic modeling using Latent Dirichlet Allocation (LDA)

The 'cluster_analysis' workbook is fully functional; the 'cluster_analysis_web' workbook has been trimmed down for the purpose of creating this walkthrough. Feel free to download the repo and use 'cluster_analysis' to step through the guide yourself.

How the repo is set up

Once you've pulled down the repo, all you need to do is run 'cluster_analysis.ipynb'; it will find the various lists of synopses and titles. The 'Film_Scrape.ipynb' contains the code I used to actually scrape the synopses, in case you are interested. The other items in the repo are mostly incidentals for setting up the webpage walk-through. There is also one pickled model.

At some point in the future I'll write up how I executed the web scraping in case it's of interest.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
d3		d3
Film Scrape.ipynb		Film Scrape.ipynb
README.md		README.md
brandonrose_doc.css		brandonrose_doc.css
cluster_analysis.ipynb		cluster_analysis.ipynb
cluster_analysis_web.html		cluster_analysis_web.html
cluster_analysis_web.ipynb		cluster_analysis_web.ipynb
cluster_script.js		cluster_script.js
clusters_small.png		clusters_small.png
clusters_small_noaxes.png		clusters_small_noaxes.png
doc_cluster.pkl		doc_cluster.pkl
doc_cluster.pkl_01.npy		doc_cluster.pkl_01.npy
doc_cluster.pkl_02.npy		doc_cluster.pkl_02.npy
film_cluster.html		film_cluster.html
genres_list.txt		genres_list.txt
header_short.jpg		header_short.jpg
link_list.txt		link_list.txt
link_list_imdb.txt		link_list_imdb.txt
link_list_wiki.txt		link_list_wiki.txt
requirements.txt		requirements.txt
synopses_list.txt		synopses_list.txt
synopses_list.txt.txt		synopses_list.txt.txt
synopses_list_imdb.txt		synopses_list_imdb.txt
synopses_list_wiki.txt		synopses_list_wiki.txt
title_list.txt		title_list.txt
ward_clusters.png		ward_clusters.png

brandomr/document_cluster

Folders and files

Latest commit

History

Repository files navigation

Document Clustering with Python

How the repo is set up

About

Resources

Stars

Watchers

Forks

Languages