A guide to document clustering in Python
Switch branches/tags
Nothing to show
Clone or download
brandomr Merge pull request #3 from bmabey/requirements
Incorporates Fil's changes and adds a working requirements.txt file
Latest commit 1a11caa Jun 28, 2015
Permalink
Failed to load latest commit information.
d3 finished the visualization using mpld3 and got a working LDA Dec 16, 2014
Film Scrape.ipynb finished the visualization using mpld3 and got a working LDA Dec 16, 2014
README.md Update README.md Dec 24, 2014
brandonrose_doc.css finalized the overview website Dec 20, 2014
cluster_analysis.ipynb Use BeautifulSoup instead of deprecated nltk.clean_html() Jun 28, 2015
cluster_analysis_web.html finished walk-through Dec 24, 2014
cluster_analysis_web.ipynb finished walk-through Dec 24, 2014
cluster_script.js finished the visualization using mpld3 and got a working LDA Dec 16, 2014
clusters_small.png working on formatting the ipython notebook for html Dec 21, 2014
clusters_small_noaxes.png working on formatting the ipython notebook for html Dec 21, 2014
doc_cluster.pkl finalized the overview website Dec 20, 2014
doc_cluster.pkl_01.npy finalized the overview website Dec 20, 2014
doc_cluster.pkl_02.npy finalized the overview website Dec 20, 2014
film_cluster.html finished walk-through Dec 24, 2014
genres_list.txt improved clustering outcomes and plot generation Dec 1, 2014
header_short.jpg finished walk-through Dec 24, 2014
link_list.txt working on exporting lists, but still have error with pickle Nov 28, 2014
link_list_imdb.txt scraped wiki plot div and also made some updates to the clustering al… Dec 6, 2014
link_list_wiki.txt scraped wiki plot div and also made some updates to the clustering al… Dec 6, 2014
requirements.txt adds a working requirements.txt file Jun 28, 2015
synopses_list.txt scraped wiki plot div and also made some updates to the clustering al… Dec 6, 2014
synopses_list.txt.txt working on exporting lists, but still have error with pickle Nov 28, 2014
synopses_list_imdb.txt scraped wiki plot div and also made some updates to the clustering al… Dec 6, 2014
synopses_list_wiki.txt scraped wiki plot div and also made some updates to the clustering al… Dec 6, 2014
title_list.txt working on exporting lists, but still have error with pickle Nov 28, 2014
ward_clusters.png working on formatting the ipython notebook for html Dec 21, 2014

README.md

Document Clustering with Python

In this guide, I will explain how to cluster a set of documents using Python. My motivating example is to identify the latent structures within the synopses of the top 100 films of all time (per an IMDB list). See the original postfor a more detailed discussion on the example. This guide covers:

The 'cluster_analysis' workbook is fully functional; the 'cluster_analysis_web' workbook has been trimmed down for the purpose of creating this walkthrough. Feel free to download the repo and use 'cluster_analysis' to step through the guide yourself.

How the repo is set up

Once you've pulled down the repo, all you need to do is run 'cluster_analysis.ipynb'; it will find the various lists of synopses and titles. The 'Film_Scrape.ipynb' contains the code I used to actually scrape the synopses, in case you are interested. The other items in the repo are mostly incidentals for setting up the webpage walk-through. There is also one pickled model.

At some point in the future I'll write up how I executed the web scraping in case it's of interest.