A couple of reminders about how to use Jupyter Notebook: 
- Hit Shift+Enter to run code and finish typing in Markdown
- You should not have to change much code here, but be careful to document carefully if you do 

Import all the necessary packages into Python. If there are any you don't have, type "!pip install package-name" above the import statements and it should install that package. You should only have to do that once, so you can delete it right afterwards. 

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import pandas as pd

Here we create an empty documents list. This will store all of the documents that you want to assign clusters to. We've also created a function called "preproces", which reads in a text file with documents on *separate lines*, so if you need to read in a text file without documents on separate lines, you need to either alter the code below or alter your input text file to separate documents by line. If you want to read the whole text file in as a "document", change the "readlines()" to "read", alter the strip function to strip that document rather than using the for loop, and get rid of line 7, which currently loops through the different lines in one file to add them to the documents list (remember to change documents.append(x) to documents.append(filename)). 

After this step, we read in the text files for this example. To run this on your own data, make sure to store your own text files in the same folder as this notebook and simply replace the name of the file in the preprocess() function below like I have.

In [5]:
documents = []

def preprocess(file):
    with open (file, "r",encoding="utf-8") as myfile:
        filename=myfile.readlines()
        filename = [x.strip() for x in filename]
        for x in filename:
            documents.append(x)
          
preprocess("Obama_town_hall_pre.txt")
preprocess("Obama_town_hall_post.txt")
preprocess("Bush_town_hall_pre.txt")
preprocess("Bush_town_hall_post.txt")
preprocess("Clinton_town_hall_pre.txt")
preprocess("Clinton_town_hall_post.txt")

Here we run the sklearn TfidfVectorizer to get the term frequency inverse document frequency (a measure of relative frequency which compares how often a term was used in one document versus all the other documents). It also runs the list of documents through a stop word list and does much of the preprocessing (stemming etc.) for us in this one line. 

In [9]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

Then we assign the number of clusters we want to run using the "true_k" variable below. We let sklearn know that we're going to be using the KMeans clustering algorithm and pass through several parameters. The most important is n_clusters, which is the number of clusters we assigned in the previous line of code. One important bit of information to get out of this model is the model labels, which tell you which document got assigned to which cluster. You can uncomment out "print(model.labels)" to see a list of labels. 

Using this information for clusters, which are in the same order as documents, you can create a dictionary called "doc" to store the document as a key with the value of the cluster. Uncomment out "print(docs)" to see the results of the for loop below.

In [14]:
true_k = 10
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
#print(model.labels_)

docs = {}
for i in range(0,len(documents)):
    docs[documents[i]] = model.labels_[i]

#print(docs)

The code below prints the top terms per cluster. This gives you an idea of the semantic content per cluster. You can also uncomment out the last line in this block of code to print the sorted cluster labels, which can give you an idea of which clusters are most numerous in your data -- if you see a lot of the same cluster, it may be a good idea to change your true_k value above. 

In [15]:
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % (i+1),)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind],)
    print()
#print(sorted(model.labels_))

Top terms per cluster:
Cluster 1:
 president
 let
 state
 did
 right
 mr
 better
 vice
 thing
 say

Cluster 2:
 people
 thank
 tax
 said
 need
 time
 ll
 way
 work
 world

Cluster 3:
 going
 tax
 taxes
 jobs
 say
 tell
 says
 people
 make
 sure

Cluster 4:
 make
 want
 governor
 sure
 romney
 point
 just
 people
 plan
 jobs

Cluster 5:
 believe
 strongly
 ought
 people
 need
 local
 country
 say
 everybody
 schools

Cluster 6:
 don
 think
 want
 really
 government
 like
 federal
 right
 word
 role

Cluster 7:
 ve
 got
 make
 sure
 years
 energy
 people
 said
 plan
 president

Cluster 8:
 think
 people
 important
 right
 american
 economy
 way
 ought
 grows
 going

Cluster 9:
 just
 years
 health
 care
 let
 people
 12
 ago
 insurance
 ve

Cluster 10:
 senator
 mccain
 dole
 agree
 billion
 mentioned
 know
 going
 suggested
 earmarks

