In [15]:
# Import all of the things you need to import!#
!pip install scipy
!pip install sklearn
!pip install nltk



In [16]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.stem.porter import PorterStemmer
from sklearn.cluster import KMeans
import numpy as np

pd.options.display.max_columns = 30
%matplotlib inline

# Homework 14 (or so): TF-IDF text analysis and clustering

Hooray, we kind of figured out how text analysis works! Some of it is still magic, but at least the **TF** and **IDF** parts make a little sense. Kind of. Somewhat.

No, just kidding, we're *professionals* now.

## Investigating the Congressional Record

The [Congressional Record](https://en.wikipedia.org/wiki/Congressional_Record) is more or less what happened in Congress every single day. Speeches and all that. A good large source of text data, maybe?

Let's pretend it's totally secret but we just got it leaked to us in a data dump, and we need to check it out. It was leaked from [this page here](http://www.cs.cornell.edu/home/llee/data/convote.html).

In [17]:
# If you'd like to download it through the command line...
!curl -O http://www.cs.cornell.edu/home/llee/data/convote/convote_v1.1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9607k  100 9607k    0     0  6409k      0  0:00:01  0:00:01 --:--:-- 6413k


In [18]:
# And then extract it through the command line...
!tar -zxf convote_v1.1.tar.gz

You can explore the files if you'd like, but we're going to get the ones from `convote_v1.1/data_stage_one/development_set/`. It's a bunch of text files.

In [19]:
# glob finds files matching a certain filename pattern
import glob

# Give me all the text files
paths = glob.glob('convote_v1.1/data_stage_one/development_set/*')
paths[:5]

['convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327025_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327044_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_0327046_DON.txt',
 'convote_v1.1/data_stage_one/development_set/052_400011_1479036_DON.txt']

In [20]:
len(paths)

702

So great, we have 702 of them. Now let's import them.

In [21]:
speeches = []
for path in paths:
    with open(path) as speech_file:
        speech = {
            'pathname': path,
            'filename': path.split('/')[-1],
            'content': speech_file.read()
        }
    speeches.append(speech)
speeches_df = pd.DataFrame(speeches)
speeches_df.head()

Unnamed: 0,content,filename,pathname
0,"mr. chairman , i thank the gentlewoman for yie...",052_400011_0327014_DON.txt,convote_v1.1/data_stage_one/development_set/05...
1,"mr. chairman , i want to thank my good friend ...",052_400011_0327025_DON.txt,convote_v1.1/data_stage_one/development_set/05...
2,"mr. chairman , i rise to make two fundamental ...",052_400011_0327044_DON.txt,convote_v1.1/data_stage_one/development_set/05...
3,"mr. chairman , reclaiming my time , let me mak...",052_400011_0327046_DON.txt,convote_v1.1/data_stage_one/development_set/05...
4,"mr. chairman , i thank my distinguished collea...",052_400011_1479036_DON.txt,convote_v1.1/data_stage_one/development_set/05...


In class we had the `texts` variable. For the homework can just do `speeches_df['content']` to get the same sort of list of stuff.

**Take a look at the contents of the first 5 speeches**

In [22]:
for item in speeches_df['content'].head(5):
    print("++++++++++++++++++++NEW SPEECH+++++++++++++++++++++")
    print(item)
    print("     ")

++++++++++++++++++++NEW SPEECH+++++++++++++++++++++
mr. chairman , i thank the gentlewoman for yielding me this time . 
my good colleague from california raised the exact and critical point . 
the question is , what happens during those 45 days ? 
we will need to support elections . 
there is not a single member of this house who has not supported some form of general election , a special election , to replace the members at some point . 
but during that 45 days , what happens ? 
the chair of the constitution subcommittee says this is what happens : martial law . 
we do not know who would fill the vacancy of the presidency , but we do know that the succession act most likely suggests it would be an unelected person . 
the sponsors of the bill before us today insist , and i think rightfully so , on the importance of elections . 
but to then say that during a 45-day period we would have none of the checks and balances so fundamental to our constitution , none of the separation of powers 

# Doing our analysis

Use the `sklearn` package and a plain boring `CountVectorizer` to get a list of all of the tokens used in the speeches. If it won't list them all, that's ok! Make a dataframe with those terms as columns.

**Be sure to include English-language stopwords**

In [23]:
c_vectorizer = CountVectorizer(stop_words='english')
x = c_vectorizer.fit_transform(speeches_df['content'])
x

<702x9106 sparse matrix of type '<class 'numpy.int64'>'
	with 56106 stored elements in Compressed Sparse Row format>

In [24]:
df = pd.DataFrame(x.toarray(), columns=c_vectorizer.get_feature_names())

In [25]:
df

Unnamed: 0,000,00007,018,050,092,10,100,106,107,108,108th,109th,10th,11,110,...,yields,york,yorkers,young,younger,youngsters,youth,yuan,zero,zeroing,zeros,zigler,zirkin,zoe,zoellick
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Okay, it's **far** too big to even look at. Let's try to get a list of features from a new `CountVectorizer` that only takes the top 100 words.

In [26]:
#http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
c2_vectorizer = CountVectorizer(stop_words='english', max_features=100)
y = c2_vectorizer.fit_transform(speeches_df['content'])
y

<702x100 sparse matrix of type '<class 'numpy.int64'>'
	with 11088 stored elements in Compressed Sparse Row format>

Now let's push all of that into a dataframe with nicely named columns.

In [27]:
new_df = pd.DataFrame(y.toarray(), columns=c2_vectorizer.get_feature_names())
#new_df

Everyone seems to start their speeches with "mr chairman" - how many speeches are there total, and how many don't mention "chairman" and how many mention neither "mr" nor "chairman"?

In [28]:
#http://stackoverflow.com/questions/15943769/how-to-get-row-count-of-pandas-dataframe
total_speeches = len(new_df.index)
print("In total there are", total_speeches, "speeches.")

In total there are 702 speeches.


In [29]:
wo_chairman = new_df[new_df['chairman']==0]['chairman'].count()
print(wo_chairman, "speeches don't mention 'chairman'")

250 speeches don't mention 'chairman'


In [30]:
wo_mr_chairman = new_df[(new_df['chairman']==0) & (new_df['mr']==0)]['chairman'].count()
print(wo_mr_chairman, "speeches don't mention neither 'chairman' nor 'mr'")

76 speeches don't mention neither 'chairman' nor 'mr'


What is the index of the speech thank is the most thankful, a.k.a. includes the word 'thank' the most times?

In [31]:
#http://stackoverflow.com/questions/18199288/getting-the-integer-index-of-a-pandas-dataframe-row-fulfilling-a-condition
print("The speech with the most 'thank's has the index", np.where(new_df['thank']==new_df['thank'].max()))

The speech with the most 'thank's has the index (array([577]),)


If I'm searching for `China` and `trade`, what are the top 3 speeches to read according to the `CountVectoriser`?

In [32]:
china_trade_speeches = (new_df['china'] + new_df['trade']).sort_values(ascending = False).head(3)
china_trade_speeches

379    92
399    36
345    27
dtype: int64

Now what if I'm using a `TfidfVectorizer`?

In [33]:
porter_stemmer = PorterStemmer()

def stem_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [porter_stemmer.stem(word) for word in words]
    return words

tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stem_tokenizer, use_idf=False, norm='l1', max_features=100)
X = tfidf_vectorizer.fit_transform(speeches_df['content'])
t_df = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())

In [34]:
china_trade_speeches_v2 = (t_df['china'] + t_df['trade']).sort_values(ascending = False).head(3)
china_trade_speeches_v2

345    0.397059
336    0.281250
402    0.250000
dtype: float64

**What's the content of the speeches?** Here's a way to get them:

In [35]:
# index 0 is the first speech, which was the first one imported.
paths[0]

'convote_v1.1/data_stage_one/development_set/052_400011_0327014_DON.txt'

In [36]:
# Pass that into 'cat' using { } which lets you put variables in shell commands
# that way you can pass the path to cat
print("++++++++++NEW SPEECH+++++++++")
!cat {paths[345]}

print("++++++++++NEW SPEECH+++++++++")
!cat {paths[336]}

print("++++++++++NEW SPEECH+++++++++")
!cat {paths[402]}

++++++++++NEW SPEECH+++++++++
i thank the gentleman from new york for yielding me this time . 
madam speaker , we have been here before . 
congress has often resorted to bills and memoranda of understanding concerning china . 
but the u.s. trade deficit with china has continued to increase . 
so i am not going to stand here and argue process . 
we can look at the history and the fact of the whole architecture of agreements that we have had with china , memoranda of understanding , concerns that members of congress from both sides of the aisle brought to this floor in order to try to manage united states trade with china . 
remember we were told that a memorandum of understanding on prison labor with china would remove their competitive advantage and restore balanced trade . 
but the u.s. trade deficit with china worsened . 
remember the agreement to reaffirm the 1992 market access memorandum of understanding . 
we passed that , but the u.s. trade deficit with china grew worse . 
rememb

**Now search for something else!** Another two terms that might show up. `elections` and `chaos`? Whatever you thnik might be interesting.

In [37]:
new_df.columns

Index(['000', '11', 'act', 'allow', 'amendment', 'america', 'american', 'amp',
       'association', 'balance', 'based', 'believe', 'bipartisan', 'chairman',
       'children', 'china', 'civil', 'colleagues', 'committee', 'congress',
       'country', 'days', 'debate', 'discrimination', 'does', 'education',
       'election', 'elections', 'fact', 'faith', 'federal', 'frivolous',
       'funding', 'gentleman', 'going', 'good', 'government', 'gt', 'head',
       'health', 'help', 'house', 'important', 'issue', 'just', 'know', 'law',
       'legislation', 'let', 'like', 'lt', 'make', 'member', 'members',
       'million', 'money', 'mr', 'nation', 'national', 'nbsp', 'need', 'new',
       'order', 'organizations', 'people', 'percent', 'policy', 'president',
       'process', 'program', 'programs', 'provide', 'religious', 'right',
       'rights', 'rule', 'rules', 'say', 'school', 'services', 'speaker',
       'start', 'state', 'states', 'support', 'teachers', 'thank', 'think',
       'time

In [38]:
election_speeches = (new_df['discrimination'] + new_df['rights']).sort_values(ascending = False).head(3)
election_speeches

577    243
348     14
672     13
dtype: int64

# Enough of this garbage, let's cluster

Using a **simple counting vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

Using a **term frequency vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

Using a **term frequency inverse document frequency vectorizer**, cluster the documents into **eight categories**, telling me what the top terms are per category.

In [39]:
def new_stem_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    #With PorterStemmer implemented as above, the text was pretty crippled and hard to judge which made more sense.
    #that's why I have commented that line out for now
    
    #words = [porter_stemmer.stem(word) for word in words]
    return words

vectorizer_types = [
    {'name': 'CVectorizer', 'definition': CountVectorizer(stop_words='english', tokenizer=new_stem_tokenizer,  max_features=100)},
    {'name': 'TFVectorizer', 'definition': TfidfVectorizer(stop_words='english', tokenizer=new_stem_tokenizer,  max_features=100, use_idf=False)},
    {'name': 'TFVIDFVectorizer', 'definition': TfidfVectorizer(stop_words='english', tokenizer=new_stem_tokenizer,  max_features=100, use_idf=True)}
]

In [40]:
for vectorizer in vectorizer_types:
    X = vectorizer['definition'].fit_transform(speeches_df['content'])

    number_of_clusters = 8
    km = KMeans(n_clusters=number_of_clusters)
    km.fit(X)

    print("++++++++ Top terms per cluster -- using a", vectorizer['name'])
    order_centroids = km.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer['definition'].get_feature_names()
    for i in range(number_of_clusters):
        top_ten_words = [terms[ind] for ind in order_centroids[i, :7]]
        print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

++++++++ Top terms per cluster -- using a CVectorizer
Cluster 0: trade china s american mr vote speaker
Cluster 1: start head religious rights civil program discrimination
Cluster 2: nbsp amp lt gt trade -- s
Cluster 3: mr chairman gentleman amendment time yield speaker
Cluster 4: association national amp american new america --
Cluster 5: start head children program mr amendment programs
Cluster 6: mr time house s chairman amendment people
Cluster 7: rule 11 rules federal h r civil
++++++++ Top terms per cluster -- using a TFVectorizer
Cluster 0: china trade s speaker mr legislation american
Cluster 1: mr chairman yield gentleman 2 1 vote
Cluster 2: time mr chairman balance yield amendment gentleman
Cluster 3: amendment mr chairman gentleman s time order
Cluster 4: yield gentleman time legislation law education elections
Cluster 5: mr gentleman time chairman s house people
Cluster 6: mr speaker yield gentleman time committee balance
Cluster 7: start head children program amendment mr 

**Which one do you think works the best?**

*The last two seem to make more sense than the first one, telling from its cluster three. The last two are more human-readable. However human-readability ends with that distinction -- I can't tell which from the last two would be better, based on the top terms per cluster.*

# Harry Potter time

I have a scraped collection of Harry Potter fanfiction at https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip.

I want you to read them in, vectorize them and cluster them. Use this process to find out **the two types of Harry Potter fanfiction**. What is your hypothesis?

In [41]:
!curl -O https://github.com/ledeprogram/courses/raw/master/algorithms/data/hp.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   149  100   149    0     0    192      0 --:--:-- --:--:-- --:--:--   193


In [42]:
import zipfile

In [43]:
import glob
potter_paths = glob.glob('hp/*')
potter_paths[:5]

['hp/10001898.txt',
 'hp/10004131.txt',
 'hp/10004927.txt',
 'hp/10007980.txt',
 'hp/10010343.txt']

In [44]:
potter = []
for path in potter_paths:
    with open(path) as potter_file:
        potter_text = {
            'pathname': path,
            'filename': path.split('/')[-1],
            'content': potter_file.read()
        }
    potter.append(potter_text)
potter_df = pd.DataFrame(potter)
potter_df.head()

Unnamed: 0,content,filename,pathname
0,Prologue: The MissionDisclaimer: All character...,10001898.txt,hp/10001898.txt
1,BlackDisclaimer: I do not own Harry PotterAuth...,10004131.txt,hp/10004131.txt
2,"Chapter 1""I'm pregnant.""""""""Mum please say some...",10004927.txt,hp/10004927.txt
3,"Author's Note: Hey, just so you know, this is ...",10007980.txt,hp/10007980.txt
4,Disclaimer: I do not own Harry Potter and frie...,10010343.txt,hp/10010343.txt


In [47]:
vectorizer = TfidfVectorizer(stop_words='english', tokenizer=new_stem_tokenizer, use_idf=True)
X = vectorizer.fit_transform(potter_df['content'])

number_of_clusters = 2
km = KMeans(n_clusters=number_of_clusters)
km.fit(X)

print("Top terms per cluster")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :7]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster
Cluster 0: harry hermione t s draco ron said
Cluster 1: t s lily james sirius said remus


*The first cluster evolves around Harry and his friends Hermoine and Ron as well as his enemy Draco.*

*The second cluster evolves around Harry's family as well as his godfather Sirius and his mentor Lupin.*