# Week 7 

In [1]:
import matplotlib.pyplot as plt
import nltk
import requests, io, re
import cPickle as pickle
from os import listdir
from os.path import isfile, join

%matplotlib inline

## Processing real text (from out on the inter-webs)

*Exercise:* Just a couple of examples from the book: Work through the exercises NLPP1e 3.12: 6, 30.

** Describe the class of strings matched by the following regular expressions. **
1. [a-zA-Z]+ -> all words that can be in lowercase or capital letters
2. [A-Z][a-z]* -> A single capital letter, or a word starting by a capital letter and followed by lowercase letters (a word at the begining of the sentence)
3. p[aeiou]{,2}t -> All words starting by a p and ending by a t, with between 0 and 2 characters from aeio in between
4. \d+(\.\d+)? -> Captures the decimal part of a number, with the leading point
5. ([^aeiou][aeiou][^aeiou])* -> Any word containing a letter in aeiou, that isn't directly precedeed by any of those letters, and not followed by any of those letters. 
6. \w+|[^\w\s]+ -> Match literally any character (either a word or anything that is not a word). 


** Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences.**

In [2]:
raw= """
The gold dollar was a coin struck as a regular issue by the United States Bureau of the Mint from 1849 to 1889. The coin had three types over its lifetime, all designed by Mint Chief Engraver James B. Longacre. The Type 1 issue had the smallest diameter of any United States coin ever minted. A gold dollar had been proposed several times in the 1830s and 1840s, but was not initially adopted. Congress was finally galvanized into action by the increased supply of bullion from the California gold rush, and in 1849 authorized a gold dollar. In its early years, silver coins were being hoarded or exported, and the gold dollar found a ready place in commerce. Silver again circulated after Congress required in 1853 that new coins of that metal be made lighter, and the gold dollar became a rarity in commerce even before federal coins vanished from circulation amid the economic disruption of the American Civil War. Gold did not circulate again in most of the nation until 1879, and even then, the gold dollar did not regain its place in commerce. In its final years, struck in small numbers, it was hoarded by speculators and mounted in jewelry.
"""
tokens = nltk.word_tokenize(raw)

porter = nltk.PorterStemmer()

print [porter.stem(t) for t in tokens]

[u'The', u'gold', u'dollar', u'wa', u'a', u'coin', u'struck', u'as', u'a', u'regular', u'issu', u'by', u'the', u'Unit', u'State', u'Bureau', u'of', u'the', u'Mint', u'from', u'1849', u'to', u'1889', u'.', u'The', u'coin', u'had', u'three', u'type', u'over', u'it', u'lifetim', u',', u'all', u'design', u'by', u'Mint', u'Chief', u'Engrav', u'Jame', u'B.', u'Longacr', u'.', u'The', u'Type', u'1', u'issu', u'had', u'the', u'smallest', u'diamet', u'of', u'ani', u'Unit', u'State', u'coin', u'ever', u'mint', u'.', u'A', u'gold', u'dollar', u'had', u'been', u'propos', u'sever', u'time', u'in', u'the', u'1830', u'and', u'1840', u',', u'but', u'wa', u'not', u'initi', u'adopt', u'.', u'Congress', u'wa', u'final', u'galvan', u'into', u'action', u'by', u'the', u'increas', u'suppli', u'of', u'bullion', u'from', u'the', u'California', u'gold', u'rush', u',', u'and', u'in', u'1849', u'author', u'a', u'gold', u'dollar', u'.', u'In', u'it', u'earli', u'year', u',', u'silver', u'coin', u'were', u'be', u'h

In [3]:
lancaster = nltk.LancasterStemmer()

print [lancaster.stem(t) for t in tokens]

['the', 'gold', 'doll', 'was', 'a', 'coin', 'struck', 'as', 'a', 'regul', 'issu', 'by', 'the', 'unit', 'stat', 'bureau', 'of', 'the', 'mint', 'from', '1849', 'to', '1889', '.', 'the', 'coin', 'had', 'three', 'typ', 'ov', 'it', 'lifetim', ',', 'al', 'design', 'by', 'mint', 'chief', 'engrav', 'jam', 'b.', 'longacr', '.', 'the', 'typ', '1', 'issu', 'had', 'the', 'smallest', 'diamet', 'of', 'any', 'unit', 'stat', 'coin', 'ev', 'mint', '.', 'a', 'gold', 'doll', 'had', 'been', 'propos', 'sev', 'tim', 'in', 'the', '1830s', 'and', '1840s', ',', 'but', 'was', 'not', 'init', 'adopt', '.', 'congress', 'was', 'fin', 'galv', 'into', 'act', 'by', 'the', 'increas', 'supply', 'of', 'bul', 'from', 'the', 'californ', 'gold', 'rush', ',', 'and', 'in', '1849', 'auth', 'a', 'gold', 'doll', '.', 'in', 'it', 'ear', 'year', ',', 'silv', 'coin', 'wer', 'being', 'hoard', 'or', 'export', ',', 'and', 'the', 'gold', 'doll', 'found', 'a', 'ready', 'plac', 'in', 'commerc', '.', 'silv', 'again', 'circ', 'aft', 'congr

The words are case sensitive in the Porter stemmer whereas in the Lancaster they are all in lowercase. 

## Words that characterize the branches

*Exercises:* TF-IDF and the branches of philosophy.

Setup. We want to start from a clean version of the philosopher pages with as little wiki-markup as possible. We needed it earlier to get the links, etc, but now we want a readable version. We can get a fairly nice version directly from the wikipedia API, simply call prop=extracts&exlimit=max&explaintext instead of prop=revisions as we did earlier. This will make the API return the text without links and other markup.

* **Use this method to retrive a nice copy of all philosopher's text. You can, of course, also clean the existing pages using regular expressions, if you like (but that's probably more work).**

In [4]:
wikipedia_root_api_url = "http://en.wikipedia.org/w/api.php"
philosophers_dir = './philosophers'

def download_wikipage(philosopher):
    payload = {
        'action': 'query',
        'format': 'json',
        'prop': 'extracts',
        'exlimit': 'max',
        'explaintext': 'true',
        'titles': philosopher
    }
    
    response = requests.get(wikipedia_root_api_url, params=payload)
    content = response.json()
    
    # If there are no pages associated to the philosopher, just skip it 
    if 'pages' not in content['query']:
        return None
        
    philosopher_pages = content['query']['pages']
    philosopher_content = philosopher_pages[philosopher_pages.keys()[0]]
    
    # If there's no content in the page, skip it as well 
    if 'extract' not in philosopher_content:
        return None
    
    return content

def save_to_file(file_name, json):
    with io.open('./' + philosophers_dir + '/' + file_name + '.pickle', 'wb') as f:
        pickle.dump(json, f)
        
def load_philosophers_from_file(file_name):
    f = io.open(file_name, 'r', encoding='utf-8')

    # Find all matches
    philosophers_matches = re.findall(re_wiki_link, f.read())
    return set(philosophers_matches)


In [11]:
# More advanced regex that captures links with whitespaces and doesn't require any manual pre-processing of the file
re_wiki_link = r'\*.*?\[\[([^\[\]|]+)[^\[\]]*\]\]' 


def create_philosophers_dict():
    philosophers_branches = {}
    
    aestheticians_matches = load_philosophers_from_file('philosophers_aestheticians.txt')
    epistemologists_matches = load_philosophers_from_file('philosophers_epistemologists.txt')
    ethicists_matches = load_philosophers_from_file('philosophers_ethicists.txt')
    logicians_matches = load_philosophers_from_file('philosophers_logicians.txt')
    metaphysicians_matches = load_philosophers_from_file('philosophers_metaphysicians.txt')
    sociopoliticians_matches = load_philosophers_from_file('philosophers_sociopolitical.txt')
    
    philosophers_unique = aestheticians_matches.union(epistemologists_matches) \
                                                .union(ethicists_matches) \
                                                .union(logicians_matches) \
                                                .union(logicians_matches) \
                                                .union(metaphysicians_matches) \
                                                .union(sociopoliticians_matches)
                        
    philosophers_unique = set(philosophers_unique)
    
    # Check if philosopher is in branch_name, and add the branch to his list of branches if so
    def if_philosopher_in_branch(philosopher, content, branch_name, branch_matches):
        if philosopher in branch_matches:
            if branch_name in philosophers_branches:
                philosophers_branches[branch_name][philosopher] = content
            else:
                # If the philosopher is not yet in the dictionary, create a new dict with the current branch
                philosophers_branches[branch_name] = {philosopher: content}
                
    
    # Helper method to check in each branch
    def check_if_philosopher_in_one_branch(philosopher):
        # Download content
        content = download_wikipage(philosopher)
        if not content:
                return
        philosopher_pages = content['query']['pages']
        philosopher_content = philosopher_pages[philosopher_pages.keys()[0]]

        philosopher_content = philosopher_content['extract']
            
        if_philosopher_in_branch(philosopher, philosopher_content, 'aestheticians', aestheticians_matches)
        if_philosopher_in_branch(philosopher, philosopher_content, 'epistemologists', epistemologists_matches)
        if_philosopher_in_branch(philosopher, philosopher_content, 'ethicists', ethicists_matches)
        if_philosopher_in_branch(philosopher, philosopher_content, 'logicians', logicians_matches)
        if_philosopher_in_branch(philosopher, philosopher_content, 'metaphysicians', metaphysicians_matches)
        if_philosopher_in_branch(philosopher, philosopher_content, 'sociopoliticians', sociopoliticians_matches)
    
    
    # For each philosopher, check in which branch they belong to
    count = 0
    print len(philosophers_unique)
    for philosopher in philosophers_unique:
        check_if_philosopher_in_one_branch(philosopher)
        count += 1
        print count
        
    return philosophers_branches

# # Get all the files with the philosophers information
# philosopher_files = get_list_of_philosophers_files(philosophers_dir)

# philosophers_content = {}

# for philosopher_file in philosopher_files:
#     philosopher_wikipage = load_philosopher_from_file(philosopher_file)
    
#     philosopher_pages = philosopher_wikipage['query']['pages']
#     philosopher_content = philosopher_pages[philosopher_pages.keys()[0]]
#     philosopher_name = philosopher_content['title']
    
#     philosophers_content[philosopher_name] = philosopher_content
    
#     content = download_wikipage(philosopher_name)
#     save_to_file(philosopher_name + '-extract', content)

In [None]:
philosopher_branches = create_philosophers_dict()

1013
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276

* **First, check out the wikipedia page for TF-IDF. Explain in your own words the point of TF-IDF. **

TF-IDF is a measure in information retrieval that captures how important a word is in a document compared to a a list of corpuses. A term will have a high TF-IDF if it appears frequently in a document but infrequently in other documents, whereas a term that appears frequently in all documents won't have a high TF-IDF.  
* 
    * ** What does TF stand for?  ** TF stands for Term Frequency
    * ** What does IDF stand for? **  IDF stands for Inverse Document Frequency 


* **Since we want to find out which words are important for each branch, so we're going to create six large documents, one per branch of philosophy. Tokenize the pages, and combine the tokens into one long list per branch. Remember the bullets below for success.**

In [None]:
import string 
philosopher_branches_tokens = {}

# Collect english stopwords 
stopwords = set(nltk.corpus.stopwords.words('english'))

def is_alphanum(input_string):
    # TODO See if we need to do not all here instead (do we accept punctuation in words ?)
    return all(char.isalnum() for char in input_string)

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

# Loop through each branch
for branch, philosophers in philosopher_branches.iteritems():
    #print len(philosophers)
    tokens_current_branch = {}
    # Collect for each branch the tokens
    for philosopher, content in philosophers.iteritems():
        tokens = nltk.word_tokenize(content)
        philosopher_name = philosopher.lower()
        
        # Go through each token and check if we keep it
        for token in tokens:
            token_lower = token.lower()
            # if token is a part of the philosopher name, not alphanum (like punctuation), is a stopword, 
            # or is a number, discard it
            if len(token_lower) == 1 \
            or not is_alphanum(token_lower) \
            or is_number(token_lower) \
            or token_lower in philosopher_name \
            or token_lower in stopwords: \
                continue
            # Add the token to the tokens of the current branch
            if token_lower in tokens_current_branch:
                tokens_current_branch[token_lower] += 1
            else
                tokens_current_branch[token_lower] = 1
    
    # Add the complete list of tokens to the branch 
    philosopher_branches_tokens[branch] = tokens_current_branch

* **Now, we're ready to calculate the TF for each word. Use the method of your choice to find the top 5 terms within each branch. **

In [None]:
from nltk.probability import FreqDist

philosopher_branches_tokens_frequency = {}

for branch, tokens in philosopher_branches_tokens.iteritems():
    #freq_tokens = FreqDist(tokens)
    freq_tokens = tokens
    freq_tokens_sorted = sorted(freq_tokens.iteritems(), key=lambda (k,v): -v)
    philosopher_branches_tokens_frequency[branch] = freq_tokens_sorted
    print "The top 5 terms for the branch %s are : %s" % (branch, freq_tokens_sorted[:5])

* 
    * **Describe similarities and differences between the branches.**
    
    A
    * ** Why aren't the TFs not necessarily a good description of the branches? **
    
    A

* **Next, we calculate IDF for every word. **

In [52]:
import numpy as np
from __future__ import division

N = len(philosopher_branches) # Number of branches

word_idf = {}

for branch, tokens in philosopher_branches_tokens.iteritems():
    for token in tokens:
        token_count = 0
        for branch1, tokens1 in philosopher_branches_tokens.iteritems():
            if token in tokens1:
                token_count += 1
        word_idf[token] = np.log(N / token_count)

KeyboardInterrupt: 

* 
    * ** What base logarithm did you use? Is that important? **
    
    A

* **We're ready to calculate TF-IDF. Do that for each branch. **

In [None]:
philosophers_branches_idf = {[] for _ in xrange(6)}
for branch, tfs in philosopher_branches_tokens_frequency.iteritems():
    for word, tf:
        philosophers_branches_idf[branch].append((word, tf /  word_idf[token]))

* 
    * **List the 10 top words for each branch.**

In [None]:
for branch, words_tf_idf in philosophers_branches_idf.iteritems():
    words_tf_idf_sorted = sorted(words_tf_idf, key=lambda (w, tf_idf): -tf_idf)
    print "Top 10 words for branch %s : %s" % (branch, words_tf_idf_sorted[:10])

* 
    * **Are these 10 words more descriptive of the branch? If yes, what is it about IDF that makes the words more informative?**
    
    A
* **Normally, TF-IDF is used for single documents. What does TF-IDF tell us about the content of a single document in a collection.**

A

## The word cloud.