In [4]:
import requests
import nltk, re, pprint, io, json
import matplotlib.pyplot as plt
import pandas as pd

In [36]:
def fetch_content(page):
    def build_query_url(page):
        # Build query
        queryUrl = "http://en.wikipedia.org/w/api.php/?action=query"
        title = "titles=%s" % page 
        content = "prop=extracts&exlimit=max&explaintext"
        rvprop= "rvprop=timestamp|content"
        dataformat = "format=json"
        query = "%s&%s&%s&%s&%s" % (queryUrl, title, content, rvprop, dataformat)
        return query
    
    def get_content(url):
        # Send request and parse response
        json_response = requests.get(u).json()
        pages = json_response['query']['pages']
        key = next(iter(ps.keys()))
        content = ps[key]['extract']
        return content
    
    url = build_query_url(page)
    content = get_content(url)
    return content

def save_to_file(content, page_name):
    filename = 'congress115/%s.txt' % page_name
    f = open(filename, "a")
    f.write(content)
    f.close()  

In [29]:
# Create a dataframe which contains page names for the 115th congress
url_h115 = 'https://raw.githubusercontent.com/suneman/socialgraphs2018/master/files/data_US_congress/H115.csv'
df = pd.read_csv(url_h115)
page_names = df.WikiPageName

In [38]:
%%time
# Fetch each wikipage and save to a txt file
for page_name in page_names:
    content = fetch_content(page_name)
    save_to_file(content, page_name)

CPU times: user 15.8 s, sys: 1.51 s, total: 17.3 s
Wall time: 3min 21s


## Exercises
### TF-IDF
**Explain in your own words the point of TF-IDF.**
* What does TF stand for?
* What does IDF stand for?

Answer:
* TFIDF (term frequency–inverse document frequency), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Tf–idf is one of the most popular term-weighting schemes today.

* Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.

### Tokenizing the Wikipedia Pages
We want to find out which words are important for each party, so we're going to create two large documents, one for the Democratic and one for the Republican party. Tokenize the pages, and combine the tokens into one long list including all the pages of the members of the same party. Remember the bullets below for success.
* Exclude the congress members names (since we're interested in the words, not the names).
* Exclude punctuation.
* Exclude stop words (if you don't know what stop words are, go back and read NLPP1e again).
* Exclude numbers (since they're difficult to interpret in the word cloud).
* Set everything to lower case.

*Note that none of the above has to be perfect. It might not be easy to remove all representatives names. And there's some room for improvisation. You can try using stemming. In my own first run the results didn't look so nice, because some pages are very detailed and repeat certain words again and again and again, whereas other pages are very short. For that reason, I decided to use the unique set of words from each page rather than each word in proportion to how it's actually used on that page. Choices like that are up to you.
Now, we're ready to calculate the TF for each word. Use the method of your choice to find the top 5 terms within each party.*

In [None]:
article = io.open("congress115/Adam_Kinzinger.txt", 'r').read().encode('utf-8')

In [None]:
tokens = nltk.word_tokenize(article)