# 4.5.1 Unsupervised Learning Capstone
For this project you'll dig into a large amount of text and apply most of what you've covered in this unit and in the course so far.

First, pick a set of texts. This can be either a series of novels, chapters, or articles. Anything you'd like. It just has to have multiple entries of varying characteristics. At least 100 should be good. There should also be at least 10 different authors, but try to keep the texts related (either all on the same topic of from the same branch of literature - something to make classification a bit more difficult than obviously different subjects).

This capstone can be an extension of your NLP challenge if you wish to use the same corpus. If you found problems with that data set that limited your analysis, however, it may be worth using what you learned to choose a new corpus. Reserve 25% of your corpus as a test set.

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is it's performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

In [1]:
# Necessary imports
import numpy as np
import pandas as pd
import requests
import re
import spacy
import sklearn

## Text Selection
For my selected texts, I decided to use the New York Times article API to get a selection of texts, all from December, every 10 years from 1960 to 2010. 

In [2]:
nyt_api = '5cb4f9a5273b4fbf97ef0d7d01eb6273'
# Get Requests to pull JSON data
request_2010 = requests.get('http://api.nytimes.com/svc/archive/v1/2010/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_2000 = requests.get('http://api.nytimes.com/svc/archive/v1/2000/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1990 = requests.get('http://api.nytimes.com/svc/archive/v1/1990/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1980 = requests.get('http://api.nytimes.com/svc/archive/v1/1980/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1970 = requests.get('http://api.nytimes.com/svc/archive/v1/1970/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1960 = requests.get('http://api.nytimes.com/svc/archive/v1/1960/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
# Gathering responses from JSON data
response_2010 = request_2010.json()
response_2000 = request_2000.json()
response_1990 = request_1990.json()
response_1980 = request_1980.json()
response_1970 = request_1970.json()
response_1960 = request_1960.json()

In [3]:
# Selecting document information from JSON
docs_2010 = response_2010['response']['docs']
docs_2000 = response_2000['response']['docs']
docs_1990 = response_1990['response']['docs']
docs_1980 = response_1980['response']['docs']
docs_1970 = response_1970['response']['docs']
docs_1960 = response_1960['response']['docs']

Great, now that I've gathered the texts, let's see a sampling of a lead paragraph to see what we're getting into.

In [4]:
docs_2000[0]['lead_paragraph']

"Almost a month after rejecting PepsiCo's takeover offer as too low, the Quaker Oats Company is now close to reaching the same deal, to be acquired by PepsiCo for $13.7 billion, executives close to the negotiations said last night. The resumption of the talks follows a decision by Coca-Cola's board last week to abandon its $16 billion deal to buy Quaker Oats and its prize growth brand, Gatorade. That deal was scuttled at the 11th hour when a divided Coke board decided it would be paying too much for Quaker, the maker of Rice-A-Roni, Aunt Jemima and Cap'n Crunch."

Now, let's extract the lead paragraph from each of the first 100 articles from each year. 

In [5]:
nyt_2010 = ''
for article in docs_2010[0:100]:
    art = str(article['lead_paragraph'])
    nyt_2010 = nyt_2010 + art

nyt_2000 = ''
for article in docs_2000[0:100]:
    art = str(article['lead_paragraph'])
    nyt_2000 = nyt_2000 + art

nyt_1990 = ''
for article in docs_1990[0:100]:
    art = str(article['lead_paragraph'])
    nyt_1990 = nyt_1990 + art

nyt_1980 = ''
for article in docs_1980[0:100]:
    art = str(article['lead_paragraph'])
    nyt_1980 = nyt_1980 + art

nyt_1970 = ''
for article in docs_1970[0:100]:
    art = str(article['lead_paragraph'])
    nyt_1970 = nyt_1970 + art

nyt_1960 = ''
for article in docs_1960[0:100]:
    art = str(article['lead_paragraph'])
    nyt_1960 = nyt_1960 + art

In [23]:
def text_cleaner(text):
    text = re.sub(r'--', ' ', text)
    text = re.sub(r'\d', '', text)
    text = re.sub(r'\.', '. ', text)
    text = text.lower()
    return text

In [27]:
nlp = spacy.load('en')

nyt_2010_doc = nlp(text_cleaner(nyt_2010))
nyt_2000_doc = nlp(text_cleaner(nyt_2000))
nyt_1990_doc = nlp(text_cleaner(nyt_1990))
nyt_1980_doc = nlp(text_cleaner(nyt_1980))
nyt_1970_doc = nlp(text_cleaner(nyt_1970))
nyt_1960_doc = nlp(text_cleaner(nyt_1960))

print('Dirty:', nyt_2010[0:200])
print()
print('Clean:', nyt_2010_doc[0:200])

Dirty: Boulder’s Uptown, with its new shops and restaurants, is worth a visit.With 10 nods for his comeback album, "Recovery," Eminem led the nominations for the 53rd annual Grammy Awards, which were announc

Clean: boulder’s uptown, with its new shops and restaurants, is worth a visit. with  nods for his comeback album, "recovery," eminem led the nominations for the rd annual grammy awards, which were announced on wednesday night in a televised ceremony from los angeles. the best-selling novelist brad meltzer leads a team of investigators in exploring mysteries of american history. nicholas d.  kristof visits a haitian cholera treatment center. nicholas d.  kristof reports from haiti about toilets that aim to address the sanitation problems that lead to cholera, while also providing fertilizer to help farmers. tuesday's meeting in tampa, fla. , between the yankees, derek jeter and his agent was set in motion when the agent, casey close, called hal steinbrenner, the team's managing par

Now that we have the paragraphs for each year, let's clean the text, removing double dashes and numbers.  Then, let's combine all of the years together into one string.

In [None]:
years = [nyt_2010, nyt_2000, nyt_1990, nyt_1980, nyt_1970, nyt_1960]
nyt_all = ''
for year in years:
    nyt_all = nyt_all + year

Just checking the length of all paragraphs.

In [11]:
len(nyt_all)

224856

Now, let's parse the cleaned paragraphs with spacy.

And group into sentences.

In [33]:
years = [nyt_2010_doc, nyt_2000_doc, nyt_1990_doc, nyt_1980_doc, nyt_1970_doc, nyt_1960_doc]
sentences = []
for year in years:
    for sentence in year.sents:
        sentence = [
            token.lemma
            for token in sentence
            if not token.is_stop
            and not token.is_punct
        ]
        sentences.append([sentence, year])

print(sentences[2])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(nyt_all)))

[[6250203919658647129, 8777643931089885836, 3537922552554931340, 2355514407854807862, 8022691765955359053, 82546335403996757, 4303356597303500170, 17774862659366831948, 17283579609114212050, 8626295335712477701, 6042939320535660714, 6249630248276940353], boulder’s uptown, with its new shops and restaurants, is worth a visit. with  nods for his comeback album, "recovery," eminem led the nominations for the rd annual grammy awards, which were announced on wednesday night in a televised ceremony from los angeles. the best-selling novelist brad meltzer leads a team of investigators in exploring mysteries of american history. nicholas d.  kristof visits a haitian cholera treatment center. nicholas d.  kristof reports from haiti about toilets that aim to address the sanitation problems that lead to cholera, while also providing fertilizer to help farmers. tuesday's meeting in tampa, fla. , between the yankees, derek jeter and his agent was set in motion when the agent, casey close, called ha

This still isn't working like I would expect...I'll keep going on this. 