# 4.5.1 Unsupervised Learning Capstone
For this project you'll dig into a large amount of text and apply most of what you've covered in this unit and in the course so far.

First, pick a set of texts. This can be either a series of novels, chapters, or articles. Anything you'd like. It just has to have multiple entries of varying characteristics. At least 100 should be good. There should also be at least 10 different authors, but try to keep the texts related (either all on the same topic of from the same branch of literature - something to make classification a bit more difficult than obviously different subjects).

This capstone can be an extension of your NLP challenge if you wish to use the same corpus. If you found problems with that data set that limited your analysis, however, it may be worth using what you learned to choose a new corpus. Reserve 25% of your corpus as a test set.

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is it's performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

In [1]:
# Necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import requests
import re
import spacy
import sklearn
from sklearn.model_selection import train_test_split

## Text Selection and Cleaning
For my selected texts, I decided to use the New York Times article API to get a selection of texts, all from December, every 10 years from 1960 to 2010. 

In [2]:
nyt_api = '5cb4f9a5273b4fbf97ef0d7d01eb6273'
# Get Requests to pull JSON data
request_2010 = requests.get('http://api.nytimes.com/svc/archive/v1/2010/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_2000 = requests.get('http://api.nytimes.com/svc/archive/v1/2000/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1990 = requests.get('http://api.nytimes.com/svc/archive/v1/1990/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1980 = requests.get('http://api.nytimes.com/svc/archive/v1/1980/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1970 = requests.get('http://api.nytimes.com/svc/archive/v1/1970/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1960 = requests.get('http://api.nytimes.com/svc/archive/v1/1960/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
# Gathering responses from JSON data
response_2010 = request_2010.json()
response_2000 = request_2000.json()
response_1990 = request_1990.json()
response_1980 = request_1980.json()
response_1970 = request_1970.json()
response_1960 = request_1960.json()

In [3]:
# Selecting document information from JSON
docs_2010 = response_2010['response']['docs']
docs_2000 = response_2000['response']['docs']
docs_1990 = response_1990['response']['docs']
docs_1980 = response_1980['response']['docs']
docs_1970 = response_1970['response']['docs']
docs_1960 = response_1960['response']['docs']

Great, now that I've gathered the texts, let's see a sampling of a lead paragraph to see what we're getting into.

In [4]:
[docs_2010[2]['lead_paragraph']]

['The best-selling novelist Brad Meltzer leads a team of investigators in exploring mysteries of American history.']

Now, let's extract the lead paragraph from each of the first 100 articles from each year. 

In [5]:
nyt_2010 = []
for article in docs_2010[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_2010.append([art[i], '2010'])

nyt_2000 = []
for article in docs_2000[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_2000.append([art[i], '2000'])

nyt_1990 = []
for article in docs_1990[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1990.append([art[i], '1990'])

nyt_1980 = []
for article in docs_1980[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1980.append([art[i], '1980'])

nyt_1970 = []
for article in docs_1970[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1970.append([art[i], '1970'])
        
nyt_1960 = []
for article in docs_1960[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1960.append([art[i], '1960'])

Now that we have gathered the information and labeled each with its respective year published, let's combine all of these years into one data frame to then be able to manipulate and use for analysis and modeling.

In [6]:
years = [nyt_2010, nyt_2000, nyt_1990, nyt_1980, nyt_1970, nyt_1960]
nyt_all = pd.DataFrame(columns=['lead_paragraph', 'year'])
for year in years:
    for i in range(len(nyt_2010)):
        nyt_all = nyt_all.append({'lead_paragraph':year[i][0], 'year':year[i][1]}, ignore_index=True)
nyt_all.head()

Unnamed: 0,lead_paragraph,year
0,"Boulder’s Uptown, with its new shops and resta...",2010
1,"With 10 nods for his comeback album, ""Recovery...",2010
2,The best-selling novelist Brad Meltzer leads a...,2010
3,Nicholas D. Kristof visits a Haitian cholera t...,2010
4,Nicholas D. Kristof reports from Haiti about t...,2010


And in order to best analyze these lead paragaphs, let's define a function to clean out the double dashes, which will not be able to be processed by natural language processing, remove all numbers, which will not provide any decade relevant information, take the lowercase of all words for consistency, and then remove all extra white space. 

In [7]:
def text_cleaner(text):
    text = re.sub(r'--', ' ', text)
    text = re.sub(r'\d', '', text)
    #text = re.sub(r'\.', '. ', text)
    text = text.lower()
    text = ' '.join(text.split())
    return text

In [8]:
# Clean all lead paragraphs
nyt_all['lead_paragraph'] = nyt_all.lead_paragraph.map(lambda x: text_cleaner(str(x)))
nyt_all.head()

Unnamed: 0,lead_paragraph,year
0,"boulder’s uptown, with its new shops and resta...",2010
1,"with nods for his comeback album, ""recovery,"" ...",2010
2,the best-selling novelist brad meltzer leads a...,2010
3,nicholas d. kristof visits a haitian cholera t...,2010
4,nicholas d. kristof reports from haiti about t...,2010


Finally, we will set aside 25% of the corpus as a test set. 

In [9]:
# Identifying variables
X = nyt_all['lead_paragraph']
y = nyt_all['year']

# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Now that we have our lead paragraphs labeled and cleaned, we are ready for further analysis of these texts.

## Turning Words into Values
### Tf-idf Vectorization
To best cluster our data, we will first turn our paragraphs into vectors.  Term Frequency Inverse Document Frequency (Tf-idf) vectorization takes into account how many times a particular word appears in a document and then takes into account the infrequent words to create a vector for each individual word. 

For this vectorizer, we will use the following parameters:
- Drop the words that occur in more than half the paragraphs
- Only use words that appear at least 4 times
- Drop English stop words
- Leave the text in lowercase, which was already cleaned out
- Apply a correction factor so that longer paragraphs and shorter paragraphs get treated equally
- Add 1 to all document frequencies to prevent divide-by-zero errors

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.8,
                            min_df=10,
                            stop_words='english',
                            lowercase=False,
                            use_idf=True,
                            norm=u'l2',
                            smooth_idf=True)
# Applying the vectorizer
X_tfidf = vectorizer.fit_transform(X)
print('Number of features: {}'.format(X_tfidf.get_shape()[1]))

# Splitting into train and test sets
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.25, random_state=42)

# Reshape vectorizer to readable content
X_train_tfidf_csr = X_train_tfidf.tocsr()

# Number of paragraphs
n = X_train_tfidf_csr.shape[0]

# A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]

# List of features
terms = vectorizer.get_feature_names()

# For each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

# Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train[593])
print('Tf_idf vector:', tfidf_bypara[0])
    

Number of features: 216
Original sentence: buffalo, nov. (upi) a blizzard-led storm system crippled transportation in upstate cities today.
Tf_idf vector: {'nov': 0.7002728551885559, 'today': 0.7138752890288806}


### Dimension Reduction
Now, in order to best cluster with k-means, we will use dimensionality reduction in the form of singlar value decomposition to reduce the feature space. 

In [11]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

# SVD data reducer.  We are going to reduce the feature space from 1599 to 150.
svd = TruncatedSVD(100)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf)

total_variance = svd.explained_variance_ratio_.sum()
print("Percent variance captured by all components:",total_variance*100)


Percent variance captured by all components: 81.14186195750321


This looks like the model is doing a good job of capturing the variance in the components. Now we're ready for clustering.


## Clustering
We will first try to create a series of clusters to group the paragraphs to see if the clusters group according to decades or other themes.  We will explore a couple of different clustering methods to see which best models the data.

### K-Means Clustering
We will start with K-Means clustering, where each point will be clustered based on minimizing the inertia, or sum of squared differences between the mean of the cluster and the data points of the cluster. 

In [12]:
from sklearn.cluster import KMeans

# Calulate predicted values
kmeans = KMeans(n_clusters=6, init='k-means++', random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X_train_tfidf)

pd.crosstab(y_train, y_pred)

col_0,0,1,2,3,4,5
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1960,9,46,8,0,0,9
1970,6,38,9,0,2,24
1980,9,30,6,7,17,12
1990,11,18,5,16,27,1
2000,8,21,3,14,18,8
2010,3,46,0,10,7,2


In [13]:
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import silhouette_score

print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X_train_lsa, y_pred, metric='euclidean')))

Adjusted Rand Score: 0.04835364
Silhouette Score: 0.06748597


According to the adjusted rand score and silhouette score, this model is esentially randomly assigning paragraphs into clusters, which isn't good.  Let's try another model.

### Mini Batch K Means

In [14]:
from sklearn.cluster import MiniBatchKMeans
minikmeans = MiniBatchKMeans(n_clusters=7, init='k-means++', random_state=42, init_size=1000, batch_size=1000)
y_pred2 = minikmeans.fit_predict(X_train_lsa)

pd.crosstab(y_train, y_pred2)

col_0,0,1,2,3,4,5,6
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1960,63,0,0,0,0,9,0
1970,67,5,0,0,2,5,0
1980,60,7,6,0,3,5,0
1990,56,10,2,2,3,4,1
2000,57,1,1,3,3,6,1
2010,63,1,1,0,1,2,0


In [15]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred2)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X_train_lsa, y_pred2, metric='euclidean')))

Adjusted Rand Score: 0.0004316068
Silhouette Score: 0.0599363


Again, this model is essentially assigning at random.  Let's try again.

### Spectral Clustering

In [16]:
from sklearn.cluster import SpectralClustering

n_clusters= 6
sc = SpectralClustering(n_clusters=n_clusters)
y_pred3 = sc.fit_predict(X_train_lsa)

pd.crosstab(y_train, y_pred3)

Number of estimated clusters: 6


col_0,0,1,2,3,4,5
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1960,21,4,1,6,8,32
1970,22,13,1,16,5,22
1980,47,9,3,10,4,8
1990,49,15,8,1,4,1
2000,45,3,9,6,6,3
2010,56,1,4,0,2,5


In [17]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred3)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X_train_lsa, y_pred3, metric='euclidean')))

Adjusted Rand Score: 0.05072691
Silhouette Score: -0.04634463


This has an ever-so-slightly higher adjusted rand score, but this is still no different than random assignments.  Let's try a final model.

### Affinity Propagation

In [18]:
from sklearn.cluster import AffinityPropagation

af = AffinityPropagation(damping=0.6, max_iter=550, copy=False)
y_pred4 = af.fit_predict(X_train_lsa)

cluster_centers_indices = af.cluster_centers_indices_
n_clusters = len(cluster_centers_indices)
print('Number of estimated clusters: {}'.format(n_clusters))

pd.crosstab(y_train, y_pred4)

Number of estimated clusters: 22


col_0,0,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1960,0,0,1,0,1,0,0,57,0,0,...,1,1,0,3,0,0,0,0,0,5
1970,0,0,2,1,2,3,2,49,0,2,...,3,0,0,2,2,1,0,1,1,3
1980,2,0,0,0,2,1,0,52,1,4,...,0,0,4,1,1,2,2,1,2,3
1990,3,2,0,0,0,1,0,62,0,4,...,1,1,0,1,0,2,0,1,0,0
2000,2,2,0,6,0,0,0,52,1,1,...,0,1,0,1,0,1,1,1,1,2
2010,1,0,0,0,0,0,0,55,2,0,...,0,3,1,0,2,0,1,0,0,1


In [19]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred4)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X_train_lsa, y_pred4, metric='euclidean')))

Adjusted Rand Score: 0.000568541
Silhouette Score: 0.07617274


None of these are better than randomly choosing clusters for each paragraph, even with trying to optimize parameters. This might be because the same types of words are used in each decade, and the specific events weren't covered enough in the data set to be able to distinguish the difference.

## Feature Generation and Selection


In [35]:
nlp = spacy.load('en')

years = [nyt_2010, nyt_2000, nyt_1990, nyt_1980, nyt_1970, nyt_1960]
sentences = []
for year in years:
    for i in range(len(year)):
        nyt_clean = text_cleaner(str(year[i][0]))
        nyt_nlp_sent = nlp(nyt_clean)
        for sentence in nyt_nlp_sent.sents:
            sentence = [
                token.lemma_.lower()
                for token in sentence
                if not token.is_stop
                and not token.is_punct
            ]
            sentences.append(sentence)

print(sentences[2])

['best', 'sell', 'novelist', 'brad', 'meltzer', 'lead', 'team', 'investigator', 'explore', 'mystery', 'american', 'history']


In [49]:
import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    sentences,
    workers=2,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=3,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

done!


In [51]:
# List of words in model.
vocab = model.wv.vocab.keys()
vocab

dict_keys(['new', 'visit', 'lead', 'award', 'announce', 'wednesday', 'night', 'sell', 'team', 'american', 'history', 'treatment', 'center', 'report', 'problem', 'provide', 'help', 'tuesday', "'s", 'meeting', 'yankee', 'set', 'close', 'call', 'suggest', 'contract', 'health', 'organization', 'year', 'little', 'exchange', 'service', 'week', 'be', 'work', 'time', 'day', 'world', 'field', 'live', 'kid', 'speak', 'design', 'name', 'sunday', 'season', 'university', 'want', '$', 'million', 'art', 'money', 'plan', 'court', 'decision', 'sale', 'billion', 'high', 'price', 'pay', 'company', 'like', 'cut', 'rate', 'deal', 'continue', 'point', 'issue', 'long', 'right', 'require', 'rule', 'house', 'republican', 'national', 'committee', 'place', 'leave', 'month', 'investigation', 'water', 'democrat', 'bank', 'share', 'base', 'large', 'united', 'state', 'make', 'stock', 'see', 'york', 'open', 'big', 'san', 'area', 'west', 'south', 'seek', 'party', 'need', 'fund', 'network', 'hope', 'president', 'east',