# 4.5.1 Unsupervised Learning Capstone
For this project you'll dig into a large amount of text and apply most of what you've covered in this unit and in the course so far.

First, pick a set of texts. This can be either a series of novels, chapters, or articles. Anything you'd like. It just has to have multiple entries of varying characteristics. At least 100 should be good. There should also be at least 10 different authors, but try to keep the texts related (either all on the same topic of from the same branch of literature - something to make classification a bit more difficult than obviously different subjects).

This capstone can be an extension of your NLP challenge if you wish to use the same corpus. If you found problems with that data set that limited your analysis, however, it may be worth using what you learned to choose a new corpus. Reserve 25% of your corpus as a test set.

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is it's performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

In [1]:
# Necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import requests
import re
import spacy
from collections import Counter
import sklearn
from sklearn.model_selection import train_test_split

## Text Selection and Cleaning
For my selected texts, I decided to use the New York Times article API to get a selection of texts, all from December, every 10 years from 1960 to 2010. 

In [2]:
nyt_api = '5cb4f9a5273b4fbf97ef0d7d01eb6273'
# Get Requests to pull JSON data
request_2010 = requests.get('http://api.nytimes.com/svc/archive/v1/2010/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_2000 = requests.get('http://api.nytimes.com/svc/archive/v1/2000/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1990 = requests.get('http://api.nytimes.com/svc/archive/v1/1990/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1980 = requests.get('http://api.nytimes.com/svc/archive/v1/1980/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1970 = requests.get('http://api.nytimes.com/svc/archive/v1/1970/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1960 = requests.get('http://api.nytimes.com/svc/archive/v1/1960/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
# Gathering responses from JSON data
response_2010 = request_2010.json()
response_2000 = request_2000.json()
response_1990 = request_1990.json()
response_1980 = request_1980.json()
response_1970 = request_1970.json()
response_1960 = request_1960.json()

In [3]:
# Selecting document information from JSON
docs_2010 = response_2010['response']['docs']
docs_2000 = response_2000['response']['docs']
docs_1990 = response_1990['response']['docs']
docs_1980 = response_1980['response']['docs']
docs_1970 = response_1970['response']['docs']
docs_1960 = response_1960['response']['docs']

Great, now that I've gathered the texts, let's see a sampling of a lead paragraph to see what we're getting into.

In [4]:
[docs_2010[2]['lead_paragraph']]

['The best-selling novelist Brad Meltzer leads a team of investigators in exploring mysteries of American history.']

Now, let's extract the lead paragraph from each of the first 100 articles from each year. 

In [5]:
nyt_2010 = []
for article in docs_2010[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_2010.append([art[i], '2010'])

nyt_2000 = []
for article in docs_2000[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_2000.append([art[i], '2000'])

nyt_1990 = []
for article in docs_1990[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1990.append([art[i], '1990'])

nyt_1980 = []
for article in docs_1980[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1980.append([art[i], '1980'])

nyt_1970 = []
for article in docs_1970[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1970.append([art[i], '1970'])
        
nyt_1960 = []
for article in docs_1960[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1960.append([art[i], '1960'])

Now that we have gathered the information and labeled each with its respective year published, let's combine all of these years into one data frame to then be able to manipulate and use for analysis and modeling.

In [6]:
years = [nyt_2010, nyt_2000, nyt_1990, nyt_1980, nyt_1970, nyt_1960]
nyt_all = pd.DataFrame(columns=['lead_paragraph', 'year'])
for year in years:
    for i in range(len(nyt_2010)):
        nyt_all = nyt_all.append({'lead_paragraph':year[i][0], 'year':year[i][1]}, ignore_index=True)
nyt_all.head()

Unnamed: 0,lead_paragraph,year
0,"Boulder’s Uptown, with its new shops and resta...",2010
1,"With 10 nods for his comeback album, ""Recovery...",2010
2,The best-selling novelist Brad Meltzer leads a...,2010
3,Nicholas D. Kristof visits a Haitian cholera t...,2010
4,Nicholas D. Kristof reports from Haiti about t...,2010


And in order to best analyze these lead paragaphs, let's define a function to clean out the double dashes, which will not be able to be processed by natural language processing, remove all numbers, which will not provide any decade relevant information, take the lowercase of all words for consistency, and then remove all extra white space. 

In [7]:
def text_cleaner(text):
    text = re.sub(r'--', ' ', text)
    text = re.sub(r'\d', '', text)
    #text = re.sub(r'\.', '. ', text)
    text = text.lower()
    text = ' '.join(text.split())
    return text

In [8]:
# Clean all lead paragraphs
nyt_all['lead_paragraph'] = nyt_all.lead_paragraph.map(lambda x: text_cleaner(str(x)))
nyt_all.head()

Unnamed: 0,lead_paragraph,year
0,"boulder’s uptown, with its new shops and resta...",2010
1,"with nods for his comeback album, ""recovery,"" ...",2010
2,the best-selling novelist brad meltzer leads a...,2010
3,nicholas d. kristof visits a haitian cholera t...,2010
4,nicholas d. kristof reports from haiti about t...,2010


Finally, we will set aside 25% of the corpus as a test set. 

In [9]:
# Identifying variables
X = nyt_all['lead_paragraph']
y = nyt_all['year']

# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Now that we have our lead paragraphs labeled and cleaned, we are ready for further analysis of these texts.

## Feature Generation 
### Tf-idf Vectorization
To best cluster our data, we will first turn our paragraphs into vectors.  Term Frequency Inverse Document Frequency (Tf-idf) vectorization takes into account how many times a particular word appears in a document and then takes into account the infrequent words to create a vector for each individual word. 

For this vectorizer, we will use the following parameters:
- Drop the words that occur in more than half the paragraphs
- Only use words that appear at least 4 times
- Drop English stop words
- Leave the text in lowercase, which was already cleaned out
- Apply a correction factor so that longer paragraphs and shorter paragraphs get treated equally
- Add 1 to all document frequencies to prevent divide-by-zero errors

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.5,
                            min_df=3,
                            stop_words='english',
                            lowercase=False,
                            use_idf=True,
                            norm=u'l2',
                            smooth_idf=True)
# Applying the vectorizer
X_tfidf = vectorizer.fit_transform(X)
print('Number of features: {}'.format(X_tfidf.get_shape()[1]))

# Splitting into train and test sets
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.25, random_state=42)

# Reshape vectorizer to readable content
X_train_tfidf_csr = X_train_tfidf.tocsr()

# Number of paragraphs
n = X_train_tfidf_csr.shape[0]

# A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]

# List of features
terms = vectorizer.get_feature_names()

# For each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

# Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train[593])
print('Tf_idf vector:', tfidf_bypara[0])
    

Number of features: 1599
Original sentence: buffalo, nov. (upi) a blizzard-led storm system crippled transportation in upstate cities today.
Tf_idf vector: {'buffalo': 0.4128383215031412, 'transportation': 0.3998365450453077, 'upi': 0.3634083755741318, 'cities': 0.3998365450453077, 'storm': 0.3998365450453077, 'nov': 0.20515178693203054, 'today': 0.20913675306100607, 'led': 0.3634083755741318}


Finally, we will normalize our data to best cluster data.

In [11]:
from sklearn.preprocessing import normalize
X_norm = normalize(X_train_tfidf)

### Natural Language Processing - spaCy
Next, we will tokenize each sentence to be able to extract information about parts of speech to add as features in our models.

In [12]:
# Instantiating spaCy
nlp = spacy.load('en')
X_train_words = []

for row in X_train:
    # Processing each row for tokens
    row_doc = nlp(row)
    # Calculating length of each sentence
    sent_len = len(row_doc) 
    # Initializing counts of different parts of speech
    advs = 0
    verb = 0
    noun = 0
    adj = 0
    for token in row_doc:
        # Identifying each part of speech and adding to counts
        if token.pos_ == 'ADV':
            advs +=1
        elif token.pos_ == 'VERB':
            verb +=1
        elif token.pos_ == 'NOUN':
            noun +=1
        elif token.pos_ == 'ADJ':
            adj +=1
    # Creating a list of all features for each sentence
    X_train_words.append([row_doc, advs, verb, noun, adj, sent_len])

In order to be able to label each of these new features, we will need to re-index the y_train data so that they match up.

In [13]:
y_train_new = y_train.reset_index(drop=True)

Combining our new features and our year labels for each sentence.

In [14]:
# Data frame for features
nyt_bow = pd.DataFrame(data=X_train_words, columns=['BOW', 'ADV', 'VERB', 'NOUN', 'ADJ', 'sent_length'])
# Adding in year data
nyt_bow = pd.concat([nyt_bow, y_train_new], ignore_index=False, axis=1)
nyt_bow.head()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,sent_length,year
0,"(buffalo, ,, nov, ., (, upi, ), a, blizzard, -...",0,2,9,1,20,1960
1,"(an, intercity, helicopter, service, between, ...",0,1,15,4,34,1960
2,"(vivian, hedy, kahane, ,, daughter, of, elly, ...",1,6,21,5,60,1980
3,"(t, wo, rookies, ,, gilles, hamel, and, alan, ...",1,13,26,6,90,1980
4,(none),0,0,1,0,1,1960


Finally, we will combine our Tf-idf vectors and our newly created features into one data frame.

In [15]:
X_norm_df = pd.DataFrame(data=X_norm.toarray())
nyt_tfidf_bow = pd.concat([nyt_bow, X_norm_df], ignore_index=False, axis=1)
nyt_tfidf_bow.head()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,sent_length,year,0,1,2,...,1589,1590,1591,1592,1593,1594,1595,1596,1597,1598
0,"(buffalo, ,, nov, ., (, upi, ), a, blizzard, -...",0,2,9,1,20,1960,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"(an, intercity, helicopter, service, between, ...",0,1,15,4,34,1960,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.195104,0.420927,0.0,0.0
2,"(vivian, hedy, kahane, ,, daughter, of, elly, ...",1,6,21,5,60,1980,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.15757,0.0,0.0,0.0
3,"(t, wo, rookies, ,, gilles, hamel, and, alan, ...",1,13,26,6,90,1980,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,(none),0,0,1,0,1,1960,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Feature Selection
To select the best features from these words, I will use the select k best to select the 100 best features based on chi-squared values. 

In [16]:
# Identifying features and labels to choose from
features = nyt_tfidf_bow.drop(['BOW', 'year'], axis=1)
y2_train = nyt_tfidf_bow.year

In [17]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Instantiating and fitting the 150 best features
kbest = SelectKBest(chi2, k=150)
X2_train = kbest.fit_transform(features, y2_train)


## Clustering
We will first try to create a series of clusters to group the paragraphs to see if the clusters group according to decades or other themes.  We will explore a couple of different clustering methods to see which best models the data.

### K-Means Clustering
We will start with K-Means clustering, where each point will be clustered based on minimizing the inertia, or sum of squared differences between the mean of the cluster and the data points of the cluster. 

In [18]:
from sklearn.cluster import KMeans

# Calulate predicted values
kmeans = KMeans(n_clusters=6, init='k-means++', random_state=42, n_init=20)
y_pred = kmeans.fit_predict(X2_train)

pd.crosstab(y2_train, y_pred)

col_0,0,1,2,3,4,5
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1960,6,0,66,0,0,0
1970,9,0,70,0,0,0
1980,22,0,35,1,23,0
1990,39,1,7,0,26,5
2000,37,0,9,0,24,2
2010,16,0,51,0,1,0


In [19]:
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import silhouette_score

print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y2_train, y_pred)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_train, y_pred, metric='euclidean')))

Adjusted Rand Score: 0.1247939
Silhouette Score: 0.5187936


According to the adjusted rand score, this model is better than random at assigning the correct year to the sentence, and according to the silhouette score, this model pretty good at assigning paragraphs into clusters.  These clusters might not be based on decades, but maybe based on topic or author.  Let's try another model.

### Mini Batch K Means

In [20]:
from sklearn.cluster import MiniBatchKMeans
minikmeans = MiniBatchKMeans(n_clusters=7, init='k-means++', random_state=42, init_size=1000, batch_size=1000)
y_pred2 = minikmeans.fit_predict(X2_train)

pd.crosstab(y2_train, y_pred2)

col_0,0,1,2,3,4,5,6
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1960,0,29,0,0,0,42,1
1970,0,48,0,0,0,29,2
1980,0,32,0,12,1,10,26
1990,5,10,1,22,0,3,37
2000,2,16,0,17,0,4,33
2010,0,45,0,0,0,17,6


In [21]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y2_train, y_pred2)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_train, y_pred2, metric='euclidean')))

Adjusted Rand Score: 0.108482
Silhouette Score: 0.5746902


Again, this model is somewhat okay at identifying sentences from decades, but worse than k-means.  And the model is pretty good at assigning sentences to tight clusters, maybe not based on year, but perhaps topic.  Let's try again.

### Spectral Clustering

In [22]:
from sklearn.cluster import SpectralClustering

n_clusters= 6
sc = SpectralClustering(n_clusters=n_clusters)
y_pred3 = sc.fit_predict(X2_train)

pd.crosstab(y2_train, y_pred3)



col_0,0,1,2,3,4,5
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1960,28,36,4,0,0,4
1970,36,28,13,0,0,2
1980,67,8,4,0,0,2
1990,73,2,1,0,2,0
2000,69,2,1,0,0,0
2010,54,2,5,2,0,5


In [23]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y2_train, y_pred3)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_train, y_pred3, metric='euclidean')))

Adjusted Rand Score: 0.05090334
Silhouette Score: -0.3117388


According to the adjusted rand score and silhouette score, this model no different than the model choosing random cluster assignments.  Let's try a final model.

### Affinity Propagation

In [24]:
from sklearn.cluster import AffinityPropagation

af = AffinityPropagation(damping=0.85, max_iter=550, copy=False)
y_pred4 = af.fit_predict(X2_train)

cluster_centers_indices = af.cluster_centers_indices_
n_clusters = len(cluster_centers_indices)
print('Number of estimated clusters: {}'.format(n_clusters))

pd.crosstab(y2_train, y_pred4)

Number of estimated clusters: 17


col_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1960,0,36,0,1,0,0,8,5,2,0,0,0,0,0,20,0,0
1970,0,28,0,2,0,0,10,14,4,0,0,0,0,0,21,0,0
1980,0,9,12,3,0,1,3,10,5,3,0,7,0,3,15,10,0
1990,1,2,7,9,1,0,2,3,6,7,1,9,1,4,2,20,3
2000,2,2,7,12,0,0,2,5,10,7,0,4,0,7,1,13,0
2010,0,9,1,1,0,0,13,14,7,0,0,0,0,0,19,4,0


In [25]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y2_train, y_pred4)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_train, y_pred4, metric='euclidean')))

Adjusted Rand Score: 0.0859515
Silhouette Score: 0.4694846


None of these models are particularly good at identifying the decade from which the sentence was written. This might be because the same types of words are used in each decade, and the specific events weren't covered enough in the data set to be able to distinguish the difference.  The models were able to pretty well separate sentences based on similarities. 

The K-Means clustering model is best at identifying the year and clustering the sentences.  I will add this as a feature to our training set to then use for supervised learning.

In [26]:
X2_train_c = pd.DataFrame(X2_train)
X2_train_c['kmeans_clust'] = y_pred

# Classification Models 
Now, I will use supervised classification models with and without the k-means clustering predictions as features to see how well our model performs.  I will start with the default settings in the models to get a baseline score.
## Random Forest Classifier

In [51]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rfc_c = RandomForestClassifier()
train_c = rfc_c.fit(X2_train_c, y2_train)
rfc_c_scores = cross_val_score(rfc_c, X2_train_c, y_train, cv=5)
print('Training set score with clustering: {:.5f}(+/- {:.3f})'.format(rfc_c_scores.mean(), rfc_c_scores.std()*2))

rfc = RandomForestClassifier()
train = rfc.fit(X2_train, y2_train)
rfc_scores = cross_val_score(rfc, X2_train, y_train, cv=5)
print('\nTraining set score without clustering:{:.5f}(+/- {:.3f})'.format(rfc_scores.mean(), rfc_scores.std()*2))

Training set score with clustering: 0.38639(+/- 0.109)

Training set score without clustering:0.40714(+/- 0.114)


The random forest classifier performed better without the clustering prediction feature. This could be because of its poor clustering according to our ground state. Let's see about another classification model.

## Logistic Regression

In [52]:
from sklearn.linear_model import LogisticRegression

lr_c = LogisticRegression()
train_c = lr_c.fit(X2_train_c, y2_train)
lr_c_scores = cross_val_score(lr_c, X2_train_c, y_train, cv=5)
print('Training set score with clustering: {:.5f}(+/- {:.3f})'.format(lr_c_scores.mean(), lr_c_scores.std()*2))

lr = LogisticRegression()
train = lr.fit(X2_train, y2_train)
lr_scores = cross_val_score(lr, X2_train, y_train, cv=5)
print('\nTraining set score without clustering:{:.5f}(+/- {:.3f})'.format(lr_scores.mean(), lr_scores.std()*2))

Training set score with clustering: 0.41107(+/- 0.111)

Training set score without clustering:0.41569(+/- 0.113)


The Logistic Regression model also performed better without clustering than with it. This model also had a higher average accuracy score than the random forest classifier.  We will try one last model before making a decision.

## Gradient Boosting

In [53]:
from sklearn.ensemble import GradientBoostingClassifier

clf_c = GradientBoostingClassifier()
train_c = clf_c.fit(X2_train_c, y2_train)
clf_c_scores = cross_val_score(clf_c, X2_train_c, y_train, cv=5)
print('Training set score with clustering: {:.5f}(+/- {:.3f})'.format(clf_c_scores.mean(), clf_c_scores.std()*2))

clf = GradientBoostingClassifier()
train = clf.fit(X2_train, y2_train)
clf_scores = cross_val_score(clf, X2_train, y_train, cv=5)
print('\nTraining set score without clustering:{:.5f}(+/- {:.3f})'.format(clf_scores.mean(), clf_scores.std()*2))

Training set score with clustering: 0.47116(+/- 0.065)

Training set score without clustering:0.46911(+/- 0.052)


The gradient boosting classifier is marginally better with the clustering than without the clustering.  This higher accuracy is also higher than any of the other two models, and also has the lowest variance between folds. I will now attempt to optimize the parameters of this gradient boosting model to improve the accuracy of this model.

### Optimizing Parameters

In [63]:
from sklearn.model_selection import GridSearchCV

# Set of parameters to test for best score in Grid Search CV
parameters = {'loss':['deviance'],
               'min_samples_split':[50,100,200],
             'min_samples_leaf':[1,2,4],
             'max_depth':[3,4,5,6],
             'max_features':['sqrt'],
             'subsample':[0.6,0.8],
             'n_estimators':[50,100,150]}

#fitting model and printing best parameters and score from model
grid_clf_c = GridSearchCV(clf_c, param_grid=parameters)
grid_clf_c.fit(X2_train_c, y2_train)

print('Best Score:', grid_clf_c.best_score_)
best_params_clf_c = grid_clf_c.best_params_
print('Best Parameters:', best_params_clf_c)

Best Score: 0.5244444444444445
Best Parameters: {'loss': 'deviance', 'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 100, 'n_estimators': 100, 'subsample': 0.6}


In [64]:
clf_c2 = GradientBoostingClassifier(**best_params_clf_c)
train = clf_c2.fit(X2_train_c, y2_train)
clf_c2_scores = cross_val_score(clf_c2, X2_train_c, y_train, cv=5)
print('Optimized training set score with clustering:{:.5f}(+/- {:.3f})'.format(clf_c2_scores.mean(), clf_c2_scores.std()*2))

Optimized training set score with clustering:0.51981(+/- 0.087)


## Test Set Processing and Modeling

Now that we've tried out what works best with the training set, I will process the test set in an identical manner as the training set in order to determine how well this model predicts the decade from which the sentence came on a new set of data. 

In [30]:
# Normalize Tf-idf vectors
X_test_norm = normalize(X_test_tfidf)

X_test_words = []

for row in X_test:
    # Processing each row for tokens
    row_doc = nlp(row)
    # Calculating length of each sentence
    sent_len = len(row_doc) 
    # Initializing counts of different parts of speech
    advs = 0
    verb = 0
    noun = 0
    adj = 0
    for token in row_doc:
        # Identifying each part of speech and adding to counts
        if token.pos_ == 'ADV':
            advs +=1
        elif token.pos_ == 'VERB':
            verb +=1
        elif token.pos_ == 'NOUN':
            noun +=1
        elif token.pos_ == 'ADJ':
            adj +=1
    # Creating a list of all features for each sentence
    X_test_words.append([row_doc, advs, verb, noun, adj, sent_len])

In [31]:
# Re-indexing y_test
y_test_new = y_test.reset_index(drop=True)

# Data frame for features
nyt_bow_test = pd.DataFrame(data=X_test_words, columns=['BOW', 'ADV', 'VERB', 'NOUN', 'ADJ', 'sent_length'])
# Adding in year data
nyt_bow_test = pd.concat([nyt_bow_test, y_test_new], ignore_index=False, axis=1)

# Combining features into one data frame
X_test_norm_df = pd.DataFrame(data=X_test_norm.toarray())
nyt_tfidf_bow_test = pd.concat([nyt_bow_test, X_test_norm_df], ignore_index=False, axis=1)
nyt_tfidf_bow_test.head()

# Identifying features and labels to choose from
features_test = nyt_tfidf_bow_test.drop(['BOW', 'year'], axis=1)
y2_test = nyt_tfidf_bow_test.year

In [32]:
# Instantiating and fitting the 150 best features
X2_test = SelectKBest(chi2, k=150).fit_transform(features_test, y2_test)

In [33]:
# Calulate predicted values
#kmeans = KMeans(n_clusters=6, init='k-means++', random_state=42, n_init=20)
y_pred_test = kmeans.predict(X2_test)

pd.crosstab(y2_test, y_pred_test)

col_0,0,2,3,4,5
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1960,2,26,0,0,0
1970,4,17,0,0,0
1980,5,12,0,2,0
1990,11,1,1,8,1
2000,13,6,0,9,0
2010,10,20,0,2,0


In [34]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y2_test, y_pred_test)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_test, y_pred_test, metric='euclidean')))

Adjusted Rand Score: 0.09797155
Silhouette Score: 0.4846398


Not terrible.  It looks like there was a 0.03 drop in both the adjusted rand score and the silhouette score from the training set to the test set.  This doesn't look like it's overfitting too much. According to the K-Means training set cross tabulation, the majority of sentences were clustered into clusters 0, 2, and 4, and the test set appears that it also has the majority of sentences labeled as 0, or 2.  

In [58]:
X2_test_c = pd.DataFrame(X2_test)
X2_test_c['kmeans_clust'] = y_pred_test

In [65]:
clf_c2_scores_test = cross_val_score(clf_c2, X2_test_c, y_test, cv=5)
print('Test set score: {:.5f}(+/- {:.3f})'.format(clf_c2_scores_test.mean(), clf_c2_scores_test.std()*2))

Test set score: 0.21348(+/- 0.015)


This model is incredibly overfit - the accuracy of the training set was around twice of that of the test set.  This means that the clusters and the models are not well able to distinguish between decades when given a piece of novel text.

## Conclusion

Natural language processing of texts from various recent decades are difficult to cluster and classify.  This is likely because the language we have used to explain events hasn't changed much, but some of the events have.  With our quick news cycle there isn't enough continued coverage of any certain event to really characterize the decade and aid in clustering and classification.  

With clustering of texts in particular, it might be better to cluster based on topic or author, as these have more distinct words or types of words that are used to describe the events or author. As in our model, then this cluster prediction could be used to support the classification models, even if they do not perfectly match the decade in which the piece was written. Interesting to note, the clustering models were improved upon adding in features to describe the number of different parts of speech in each sentence of the text. This is likely because different events require different parts of speech to accurately depict. 

The classificaiton of texts from different decades was also challenging even using optimized gradient boosting classification.  With each sentence carrying such different information, it was challenging to predict the decade from which the document was written.  This is evidenced by the accuracy of the test set that was about half of the accuracy of the optimized model.    Classification is usually best used when the features are highly correlated to the different classes.  