# New York Times Over the Decades
## Unsupervised Learning Capstone
The world has changed over the past decades. Has what we read about in the media changed? This project will utilize data from the [New York Times (NYT) archives API](https://developer.nytimes.com/archive_api.json).  From these archives, the NYT provides the lead paragraph from the article, along with other information about the article.  Using these lead paragraphs, I used the natural language processing techniques of Term Frequency Inverse Document Frequency Vectorization to turn paragraphs into vectors by determining the frequent and infrequent words and how frequently the words were used. I also used spaCy to determine the number of various parts of speech that were used in the paragraphs.  Using the best 5% of the features, I used various clustering methods to group the sentences based on similarity.  Finally, I used these features to predict the decade from which a sentence of a New York Times article was written.  

This modeling be useful for authors to determine the patterns of writing over time, especially the changes, or lack thereof, in topics and ways to help people better understand the world.

In [1]:
# Necessary imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Natural Language Processing imports
import requests
import re
import spacy
from collections import Counter

# Modeling imports
from sklearn.model_selection import train_test_split

## Text Selection and Cleaning
For my selected texts, I decided to use the New York Times article API to get a selection of texts, all from December, every 10 years from 1960 to 2010. 

In [2]:
API_KEY = '5cb4f9a5273b4fbf97ef0d7d01eb6273'

# Get Requests to pull JSON data
request_2010 = requests.get('http://api.nytimes.com/svc/archive/v1/2010/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_2000 = requests.get('http://api.nytimes.com/svc/archive/v1/2000/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1990 = requests.get('http://api.nytimes.com/svc/archive/v1/1990/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1980 = requests.get('http://api.nytimes.com/svc/archive/v1/1980/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1970 = requests.get('http://api.nytimes.com/svc/archive/v1/1970/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
request_1960 = requests.get('http://api.nytimes.com/svc/archive/v1/1960/12.json?api-key=5cb4f9a5273b4fbf97ef0d7d01eb6273')
# Gathering responses from JSON data
response_2010 = request_2010.json()
response_2000 = request_2000.json()
response_1990 = request_1990.json()
response_1980 = request_1980.json()
response_1970 = request_1970.json()
response_1960 = request_1960.json()

In [3]:
# Selecting document information from JSON
docs_2010 = response_2010['response']['docs']
docs_2000 = response_2000['response']['docs']
docs_1990 = response_1990['response']['docs']
docs_1980 = response_1980['response']['docs']
docs_1970 = response_1970['response']['docs']
docs_1960 = response_1960['response']['docs']

Great, now that I've gathered the texts, let's see a sampling of a lead paragraph to see what we're getting into.

In [4]:
[docs_2010[2]['lead_paragraph']]

['The best-selling novelist Brad Meltzer leads a team of investigators in exploring mysteries of American history.']

Now, let's extract the lead paragraph from each of the first 100 articles from each year. 

In [5]:
nyt_2010 = []
for article in docs_2010[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_2010.append([art[i], '2010'])

nyt_2000 = []
for article in docs_2000[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_2000.append([art[i], '2000'])

nyt_1990 = []
for article in docs_1990[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1990.append([art[i], '1990'])

nyt_1980 = []
for article in docs_1980[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1980.append([art[i], '1980'])

nyt_1970 = []
for article in docs_1970[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1970.append([art[i], '1970'])
        
nyt_1960 = []
for article in docs_1960[0:100]:
    art = [article['lead_paragraph']]
    for i in range(len(art)):
        nyt_1960.append([art[i], '1960'])

Now that we have gathered the information and labeled each with its respective year published, let's combine all of these years into one data frame to then be able to manipulate and use for analysis and modeling.

In [6]:
years = [nyt_2010, nyt_2000, nyt_1990, nyt_1980, nyt_1970, nyt_1960]
nyt_all = pd.DataFrame(columns=['lead_paragraph', 'year'])
for year in years:
    for i in range(len(nyt_2010)):
        nyt_all = nyt_all.append({'lead_paragraph':year[i][0], 'year':year[i][1]}, ignore_index=True)
nyt_all.head()

Unnamed: 0,lead_paragraph,year
0,"Boulder’s Uptown, with its new shops and resta...",2010
1,"With 10 nods for his comeback album, ""Recovery...",2010
2,The best-selling novelist Brad Meltzer leads a...,2010
3,Nicholas D. Kristof visits a Haitian cholera t...,2010
4,Nicholas D. Kristof reports from Haiti about t...,2010


And in order to best analyze these lead paragaphs, let's define a function to clean out the double dashes, which will not be able to be processed by natural language processing, remove all numbers, which will not provide any decade relevant information, take the lowercase of all words for consistency, and then remove all extra white space. 

In [7]:
def text_cleaner(text):
    text = re.sub(r'--', ' ', text)
    text = re.sub(r'\d', '', text)
    #text = re.sub(r'\.', '. ', text)
    text = text.lower()
    text = ' '.join(text.split())
    return text

In [8]:
# Clean all lead paragraphs
nyt_all['lead_paragraph'] = nyt_all.lead_paragraph.map(lambda x: text_cleaner(str(x)))
nyt_all.head()

Unnamed: 0,lead_paragraph,year
0,"boulder’s uptown, with its new shops and resta...",2010
1,"with nods for his comeback album, ""recovery,"" ...",2010
2,the best-selling novelist brad meltzer leads a...,2010
3,nicholas d. kristof visits a haitian cholera t...,2010
4,nicholas d. kristof reports from haiti about t...,2010


Finally, we will set aside 25% of the corpus as a test set. 

In [9]:
# Identifying variables
X = nyt_all['lead_paragraph']
y = nyt_all['year']

Now that we have our lead paragraphs labeled and cleaned, we are ready for further analysis of these texts.

## Feature Generation 
### Tf-idf Vectorization
To best cluster our data, we will first turn our paragraphs into vectors.  Term Frequency Inverse Document Frequency (Tf-idf) vectorization takes into account how many times a particular word appears in a document and then takes into account the infrequent words to create a vector for each individual word. 

For this vectorizer, we will use the following parameters:
- Drop the words that occur in more than half the paragraphs
- Only use words that appear at least 4 times
- Drop English stop words
- Leave the text in lowercase, which was already cleaned out
- Apply a correction factor so that longer paragraphs and shorter paragraphs get treated equally
- Add 1 to all document frequencies to prevent divide-by-zero errors

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.5,
                            min_df=3,
                            stop_words='english',
                            lowercase=False,
                            use_idf=True,
                            norm=u'l2',
                            smooth_idf=True)
# Applying the vectorizer
X_tfidf = vectorizer.fit_transform(X)
print('Number of features: {}'.format(X_tfidf.get_shape()[1]))

# Reshape vectorizer to readable content
X_tfidf_csr = X_tfidf.tocsr()

# Number of paragraphs
n = X_tfidf_csr.shape[0]

# A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]

# List of features
terms = vectorizer.get_feature_names()

# For each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_tfidf_csr[i, j]

# Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X[0])
print('Tf_idf vector:', tfidf_bypara[0])
    

Number of features: 1599
Original sentence: boulder’s uptown, with its new shops and restaurants, is worth a visit.
Tf_idf vector: {'new': 0.26399997174868384, 'shops': 0.5922351826476437, 'worth': 0.5239574673092323, 'visit': 0.5522952795397594}


Finally, we will normalize our data to best cluster data.

In [11]:
from sklearn.preprocessing import normalize
X_norm = normalize(X_tfidf)

### Natural Language Processing - spaCy
Next, we will tokenize each sentence to be able to extract information about parts of speech to add as features in our models.

In [12]:
# Instantiating spaCy
nlp = spacy.load('en')
X_words = []

for row in X:
    # Processing each row for tokens
    row_doc = nlp(row)
    # Calculating length of each sentence
    sent_len = len(row_doc) 
    # Initializing counts of different parts of speech
    advs = 0
    verb = 0
    noun = 0
    adj = 0
    for token in row_doc:
        # Identifying each part of speech and adding to counts
        if token.pos_ == 'ADV':
            advs +=1
        elif token.pos_ == 'VERB':
            verb +=1
        elif token.pos_ == 'NOUN':
            noun +=1
        elif token.pos_ == 'ADJ':
            adj +=1
    # Creating a list of all features for each sentence
    X_words.append([row_doc, advs, verb, noun, adj, sent_len])

Combining our new features and our year labels for each sentence.

In [13]:
# Data frame for features
nyt_bow = pd.DataFrame(data=X_words, columns=['BOW', 'ADV', 'VERB', 'NOUN', 'ADJ', 'sent_length'])
# Adding in year data
nyt_bow = pd.concat([nyt_bow, y], ignore_index=False, axis=1)
nyt_bow.head()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,sent_length,year
0,"(boulder, ’s, uptown, ,, with, its, new, shops...",0,1,5,3,16,2010
1,"(with, nods, for, his, comeback, album, ,, "", ...",0,4,11,6,36,2010
2,"(the, best, -, selling, novelist, brad, meltze...",1,3,7,1,19,2010
3,"(nicholas, d., kristof, visits, a, haitian, ch...",0,3,4,1,10,2010
4,"(nicholas, d., kristof, reports, from, haiti, ...",1,7,9,2,28,2010


Finally, we will combine our Tf-idf vectors and our newly created features into one data frame.

In [14]:
X_norm_df = pd.DataFrame(data=X_norm.toarray())
nyt_tfidf_bow = pd.concat([nyt_bow, X_norm_df], ignore_index=False, axis=1)
nyt_tfidf_bow.head()

Unnamed: 0,BOW,ADV,VERB,NOUN,ADJ,sent_length,year,0,1,2,...,1589,1590,1591,1592,1593,1594,1595,1596,1597,1598
0,"(boulder, ’s, uptown, ,, with, its, new, shops...",0,1,5,3,16,2010,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"(with, nods, for, his, comeback, album, ,, "", ...",0,4,11,6,36,2010,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"(the, best, -, selling, novelist, brad, meltze...",1,3,7,1,19,2010,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"(nicholas, d., kristof, visits, a, haitian, ch...",0,3,4,1,10,2010,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"(nicholas, d., kristof, reports, from, haiti, ...",1,7,9,2,28,2010,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Feature Selection
To select the best features from these words, I will use the select k best to select the 100 best features based on chi-squared values. 

In [15]:
# Identifying features and labels to choose from
features = nyt_tfidf_bow.drop(['BOW', 'year'], axis=1)
y = nyt_tfidf_bow.year

In [16]:
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import chi2

# Instantiating and fitting the 150 best features
kpercentile = SelectPercentile(chi2, percentile=5)
X2 = kpercentile.fit_transform(features, y)


## Clustering
We will first try to create a series of clusters to group the paragraphs to see if the clusters group according to decades or other themes.  We will explore a couple of different clustering methods to see which best models the data.  Before doing so, we will split the data into training and test sets.

In [36]:
X2_train, X2_test, y_train, y_test = train_test_split(X2, y, test_size=0.25, random_state=11)

### K-Means Clustering
We will start with K-Means clustering, where each point will be clustered based on minimizing the inertia, or sum of squared differences between the mean of the cluster and the data points of the cluster. 

In [37]:
from sklearn.cluster import KMeans

# Calulate predicted values
kmeans = KMeans(n_clusters=6, init='k-means++', random_state=42, n_init=20)
y_pred = kmeans.fit_predict(X2_train)

pd.crosstab(y_train, y_pred)

col_0,0,1,2,3,4,5
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1960,27,0,0,2,0,47
1970,54,0,0,1,0,21
1980,30,6,1,23,0,10
1990,11,19,1,36,4,2
2000,19,16,0,37,1,3
2010,49,1,0,9,0,20


In [38]:
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics import silhouette_score

print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_train, y_pred, metric='euclidean')))

Adjusted Rand Score: 0.1155519
Silhouette Score: 0.589875


According to the adjusted rand score, this model is better than random at assigning the correct year to the sentence, and according to the silhouette score, this model pretty good at assigning paragraphs into clusters.  These clusters might not be based on decades, but maybe based on topic or author.  Let's try another model.

### Mini Batch K Means

In [39]:
from sklearn.cluster import MiniBatchKMeans
minikmeans = MiniBatchKMeans(n_clusters=7, init='k-means++', random_state=42, init_size=1000, batch_size=1000)
y_pred2 = minikmeans.fit_predict(X2_train)

pd.crosstab(y_train, y_pred2)

col_0,0,1,2,3,4,5,6
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1960,2,0,27,0,0,47,0
1970,2,0,53,0,0,21,0
1980,22,0,29,9,0,9,1
1990,36,1,10,20,5,1,0
2000,34,0,16,22,1,3,0
2010,11,0,49,1,0,18,0


In [40]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred2)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_train, y_pred2, metric='euclidean')))

Adjusted Rand Score: 0.120797
Silhouette Score: 0.5888146


Again, this model is somewhat okay at identifying sentences from decades, but worse than k-means.  And the model is pretty good at assigning sentences to tight clusters, maybe not based on year, but perhaps topic.  Let's try again.

### Spectral Clustering

In [41]:
from sklearn.cluster import SpectralClustering

n_clusters= 6
sc = SpectralClustering(n_clusters=n_clusters)
y_pred3 = sc.fit_predict(X2_train)

pd.crosstab(y_train, y_pred3)



col_0,0,1,2,3,4,5
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1960,43,23,2,6,0,2
1970,21,45,3,7,0,0
1980,8,55,2,5,0,0
1990,1,67,0,3,2,0
2000,2,72,0,2,0,0
2010,1,65,3,8,0,2


In [42]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred3)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_train, y_pred3, metric='euclidean')))

Adjusted Rand Score: 0.05318995
Silhouette Score: -0.313308


According to the adjusted rand score and silhouette score, this model no different than the model choosing random cluster assignments.  Let's try a final model.

### Affinity Propagation

In [43]:
from sklearn.cluster import AffinityPropagation

af = AffinityPropagation(damping=0.85, max_iter=550, copy=False)
y_pred4 = af.fit_predict(X2_train)

cluster_centers_indices = af.cluster_centers_indices_
n_clusters = len(cluster_centers_indices)
print('Number of estimated clusters: {}'.format(n_clusters))

pd.crosstab(y_train, y_pred4)

Number of estimated clusters: 13


col_0,0,1,2,3,4,5,6,7,8,9,10,11,12
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1960,0,0,0,2,0,23,0,4,47,0,0,0,0
1970,0,0,0,2,0,33,0,20,21,0,0,0,0
1980,0,6,1,6,3,18,0,11,9,1,4,0,11
1990,3,21,5,9,7,4,1,6,1,0,8,2,6
2000,0,10,7,18,6,5,0,11,3,0,6,1,9
2010,0,6,1,4,0,33,0,17,18,0,0,0,0


In [44]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_train, y_pred4)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_train, y_pred4, metric='euclidean')))

Adjusted Rand Score: 0.1055845
Silhouette Score: 0.5018931


None of these models are particularly good at identifying the decade from which the sentence was written. This might be because the same types of words are used in each decade, and the specific events weren't covered enough in the data set to be able to distinguish the difference.  The models were able to pretty well separate sentences based on similarities. 

The K-Means clustering model is best at identifying the year and clustering the sentences.  I will add this as a feature to our training set to then use for supervised learning.

In [45]:
X2_train_c = pd.DataFrame(X2_train)
X2_train_c['kmeans_clust'] = y_pred

# Classification Models 
Now, I will use supervised classification models with and without the k-means clustering predictions as features to see how well our model performs.  I will start with the default settings in the models to get a baseline score.
## Random Forest Classifier

In [46]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rfc_c = RandomForestClassifier()
train_c = rfc_c.fit(X2_train_c, y_train)
rfc_c_scores = cross_val_score(rfc_c, X2_train_c, y_train, cv=5)
print('Training set score with clustering: {:.5f}(+/- {:.3f})'.format(rfc_c_scores.mean(), rfc_c_scores.std()*2))

rfc = RandomForestClassifier()
train = rfc.fit(X2_train, y_train)
rfc_scores = cross_val_score(rfc, X2_train, y_train, cv=5)
print('\nTraining set score without clustering:{:.5f}(+/- {:.3f})'.format(rfc_scores.mean(), rfc_scores.std()*2))

Training set score with clustering: 0.40478(+/- 0.056)

Training set score without clustering:0.39796(+/- 0.047)


The random forest classifier performed better with the clustering prediction feature. This could be because of its poor clustering according to our ground state, but still adding something to our model. Let's see about another classification model.

## Logistic Regression

In [47]:
from sklearn.linear_model import LogisticRegression

lr_c = LogisticRegression()
train_c = lr_c.fit(X2_train_c, y_train)
lr_c_scores = cross_val_score(lr_c, X2_train_c, y_train, cv=5)
print('Training set score with clustering: {:.5f}(+/- {:.3f})'.format(lr_c_scores.mean(), lr_c_scores.std()*2))

lr = LogisticRegression()
train = lr.fit(X2_train, y_train)
lr_scores = cross_val_score(lr, X2_train, y_train, cv=5)
print('\nTraining set score without clustering:{:.5f}(+/- {:.3f})'.format(lr_scores.mean(), lr_scores.std()*2))

Training set score with clustering: 0.39351(+/- 0.060)

Training set score without clustering:0.39842(+/- 0.084)


The Logistic Regression model performed better without clustering than with it. This model also had a lower average accuracy score than the random forest classifier.  We will try one last model before making a decision.

## Gradient Boosting

In [48]:
from sklearn.ensemble import GradientBoostingClassifier

clf_c = GradientBoostingClassifier()
train_c = clf_c.fit(X2_train_c, y_train)
clf_c_scores = cross_val_score(clf_c, X2_train_c, y_train, cv=5)
print('Training set score with clustering: {:.5f}(+/- {:.3f})'.format(clf_c_scores.mean(), clf_c_scores.std()*2))

clf = GradientBoostingClassifier()
train = clf.fit(X2_train, y_train)
clf_scores = cross_val_score(clf, X2_train, y_train, cv=5)
print('\nTraining set score without clustering:{:.5f}(+/- {:.3f})'.format(clf_scores.mean(), clf_scores.std()*2))

Training set score with clustering: 0.49172(+/- 0.111)

Training set score without clustering:0.48072(+/- 0.083)


The gradient boosting classifier is marginally better with the clustering than without the clustering.  This higher accuracy is also higher than any of the other two models, and also has a slightly higher variance between folds than the random forest. I will now attempt to optimize the parameters of this gradient boosting model to improve the accuracy of this model.

### Optimizing Parameters

In [49]:
from sklearn.model_selection import GridSearchCV

# Set of parameters to test for best score in Grid Search CV
parameters = {'loss':['deviance'],
               'min_samples_split':[50,100,200],
             'min_samples_leaf':[1,2,4],
             'max_depth':[3,4,5,6],
             'max_features':['sqrt'],
             'subsample':[0.6,0.8],
             'n_estimators':[50,100,150]}

#fitting model and printing best parameters and score from model
grid_clf_c = GridSearchCV(clf_c, param_grid=parameters)
grid_clf_c.fit(X2_train_c, y_train)

print('Best Score:', grid_clf_c.best_score_)
best_params_clf_c = grid_clf_c.best_params_
print('Best Parameters:', best_params_clf_c)

Best Score: 0.54
Best Parameters: {'loss': 'deviance', 'max_depth': 6, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 100, 'n_estimators': 50, 'subsample': 0.6}


In [50]:
clf_c2 = GradientBoostingClassifier(**best_params_clf_c)
train = clf_c2.fit(X2_train_c, y_train)
clf_c2_scores = cross_val_score(clf_c2, X2_train_c, y_train, cv=5)
print('Optimized training set score with clustering:{:.5f}(+/- {:.3f})'.format(clf_c2_scores.mean(), clf_c2_scores.std()*2))

Optimized training set score with clustering:0.52027(+/- 0.047)


## Test Set Processing and Modeling

Now that we've tried out what works best with the training set, I will process the test set in an identical manner as the training set in order to determine how well this model predicts the decade from which the sentence came on a new set of data. 

In [51]:
# Calulate predicted values
y_pred_test = kmeans.predict(X2_test)

pd.crosstab(y_test, y_pred_test)

col_0,0,1,2,3,4,5
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1960,12,0,0,0,0,12
1970,12,0,0,1,0,11
1980,15,4,0,8,0,3
1990,3,6,1,14,2,1
2000,8,3,0,11,1,1
2010,10,0,0,5,0,6


In [52]:
print('Adjusted Rand Score: {:0.7}'.format(adjusted_rand_score(y_test, y_pred_test)))
print('Silhouette Score: {:0.7}'.format(silhouette_score(X2_test, y_pred_test, metric='euclidean')))

Adjusted Rand Score: 0.07415039
Silhouette Score: 0.5402269


Not terrible.  It looks like there was a 0.05 drop in both the adjusted rand score and the silhouette score from the training set to the test set.  This doesn't look like it's overfitting too much. According to the K-Means training set cross tabulation, the majority of sentences were clustered into clusters 0, 2, and 4, and the test set appears that it also has the majority of sentences labeled as 0, or 2, and some 4.  

In [53]:
X2_test_c = pd.DataFrame(X2_test)
X2_test_c['kmeans_clust'] = y_pred_test

In [54]:
clf_c2_scores_test = cross_val_score(clf_c2, X2_test_c, y_test, cv=5)
print('Test set score: {:.5f}(+/- {:.3f})'.format(clf_c2_scores_test.mean(), clf_c2_scores_test.std()*2))

Test set score: 0.20065(+/- 0.023)


This model is incredibly overfit - the accuracy of the training set was around twice of that of the test set.  This means that the clusters and the models are not well able to distinguish between decades when given a piece of novel text.

## Conclusion

Natural language processing of texts from various recent decades are difficult to cluster and classify.  This is likely because the language we have used to explain events hasn't changed much, but some of the events have.  With our quick news cycle there isn't enough continued coverage of any certain event to really characterize the decade and aid in clustering and classification.  

With clustering of texts in particular, it might be better to cluster based on topic or author, as these have more distinct words or types of words that are used to describe the events or author. As in our model, then this cluster prediction could be used to support the classification models, even if they do not perfectly match the decade in which the piece was written. Interesting to note, the clustering models were improved upon adding in features to describe the number of different parts of speech in each sentence of the text. This is likely because different events in the decades require different parts of speech to accurately depict. This poor clustering also could be due to the lack of evolution of topics covered by the media in that the public is still interested in the same types of articles today as we were in the 1960s. 

The classificaiton of texts from different decades was also challenging even using optimized gradient boosting classification.  With each sentence carrying such different information, it was challenging to predict the decade from which the document was written.  This is evidenced by the accuracy of the test set that was about half of the accuracy of the optimized model.  Classification is usually best used when the features are highly correlated to the different classes and more consistent amongst the data, whereas with text data this isn't necessarily the case.