# Chapter Two - Footnotes: The Poetics of Progress

In this notebook, I walk through my steps in performing the [clustering](#Clustering) and [classification](#Classification) experiments found in chapter two. I've included pickled versions of the corpora I use in the tests. I've also included the metadata csv for the 18th-century progress poems I analyze in case you want to process the texts differently. 


In [None]:
%matplotlib inline 
import bookFunctions
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt
from sklearn import metrics
import matplotlib
import matplotlib.cm as cm
import random
import numpy as np
import pickle
import csv
import sys

### These are function files
sys.path.insert(0, '/Users/collinjennings/dropbox/eccoDriveWork') ### Change this to match github directory.


## Load Corpora

Both experiments compare 55 progress poems to 55 randomly selected works from the 18th century. The random sample includes poetry and prose. I explain why this is the case at length in the book. Here I've stored the two collections in one pickle file for easy handling. The pickle is structured as a list of dictionaries that include the text, author, publication dates, and class of the works. For more bibliographical details, see the metadata csv.

In [None]:
full_corpus = pickle.load(open('progressPlusSample.p', 'rb'))
full_text = [' '.join(item['text']) for item in full_corpus]
corpus_sample = full_corpus[54:]
print (len(full_corpus))

## Clustering 

### Separate the metadata into lists

The lists make it a bit easier to create a dataframe for analyzing and visualizing the differences between the two corpora. 

In [None]:
classDividingLine = 54 ### End of 
classes = [] 
authors = [] 
titles = [] 
dates =[] 
for idx, item in enumerate(full_corpus): 
    if idx < classDividingLine: 
        classes.append('poetry')
        authors.append(item['author'][:20])
        titles.append(item['title']) 
        dates.append(int(item['date']))
    else: 
        classes.append(item['class'])
        authors.append(item['author'][:20])
        titles.append(item['title']) 
        dates.append(int(item['date']))
        
### Shorten the titles for visualizing.
titles3 = [title[:20] for title in titles]

### Transform the text wordlists into vectors for cluster visualization
This version of the experiment limits the terms in the vocabulary to the top 400 bigrams that appear in a least 20% of the poems. The goal is to observe the differences in how the classes of poems use common phrases. 

In [None]:
vectorizer = TfidfVectorizer(input='content', decode_error='replace', ngram_range=(2,2), stop_words=None,
                             max_features=400, min_df=.20, norm='l2', use_idf=True)
dtm = vectorizer.fit_transform(full_text)
vocab = vectorizer.get_feature_names()
dtm = dtm.toarray()
dist = 1 - cosine_similarity(dtm)
print(len(vocab))


In [None]:
pca = PCA(n_components=2).fit(dist)
pos = pca.transform(dist)
xs, ys = pos[:, 0], pos[:, 1]
df = pd.DataFrame(dict(x=xs, y=ys, label=titles3, class2=classes)) 
groups = df.groupby('class2')

#### The `idList` filters the labels to make the graph more legible. You can change which labels are included on the plot.

In [None]:
idList = [ 3, 11, 14,19, 20, 22, 38, 39, 56, 75, 68, 87, 
          93, 98, 81, 85, 102, 48, 103, 108, 110, 111, 38, 114, 60]

### Visualize Poems
Here we visualize how the poetry poems cluster in relation to the texts from the sample corpus. There are a lot of ways you can tinker with the visualization. See the documentation for `matplotlib`. 

In [None]:
cluster_colors = {'poetry': '#EBEBEB', 'sample': '#BBBABA'}

fig, ax = plt.subplots(figsize=(12, 12)) # set size
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=24, 
            label=name, color=cluster_colors[name]) 

for i in range(len(df)):
    if i in idList: 
        ax.text(df.loc[i]['x'], df.loc[i]['y'], df.loc[i]['label'], horizontalalignment='left', size=11, weight='bold') 
ax.legend()
plt.title('Semantic Proximity between Progress Poetry and ECCO Sample - Top 400 Bigrams', fontsize=20)
plt.show()

### Observe the most informative bigrams

This is a simple way to observe similarities and differences between clusters of poems in the graph. Update the x-axis and y-axis values as well as the `above` and `toTheLeft` boolean values. You are saying whether you want to examine the data points to the left of the x-value or not and the above the y-value or not. The `graph_analyzer` functions will return the words or phrases that have the highest frequency among the texts that appear within the area of focus. 

In [None]:
xValue = None  
yValue = None
above = None ## defaults to True
toTheLeft = None ## defaults to True

posGroup, notGroup = graph_analyzer.separator(pos, xValue, , above=True, toTheLeft=True )
avgArray, notAvgArray = graph_analyzer.analyzer(dtm2, posGroup, notGroup)
allProg, nonProg = graph_analyzer.valueRanker(graph_analyzer.differencer
                                                             (avgArray, notAvgArray),  vocab, 60)

In [None]:
print(sorted(upperProg, key=lambda x:x[1], reverse=True))
print()
print(sorted(nonUpProg, key=lambda x:x[1], reverse=True))

## Classification


### Organize the corpus for classification

In [None]:
corpus_sample = full_corpus[54:]
progressCorpus = full_corpus[:54]

In [None]:
progressTexts = textFieldCorpus(progressCorpus, 'text', 'progress') #, estcPhil)

In [None]:
progressWordLabels = nonParsing('progress', progressTexts)

In [None]:
sampGroup = sampleProcessor(corpus_sample, 'text', 70)
sampWordLabels = nonParsing('sample', sampGroup)

In [None]:
chunks, chunkLabels = buildCorpus(progressWordLabels, sampWordLabels, text_split, 300)
print(len(chunks))

### Create a dataframe with the text chunks and their class labels. 

In [None]:
d = {'category': chunkLabels, 'text': chunks}
dfFull = pd.DataFrame(d)
dfFull['category_id'] = dfFull['category'].factorize()[0]

tfidf = TfidfVectorizer(sublinear_tf=True, ngram_range=(1,2), min_df=.1, norm='l2', use_idf=False)
                    
feats = tfidf.fit_transform(dfFull.text).toarray()
vocabulary = tfidf.get_feature_names()
labels = dfFull.category_id

### Train the classifier and compare the results of different models

In [None]:
models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(random_state=0, multi_class='ovr', solver='liblinear',),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
    model_name = model.__class__.__name__
    accuracies = cross_val_score(model, feats, labels, scoring='accuracy', cv=CV) 
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df, 
              size=8, jitter=True, edgecolor="gray", linewidth=2)
plt.show()

### Choose the model to use
Specify the model by indexing the one you want from the block above. Linear SVC tends to perform the best across the experiments of the book. 

In [None]:
mod = models[1]
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(feats, labels, dfFull.index, 
                                                                                 test_size=0.20, random_state=0)
mod.fit(X_train, y_train)
y_pred = mod.predict(X_test)

### Observe the classifier scores

In [None]:
print(metrics.classification_report(y_test, y_pred, 
                                    target_names=dfFull['category'].unique()))

### Examine the results

In [None]:
### Code for all other models
modTerms = mostInformTerms(mod, topn=30)
dfTM = docTermMatrix (feats, tfidf)
suppMeanDF, sampMeanDF = bookFunctions.buildMeanDF(dfTM, feats, 'progress', chunkLabels)
progWords, sampWords = bookFunctions.topGroupWord (suppMeanDF, sampMeanDF, modTerms, vocabulary)