This notebook is designed to reproduce several findings from Andrew Piper's article "Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel" (<i>New Literary History</i> 46.1 (2015), 63-98). See especially Fig 2 (p 72), Fig 4 (p 75), and Table 1 (p 79).

Piper has made his research corpus of novels available here: http://txtlab.org/?p=601

## Pre-Processing
<li>Preparation</li>
<li>Term Frequency Revisited</li>
<li>Document-Term Matrix</li>
<li>Normalization</li>
<li>Streamlining</li>

## Textual Similarity
<li>Vector Space Model of Language</li>
<li>Visualizing Texts in Vector Space</li>
<li>Brief Aside: K-Means Clustering</li>
<li>The Conversional Novel</li>

# Preparation

In [None]:
%pylab inline
import numpy as np
from datascience import *

In [None]:
# Read the Confessions from file, split into Books

with open('Augustine - Confessions.txt') as file_in:
    confession = file_in.read()
confession_list = confession.split('\n'*6)

In [None]:
# Each list entry is a string containing a Book of the Confessions

confession_list

In [None]:
# Thirteen books in the Confessions, so hopefully that's the length of the list!

len(confession_list)

# Term Frequency Revisited

In [None]:
# Get a list of tokens from each text
first_book = confession_list[1]
first_token_list = first_book.lower().split()
first_token_list

In [None]:
# Then use Counter to return a dictionary of tokens and their frequencies
from collections import Counter
word_freq = Counter(first_token_list)
word_freq.most_common(20)

In [None]:
# EX. Edit the script to return the ten most common words from the second book
#      of the Confessions. How similar are they to those of the first book?

# Document-Term Matrix

In [None]:
# If we plan to compare word frequencies across texts, then there is an easy
# function that streamlines the process.

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
dtm = cv.fit_transform(confession_list)

In [None]:
# Produces a 'sparse matrix.' Notice the dimensions.
dtm

In [None]:
# Let's make this human-readable to build our intuition

In [None]:
# De-sparsify
desparse = dtm.toarray()

# Create labels for columns
word_list = cv.get_feature_names()

# Create a new Table
dtm_tb = Table(word_list).with_rows(desparse)

dtm_tb

In [None]:
# We can call up frequencies for a given word easily, since they are the column names

dtm_tb['read']

In [None]:
## Q. Check-in: What do the values in the word columns represent?

# Normalization

In [None]:
# Get the total number of words in the whole text
sum(desparse)

In [None]:
# In order to make apples-to-apples comparisons across Books, we can normalize our values
# by dividing each word count by the total number of words in its Book.

row_sums = np.sum(desparse, axis=1)
normed = desparse/row_sums[:,None]
dtm_tb = Table(word_list).with_rows(normed)

dtm_tb

In [None]:
dtm_tb['read']

In [None]:
# For a variety of reasons we like to remove words like "the", "of", "and", etc.
# These are refered to as 'stopwords.'

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
ENGLISH_STOP_WORDS

In [None]:
# Since we are using an older translation of Augustine, we have to remove archaic forms
# of these stopwords as well.

ye_olde_stop_words = ['thou','thy','thee', 'thine', 'ye', 'hath','hast', 'wilt','aught',\
                      'art', 'dost','doth', 'shall', 'shalt','tis','canst','thyself',\
                     'didst', 'yea', 'wert']

stop_words = list(ENGLISH_STOP_WORDS)+ye_olde_stop_words

# Remove stopwords from column list
dtm_tb = dtm_tb.drop(stop_words)

# It is often more efficient to perform operations on arrays rather than tables
dtm_array = dtm_tb.to_array()

In [None]:
## Q. In the script above, we normalized term frequencies before removing stopwords.
##    However, it would have been just as easy to do those steps in the opposite order.
##    Are there situations where this decision has more or less of an impact on the output?

# Streamlining

In [None]:
# In fact, we can simply instruct CountVectorizer not to include stopwords at all
# and another function, TfidfTransformer, normalizes easily.

from sklearn.feature_extraction.text import TfidfTransformer

cv = CountVectorizer(stop_words = stop_words)
dtm = cv.fit_transform(confession_list)
tt = TfidfTransformer(norm='l1',use_idf=False)
dtm_tf = tt.fit_transform(dtm)

word_list = cv.get_feature_names()
dtm_array = dtm_tf.toarray()

In [None]:
# Note: If you are processing a text that uses only contemporary English, it may be
#       unnecessary to import the list of stopwords explicitly. Simply pass the value
#       "english" into the "stop_words" argument in CountVectorizer.

In [None]:
Table(word_list).with_rows(dtm_array)

In [None]:
# EX. How many unique words were included in our list? How many unique words
#     are there in total in the book (including stop words)?

# EX. What is the Type-Token Ratio of Augustine's Confessions?

# Vector Space Model of Language

In [None]:
# Let's treat each document as a point (or vector) in space

dtm_array = dtm_tf.toarray()
dtm_array

In [None]:
# Each vector has a number of coordinates equal to the number of
# unique words in the corpus.

dtm_array[0]

In [None]:
# Algebra 2: Euclidean Distance

a = (2,6)
b = (5,10)

euc_dist = sqrt( (a[0]-b[0])**2  +  (a[1]-b[1])**2 )
euc_dist

In [None]:
# It also works in three dimensions!

a = (2,6,15)
b = (5,10,3)

euc_dist = sqrt( (a[0]-b[0])**2 +  (a[1]-b[1])**2 + (a[2]-b[2])**2 )
euc_dist

In [None]:
from scipy.spatial import distance
distance.euclidean(a,b)

In [None]:
# Pre-Calculus & Linear Algebra: Cosine Distance

a = (2,6)
b = (5,10)

# Don't worry about the formula so much as the intuition behind it: angle between vectors
cos_dist = 1 - sum( a[0]*b[0] + a[1]*b[1] ) / ( sqrt(sum( a[0]**2 + a[1]**2 )) * sqrt(sum( b[0]**2 + b[1]**2 )))
cos_dist

In [None]:
distance.cosine(a,b)

In [None]:
# EX. Try passing different values into both the euclidean and cosine
#     distance functions. What is your intution about these different measurements?

#     Remember that all values in the Term-Frequency Matrix are positive,
#     between [0,1], and that most are very small.

# Visualizing Texts in Vector Space

In [None]:
# Measure distances among multiple points

a = (2,6)
b = (5,10)
c = (14,11)

print(distance.euclidean(a,b))
print(distance.euclidean(a,c))
print(distance.euclidean(b,c))

In [None]:
# Represent points as rows of matrix

point_matrix = np.array([a,b,c])
point_matrix

In [None]:
# Calculate distances among all rows of matrix

from sklearn.metrics import pairwise
pairwise.pairwise_distances(point_matrix, metric='euclidean')

In [None]:
# Calculate distances among texts in vector space

dist_matrix = pairwise.pairwise_distances(dtm_tf, metric='euclidean')

title_list = ['Book '+str(i+1) for i in range(len(confession_list))]
Table(title_list).with_rows(dist_matrix)

In [None]:
# Multi-Dimensional Scaling

# Measures the relative distances among points in high dimensional space
# and projects them into two-dimensional space. (hand-waving the MDS algorithm for now)

from sklearn.manifold import MDS

mds = MDS(n_components = 2, dissimilarity="precomputed")
embeddings = mds.fit_transform(dist_matrix)

_, ax = plt.subplots(figsize=(10,10))
ax.scatter(embeddings[:,0], embeddings[:,1], alpha=0)
for i in range(13):
    ax.annotate(i+1, ((embeddings[i,0], embeddings[i,1])))

In [None]:
# EX. Try visualizing the textual similarities again using the Cosine distance.
#     How does that change the result? Why?

# Brief Aside: K-Means Clustering

In [None]:
# Tries to find natural groupings among points, once we tell it
# how many groups to look for. (Also hand-waving K-Means algorithm)

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2)
kmeans.fit_predict(dist_matrix)

In [None]:
# EX. Try passing 'embeddings' and 'dtm_tf' as arguments into kmeans.fit_predict()
#     Why do the clusters vary?

# The Conversional Novel

In [None]:
# Operationalizing conversion in the novel

def text_splitter(text):
    n = int(len(text)/20)
    text_list = [text[i*n:(i+1)*n] for i in range(20)]
    return(text_list)

def text_distances(text_list):
    
    from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    from sklearn.metrics import pairwise
    
    ye_olde_stop_words = ['thou','thy','thee', 'thine', 'ye', 'hath','hast', 'wilt','aught',\
                          'art', 'dost','doth', 'shall', 'shalt','tis','canst','thyself',\
                         'didst', 'yea', 'wert']
    stop_words = list(ENGLISH_STOP_WORDS)+ye_olde_stop_words
    cv = CountVectorizer(stop_words = stop_words, min_df=0.6)
    dtm = cv.fit_transform(text_list)
    tt = TfidfTransformer(norm='l1',use_idf=False)
    dtm_tf = tt.fit_transform(dtm)
    dist_matrix = pairwise.pairwise_distances(dtm_tf, metric='euclidean')
    return(dist_matrix)

def in_half_dist(matrix):
    n = len(matrix)
    d1 = []
    d2 = []
    for i in range(int(n/2)-1):
        for j in range(i+1, int(n/2)):
            d1.append(matrix[i,j])
    for i in range(int(n/2), n-1):
        for j in range(i+1, n):
            d2.append(matrix[i,j])
    return(abs(sum(d1)-sum(d2))/len(d1))


def cross_half_dist(matrix):
    n = len(matrix)
    d = []
    for i in range(int(n/2)):
        for j in range(int(n/2), n):
            d.append(matrix[i,j])
    return(sum(d)/len(d))

def text_measures(text):
    text_list = text_splitter(text)
    dist_matrix = text_distances(text_list)
    return(cross_half_dist(dist_matrix), in_half_dist(dist_matrix))

In [None]:
# Test measurement on the Confessions

text_measures(confession)

In [None]:
# Get corpus metadata

metadata_tb = Table.read_table('2_txtlab_Novel450.csv')
metadata_tb

In [None]:
# We'll use just a single language, since there are likely to be different
# norms across languages.

metadata_tb = metadata_tb.where('language', "English")
metadata_tb

In [None]:
# Modify function to read texts from hard-drive
# Integrates with Tables' ".apply()" method and available metadata

corpus_path = '2_txtalb_Novel450/'

def text_measures_alt(text_name):
    with open(corpus_path+text_name, 'r') as file_in:
        text = file_in.read()
    text_list = text_splitter(text)
    dist_matrix = text_distances(text_list)
    return(cross_half_dist(dist_matrix), in_half_dist(dist_matrix))

In [None]:
measures = metadata_tb.apply(text_measures_alt, 'filename')

In [None]:
# Separate the values from 'measures' into two separate Table columns

metadata_tb['Cross-Half'] = measures[:,0]
metadata_tb['In-Half'] = measures[:,1]

In [None]:
metadata_tb

In [None]:
# Compute the Z-score of each value -- its number of standard deviations from the mean

def get_zscores(values):

    import numpy as np
    mn = np.mean(values)
    st = np.std(values)
    zs = []
    
    for x in values:
        z = (x-mn)/st
        zs.append(z)

    return zs

metadata_tb['Cross-Z-Score'] = get_zscores(measures[:,0])
metadata_tb['In-Z-Score'] = get_zscores(measures[:,1])

In [None]:
metadata_tb

In [None]:
# Let's visualize!
metadata_tb.scatter('In-Half', 'Cross-Half')

In [None]:
# Even bigger!
figure(figsize=(10,10))
xlim((0,0.1))
scatter(measures[:,1], measures[:,0])

In [None]:
# Create Rankings for novels' Cross-Half and In-Half values

cross_sort = metadata_tb.sort('Cross-Half', descending=True)['id']
in_sort = metadata_tb.sort('In-Half', descending=True)['id']

In [None]:
cross_sort

In [None]:
# Average the Rankings from the two lists

ranks = [ ( cross_sort.tolist().index(_id) + in_sort.tolist().index(_id) )/2 for _id in metadata_tb['id']]
metadata_tb['Ranking'] = ranks

In [None]:
# Most conversional novels

columns = ['author', 'title', 'Cross-Half', 'Cross-Z-Score', 'In-Half', 'In-Z-Score', 'Ranking']
metadata_tb.select(columns).sort('Ranking')

In [None]:
# Q.  Piper includes only words that appeared in at least 60% of the book's sections.
#     How might that shape his findings? What if he had used a 50% threshold?

# EX. Try changing the 'min_df' argument to 0.5. How do the rankings change?
#     Try eliminating the 'min_df' altogether.

In [None]:
# EX. Visualize distances among the twenty sections of the top-ranked
#     conversional novel in the corpus using the MDS technique.

In [None]:
# EX. When we processed our texts most recently, we removed stopwords before normalizing.
#     Switch the order of these tasks. Does it change our findings? Why?