In [None]:
!wget https://ndownloader.figshare.com/files/3686778 -P data/

In [None]:
%%capture
!unzip data/3686778 -d data/

# The Conversional Novel

> The first step was to divide each novel into twenty equal parts. Rather than rely on the
irregularity of chapter divisions, which can vary within and between works, this process creates
standard units of analysis. [95]

Instead of actually using chapter divisions, Piper elects to split each novel into 20 equal parts. We can write a function `text_splitter` that will take in a `str` of the text and return a list of 20 equal parts:

In [None]:
def text_splitter(text):
    n = int(len(text)/20)  # get length n of each part
    text_list = [text[i*n:(i+1)*n] for i in range(20)]  # slice out the text
    return(text_list)

> I then
calculated the Euclidean distance between each of the twenty parts of the work based on
the frequency of the remaining words and stored those results in a symmetrical distance
table. In the end, for each work I had a 20x20 table of distances between every part of
a work to every other, in which the distances are considered to be measures of the similarity
of the language between a work’s individual parts. [95]

Piper then calculates the ***Euclidean*** distances between each part to every other part. So we'll have to calculate the distance and use our `pairwise` method. We can write a function for that too! To make it better, let's have it take in a list of texts that our `text_splitter` will output:

In [None]:
def text_distances(text_list):
    
    from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    from sklearn.metrics import pairwise
    
    ye_olde_stop_words = ['thou','thy','thee', 'thine', 'ye', 'hath','hast', 'wilt','aught',\
                          'art', 'dost','doth', 'shall', 'shalt','tis','canst','thyself',\
                         'didst', 'yea', 'wert']
    stop_words = list(ENGLISH_STOP_WORDS)+ye_olde_stop_words
    cv = CountVectorizer(stop_words = stop_words, min_df=0.6)
    dtm = cv.fit_transform(text_list)
    tt = TfidfTransformer(norm='l1',use_idf=False)
    dtm_tf = tt.fit_transform(dtm)
    dist_matrix = pairwise.pairwise_distances(dtm_tf, metric='euclidean')
    return(dist_matrix)

Piper the introduces two new ideas.

> for the ***in-half distance*** I took the average distance of each part in the first half of a work to every other part in that half and subtracted it from the average distance of every part of the second half to itself. [95]

Let's write a function that does that, and have it take in our matrix returned by `text_distances`:

In [None]:
def in_half_dist(matrix):
    n = len(matrix)  # length of work, should be 20
    d1 = []  # will hold distances for first half
    d2 = []  # will hold distances for second half
    for i in range(int(n/2)-1):  # loop through first half of work (10 in our case)
        for j in range(i+1, int(n/2)):  # loop through itself (first half again)
            d1.append(matrix[i,j])  # append distance between one part to another (in first half)
    for i in range(int(n/2), n-1):
        for j in range(i+1, n):
            d2.append(matrix[i,j])
    return(abs(sum(d1)-sum(d2))/len(d1))  # take average of each distance array and subtract 2 from 1

Great! And now for his second measure:

> For the cross-half distance, I took the average distance between
all of the first ten parts of a work to all of the second ten parts of a work, similar to the
process used in group average clustering. [95]

Let's write another function:

In [None]:
def cross_half_dist(matrix):
    n = len(matrix)  # number of parts, here 20
    d = []  # will hold distnaces
    for i in range(int(n/2)):  # loop through first half
        for j in range(int(n/2), n):  # loop through second half
            d.append(matrix[i,j])  # append distance between first and second
    return(sum(d)/len(d))  # take average

Awesome! We can also write ourselves a quick function to call the four functions we just wrote:

In [None]:
def text_measures(text):
    text_list = text_splitter(text)
    dist_matrix = text_distances(text_list)
    return(cross_half_dist(dist_matrix), in_half_dist(dist_matrix))

`text_measures` should now return two values. The first values is the `cross_half_dist` and the second values is the `in_half_dist`. Let's test this out on Augustine's `Confessions':

In [None]:
with open('data/Augustine-Confessions.txt') as f:
    confessions = f.read()

text_measures(confessions)

Looks good! Now we can read in the corpus Piper used:

In [None]:
from datascience import *
metadata_tb = Table.read_table('data/2_txtlab_Novel450.csv')
metadata_tb.show(5)

We'll stick with English so we don't have to think about the possible issues of going between languages:

In [None]:
metadata_tb = metadata_tb.where('language', "English")
metadata_tb.show(5)

We'll slightly change our `text_measures` function so that it can read in the file of the text we want to read in, instead of taking the `confessions` string we already had:

In [None]:
corpus_path = 'data/2_txtalb_Novel450/'

def text_measures_alt(text_name):
    with open(corpus_path+text_name, 'r') as file_in:
        text = file_in.read()
    text_list = text_splitter(text)
    dist_matrix = text_distances(text_list)
    return(cross_half_dist(dist_matrix), in_half_dist(dist_matrix))

Now we can use `Table`'s `apply` method to call the function `text_measures_alt` on all the files in the corpus:

In [None]:
measures = metadata_tb.apply(text_measures_alt, 'filename')
measures

Let's add these measures to our `Table`:

In [None]:
metadata_tb['Cross-Half'] = measures[:,0]
metadata_tb['In-Half'] = measures[:,1]
metadata_tb.show(5)

If we want to see which novels stick out, we might be interested in the z-score for a particular novel. This is how many standard devations the novel is away from the mean. Let's write a function:

In [None]:
def get_zscores(values):

    import numpy as np
    mn = np.mean(values)
    st = np.std(values)
    zs = []
    
    for x in values:
        z = (x-mn)/st
        zs.append(z)

    return zs

Now we can add these to the `Table` too:

In [None]:
metadata_tb['Cross-Z-Score'] = get_zscores(measures[:,0])
metadata_tb['In-Z-Score'] = get_zscores(measures[:,1])
metadata_tb.show(5)

Scatter plot, please!

In [None]:
metadata_tb.scatter('In-Half', 'Cross-Half')

## Homework

Use our z-scores to rank the novels. Which novels are most "conversional"?

Piper includes only words that appeared in at least 60% of the book's sections. How might that shape his findings? What if he had used a 50% threshold?

Try changing the `min_df` argument to 0.5. How do the rankings change? Try eliminating the `min_df` altogether.

## Bonus (not assigned)

Visualize distances among the twenty sections of the top-ranked conversional novel in the corpus using the MDS technique.