In this notebook, we'll build a predictive model for measuring the passage of narrative time using data from Underwood (2018), "Why Literary Time is Measured in Minutes", which annotates how many minutes have passed in a 250-word chunk of text.  We'll compare the accuracy of regularized linear regression model with a baseline that assigns the *average* time in the training data to all test passages (think of this as an analogue to a majority class baseline for real-valued predictions).

In [1]:
from math import log, exp
from sklearn.model_selection import GroupKFold
import numpy as np
from collections import Counter
from sklearn import linear_model
from scipy import sparse
from sklearn.metrics import mean_squared_error, mean_absolute_error
import operator
import nltk

In [2]:
def read_data(filename):
    X=[]
    Y=[]
    groups=[]
    with open(filename) as file:
        for line in file:
            cols=line.split("\t")
            book_id=cols[0]
            time=float(cols[2])
            text=cols[6].rstrip()
            if len(text) > 0 and time > 0 and time < 1000000:
                tokens=nltk.word_tokenize(text.lower())
                X.append(tokens)
                Y.append(log(time))
                groups.append(book_id)
            
    return np.array(X), np.array(Y), groups

In [3]:
X, Y, groups=read_data("../data/literary_time_data.txt")

  return np.array(X), np.array(Y), groups


In [4]:
def build_features(dataX, feature_functions):
    
    """ This function featurizes the data according to the list of parameter feature_functions """
    
    data=[]
    for tokens in dataX:
        feats={}
        
        for function in feature_functions:
            feats.update(function(tokens))

        data.append(feats)
    return data

In [5]:
def features_to_ids(data, feature_vocab):
    
    """ 
    
    This helper function converts a dictionary of feature names to a sparse representation
 that we can fit in a scikit-learn model.  This is important because almost all feature 
 values will be 0 for most documents (note: why?), and we don't want to save them all in 
 memory.

    """
    new_data=sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx,doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx,feature_vocab[f]]=doc[f]
    return new_data

In [6]:
def create_vocab(data, top_n=None):
    
    """ 
    
    This helper function converts a dictionary of feature names to unique numerical ids. 
    top_n limits the features to only the n most frequent features observed in the training data 
    (in terms of the number of documents that contains it).
    
    """
    
    counts=Counter()
    for doc in data:
        for feat in doc:
            counts[feat]+=1

    feature_vocab={}

    for idx, (k, v) in enumerate(counts.most_common(top_n)):
        feature_vocab[k]=idx
                
    return feature_vocab

In [7]:
def pipeline(trainX, devX, trainY, devY, feature_functions):

    """ This function evaluates a list of feature functions on the training/dev data arguments """
    
    trainX_feat=build_features(trainX, feature_functions)
    devX_feat=build_features(devX, feature_functions)

    # just create vocabulary from features in *training* data.
    feature_vocab=create_vocab(trainX_feat, top_n=100000)

    trainX_ids=features_to_ids(trainX_feat, feature_vocab)
    devX_ids=features_to_ids(devX_feat, feature_vocab)
    
    clf = linear_model.Ridge(alpha=1)
    clf.fit(trainX_ids, trainY)
    
    predictions=clf.predict(devX_ids)
    
    return clf, predictions, feature_vocab

In [8]:
def unigram_feature(tokens):
    feats={}
    for word in tokens:
        feats["UNIGRAM_%s" % word]=1
    return feats

In [9]:
features=[unigram_feature]

kf = GroupKFold(n_splits=10)
preds=[]
golds=[]
baselines=[]

for train_index, test_index in kf.split(X, Y, groups):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = Y[train_index], Y[test_index]
    clf, predictions, vocab=pipeline(X_train, X_test, y_train, y_test, features)
    preds.extend(predictions)
    golds.extend(y_test)
    average_in_training=np.mean(y_train)
    baselines.extend([average_in_training]*len(y_test))
        

expgolds=np.exp(golds)
exppreds=np.exp(preds)
expbaselines=np.exp(baselines)

print ("Lower numbers are better:\n")

print ("===Mean squared error===\n")

print(f"Baseline MSE (log minutes): {mean_squared_error(golds, baselines):.1f}") 
print(f"Ridge MSE (log minutes): {mean_squared_error(golds, preds):.1f}")

print()

print(f"Baseline MSE (minutes): {mean_squared_error(expgolds, expbaselines):.1f}")
print(f"Ridge MSE (minutes): {mean_squared_error(expgolds, exppreds):.1f}")

print ("\n===Mean absolute error===\n")

print(f"Baseline MAE (log minutes): {mean_absolute_error(golds, baselines):.1f}")
print(f"Ridge MAE (log minutes): {mean_absolute_error(golds, preds):.1f}")

print()
print(f"Baseline MAE (minutes): {mean_absolute_error(expgolds, expbaselines):.1f}")  
print(f"Ridge MAE (minutes): {mean_absolute_error(expgolds, exppreds):.1f}")




Lower numbers are better:

===Mean squared error===

Baseline MSE (log minutes): 14.1
Ridge MSE (log minutes): 12.2

Baseline MSE (minutes): 5547264622.6
Ridge MSE (minutes): 5584585933.8

===Mean absolute error===

Baseline MAE (log minutes): 3.2
Ridge MAE (log minutes): 2.7

Baseline MAE (minutes): 15740.9
Ridge MAE (minutes): 16132.7


What are the features that are predictive of the fast (and slow) passage of time?  Looking at these features, what aspects of the text do you think this model is latching onto in predicting time?

In [11]:
def print_weights(clf, vocab, n=10):
    weights=clf.coef_
    reverse_vocab=[None]*len(weights)
    for k in vocab:
        reverse_vocab[vocab[k]]=k

    for feature, weight in list(reversed(sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))))[:n]:
        print("%.3f\t%s" % (weight, feature))

    print()

    for feature, weight in sorted(zip(reverse_vocab, weights), key = operator.itemgetter(1))[:n]:
        print("%.3f\t%s" % (weight, feature))

In [12]:
print_weights(clf, vocab)

0.890	UNIGRAM_day
0.745	UNIGRAM_appear
0.697	UNIGRAM_return
0.591	UNIGRAM_agreed
0.532	UNIGRAM_aunt
0.527	UNIGRAM_story
0.523	UNIGRAM_days
0.513	UNIGRAM_suffered
0.501	UNIGRAM_joy
0.476	UNIGRAM_expected

-0.573	UNIGRAM_''
-0.557	UNIGRAM_replied
-0.523	UNIGRAM_speaking
-0.503	UNIGRAM_known
-0.496	UNIGRAM_wish
-0.451	UNIGRAM_voice
-0.438	UNIGRAM_said
-0.434	UNIGRAM_given
-0.434	UNIGRAM_dozen
-0.426	UNIGRAM_standing


Now let's make some predictions for new ~250-word passages.

In [13]:
def predict(text, feature_vocab):
    tokens=nltk.word_tokenize(text)
    feats=build_features([tokens], [unigram_feature])
    devX_ids=features_to_ids(feats, feature_vocab)
    prediction=clf.predict(devX_ids)
    return exp(prediction)

“I know you heard all this before, Franklin, but it didn’t take, obviously. Maybe this time it will. Right now, all of you are Grubs. We have four ranks of behavior here—start as a Grub, work your way up to Explorer, then Pioneer, and finally, Ace. Earn merits for acting right, and you move on up the ladder. You work on achieving the highest rank of Ace and then you graduate and go home to your families.” He paused. “If they’ll have you, but that’s between y’all.” An Ace, he said, listens to the housemen and his house father, does his work without shirking and malingering, and applies himself to his studies. An Ace does not rough-house, he does not cuss, he does not blaspheme or carry on. He works to reform himself, from sunrise to sunset. “It’s up to you how much time you spend with us,” Spencer said. “We don’t mess around with idiots here. If you mess up, we have a place for you, and you will not like it. I’ll see to it personally.”   Spencer had a severe face, but when he touched the enormous key ring on his belt the corners of his mouth twitched in pleasure, it seemed, or to signal a murkier emotion. The supervisor turned to Franklin, the boy who’d come back for a second taste of Nickel. “Tell them, Franklin.”   Franklin’s voice cracked and he had to fix himself before he got out, “Yes, sir. You don’t want to step over the line in here.”

(Colson Whitehead, *The Nickel Boys*)

In [14]:
text="“I know you heard all this before, Franklin, but it didn’t take, obviously. Maybe this time it will. Right now, all of you are Grubs. We have four ranks of behavior here—start as a Grub, work your way up to Explorer, then Pioneer, and finally, Ace. Earn merits for acting right, and you move on up the ladder. You work on achieving the highest rank of Ace and then you graduate and go home to your families.” He paused. “If they’ll have you, but that’s between y’all.” An Ace, he said, listens to the housemen and his house father, does his work without shirking and malingering, and applies himself to his studies. An Ace does not rough-house, he does not cuss, he does not blaspheme or carry on. He works to reform himself, from sunrise to sunset. “It’s up to you how much time you spend with us,” Spencer said. “We don’t mess around with idiots here. If you mess up, we have a place for you, and you will not like it. I’ll see to it personally.”   Spencer had a severe face, but when he touched the enormous key ring on his belt the corners of his mouth twitched in pleasure, it seemed, or to signal a murkier emotion. The supervisor turned to Franklin, the boy who’d come back for a second taste of Nickel. “Tell them, Franklin.”   Franklin’s voice cracked and he had to fix himself before he got out, “Yes, sir. You don’t want to step over the line in here.”"
print(f"{predict(text, vocab):.1f} minutes")


1.2 minutes


“Shhh!” hissed Professor McGonagall, “you’ll wake the Muggles!”  “S-s-sorry,” sobbed Hagrid, taking out a large, spotted handkerchief and burying his face in it. “But I c-c-can’t stand it — Lily an’ James dead — an’ poor little Harry off ter live with Muggles —”    “Yes, yes, it’s all very sad, but get a grip on yourself, Hagrid, or we’ll be found,” Professor McGonagall whispered, patting Hagrid gingerly on the arm as Dumbledore stepped over the low garden wall and walked to the front door. He laid Harry gently on the doorstep, took a letter out of his cloak, tucked it inside Harry’s blankets, and then came back to the other two. For a full minute the three of them stood and looked at the little bundle; Hagrid’s shoulders shook, Professor McGonagall blinked furiously, and the twinkling light that usually shone from Dumbledore’s eyes seemed to have gone out. “Well,” said Dumbledore finally, “that’s that. We’ve no business staying here. We may as well go and join the celebrations.”   “Yeah,” said Hagrid in a very muffled voice, “I’ll be takin’ Sirius his bike back. G’night, Professor McGonagall — Professor Dumbledore, sir.” Wiping his streaming eyes on his jacket sleeve, Hagrid swung himself onto the motorcycle and kicked the engine into life; with a roar it rose into the air and off into the night.  “I shall see you soon, I expect, Professor McGonagall,” said Dumbledore, nodding to her. Professor McGonagall blew her nose in reply.

(J.K. Rowling, *Harry Potter and the Sorcerer's Stone*)

In [15]:
text="“Shhh!” hissed Professor McGonagall, “you’ll wake the Muggles!”  “S-s-sorry,” sobbed Hagrid, taking out a large, spotted handkerchief and burying his face in it. “But I c-c-can’t stand it — Lily an’ James dead — an’ poor little Harry off ter live with Muggles —”    “Yes, yes, it’s all very sad, but get a grip on yourself, Hagrid, or we’ll be found,” Professor McGonagall whispered, patting Hagrid gingerly on the arm as Dumbledore stepped over the low garden wall and walked to the front door. He laid Harry gently on the doorstep, took a letter out of his cloak, tucked it inside Harry’s blankets, and then came back to the other two. For a full minute the three of them stood and looked at the little bundle; Hagrid’s shoulders shook, Professor McGonagall blinked furiously, and the twinkling light that usually shone from Dumbledore’s eyes seemed to have gone out. “Well,” said Dumbledore finally, “that’s that. We’ve no business staying here. We may as well go and join the celebrations.”   “Yeah,” said Hagrid in a very muffled voice, “I’ll be takin’ Sirius his bike back. G’night, Professor McGonagall — Professor Dumbledore, sir.” Wiping his streaming eyes on his jacket sleeve, Hagrid swung himself onto the motorcycle and kicked the engine into life; with a roar it rose into the air and off into the night.  “I shall see you soon, I expect, Professor McGonagall,” said Dumbledore, nodding to her. Professor McGonagall blew her nose in reply."
print(f"{predict(text, vocab):.1f} minutes")

5.6 minutes


The station wagons arrived at noon, a long shining line that coursed through the west campus. In single file they eased around the orange I-beam sculpture and moved toward the dormitories. The roofs of the station wagons were loaded down with carefully secured suitcases full of light and heavy clothing; with boxes of blankets, boots and shoes, stationery and books, sheets, pillows, quilts; with rolled-up rugs and sleeping bags; with bicycles, skis, rucksacks, English and Western saddles, inflated rafts. As cars slowed to a crawl and stopped, students sprang out and raced to the rear doors to begin removing the objects inside; the stereo sets, radios, personal computers; small refrigerators and table ranges; the cartons of phonograph records and cassettes; the hairdryers and styling irons; the tennis rackets, soccer balls, hockey and lacrosse sticks, bows and arrows; the controlled substances, the birth control pills and devices; the junk food still in shopping bags—onion-and-garlic chips, nacho thins, peanut creme patties, Waffelos and Kabooms, fruit chews and toffee popcorn; the Dum-Dum pops, the Mystic mints.

(Don DeLillo, *White Noise*)

In [16]:
text="The station wagons arrived at noon, a long shining line that coursed through the west campus. In single file they eased around the orange I-beam sculpture and moved toward the dormitories. The roofs of the station wagons were loaded down with carefully secured suitcases full of light and heavy clothing; with boxes of blankets, boots and shoes, stationery and books, sheets, pillows, quilts; with rolled-up rugs and sleeping bags; with bicycles, skis, rucksacks, English and Western saddles, inflated rafts. As cars slowed to a crawl and stopped, students sprang out and raced to the rear doors to begin removing the objects inside; the stereo sets, radios, personal computers; small refrigerators and table ranges; the cartons of phonograph records and cassettes; the hairdryers and styling irons; the tennis rackets, soccer balls, hockey and lacrosse sticks, bows and arrows; the controlled substances, the birth control pills and devices; the junk food still in shopping bags—onion-and-garlic chips, nacho thins, peanut creme patties, Waffelos and Kabooms, fruit chews and toffee popcorn; the Dum-Dum pops, the Mystic mints."
print(f"{predict(text, vocab):.1f} minutes")

39.9 minutes


   But the ring was not on the island; he had lost it, it was gone. His screech sent a shiver down Bilbo’s back, though he did not yet understand what had happened. But Gollum had at last leaped to a guess, too late. What has it got in its pocketses? he cried. The light in his eyes was like a green flame as he sped back to murder the hobbit and recover his ‘Precious’. Just in time Bilbo saw his peril, and he fled blindly up the passage away from the water; and once more he was saved by his luck. For as he ran he put his hand in his pocket, and the ring slipped quietly on to his finger. So it was that Gollum passed him without seeing him, and went to guard the way out, lest the ‘thief’ should escape. Warily Bilbo followed him, as he went along, cursing, and talking to himself about his ‘Precious’; from which talk at last even Bilbo guessed the truth, and hope came to him in the darkness: he himself had found the marvellous ring and a chance of escape from the ores and from Gollum. At length they came to a halt before an unseen opening that led to the lower gates of the mines, on the eastward side of the mountains. There Gollum crouched at bay, smelling and listening; and Bilbo was tempted to slay him with his sword. But pity stayed him, and though he kept the ring, in which his only hope lay, he would not use it to help him kill the wretched creature at a disadvantage. In the end, gathering his courage, he leaped over Gollum in the dark, and fled away down the passage, pursued by his enemy’s cries of hate and despair: Thief, thief! Baggins! We hates it for ever!
   
(J.R.R. Tolkien, *Lord of the Rings*)


In [17]:
text="But the ring was not on the island; he had lost it, it was gone. His screech sent a shiver down Bilbo’s back, though he did not yet understand what had happened. But Gollum had at last leaped to a guess, too late. What has it got in its pocketses? he cried. The light in his eyes was like a green flame as he sped back to murder the hobbit and recover his ‘Precious’. Just in time Bilbo saw his peril, and he fled blindly up the passage away from the water; and once more he was saved by his luck. For as he ran he put his hand in his pocket, and the ring slipped quietly on to his finger. So it was that Gollum passed him without seeing him, and went to guard the way out, lest the ‘thief’ should escape. Warily Bilbo followed him, as he went along, cursing, and talking to himself about his ‘Precious’; from which talk at last even Bilbo guessed the truth, and hope came to him in the darkness: he himself had found the marvellous ring and a chance of escape from the ores and from Gollum. At length they came to a halt before an unseen opening that led to the lower gates of the mines, on the eastward side of the mountains. There Gollum crouched at bay, smelling and listening; and Bilbo was tempted to slay him with his sword. But pity stayed him, and though he kept the ring, in which his only hope lay, he would not use it to help him kill the wretched creature at a disadvantage. In the end, gathering his courage, he leaped over Gollum in the dark, and fled away down the passage, pursued by his enemy’s cries of hate and despair: Thief, thief! Baggins! We hates it for ever!"
print(f"{predict(text, vocab):.1f} minutes")

160.4 minutes


A onelegged sailor, swinging himself onward by lazy jerks of his crutches, growled some notes. He jerked short before the convent of the sisters of charity and held out a peaked cap for alms towards the very reverend John Conmee S. J. Father Conmee blessed him in the sun for his purse held, he knew, one silver crown. Father Conmee crossed to Mountjoy square. He thought, but not for long, of soldiers and sailors, whose legs had been shot off by cannonballs, ending their days in some pauper ward, and of cardinal Wolsey's words: If I had served my God as I have served my king He would not have abandoned me in my old days. He walked by the treeshade of sunnywinking leaves: and towards him came the wife of Mr David Sheehy M.P. Very well, indeed, father. And you, father? Father Conmee was wonderfully well indeed. He would go to Buxton probably for the waters. And her boys, were they getting on well at Belvedere? Was that so? Father Conmee was very glad indeed to hear that. And Mr Sheehy himself? Still in London. The house was still sitting, to be sure it was. Beautiful weather it was, delightful indeed. Yes, it was very probable that Father Bernard Vaughan would come again to preach. O, yes: a very great success. A wonderful man really.

(Joyce, *Ulysses*)

In [18]:
text="A onelegged sailor, swinging himself onward by lazy jerks of his crutches, growled some notes. He jerked short before the convent of the sisters of charity and held out a peaked cap for alms towards the very reverend John Conmee S. J. Father Conmee blessed him in the sun for his purse held, he knew, one silver crown. Father Conmee crossed to Mountjoy square. He thought, but not for long, of soldiers and sailors, whose legs had been shot off by cannonballs, ending their days in some pauper ward, and of cardinal Wolsey's words: If I had served my God as I have served my king He would not have abandoned me in my old days. He walked by the treeshade of sunnywinking leaves: and towards him came the wife of Mr David Sheehy M.P. Very well, indeed, father. And you, father? Father Conmee was wonderfully well indeed. He would go to Buxton probably for the waters. And her boys, were they getting on well at Belvedere? Was that so? Father Conmee was very glad indeed to hear that. And Mr Sheehy himself? Still in London. The house was still sitting, to be sure it was. Beautiful weather it was, delightful indeed. Yes, it was very probable that Father Bernard Vaughan would come again to preach. O, yes: a very great success. A wonderful man really."
print(f"{predict(text, vocab):.1f} minutes")

148.7 minutes


**Question for discussion**. While the predictions here vary in their accuracy, we can see from the empirical results that there is plenty of room for improvement.  Brainstorm some ideas for how you could improve the computational modeling of time -- including new features you would use, other data collection strategies, or reframing the problem into one that you see as more tractable.