# Exercises week 5

In these exercises, you will work with a set of 19th century novels from Project Gutenberg, specifically the 18 texts in the directory `data/gutenberg/training/` (which was part of the chapter 4 notebook).

In [3]:
def read_file(filename):
    with open(filename, encoding='utf8') as infile:
        contents = infile.read()
    return contents

import nltk
nltk.download('punkt')

def preProcess(text):
    return [token.lower() for token in nltk.word_tokenize(text) if token not in '''!()-[]{};:'"\,<>./?@#$%^&*_~''']'melania & michelle', 'melania & laura', 'michelle & laura'

def tokenize_sent(sent):
    return [token.lower() for token in nltk.word_tokenize(sent)
           if token not in ".,?!:;()[]''``*"]

def preprocess(text):
    return [tokenize_sent(sent) for sent in nltk.sent_tokenize(text)]

from glob import glob

from os.path import splitext, basename

corpus = {}

for filepath in glob('data/gutenberg/training/*.txt'):
    text = read_file(filepath)
    corpus[splitext(basename(filepath))[0]] = preprocess(text)

[nltk_data] Downloading package punkt to /Users/rik/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Readability in 19th century fiction

Read the corpus, split it into sentences, and tokenize and clean it in the same way as in notebook chapter 4.
It is useful to put the result into a dictionary `corpus`, with filenames as keys, and the tokenized/sentence-splitted texts as values.

Implement a function `readability(text)`. It should use the ARI formula (see the slides from week 1) to estimate the readability of a tokenized text.

Apply this function to each novel, and store the results in a dictionary mapping filenames to readability scores.
Look at the results:

- Who is the most difficult to read?
- Do you see interesting or surprising results?

In [4]:
# your code her
def readability(text): 
    total_sentences = len(text)
    total_words = 0
    total_characters = 0
    for sentence in text: 
        total_words += len(sentence)
        for word in sentence: 
            total_characters += len(word)           
    words_per_sent = total_words / total_sentences 
    chars_per_word = total_characters / total_words
    gradeLevel = 0.5 * words_per_sent + 4.71 * chars_per_word - 21.43
    return gradeLevel
    
for text in corpus: 
    print(text,readability(corpus[text]))

blake-poems 7.3330006097189155
whitman-poems 11.37180744901056
carroll-alice 5.890725046288502
shakespeare-caesar 4.258179826361175
whitman-leaves 15.565010325629153
milton-paradise 21.61959530472408
melville-piazza 11.838610293672211
blake-songs 8.439722163525623
austen-pride 9.550872361895102
whitman-patriotic 16.911261560073363
edgeworth-parents 6.3573986007179215
chesterton-thursday 6.8577625416159265
burgess-busterbrown 5.340729526600175
chesterton-ball 7.802034866703821
austen-emma 9.454080590744201
shakespeare-hamlet 4.231908935047343
austen-sense 11.669015186802994
bryant-stories 5.900936749832574


In [5]:
answers = {'austen-emma.txt': 9.454080590744201,
 'bryant-stories.txt': 5.900936749832574,
 'whitman-poems.txt': 11.37235359928885,
 'chesterton-thursday.txt': 6.8577625416159265,
 'burgess-busterbrown.txt': 5.340729526600175,
 'milton-paradise.txt': 21.61922616435404,
 'blake-poems.txt': 7.3330006097189155,
 'blake-songs.txt': 8.439722163525623,
 'edgeworth-parents.txt': 6.356779166270105,
 'shakespeare-caesar.txt': 4.258179826361175,
 'whitman-leaves.txt': 15.565010325629153,
 'shakespeare-hamlet.txt': 4.231908935047343,
 'whitman-patriotic.txt': 16.911261560073363,
 'austen-pride.txt': 9.550872361895102,
 'carroll-alice.txt': 5.890725046288502,
 'chesterton-ball.txt': 7.799681127108684,
 'melville-piazza.txt': 11.838489721191259,
 'austen-sense.txt': 11.668964287000264}

Shakespeare and Milton are the easiest and hardest texts respectively. This might be because Shakespear wrote plays for the general public. 

## Sentiment and sensibility

Compute a sentiment score for each of the books, using the code on this week's slides. Create a function `sentiment(filename, positive_words, negative_words)` which returns a score for a give filename and sets of sentiment words.

- The link on my slides has sentiment lexicons for 81 languages, but not English ... Use the sentiment lexicon made available at: https://github.com/BijoySingh/east/tree/master/east/datasets/opinion_lexicon
  Click on the files and press the "raw" button to download the file; put them in the `data/` directory.
- Note that for this application, we don't care about sentences, so it is easier to read the text as one big list of tokens.
- The books have different lengths, is this a problem? If so, can you think of something to correct for this?
- Do you see interesting/surprising patterns?

In [6]:
# Just the sentiment

def sentiment(filename, positive_words, negative_words):
    posWords = read_file(positive_words).splitlines()
    negWords = read_file(negative_words).splitlines()
    sentiment = 0
    for sentence in filename:
        for word in sentence:
            if word in posWords:
                sentiment += 1
            elif word in negWords:
                sentiment -= 1
    return sentiment

for text in corpus:
    print(text, sentiment(corpus[text], "data/wordList/positive-words.txt", "data/wordList/negative-words.txt"))


blake-poems 66
whitman-poems 669
carroll-alice -76
shakespeare-caesar 88
whitman-leaves 835
milton-paradise 102
melville-piazza -718
blake-songs 56
austen-pride 1364
whitman-patriotic 170
edgeworth-parents 1424
chesterton-thursday -643
burgess-busterbrown 58
chesterton-ball -621
austen-emma 2255
shakespeare-hamlet -30
austen-sense 1214
bryant-stories 443


In [10]:
# Accounting for number of total words

def sentiment(filename, positive_words, negative_words):
    posWords = read_file(positive_words).splitlines()
    negWords = read_file(negative_words).splitlines()
    sentiment = 0
    numWords = 0 
    for sentence in filename:
        for word in sentence:
            numWords += 1
            if word in posWords:
                sentiment += 1
            elif word in negWords:
                sentiment -= 1
    return (sentiment/numWords)

for text in corpus:
    print(text, sentiment(corpus[text], "data/wordList/positive-words.txt", "data/wordList/negative-words.txt"))


blake-poems 0.009523809523809525
whitman-poems 0.009696776437847866
carroll-alice -0.0027716994894237783
shakespeare-caesar 0.0042395336512983575
whitman-leaves 0.0066238299222592415
milton-paradise 0.001271091393963562
melville-piazza -0.008892521859750811
blake-songs 0.0096005486027773
austen-pride 0.011091775497259584
whitman-patriotic 0.006250919252831299
edgeworth-parents 0.00833982441858422
chesterton-thursday -0.010964276579418536
burgess-busterbrown 0.0035624347398808425
chesterton-ball -0.007495654692931634
austen-emma 0.013702958745282962
shakespeare-hamlet -0.000997307270370001
austen-sense 0.009984537947823799
bryant-stories 0.009510927905878312


Some books have a negative value, whereas others have a positive value. By dividing by total number of words we can get the proportion of the sentiment. 

For example, Jane austen has a value reaching 0.1 and thus has the most positive sentiment exhibited in the corpus. Chesterton has the most negative value at -0.01.