# Benjamin Wilke
# Homework 1

In [1]:
import nltk
from nltk.tokenize import word_tokenize

## Files Downloaded from Gutenberg

pg14640.txt
McGuffey's First Eclectic Reader, Revised Edition by William Holmes McGuffey
http://www.gutenberg.org/cache/epub/14640/pg14640.txt

pg14880.txt
McGuffey's Fourth Eclectic Reader by William Holmes McGuffey
http://www.gutenberg.org/cache/epub/14880/pg14880.txt

pg16751.txt
McGuffey's Sixth Eclectic Reader by William Holmes McGuffey
http://www.gutenberg.org/cache/epub/16751/pg16751.txt

pg19721.txt
The Literary World Seventh Reader by Browne, Metcalf, and Withers
http://www.gutenberg.org/cache/epub/19721/pg19721.txt


## Loader and Tokenizer Functions

In [2]:
from urllib import request

def open_Gutenberg(url):
    response = request.urlopen(url)
    return response.read().decode('utf8')

def loadAndPrepReader(file_contents):
    file_contents = file_contents.split('\n')   # split into lines
    file_contents = [line.strip('\r') for line in file_contents]  # remove carriage-returns
    start_index = [i for i, s in enumerate(file_contents) if '*** START OF THIS PROJECT' in s] # get start index
    end_index = [i for i, s in enumerate(file_contents) if '*** END OF THIS PROJECT' in s] # get end index
    return file_contents[start_index[0]+1:end_index[0]-1] # returned sliced file_contents

def tokenizer(file_contents):
    tokenized_words = [word_tokenize(line) for line in file_contents] # tokenize our raw list of lines
    unroll = [item for sublist in tokenized_words for item in sublist] # unroll into single list
    return [w for w in unroll if w.isalnum()] # drop special characters, punctuation

## Load Data

In [3]:
mcGuffey_one = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/14640/pg14640.txt")))
mcGuffey_four = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/14880/pg14880.txt")))
mcGuffey_six = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/16751/pg16751.txt")))
literaryWorld_seven = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/19721/pg19721.txt")))

## "Text Difficulty" Measurement Functions

In [4]:
# returns percentage of unique words divided by total words, "lexical diversity"
# I don't like that this lexical_diversity function doesn't account for case, but following the example in the
# NLTK book

def lexical_diversity(text):
    return len(set(text)) / len(text)

def vocab_size(text):
    return len(set(word.lower() for word in text)) # this does account for case

def percentage(count, total):
    return 100 * count / total

## 3. Lexical Diversity

In [5]:
print("The lexical diversity of McGuffey's First Eclectic Reader: {0:.2f}".format(lexical_diversity(mcGuffey_one)))
print("The lexical diversity of McGuffey's Fourth Eclectic Reader: {0:.2f}".format(lexical_diversity(mcGuffey_four)))
print("The lexical diversity of McGuffey's Sixth Eclectic Reader: {0:.2f}".format(lexical_diversity(mcGuffey_six)))
print("The lexical diversity of The Literary World Seventh Reader: {0:.2f}".format(lexical_diversity(literaryWorld_seven)))

The lexical diversity of McGuffey's First Eclectic Reader: 0.18
The lexical diversity of McGuffey's Fourth Eclectic Reader: 0.14
The lexical diversity of McGuffey's Sixth Eclectic Reader: 0.12
The lexical diversity of The Literary World Seventh Reader: 0.11


These results seem to align with my expectations. That is - for more basic readers the number of unique words is large(r) compared to the total number of words. This means that the unique words take up more of the total words, which could be a proxy for the level of the writing style.

## 4. Vocabulary Size

In [6]:
print("The vocabulary size of McGuffey's First Eclectic Reader: {0}".format(vocab_size(mcGuffey_one)))
print("The vocabulary size of McGuffey's Fourth Eclectic Reader: {0}".format(vocab_size(mcGuffey_four)))
print("The vocabulary size of McGuffey's Sixth Eclectic Reader: {0}".format(vocab_size(mcGuffey_six)))
print("The vocabulary size of The Literary World Seventh Reader: {0}".format(vocab_size(literaryWorld_seven)))

The vocabulary size of McGuffey's First Eclectic Reader: 1141
The vocabulary size of McGuffey's Fourth Eclectic Reader: 7579
The vocabulary size of McGuffey's Sixth Eclectic Reader: 14038
The vocabulary size of The Literary World Seventh Reader: 10220


This also seems to align with my expectations. The more advanced readers have a higher total vocabulary size. This is clear examining from first, fourth, and sixth. The seventh reader seems to drop back down, but may simply be related to the fact that it's a different series.

## 5. Vocabulary Size & Lexical Diversity Measures

The first issue is that lexical diversity on its own is a poor measure of "lexical richness" because usually texts containing a large number of tokens will have lower lexical diversity scores simply based on the fact that the writer must re-use function words (words that have litle lexical meaning). A second issue could be that a text contains a very large vocabulary, but is succint in it's writing, thus producing a high lexical diversity score (but could be quite advanced in terms of the vocabulary). To solve the first issue lexical diversity makes more sense when comparing texts that have similar total token size. The second issue can be analyzed by looking at both lexical diversity and the vocabulary size. Thus I would argue that both measures are required to appropriately measure text difficulty.