In [1]:
import pandas as pd
import numpy as np
import nltk
import requests

In [None]:
from nltk.tokenize import word_tokenize

Files Downloaded from Gutenberg
pg14640.txt McGuffey's First Eclectic Reader, Revised Edition by William Holmes McGuffey http://www.gutenberg.org/cache/epub/14640/pg14640.txt

pg14880.txt McGuffey's Fourth Eclectic Reader by William Holmes McGuffey http://www.gutenberg.org/cache/epub/14880/pg14880.txt

pg16751.txt McGuffey's Sixth Eclectic Reader by William Holmes McGuffey http://www.gutenberg.org/cache/epub/16751/pg16751.txt

pg19721.txt The Literary World Seventh Reader by Browne, Metcalf, and Withers http://www.gutenberg.org/cache/epub/19721/pg19721.txt

## Loader and Tokenizer Functions

In [None]:
from urllib import request

def open_Gutenberg(url):
    response = request.urlopen(url)
    return response.read().decode('utf8')

def loadAndPrepReader(file_contents):
    file_contents = file_contents.split('\n')   # split into lines
    file_contents = [line.strip('\r') for line in file_contents]  # remove carriage-returns
    start_index = [i for i, s in enumerate(file_contents) if '*** START OF THIS PROJECT' in s] # get start index
    end_index = [i for i, s in enumerate(file_contents) if '*** END OF THIS PROJECT' in s] # get end index
    return file_contents[start_index[0]+1:end_index[0]-1] # returned sliced file_contents

def tokenizer(file_contents):
    tokenized_words = [word_tokenize(line) for line in file_contents] # tokenize our raw list of lines
    unroll = [item for sublist in tokenized_words for item in sublist] # unroll into single list
    return [w for w in unroll if w.isalnum()] # drop special characters, punctuation

## Load Data

In [None]:
mcGuffey_one = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/14640/pg14640.txt")))
mcGuffey_four = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/14880/pg14880.txt")))
mcGuffey_six = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/16751/pg16751.txt")))
literaryWorld_seven = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/19721/pg19721.txt")))

### HW1 "Text Difficulty" Measurement Functions

In [None]:
# returns percentage of unique words divided by total words, "lexical diversity"
# I don't like that this lexical_diversity function doesn't account for case, but following the example in the
# NLTK book

def lexical_diversity(text):
    return len(set(text)) / len(text)

def vocab_size(text):
    return len(set(word.lower() for word in text)) # this does account for case

def percentage(count, total):
    return 100 * count / total

 ### Lexical Diversity

In [None]:
print("The lexical diversity of McGuffey's First Eclectic Reader: {0:.2f}".format(lexical_diversity(mcGuffey_one)))
print("The lexical diversity of McGuffey's Fourth Eclectic Reader: {0:.2f}".format(lexical_diversity(mcGuffey_four)))
print("The lexical diversity of McGuffey's Sixth Eclectic Reader: {0:.2f}".format(lexical_diversity(mcGuffey_six)))
print("The lexical diversity of The Literary World Seventh Reader: {0:.2f}".format(lexical_diversity(literaryWorld_seven)))

These results seem to align with my expectations. That is - for more basic readers the number of unique words is large(r) compared to the total number of words. This means that the unique words take up more of the total words, which could be a proxy for the level of the writing style.

### Vocabulary Size

In [None]:
print("The vocabulary size of McGuffey's First Eclectic Reader: {0}".format(vocab_size(mcGuffey_one)))
print("The vocabulary size of McGuffey's Fourth Eclectic Reader: {0}".format(vocab_size(mcGuffey_four)))
print("The vocabulary size of McGuffey's Sixth Eclectic Reader: {0}".format(vocab_size(mcGuffey_six)))
print("The vocabulary size of The Literary World Seventh Reader: {0}".format(vocab_size(literaryWorld_seven)))

This also seems to align with my expectations. The more advanced readers have a higher total vocabulary size. This is clear examining from first, fourth, and sixth. The seventh reader seems to drop back down, but may simply be related to the fact that it's a different series.

1.	In Python, create a method for scoring the vocabulary size of a text, and normalize the score from 0 to 1. It does not matter what method you use for normalization as long as you explain it in a short paragraph. 

## Vocabulary Size & Lexical Diversity Measures

The first issue is that lexical diversity on its own is a poor measure of "lexical richness" because usually texts containing a large number of tokens will have lower lexical diversity scores simply based on the fact that the writer must re-use function words (words that have litle lexical meaning). A second issue could be that a text contains a very large vocabulary, but is succint in it's writing, thus producing a high lexical diversity score (but could be quite advanced in terms of the vocabulary). To solve the first issue lexical diversity makes more sense when comparing texts that have similar total token size. The second issue can be analyzed by looking at both lexical diversity and the vocabulary size. Thus I would argue that both measures are required to appropriately measure text difficulty.

## Normalized Vocabulary Size for Multiple Texts

In [None]:
import numpy as np
def normalize(x):
    x = np.asarray(x, dtype=np.float32)
    denom = max(x) - min(x) 
    return list((x - min(x)) / denom)

In [None]:
my_texts = [mcGuffey_one, mcGuffey_four, mcGuffey_six, literaryWorld_seven]

In [None]:
[vocab_size(text) for text in my_texts]

In [None]:
vocab_size_normal = normalize([vocab_size(text) for text in my_texts])
vocab_size_normal

My normalization utilizes a pretty common approach to feature scaling following: Xnew = (X - Xmin) / (Xmax - Xmin). This will make the largest value equal to 1 and the smallest value equal to 0 with the intermediate values falling between 0 and 1.

## Long words

In [None]:
def long_word_vocab_size(text, length):
    return len(set(word.lower() for word in text if len(word) > length))

def long_word_vocab_words(text, length):
    return sorted(set(word.lower() for word in text if len(word) > length))

In [None]:
long_word_vocab_size(mcGuffey_one, 11)

In [None]:
long_word_vocab_words(mcGuffey_one, 11)

In [None]:
[long_word_vocab_size(text, 7) for text in my_texts]

In [None]:
long_word_vocab_normal = normalize([long_word_vocab_size(text, 7) for text in my_texts])
long_word_vocab_normal

I'll consider words longer than 7 characters to be "long words" since we're dealing with adolescent level reading material. I re-use my normalization technique from the total vocabulary size here to normalize the long word vocabulary counts across the readers.

## Combination "Text Difficulty Score"

In [None]:
lexical_diversity_scores = [lexical_diversity(text) for text in my_texts]  # save lexical diversity to list
lexical_diversity_scores

In [None]:
#lexical_diversity_scores = np.asarray(lexical_diversity_scores)
#lexical_diversity_scores = list(1 - lexical_diversity_scores)
#lexical_diversity_scores

As discussed in HW1, lexical diversity seems problematic as longer and more complicated texts would have a disproportionate score simply based on the number of function words contributing to total length of the text. It seems the length of the text is going to dillute the lexical diversity score in this regard. This was shown in my analysis - texts with low lexical diversity scores seem to be more complex and should have higher lexical diversity scores, but are being misrepresented based on length.

I chose not to manipulate the lexical diversity scores from HW1 as I think that's the spirit of HW2.

I averaged each of the scores: lexical diversity, normalized vocab, normalize long-word vocab, to produce final "Text Difficulty Score"

In [None]:
total_scores = np.asarray(vocab_size_normal) + np.asarray(long_word_vocab_normal) + np.asarray(lexical_diversity_scores)
text_difficulty_score = list(total_scores / 3)

In [None]:
print("The Text Difficulty Score of McGuffey's First Eclectic Reader: {0:.2f}".format(text_difficulty_score[0]))
print("The Text Difficulty Score of McGuffey's Fourth Eclectic Reader: {0:.2f}".format(text_difficulty_score[1]))
print("The Text Difficulty Score of McGuffey's Sixth Eclectic Reader: {0:.2f}".format(text_difficulty_score[2]))
print("The Text Difficulty Score of The Literary World Seventh Reader: {0:.2f}".format(text_difficulty_score[3]))

These final Text Difficulty Scores still seem to generally align with our expectations of the complexity of the reader as grade level increases. The normalization generally magnifies the differences between the measured complexity of this set of readers. This score also seems to magnify the differences between McGuffey's Sixth Eclectic Reader and The Literary World Seventh Reader, which by this measure seems to show The Literary World Seventh Reader be significantly easier.

Random Interview questions

In [1]:
import numpy as np

In [5]:
np.arange(0,20.1,2)

array([ 0.,  2.,  4.,  6.,  8., 10., 12., 14., 16., 18., 20.])

In [4]:
4 in np.arange(0,20,2)

True

In [None]:
nterms = int(input("How many terms? "))

# first two terms
n1, n2 = 0, 1
count = 0

# check if the number of terms is valid
if nterms <= 0:
   print("Please enter a positive integer")
elif nterms == 1:
   print("Fibonacci sequence upto",nterms,":")
   print(n1)
else:
   print("Fibonacci sequence:")
   while count < nterms:
       print(n1)
       nth = n1 + n2
       # update values
       n1 = n2
       n2 = nth
       count -= 1