# Benjamin Wilke
# Homework 2

In [1]:
import nltk
from nltk.tokenize import word_tokenize

## Files Downloaded from Gutenberg

pg14640.txt
McGuffey's First Eclectic Reader, Revised Edition by William Holmes McGuffey
http://www.gutenberg.org/cache/epub/14640/pg14640.txt

pg14880.txt
McGuffey's Fourth Eclectic Reader by William Holmes McGuffey
http://www.gutenberg.org/cache/epub/14880/pg14880.txt

pg16751.txt
McGuffey's Sixth Eclectic Reader by William Holmes McGuffey
http://www.gutenberg.org/cache/epub/16751/pg16751.txt

pg19721.txt
The Literary World Seventh Reader by Browne, Metcalf, and Withers
http://www.gutenberg.org/cache/epub/19721/pg19721.txt


## Loader and Tokenizer Functions

In [2]:
from urllib import request

def open_Gutenberg(url):
    response = request.urlopen(url)
    return response.read().decode('utf8')

def loadAndPrepReader(file_contents):
    file_contents = file_contents.split('\n')   # split into lines
    file_contents = [line.strip('\r') for line in file_contents]  # remove carriage-returns
    start_index = [i for i, s in enumerate(file_contents) if '*** START OF THIS PROJECT' in s] # get start index
    end_index = [i for i, s in enumerate(file_contents) if '*** END OF THIS PROJECT' in s] # get end index
    return file_contents[start_index[0]+1:end_index[0]-1] # returned sliced file_contents

def tokenizer(file_contents):
    tokenized_words = [word_tokenize(line) for line in file_contents] # tokenize our raw list of lines
    unroll = [item for sublist in tokenized_words for item in sublist] # unroll into single list
    return [w for w in unroll if w.isalnum()] # drop special characters, punctuation

## Load Data

In [3]:
mcGuffey_one = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/14640/pg14640.txt")))
mcGuffey_four = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/14880/pg14880.txt")))
mcGuffey_six = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/16751/pg16751.txt")))
literaryWorld_seven = tokenizer(loadAndPrepReader(open_Gutenberg("http://www.gutenberg.org/cache/epub/19721/pg19721.txt")))

## HW1 "Text Difficulty" Measurement Functions

In [4]:
# returns percentage of unique words divided by total words, "lexical diversity"
# I don't like that this lexical_diversity function doesn't account for case, but following the example in the
# NLTK book

def lexical_diversity(text):
    return len(set(text)) / len(text)

def vocab_size(text):
    return len(set(word.lower() for word in text)) # this does account for case

def percentage(count, total):
    return 100 * count / total

## 1.3 Lexical Diversity

In [5]:
print("The lexical diversity of McGuffey's First Eclectic Reader: {0:.2f}".format(lexical_diversity(mcGuffey_one)))
print("The lexical diversity of McGuffey's Fourth Eclectic Reader: {0:.2f}".format(lexical_diversity(mcGuffey_four)))
print("The lexical diversity of McGuffey's Sixth Eclectic Reader: {0:.2f}".format(lexical_diversity(mcGuffey_six)))
print("The lexical diversity of The Literary World Seventh Reader: {0:.2f}".format(lexical_diversity(literaryWorld_seven)))

The lexical diversity of McGuffey's First Eclectic Reader: 0.18
The lexical diversity of McGuffey's Fourth Eclectic Reader: 0.14
The lexical diversity of McGuffey's Sixth Eclectic Reader: 0.12
The lexical diversity of The Literary World Seventh Reader: 0.11


These results seem to align with my expectations. That is - for more basic readers the number of unique words is large(r) compared to the total number of words. This means that the unique words take up more of the total words, which could be a proxy for the level of the writing style.

## 1.4 Vocabulary Size

In [6]:
print("The vocabulary size of McGuffey's First Eclectic Reader: {0}".format(vocab_size(mcGuffey_one)))
print("The vocabulary size of McGuffey's Fourth Eclectic Reader: {0}".format(vocab_size(mcGuffey_four)))
print("The vocabulary size of McGuffey's Sixth Eclectic Reader: {0}".format(vocab_size(mcGuffey_six)))
print("The vocabulary size of The Literary World Seventh Reader: {0}".format(vocab_size(literaryWorld_seven)))

The vocabulary size of McGuffey's First Eclectic Reader: 1141
The vocabulary size of McGuffey's Fourth Eclectic Reader: 7579
The vocabulary size of McGuffey's Sixth Eclectic Reader: 14038
The vocabulary size of The Literary World Seventh Reader: 10220


This also seems to align with my expectations. The more advanced readers have a higher total vocabulary size. This is clear examining from first, fourth, and sixth. The seventh reader seems to drop back down, but may simply be related to the fact that it's a different series.

## 1.5 Vocabulary Size & Lexical Diversity Measures

The first issue is that lexical diversity on its own is a poor measure of "lexical richness" because usually texts containing a large number of tokens will have lower lexical diversity scores simply based on the fact that the writer must re-use function words (words that have litle lexical meaning). A second issue could be that a text contains a very large vocabulary, but is succint in it's writing, thus producing a high lexical diversity score (but could be quite advanced in terms of the vocabulary). To solve the first issue lexical diversity makes more sense when comparing texts that have similar total token size. The second issue can be analyzed by looking at both lexical diversity and the vocabulary size. Thus I would argue that both measures are required to appropriately measure text difficulty.

## 2.1 Normalized Vocabulary Size for Multiple Texts

In [7]:
import numpy as np
def normalize(x):
    x = np.asarray(x, dtype=np.float32)
    denom = max(x) - min(x) 
    return list((x - min(x)) / denom)

In [8]:
my_texts = [mcGuffey_one, mcGuffey_four, mcGuffey_six, literaryWorld_seven]

In [9]:
[vocab_size(text) for text in my_texts]

[1141, 7579, 14038, 10220]

In [10]:
vocab_size_normal = normalize([vocab_size(text) for text in my_texts])
vocab_size_normal

[0.0, 0.49918586, 1.0, 0.70396215]

My normalization utilizes a pretty common approach to feature scaling following: Xnew = (X - Xmin) / (Xmax - Xmin). This will make the largest value equal to 1 and the smallest value equal to 0 with the intermediate values falling between 0 and 1.

## 2.2 Long-Word Vocabulary Size & Words

In [11]:
def long_word_vocab_size(text, length):
    return len(set(word.lower() for word in text if len(word) > length))

def long_word_vocab_words(text, length):
    return sorted(set(word.lower() for word in text if len(word) > length))

In [12]:
long_word_vocab_size(mcGuffey_one, 11)

10

In [13]:
long_word_vocab_words(mcGuffey_one, 11)

['alternatives',
 'combinations',
 'contemporary',
 'diacriticals',
 'explanations',
 'illustration',
 'illustrations',
 'representative',
 'representatives',
 'vocabularies']

In [14]:
[long_word_vocab_size(text, 7) for text in my_texts]

[112, 2152, 5707, 3877]

In [15]:
long_word_vocab_normal = normalize([long_word_vocab_size(text, 7) for text in my_texts])
long_word_vocab_normal

[0.0, 0.36461127, 1.0, 0.67292225]

I'll consider words longer than 7 characters to be "long words" since we're dealing with adolescent level reading material. I re-use my normalization technique from the total vocabulary size here to normalize the long word vocabulary counts across the readers.

## 2.3 Combination "Text Difficulty Score"

In [16]:
lexical_diversity_scores = [lexical_diversity(text) for text in my_texts]  # save lexical diversity to list
lexical_diversity_scores

[0.17572269807280513,
 0.13828714155338873,
 0.11789498246304844,
 0.10917464242943518]

In [None]:
#lexical_diversity_scores = np.asarray(lexical_diversity_scores)
#lexical_diversity_scores = list(1 - lexical_diversity_scores)
#lexical_diversity_scores

As discussed in HW1, lexical diversity seems problematic as longer and more complicated texts would have a disproportionate score simply based on the number of function words contributing to total length of the text. It seems the length of the text is going to dillute the lexical diversity score in this regard. This was shown in my analysis - texts with low lexical diversity scores seem to be more complex and should have higher lexical diversity scores, but are being misrepresented based on length.

I chose not to manipulate the lexical diversity scores from HW1 as I think that's the spirit of HW2.

I averaged each of the scores: lexical diversity, normalized vocab, normalize long-word vocab, to produce final "Text Difficulty Score"



In [17]:
total_scores = np.asarray(vocab_size_normal) + np.asarray(long_word_vocab_normal) + np.asarray(lexical_diversity_scores)
text_difficulty_score = list(total_scores / 3)

In [18]:
print("The Text Difficulty Score of McGuffey's First Eclectic Reader: {0:.2f}".format(text_difficulty_score[0]))
print("The Text Difficulty Score of McGuffey's Fourth Eclectic Reader: {0:.2f}".format(text_difficulty_score[1]))
print("The Text Difficulty Score of McGuffey's Sixth Eclectic Reader: {0:.2f}".format(text_difficulty_score[2]))
print("The Text Difficulty Score of The Literary World Seventh Reader: {0:.2f}".format(text_difficulty_score[3]))

The Text Difficulty Score of McGuffey's First Eclectic Reader: 0.06
The Text Difficulty Score of McGuffey's Fourth Eclectic Reader: 0.33
The Text Difficulty Score of McGuffey's Sixth Eclectic Reader: 0.71
The Text Difficulty Score of The Literary World Seventh Reader: 0.50


These final Text Difficulty Scores still seem to generally align with our expectations of the complexity of the reader as grade level increases. The normalization generally magnifies the differences between the measured complexity of this set of readers. This score also seems to magnify the differences between McGuffey's Sixth Eclectic Reader and The Literary World Seventh Reader, which by this measure seems to show The Literary World Seventh Reader be significantly easier. 

## 2.4 Simple Neural Net for California Data Set

In [19]:
from sklearn import datasets
data = datasets.fetch_california_housing()
import pandas as pd
X = pd.DataFrame(data['data'])
X.columns = data['feature_names']
y = data['target']
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [20]:
X.shape

(20640, 8)

In [21]:
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras import Sequential

In [22]:
model = tf.keras.Sequential()
model.add(Dense(100, input_dim=8, activation="relu"))
model.add(Dense(100, activation="relu"))
model.add(Dense(100, activation="relu"))
model.add(Dense(1))
          
model.compile(optimizer="adam", loss="mean_squared_error", metrics=['mean_squared_error'])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 100)               900       
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_2 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 21,201
Trainable params: 21,201
Non-trainable params: 0
_________________________________________________________________


Following the example from Dr. Slater's office hours on deriving input dimension:

(1 x 8).(8 x 100) + 100 = 900 params

(1 x 100).(100 x 100) + 100 = 10100 params

(1 x 100).(100 x 100) + 100 = 10100 params

(1 x 100).(100 x 1) + 1 = 101 params

In [24]:
model.fit(X.values, y, epochs=10, batch_size=32)

Train on 20640 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a37ad3128>