# NLP - Homework 2
### Miguel Bonilla

- [Normalized Vocabulary Scoring](#Normalized-Vocabulary-Scoring)
- [Long Word Vocabulary Scoring](#Long-Word-Vocabulary-Scoring)
- [Text Difficulty Scoring](#Text-Difficulty-Scoring)

1. In Python, create a method for scoring the vocabulary size of a text, and normalize the score 
from 0 to 1. It does not matter what method you use for normalization as long as you explain it 
in a short paragraph. 
Some relevant resources that you can leverage:
https://docs.tibco.com/pub/spotfire/6.5.0/doc/html/norm/norm_scale_between_0_and_1.htm
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
2. After consulting section 3.2 in chapter 1 of Bird-Klein, create a method for scoring the long-word 
vocabulary size of a text, and likewise normalize (and explain) the scoring as in step 1 above.
3. Now create a “text difficulty score” by combining the lexical diversity score from homework 1, 
and your normalized score of vocabulary size and long-word vocabulary size, in equal weighting. 
Explain what you see when this score is applied to same graded texts you used in homework 1.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
import numpy as np

- First https://www.gutenberg.org/cache/epub/14640/pg14640.txt
- Second https://www.gutenberg.org/cache/epub/14668/pg14668.txt
- Third https://www.gutenberg.org/cache/epub/14766/pg14766.txt

In [2]:
from urllib import request
url1 = "https://www.gutenberg.org/cache/epub/14640/pg14640.txt"
url2 = "https://www.gutenberg.org/cache/epub/14668/pg14668.txt"
url3 = "https://www.gutenberg.org/cache/epub/14766/pg14766.txt"

#import books
text1 = request.urlopen(url1).read().decode('utf8')
text2 = request.urlopen(url2).read().decode('utf8')
text3 = request.urlopen(url3).read().decode('utf8')

# tokenize books
text1 = word_tokenize(text1)
text2 = word_tokenize(text2)
text3 = word_tokenize(text3)

In [3]:
## text 1 find beginning and end
print('beginning of text 1 is: ',text1.index('LESSON')) #beginning
print('end of text 1 is: ',text1.index('PHONIC')-1) #end
## text 2 find beginning and end
print('beginning of text 2 is: ',text2.index('LESSON')) #beginning
print('end of text 2 is: ',text2.index('Tennyson')) #end
## text 3 find beginning and end
print('beginning of text 3 is: ',text3.index('SHEPHERD')-4) #beginning
print('end of text 3 is: ',text3.index('Follen')) #end

beginning of text 1 is:  1125
end of text 1 is:  9335
beginning of text 2 is:  1500
end of text 2 is:  22004
beginning of text 3 is:  2116
end of text 3 is:  34675


In [4]:
text1 = text1[1125:9335]
text2 = text2[1500:22004]
text3 = text3[2116:34675]

In [5]:
## remove word duplicates due to casing and non-alpha words
## note that performing this step removes hyphenated words, which was deemed acceptable for this project since hyphenated words are combinations of words
text1 = [w.lower() for w in text1 if w.isalpha()]
text2 = [w.lower() for w in text2 if w.isalpha()]
text3 = [w.lower() for w in text3 if w.isalpha()]

## Normalized Vocabulary Scoring

In [6]:
texts = [text1,text2,text3] #list all 3 documents    

In [7]:
# define a function that iterates through document word counts and normalizes the counts into a score on a 0 to 1 scale
def normalize_func(sizes_list,scores_list):
    for i in range(len(sizes_list)):
        norm = round(
            (sizes_list[i][1]-np.asarray(list(zip(*sizes_list))[1]).min())/(np.asarray(list(zip(*sizes_list))[1]).max()-np.asarray(list(zip(*sizes_list))[1]).min()),
            5)
        scores_list.append(("text{}".format(i+1),norm))

In [8]:
#define a function that iterates through the documents calculating the vocab size and then calls the previously defined norm_func to calculate the corresponding normalized scores.
# it requires a text list and a blank scores_list which it passes to the normalize_func
def vocab_score(text_list,scores_list):
    for i in range(len(text_list)):
        vocab_size = len(set(text_list[i]))
        norm_sizes.append(("text{}".format(i+1),vocab_size))
    normalize_func(norm_sizes,scores_list)

In [9]:
norm_scores = []
norm_sizes = []

vocab_score(texts,norm_scores)
print(norm_sizes)
norm_scores

[('text1', 882), ('text2', 2128), ('text3', 3095)]


[('text1', 0.0), ('text2', 0.56304), ('text3', 1.0)]

The function vocab_score first calculates the unique word count for all 3 texts, then a normalized vocabulary score is calculated by taking a document's word count and providing a relative (normalized) measure of how it compares to the rest of the documents by subtracting the minimum word count and dividing it by the word count range between all documents. This provides a comparable score where all documents will have a measure between 0 and 1, with 0 and 1 representing the scores for the documents with the lowest and highest word count, accordingly. 

## Long Word Vocabulary Scoring

Since word length is a relative concept, and considering the documents are for first, second, and third year learners, the following will use a "long-word" cutoff of 9 characters. Meaning, words with 9 or higher number of characters will be counted as long.

In [10]:
#define function for calculating long word based on a long-word cutoff of 9 characters and above
#it then calls the previously defined normalize_func to get the normalized scores in a range between 0 and 1
def long_word_score(text_list,scores_list):
    for i in range(len(text_list)):
        long_word_size = len([w for w in set(text_list[i]) if len(w) > 8])
        long_sizes.append(("text{}".format(1+i),long_word_size))
    normalize_func(long_sizes,long_scores)

In [11]:
long_scores = []
long_sizes = []

long_word_score(texts,long_scores)
print(long_sizes)
print(long_scores)

[('text1', 13), ('text2', 92), ('text3', 253)]
[('text1', 0.0), ('text2', 0.32917), ('text3', 1.0)]


The general approach here is similar to the approach for Normalized Vocabulary Score above. Normalization is achieved by using the same normalize function defined previously, which is integrated into the long word score function. Long words, defined here as having 9 or more characters in length, are identified, and the number of unique appearances counted. Then, the counts are normalized into a score that is relative to the long word count of all the documents in question. 

## Text Difficulty Scoring

Calculate a text difficulty score by doing an equal weights average of both the normalized vocabulary score and the normalized long word score. Since the minimum and maximum scores for each were 0 and 1, respectively, the averaged out combined score will also be in the scale between 0 and 1 (since 0/2=0 and 2/2=1)

#### needs to add from hw1

In [12]:
diff_scores=[]
for i in range(len(long_scores)):
    diff_scores.append(('text{}'.format(i+1),(long_scores[i][1]+norm_scores[i][1])/2))

In [13]:
diff_scores

[('text1', 0.0), ('text2', 0.446105), ('text3', 1.0)]