# NLP - Homework 1
### Miguel Bonilla

* [Lexical Diversity Score](#Lexical-Diversity-Score)
* [Vocabulary Size](#Vocabulary-Size)
* [Conclusion](#Conclusion)

1. Install Python (if you don’t have it already) and install NLTK.  
2. Follow the instructions in chapter 1 of Bird-Klein for implementing a “lexical diversity” scoring 
routine.
3. Go to http://www.gutenberg.org/ebooks/bookshelf/215  and select texts of different grade levels 
(e.g., fourth reader, fifth reader et al)
Report the lexical diversity score of each. Explain whether the result was surprising.
4. Also compare the vocabulary size of the same three texts. Explain whether the result was 
surprising.  
5. Write a paragraph arguing whether vocabulary size and lexical diversity in combination could be 
a better measure of text difficulty (or reading level) than either measure is by itself.

In [1]:
import nltk
from nltk.tokenize import word_tokenize

- First Reader https://www.gutenberg.org/cache/epub/14640/pg14640.txt
- Second Reader https://www.gutenberg.org/cache/epub/14668/pg14668.txt
- Third Reader https://www.gutenberg.org/cache/epub/14766/pg14766.txt

In [2]:
from urllib import request
url1 = "https://www.gutenberg.org/cache/epub/14640/pg14640.txt" #first reader url
url2 = "https://www.gutenberg.org/cache/epub/14668/pg14668.txt" #second reader url
url3 = "https://www.gutenberg.org/cache/epub/14766/pg14766.txt" #third reader url

#import books
text1 = request.urlopen(url1).read().decode('utf8')
text2 = request.urlopen(url2).read().decode('utf8')
text3 = request.urlopen(url3).read().decode('utf8')

# tokenize books by words
text1 = word_tokenize(text1)
text2 = word_tokenize(text2)
text3 = word_tokenize(text3)

In [3]:
## text 1 find beginning and end
print('beginning of text 1 is: ',text1.index('LESSON')) #beginning
print('end of text 1 is: ',text1.index('PHONIC')-1) #end
## text 2 find beginning and end
print('beginning of text 2 is: ',text2.index('LESSON')) #beginning
print('end of text 2 is: ',text2.index('Tennyson')) #end
## text 3 find beginning and end
print('beginning of text 3 is: ',text3.index('SHEPHERD')-4) #beginning
print('end of text 3 is: ',text3.index('Follen')) #end

beginning of text 1 is:  1125
end of text 1 is:  9335
beginning of text 2 is:  1500
end of text 2 is:  22004
beginning of text 3 is:  2116
end of text 3 is:  34675


In [4]:
## trim texts removing beginning and end fluff
text1 = text1[1125:9335]
text2 = text2[1500:22004]
text3 = text3[2116:34675]

## Lexical Diversity Score

In [5]:
## define lexical_diversity (from textbook example)
def lexical_diversity(text):
    return(len(set(text))/len(text))

In [34]:
# calculate the lexical diversity scores for the 3 books
ld1 = lexical_diversity(text1)
ld2 = lexical_diversity(text2)
ld3 = lexical_diversity(text3)
# calculate text length
length1 = len(text1)
length2 = len(text2)
length3 = len(text3)

In [35]:
print(f'First Reader \n length: {length1} | lexical diversity: {ld1}')
print('--------------------------------')
print(f'Second Reader \n length: {length2} | lexical diversity: {ld2}')
print('--------------------------------')
print(f'Third Reader \n length: {length3} | lexical diversity: {ld3}')

First Reader 
 length: 8210 | lexical diversity: 0.14165651644336175
--------------------------------
Second Reader 
 length: 20504 | lexical diversity: 0.1472395630120952
--------------------------------
Third Reader 
 length: 32559 | lexical diversity: 0.11701833594397862


We see that Third Reader has the lowest lexical diversity score of the three, this is a little surprising, however, it should be noted that the text has significantly longer length than Second and First Reader. Considering the fact that the books have singificantly different lengths, the lexical diversity scores might not be a great direct comparison.

## Vocabulary Size

In [38]:
# calculate vocabulary size of the 3 texts
vs1 = len(set(w.lower() for w in text1))
vs2 = len(set(w.lower() for w in text2))
vs3 = len(set(w.lower() for w in text3))

In [40]:
print(f'First Reader \n length: {length1} | vocabulary size: {vs1}')
print('--------------------------------')
print(f'First Reader \n length: {length2} | vocabulary size: {vs2}')
print('--------------------------------')
print(f'First Reader \n length: {length3} | vocabulary size: {vs3}')
print('--------------------------------')

First Reader 
 length: 8210 | vocabulary size: 1041
--------------------------------
First Reader 
 length: 20504 | vocabulary size: 2626
--------------------------------
First Reader 
 length: 32559 | vocabulary size: 3397
--------------------------------


The results are not suprirising, the vocabulary sizes increase signficantly from one book level to the next. With a more than double jump from First Reader to Second Reader (152% increase in vocabulary size), and a close to one third increase from Second Reader to Third Reader (29% increase in vocabulary size)

## Conclusion

Book | Length | Vocabulary Size | Lexical Diversity
-----|--------|---------|---------
First Reader | 8210 | 1041 | .1417
Second Reader | 20504 | 2626 | .1472
Third Reader | 32559 | 3397 | .1170

As previously mentioned, it seems like comparing lexical diversity scores when the texts have significantly different lengths produces misleading results in terms of analyzing text difficulty, since the Lexical Diversity score is a measure of the unique words divided by the total number of words. Though the vocabulary size increased for each subsequent level, the vocabulary size increase from second to third reader was not as large in comparison to the increase in word length of the text, leading to a comparatively lower lexical diversity score.

Therefore, lexical diversity alone, is not a good measure of text difficulty. Additionally, vocabulary size alone would not be a great measure of text difficulty, since potentially there could be a text with a higher vocabulary size but lower word count.

When analyzing text difficulty, it is probably better to look at both Lexical Diversity Score in conjunction with Vocabulary Size, since a text with both higher vocabulary size and higher word count can produce a lower Lexical Diversity score (given that the incresae in vocabulary size is overshadowed by the increase in word length).