#### 1. Install Python (if you don’t have it already) and install NLTK. 

Installed Python and NLTK

#### 2. Follow the instructions in chapter 1 of Bird-Klein for implementing a “lexical diversity” scoring routine.

Lexical diversity is a measurement of how many different lexical words there are in a text. The lexical diversity of a text can be calculated by taking the ratio of unique types and total types.

In [23]:
import nltk
from nltk.book import *

In [24]:
def lexical_diversity(text):
    return len(set(text)) / len(text)

In [25]:
lexical_diversity(text3)

0.06230453042623537

#### 3. Go to http://www.gutenberg.org/ebooks/bookshelf/215  and select texts of different grade levels (e.g., fourth reader, fifth reader et al) Report the lexical diversity score of each. Explain whether the result was surprising.

Below are links to third, fourth and fifth grade books
 - Third grade: https://www.gutenberg.org/ebooks/15040
 - Fourth grade: https://www.gutenberg.org/ebooks/14880
 - Fifth grade: https://www.gutenberg.org/ebooks/15040

In [26]:
from urllib import request
import re

# returns text/tokens
def getTextFromUrl(url, text=False):
    # make a request
    response = request.urlopen(url)

    # extract the raw response
    raw = response.read().decode('utf8')

    # tokenize the words
    tokens = nltk.word_tokenize(raw)

    if text:
        textObj = nltk.Text(tokens)
        return textObj

    return tokens

# get the book title
def getBookTitle(url):
    # make a request
    response = request.urlopen(url)

    # extract the raw response
    raw = response.read().decode('utf8')

    match = re.search(r'Title: ([\w\']+) (.+)', raw)
    if match:
        title = match.group()
    else:
        return None

    return title.strip('\r')

In [27]:
# url for third grade
url1 = "https://www.gutenberg.org/cache/epub/14766/pg14766.txt"

# url for fourth grade
url2 = "https://www.gutenberg.org/cache/epub/14880/pg14880.txt"

# url for fifth grade
url3 = "https://www.gutenberg.org/cache/epub/15040/pg15040.txt"

# create a list with URL's
urls = [url1, url2, url3]

In [28]:
# get the text for each book
texts = [getTextFromUrl(url, text=True) for url in urls]

# get the token lenght for each book 
tokens = [len(getTextFromUrl(url, text=False)) for url in urls]

# get a list of unique types
types = [len(set(getTextFromUrl(url, text=False))) for url in urls]

# get book title 
titles = [getBookTitle(url) for url in urls]

# Calculate lexical diversity scores
lexicalDiversity = [lexical_diversity(text) for text in texts]

In [29]:
import pandas as pd
# create a dataframe
lexical_summary = pd.DataFrame({'Title': titles, 'Tokens': tokens, 'Types': types,
                               'Lexical_diversity': lexicalDiversity})

# sort by highest lexical_diversity score
lexical_summary.sort_values(by='Lexical_diversity', ascending=False)

Unnamed: 0,Title,Tokens,Types,Lexical_diversity
0,Title: McGuffey's Third Eclectic Reader,37993,4713,0.124049
1,Title: McGuffey's Fourth Eclectic Reader,84066,10380,0.123474
2,Title: McGuffey's Fifth Eclectic Reader,126615,14290,0.112862


With respect to lexical diversity, we can see that third grade book's lexical diversity score is more than fourth and fifth grade books. At early age, students tend use new words more than later ages. 

#### 4. Also compare the vocabulary size of the same three texts. Explain whether the result was surprising. 

The Vocabulary size is calculated by removing tokens that are not alphabetic tokens and converting them to lower case. 

In [30]:
# get the vocab size of the 3 texts
def getVocabSize(text):
    vocab = set(w.lower() for w in text if w.isalpha())
    return len(vocab)

In [31]:
vocabSize = [getVocabSize(text) for text in texts]
lexical_summary['Vocabulary_Size'] = vocabSize

In [32]:
# sort by highest lexical_diversity score
lexical_summary.sort_values(by='Lexical_diversity', ascending=False)

Unnamed: 0,Title,Tokens,Types,Lexical_diversity,Vocabulary_Size
0,Title: McGuffey's Third Eclectic Reader,37993,4713,0.124049,3684
1,Title: McGuffey's Fourth Eclectic Reader,84066,10380,0.123474,7626
2,Title: McGuffey's Fifth Eclectic Reader,126615,14290,0.112862,10832


After removing the non-alphabet tokens and converting all tokens to lower case, Fifth grade book has most number of unique words than fouth grade and third grade book has less number of unique words. This make sense as grade level increases, students learn more new words.

#### 5. Write a paragraph arguing whether vocabulary size and lexical diversity in combination could be a better measure of text difficulty (or reading level) than either measure is by itself.

When we are comparing text difficulty among different texts with different number of types, lexical diversity alone to measure the text difficultly is not accurate. In other words lexical diversity is dependent on the total number of tokens, if the tokens are a large number like those of Fifth Grade, then even though the Vocabulary size of fifth grade is larger, its lexical diversity is the lowest among all three texts. Thus, it is safe to say that lexical_diversity alone shouldn’t be a measure of text difficulty but it can be used with a thorough Vocabulary analysis.