# DS7337 NLP - HW 1
#### Neil Benson

1.Follow the instructions in chapter 1 of Bird-Klein for implementing a “lexical diversity” scoring routine. book link

2.Go to http://www.gutenberg.org/wiki/Children%27s_Instructional_Books_(Bookshelf), and obtain three texts (of different grade levels) from the “Graded Readers” section. 
  
* [McGuffey's First Eclectic Reader, Revised Edition](https://www.gutenberg.org/cache/epub/14640/pg14640.txt)
* [McGuffey's Sixth Eclectic Reader](https://www.gutenberg.org/cache/epub/16751/pg16751.txt)
* [The Ontario Readers: The High School Reader, 1886](https://www.gutenberg.org/cache/epub/19923/pg19923.txt)


In [22]:
import requests
import nltk
import re

nltk.download("punkt")

[nltk_data] Downloading package punkt to /home/bensonnd/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [23]:
# tokenize the text and make it an NTLK text class
def get_text_tokens(decoded_text):
    tokens = nltk.word_tokenize(decoded_text)
    text = nltk.Text(tokens)
    return text


# get the text as raw string format, return title and text
def text_getter(target_url):
    """
    Retrieves book text from a url and returns decoded text for nl analysis
    param:
        target_url (str): the url to retrieve the book text from
    return:
        decoded_text: the book text decoded
    """
    response = requests.get(target_url)
    decoded_text = response.text

    return book_title(decoded_text), get_text_tokens(decoded_text)


# calculate the lexical diversity
def lexical_diversity(book_text):
    """
    Calculates the lexical diversity of the book text.
    param:
        book_text: The text of the book to calculate word count and vocaulary size
    return:
        diversity_score: The unique number of words in a book divided by the total words used in the book.

    """
    vocabulary_size = len(set(book_text))
    total_word_count = len(book_text)
    lexical_diversity = vocabulary_size / total_word_count
    return lexical_diversity


# get the book titles
def book_title(book_text):
    """
    Extracts the title of a book.
    param:
        book_text: The text of the book to extract the book title from.
    return:
        title: The parsed title of the book.

    """
    search = re.search("Title:(.*)", book_text)
    title = search.group(1).replace("\r", " ").strip()
    return title

In [24]:
urls = [
    "https://www.gutenberg.org/cache/epub/16751/pg16751.txt",
    "https://www.gutenberg.org/cache/epub/14640/pg14640.txt",
    "https://www.gutenberg.org/cache/epub/19923/pg19923.txt",
]

In [26]:
# get the book titles, and their respective texts
books = [text_getter(url) for url in urls]

In [27]:
lex_diversity = {
    title: {
        "lexical_diversity": f"{lexical_diversity(text)*100:.2f}%",
        "vocabulary_size": len(set(text)),
    }
    for title, text in books
}

3.Report the lexical diversity score of each. Explain whether the result was surprising.

In [28]:
# reviewing/analyzing lexical divirsity differences between the grades
for key, value in lex_diversity.items():
    print(f"{key}:\n lexical_diversity: {value['lexical_diversity']}\n")

McGuffey's Sixth Eclectic Reader:
 lexical_diversity: 10.09%

McGuffey's First Eclectic Reader, Revised Edition:
 lexical_diversity: 16.69%

The Ontario Readers: The High School Reader, 1886:
 lexical_diversity: 11.08%



It's interesting to see that 1st grade is 50%+ more lexically diverse than 6th grade and high school. My initial feeling was that high school readers would have been more lexically diverse than all; you can see that is certainly more diverse than 6th grade.

4.Also compare the vocabulary size of the same three texts. Explain whether the result was surprising.

In [29]:
# reviewing/analyzing vocabulary differences between the grades
for key, value in lex_diversity.items():
    print(f"{key}:\n vocabulary_size: {value['vocabulary_size']}\n")

McGuffey's Sixth Eclectic Reader:
 vocabulary_size: 17259

McGuffey's First Eclectic Reader, Revised Edition:
 vocabulary_size: 2146

The Ontario Readers: The High School Reader, 1886:
 vocabulary_size: 17101



Here, we see that 1st grade's vocabulary is significantly smaller than the other two. I am surprised to see that 6th grade has the largest. Does this mean that by 6th grade, we have learned most of our vocabulary that we are going to learn from standard educational sources? And that we don't increase our vocabulary much between 6th grade and high school? It isn't surprising that the 1st grade vocabulary was as small as it is.

5.Write a paragraph arguing whether vocabulary size and lexical diversity in combination could be a better measure of text difficulty (or reading level) than either measure is by itself.
Identifying Text to Analyze

In [30]:
# reviewing/analyzing lexical diversity AND vocabulary differences between the grades
for key, value in lex_diversity.items():
    print(f"{key}:\n  {value}\n")

McGuffey's Sixth Eclectic Reader:
  {'lexical_diversity': '10.09%', 'vocabulary_size': 17259}

McGuffey's First Eclectic Reader, Revised Edition:
  {'lexical_diversity': '16.69%', 'vocabulary_size': 2146}

The Ontario Readers: The High School Reader, 1886:
  {'lexical_diversity': '11.08%', 'vocabulary_size': 17101}



When evaluating lexical diversity, it's important to understand vocabulary size as it relates to total word count. Smaller denominators in rates, generally means it requires fewer adjustments in the numerator to increase the rate and the rate of change is much greater in cases with smaller overall word counts. 

In this case, adding a single word to the 1st grade vocabulary, increases the lexical diversity much more (+0.04%) than it would if we added a single word to the 6th grade (+0.01%) or high school vocabulary (+0.01%); lexical diversity increases 4x for 1st grade over upper grades when adding a single word.

I'd be interested in knowing what the word repitition rate is. I imagine that for lower grades, words are repeated more than in upper grades.