#### Home Work - 2                                                                                                                                 

In [1]:
# Load important libraries
import nltk
from nltk.book import *
import numpy as np
import pandas as pd
from sklearn.preprocessing import minmax_scale
from urllib import request
import re

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


#### 1.	In Python, create a method for scoring the vocabulary size of a text, and normalize the score from 0 to 1. It does not matter what method you use for normalization as long as you explain it in a short paragraph. 

Normaliztion can be done using Min Max Scaler from the sklearn module or it can also be done using the simple normalization formula. The formula is dividing the difference of each text's vocabulary size to the min vocabular size on the list by the difference between the max and the min vocabulary size ((x - min(x) / (max(x) - min(x)))). This ensures that the score is normalized between 0 and 1.

In [2]:
def vocab_size(*arg):
    vocab_size = np.array([])
    vocab_size_norm = np.array([])
    vocab_size_norm_sklearn = np.array([])
    
    #### Getting the Vocab Size after converting all letters to lower case and removing the words which are not alphabets
    for text in arg:
        vocab_size = np.append(vocab_size,len(set(word.lower() for word in text if word.isalpha())))
        
    #### Normalizing the vocabulary size using formula (x - min(x) / (max(x) - min(x)))
    for vsize in vocab_size:
        vocab_size_norm = np.append(vocab_size_norm,(vsize - vocab_size.min()) /
                                                    (vocab_size.max() - vocab_size.min()))
    
    #### Normalizing the vocabulary size using sklearn Min Max Scaler 
    vocab_size_norm_sklearn = minmax_scale(vocab_size, feature_range=(0,1), axis=0)
    
    return(vocab_size,vocab_size_norm,vocab_size_norm_sklearn)

In [3]:
vocab_size_dtls = vocab_size(text1,text2,text3,text4,text5,text6,text7,text8,text9)

In [4]:
# create a dataframe
vocab_size_summary = pd.DataFrame({'Vocabulary Size': vocab_size_dtls[0], 
                                   'Normaliztion using the simple formula': vocab_size_dtls[1], 
                                   'Normalization using the sklearn module': vocab_size_dtls[2]})

vocab_size_summary

Unnamed: 0,Vocabulary Size,Normaliztion using the simple formula,Normalization using the sklearn module
0,16948.0,1.0,1.0
1,6283.0,0.340731,0.340731
2,2615.0,0.113989,0.113989
3,9110.0,0.515485,0.515485
4,4671.0,0.241083,0.241083
5,1799.0,0.063547,0.063547
6,9109.0,0.515423,0.515423
7,771.0,0.0,0.0
8,6300.0,0.341782,0.341782


Normalized vocabulary scores for texts from NLTK library are same either calculated with simple formula or with Min Max Scaler from the sklearn module

#### 2.	After consulting section 3.2 in chapter 1 of Bird-Klein, create a method for scoring the long-word vocabulary size of a text, and likewise normalize (and explain) the scoring as in step 1 above.

Below method calculates the vocabulary size in a particular corpus. While calculating the vocabulary size it considereds only the words with more than 7 letters and words repeat more than 7 times in the corpus. This eliminates frequent short words (e.g., the) and infrequent long words in the text. Then calculates the normalized score using simple formula and Min Max Scaler from the sklearn module

In [5]:
def long_word_score(*arg, word_len=7, word_freq=7):
    long_word_score = np.array([])
    long_word_score_norm = np.array([])
    long_word_score_norm_sklearn = np.array([])
    
    #### Getting the Vocab Size of long words with word lenth of 7 characters and word repeats atleast 7 times
    for text in arg:
        fdist = FreqDist(text)
        size = len(sorted(word for word in set(text) if len(word) > word_len and fdist[word] > word_freq))
        long_word_score = np.append(long_word_score,size)
        
    #### Normalizing the vocabulary size for long words using formula (x - min(x) / (max(x) - min(x)))
    for vsize in long_word_score:
        long_word_score_norm = np.append(long_word_score_norm,(vsize - long_word_score.min()) /
                                                    (long_word_score.max() - long_word_score.min()))
    
    #### Normalizing the vocabulary size for long words using sklearn Min Max Scaler 
    long_word_score_norm_sklearn = minmax_scale(long_word_score, feature_range=(0,1), axis=0)
    
    return(long_word_score,long_word_score_norm,long_word_score_norm_sklearn)

In [6]:
word_score = long_word_score(text1,text2,text3,text4,text5,text6,text7,text8,text9)

In [7]:
# create a dataframe
long_word_vocab_size_summary = pd.DataFrame({'Long Word Vocabulary Size': word_score[0], 
                                   'Normaliztion using the simple formula': word_score[1], 
                                   'Normalization using the sklearn module': word_score[2]})

long_word_vocab_size_summary

Unnamed: 0,Long Word Vocabulary Size,Normaliztion using the simple formula,Normalization using the sklearn module
0,555.0,0.827795,0.827795
1,399.0,0.592145,0.592145
2,48.0,0.061934,0.061934
3,669.0,1.0,1.0
4,19.0,0.018127,0.018127
5,12.0,0.007553,0.007553
6,369.0,0.546828,0.546828
7,7.0,0.0,0.0
8,101.0,0.141994,0.141994


Normalized long word vocabulary scores with word length and word frequency of 7 for all books from NLTK library are same either calculated with simple formula or with Min Max Scaler from the sklearn module

#### 3.	Now create a “text difficulty score” by combining the lexical diversity score from homework 1, and your normalized score of vocabulary size and long-word vocabulary size, in equal weighting. Explain what you see when this score is applied to same graded texts you used in homework1.

In [8]:
# returns text/tokens
def getTextFromUrl(url, text=False):
    # make a request
    response = request.urlopen(url)

    # extract the raw response
    raw = response.read().decode('utf8')

    # tokenize the words
    tokens = nltk.word_tokenize(raw)

    if text:
        textObj = nltk.Text(tokens)
        return textObj

    return tokens

# get the book title
def getBookTitle(url):
    # make a request
    response = request.urlopen(url)

    # extract the raw response
    raw = response.read().decode('utf8')

    match = re.search(r'Title: ([\w\']+) (.+)', raw)
    if match:
        title = match.group()
    else:
        return None

    return title.strip('\r')

# get the lexical diversity 
def lexical_diversity(*arg):
    lexical_diversity = np.array([])
    
    for text in arg:
        lexical_diversity = np.append(lexical_diversity,len(set(text)) / len(text))
    
    return (lexical_diversity)

# get text difficulty score
def text_difficulty_score(*arg):
    text_difficulty_score = np.array([])
    
    print("Lexical Diversity:", lexical_diversity(arg[0], arg[1], arg[2]))
    print("Vocabulary Score:", vocab_size(arg[0], arg[1], arg[2])[2])
    print("Long word Vocabulary score:", long_word_score(arg[0], arg[1], arg[2])[2])
    
    text_difficulty_score = 1/3*(lexical_diversity(arg[0], arg[1], arg[2]) + 
                                 vocab_size(arg[0], arg[1], arg[2])[2] + 
                                 long_word_score(arg[0], arg[1], arg[2])[2])
                                            
    return text_difficulty_score

In [9]:
# url for third grade
url1 = "https://www.gutenberg.org/cache/epub/14766/pg14766.txt"

# url for fourth grade
url2 = "https://www.gutenberg.org/cache/epub/14880/pg14880.txt"

# url for fifth grade
url3 = "https://www.gutenberg.org/cache/epub/15040/pg15040.txt"

# create a list with URL's
urls = [url1, url2, url3]

In [10]:
# get the text for each book
texts = [getTextFromUrl(url, text=True) for url in urls]

# get book title 
titles = [getBookTitle(url) for url in urls]

# Calculate text difficulty score
text_difficulty_score = text_difficulty_score(texts[0],texts[1],texts[2])

Lexical Diversity: [0.12404917 0.12347441 0.11286183]
Vocabulary Score: [0.         0.55148293 1.        ]
Long word Vocabulary score: [0.         0.36111111 1.        ]


In [11]:
# create a dataframe with title of the book and difficulty score
summary = pd.DataFrame({'Title': titles, 'Text difficulty score': text_difficulty_score})
summary

Unnamed: 0,Title,Text difficulty score
0,Title: McGuffey's Third Eclectic Reader,0.04135
1,Title: McGuffey's Fourth Eclectic Reader,0.345356
2,Title: McGuffey's Fifth Eclectic Reader,0.704287


### Summary

When we are comparing text difficulty among different texts with lexical diversity score alone from homework1,  third grade book (lexical diversity score 0.124049) is more difficulty than fourth (lexical diversity score 0.123474) and fifth grade (lexical diversity score 0.112862) books where lexical diversity is dependent on the total number of tokens. New measurement which combines the lexical score with the normalized score of the vocabulary size and the long-word vocabulary size, I believe that this difficulty score is more useful than the one in Homework 1 because it takes into account longer words, which tend to be more difficult to read, and the normalized size of the vocabulary, in addition to the lexical diversity.