---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Assignment 2 - Introduction to NLTK

In part 1 of this assignment you will use nltk to explore the Herman Melville novel Moby Dick. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling. 

## Part 1 - Analyzing Moby Dick

In [2]:
import nltk
#nltk.download('punkt')
import pandas as pd
import numpy as np

# If you would like to work with the raw text you can use 'moby_raw'
with open('moby.txt', 'r') as f:
    moby_raw = f.read()
    
# If you would like to work with the novel in nltk.Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)

### Example 1

How many tokens (words and punctuation symbols) are in text1?

*This function should return an integer.*

In [None]:
def example_one():
    
    return len(nltk.word_tokenize(moby_raw)) # or alternatively len(text1)

example_one()

### Example 2

How many unique tokens (unique words and punctuation) does text1 have?

*This function should return an integer.*

In [None]:
def example_two():
    
    return len(set(nltk.word_tokenize(moby_raw))) # or alternatively len(set(text1))

example_two()

### Example 3

After lemmatizing the verbs, how many unique tokens does text1 have?

*This function should return an integer.*

In [None]:
#nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer

def example_three():

    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

    return len(set(lemmatized))

example_three()

### Question 1

What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

*This function should return a float.*

In [None]:
def answer_one():
    total_tokens = len(nltk.word_tokenize(moby_raw))
    unique_tokens = len(set(nltk.word_tokenize(moby_raw)))
    
    return unique_tokens/total_tokens

answer_one()

### Question 2

What percentage of tokens is 'whale'or 'Whale'?

*This function should return a float.*

In [None]:
def answer_two():
    tokens = nltk.word_tokenize(moby_raw)
    counts = tokens.count('whale') + tokens.count('Whale')    
    
    return counts/len(tokens)*100

answer_two()

### Question 3

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

*This function should return a list of 20 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*

In [None]:
def answer_three():
    from nltk import FreqDist
    fdist = FreqDist(nltk.word_tokenize(moby_raw))    
    return fdist.most_common(20)
# answer_three()

### Question 4

What tokens have a length of greater than 5 and frequency of more than 150?

*This function should return an alphabetically sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*

In [None]:
# from nltk.probability import FreqDist
# word_freq = FreqDist(nltk.word_tokenize(moby_raaw)).most_common(None)   
# # word_freq = fdist.most_common(None)
# custom_tokens = [tup[0] for tup in word_freq if (len(tup[0]) > 5) & (tup[1] > 150)] 

In [None]:
def answer_four():
    from nltk.probability import FreqDist
    word_freq = FreqDist(nltk.word_tokenize(moby_raw)).most_common(None)   
    custom_tokens = [tup[0] for tup in word_freq if (len(tup[0]) > 5) & (tup[1] > 150)] 
    return sorted(custom_tokens)

# answer_four()

### Question 5

Find the longest word in text1 and that word's length.

*This function should return a tuple `(longest_word, length)`.*

In [None]:
def answer_five():
    from nltk.probability import FreqDist
    length = max(len(w) for w in text1)
    longest = [w for w in text1 if len(w) == length]
    return longest[0], length
# answer_five()

### Question 6

What unique words have a frequency of more than 2000? What is their frequency?

"Hint:  you may want to use `isalpha()` to check if the token is a word and not punctuation."

*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*

In [3]:
# word_freq = nltk.FreqDist(text1)
# chosed_words = [x for x in word_freq.keys() if x.isalpha() and word_freq[x] > 2000]
# res = [(word_freq[word],word) for word in chosed_words]
# sorted(res, reverse=True)

In [9]:
# dist = nltk.FreqDist(text1) # unique words? or all words? confusing answer.
# chosed_words = [x for x in dist.keys() if x.isalpha() and dist[x] > 2000]
    
# sorted([(dist[x], x) for x in chosed_words], reverse = True)

In [None]:
def answer_six():
    word_freq = nltk.FreqDist(nltk.word_tokenize(moby_raw))
    custom_tokens = [(tup[1],tup[0]) for tup in word_freq.items() if (tup[1] > 2000) & (tup[0].isalpha())]
    return sorted(custom_tokens, reverse=True)

answer_six()

### Question 7

What is the average number of tokens per sentence?

*This function should return a float.*

In [10]:
for i in nltk.sent_tokenize(moby_raw):
    print(len(i.split()))

7
23
28
15
47
2
4
1
11
5
3
1
7
2
7
4
2
2
2
2
2
2
2
2
2
2
2
5
49
23
3
47
13
63
24
13
39
12
1
5
1
16
1
12
1
15
1
35
40
3
33
2
23
10
1
20
2
3
29
1
18
15
15
38
1
4
6
20
1
6
2
12
6
14
18
1
5
11
2
4
1
42
3
20
3
3
22
3
8
18
18
6
20
5
19
2
17
2
35
1
19
5
30
3
41
8
30
8
2
29
3
8
6
1
13
1
24
1
19
20
8
39
13
4
36
6
1
1
2
7
3
31
8
24
3
14
19
4
29
9
2
1
17
3
31
18
2
8
49
13
21
10
11
9
9
2
1
37
17
1
22
2
23
21
7
18
9
4
44
2
9
2
25
11
83
5
1
1
1
1
5
22
6
38
5
26
4
27
1
29
6
2
21
4
2
6
9
26
10
3
37
3
26
25
1
53
1
72
8
72
9
2
9
2
3
7
4
1
2
2
6
3
5
3
3
2
4
3
3
3
3
5
3
2
3
8
1
23
9
2
48
6
16
30
21
1
13
22
2
3
17
11
7
36
3
38
3
17
28
3
2
1
44
9
29
11
26
5
15
4
6
32
8
40
5
2
34
4
18
2
29
2
3
3
38
15
86
8
16
6
26
24
8
29
7
8
13
22
35
22
4
5
4
2
15
1
22
1
15
6
15
5
19
2
12
27
5
37
24
12
5
22
7
37
19
37
30
16
36
24
31
9
14
7
31
12
18
49
25
45
14
16
25
58
32
27
22
10
25
32
33
7
21
16
25
4
3
71
33
7
14
18
7
38
1
7
21
46
9
25
62
24
15
25
5
4
115
13
10
39
25
11
41
61
4
23
13
7
29
39
34
46
22
42
38
16
7
76
22
74
2

17
24
25
25
4
117
13
51
30
11
41
16
57
34
10
50
3
35
44
45
5
26
33
30
20
17
25
22
21
18
35
14
79
27
12
7
9
67
9
27
1
9
59
48
29
19
19
36
11
2
18
16
11
22
56
59
55
21
17
16
33
34
13
28
19
11
11
52
6
40
35
35
59
14
27
22
29
4
29
40
26
22
20
17
27
26
13
46
74
12
26
12
25
23
30
15
67
37
7
7
26
22
3
10
11
27
7
34
22
21
9
55
18
11
42
50
8
32
43
40
9
22
37
70
25
15
20
33
41
29
22
22
52
25
59
56
27
17
20
38
23
18
15
11
19
24
5
22
45
24
18
104
25
49
21
39
50
62
69
3
9
5
6
1
22
2
6
30
11
12
6
34
49
21
64
34
38
75
52
38
49
59
65
24
37
28
32
13
33
28
32
41
37
114
74
36
59
64
39
26
26
27
11
19
39
33
35
24
34
41
27
89
67
15
10
5
23
6
19
14
55
11
16
18
23
13
8
25
10
50
29
44
51
16
29
18
64
29
27
25
48
32
92
27
23
23
28
80
11
29
47
25
1
1
8
4
7
4
2
6
62
30
16
8
78
28
19
10
34
46
54
28
31
19
63
45
69
31
13
63
6
46
10
1
1
13
5
9
6
9
20
26
32
47
38
33
59
20
9
25
5
27
33
6
27
39
25
38
18
15
15
14
39
38
26
31
39
5
29
24
25
28
29
35
26
44
87
27
32
71
23
13
40
19
54
9
43
24
17
6
5
33
5
37
45
52
27
14
9
26
28

1
17
11
18
20
61
31
37
42
23
3
3
16
59
9
10
10
3
35
18
14
2
29
45
80
41
66
38
34
6
45
83
141
82
37
42
37
35
38
13
24
19
12
1
35
21
12
2
4
10
17
9
3
6
2
13
1
5
11
3
7
5
3
7
9
2
22
17
16
2
2
1
19
18
9
12
13
5
18
2
28
43
12
10
8
22
24
6
5
13
1
9
3
7
21
15
19
21
8
6
19
11
11
5
5
5
13
13
1
19
42
5
11
56
46
5
35
3
10
11
10
5
33
24
3
7
25
29
39
25
31
4
14
9
10
4
10
10
1
17
10
32
12
19
53
5
28
2
2
3
2
1
4
4
3
6
3
4
15
1
5
9
1
1
10
3
41
23
18
11
6
7
5
16
9
9
34
11
11
9
3
7
9
18
27
9
6
3
14
1
16
7
2
7
8
10
19
6
14
28
4
14
2
16
13
21
12
17
20
38
1
1
12
7
12
4
29
7
10
22
2
8
4
12
2
2
14
16
52
38
65
4
77
1
2
21
18
2
8
10
14
12
5
6
3
19
4
1
13
9
7
63
8
7
22
1
1
28
34
25
37
2
36
65
37
54
6
2
7
19
11
5
5
24
18
17
3
71
20
1
15
1
5
8
19
42
41
30
24
3
41
69
46
3
9
2
11
12
5
19
6
2
3
20
85
67
64
18
54
4
6
1
4
69
7
4
11
2
2
2
4
1
1
25
4
2
4
6
52
56
4
4
11
13
7
2
9
4
6
1
2
2
3
5
15
6
21
7
13
21
7
6
15
16
2
2
10
6
14
8
1
1
9
1
9
26
63
36
6
14
14
2
4
11
42
7
3
5
1
23
7
13
6
12
2
26
33
10
23
6
19
40
28
11
2
6


In [12]:
def answer_seven():
    sentences = nltk.sent_tokenize(moby_raw)
    lengths = [len(nltk.word_tokenize(sent)) for sent in sentences]
    
    return sum(lengths) / len(sentences) 
# answer_seven()

25.88489646772229

### Question 8

What are the 5 most frequent parts of speech in this text? What is their frequency?

*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*

In [31]:
def answer_eight():
    from collections import Counter
    tokens = nltk.word_tokenize(moby_raw)
    pos = nltk.pos_tag(tokens)
    pos_tags = [i[1] for i in pos]
    
    return Counter(pos_tags).most_common(5)

answer_eight()

[('NN', 32730), ('IN', 28658), ('DT', 25870), (',', 19204), ('JJ', 17619)]

## Part 2 - Spelling Recommender

For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender should find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.

*Each of the three different recommenders will use a different distance measure (outlined below).

Each of the recommenders should provide recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`.

In [34]:
from nltk.corpus import words

correct_spellings = words.words()

In [35]:
correct_spellings

['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron',
 'Aaronic',
 'Aaronical',
 'Aaronite',
 'Aaronitic',
 'Aaru',
 'Ab',
 'aba',
 'Ababdeh',
 'Ababua',
 'abac',
 'abaca',
 'abacate',
 'abacay',
 'abacinate',
 'abacination',
 'abaciscus',
 'abacist',
 'aback',
 'abactinal',
 'abactinally',
 'abaction',
 'abactor',
 'abaculus',
 'abacus',
 'Abadite',
 'abaff',
 'abaft',
 'abaisance',
 'abaiser',
 'abaissed',
 'abalienate',
 'abalienation',
 'abalone',
 'Abama',
 'abampere',
 'abandon',
 'abandonable',
 'abandoned',
 'abandonedly',
 'abandonee',
 'abandoner',
 'abandonment',
 'Abanic',
 'Abantes',
 'abaptiston',
 'Abarambo',
 'Abaris',
 'abarthrosis',
 'abarticular',
 'abarticulation',
 'abas',
 'abase',
 'abased',
 'abasedly',
 'abasedness',
 'abasement',
 'abaser',
 'Abasgi',
 'abash',
 'abashed',
 'abashedly',
 'abashedness',
 'abashless',
 'abashlessly',
 'abashment',
 'abasia',
 'abasic',
 'abask',
 'Abassin',
 'abastardize',
 'abatable',
 'abate

### Question 9

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [88]:
def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
    from nltk.metrics.distance import jaccard_distance
    from nltk.util import ngrams
    
    best_matches = []
    for entry in entries:
        #list of correct_spelling that starts with the same latter
        correct_spellings_sel = [word for word in correct_spellings if entry[0] == word[0]]
        entry_ngrams = set(ngrams(entry, 3))
        test = [jaccard_distance(entry_ngrams, set(ngrams(word, 3))) for word in correct_spellings_sel]
        index_min = min(range(len(test)), key=test.__getitem__)
        best_matches.append(correct_spellings_sel[index_min])

    return best_matches
    
# answer_nine()

### Question 10

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [89]:
def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):
    
    from nltk.metrics.distance import jaccard_distance
    from nltk.util import ngrams
    
    best_matches = []
    for entry in entries:
        #list of correct_spelling that starts with the same latter
        correct_spellings_sel = [word for word in correct_spellings if entry[0] == word[0]]
        entry_ngrams = set(ngrams(entry, 4))
        test = [jaccard_distance(entry_ngrams, set(ngrams(word, 4))) for word in correct_spellings_sel]
        index_min = min(range(len(test)), key=test.__getitem__)
        best_matches.append(correct_spellings_sel[index_min])

    return best_matches
    
answer_ten()

['cormus', 'incendiary', 'valid']

### Question 11

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [95]:
def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):
    from nltk.metrics.distance import edit_distance
    best_matches = []
    for entry in entries:
        #list of correct_spelling that starts with the same latter
        correct_spellings_sel = [word for word in correct_spellings if entry[0] == word[0]]
        test = [edit_distance(entry, word) for word in correct_spellings_sel]
        index_min = min(range(len(test)), key=test.__getitem__)
        best_matches.append(correct_spellings_sel[index_min])

    return best_matches
    
answer_eleven()

['corpulent', 'intendence', 'validate']