# Introduction

You should process some texts using [NLTK](https://www.nltk.org/) or [spaCy](https://spacy.io/) libraries (ideally both). In particular, you should do the following:
- Load the `harry_potter` book. You can find this text corpus in the datasets folder.
- Segment the text of the book into sentences. How many sentences does this book have?
- Compute the frequency of each token in the book. What are the most frequent tokens?
- Choose a sentence from the book. Analyze this chosen sentence by
    - Calculating all [n-grams](https://en.wikipedia.org/wiki/N-gram).
    - Finding [POS tags](https://en.wikipedia.org/wiki/Part-of-speech_tagging) of tokens.
    - [Stemming](https://en.wikipedia.org/wiki/Stemming) and [lemmatizing](https://en.wikipedia.org/wiki/Lemmatisation) tokens.
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

In [42]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist
from nltk.util import ngrams
from nltk.tag import pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [2]:
f = open("../../datasets/harry_potter.txt").read()
f[:1000]

"CHAPTER ONE THE BOY WHO LIVED \n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. \n\nMr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. \n\nThe Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's si

In [46]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /Users/tactlabs/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/tactlabs/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/tactlabs/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/tactlabs/nltk_data...


True

In [11]:
#tokenizing sentences
sentences = sent_tokenize(f)
print(len(sentences))

6394


In [22]:
#frequency of each token
words = word_tokenize(f)
FreqDist(words)

FreqDist({',': 5658, '.': 5118, 'the': 3310, "''": 2443, '``': 2305, 'to': 1845, 'and': 1804, 'a': 1578, 'Harry': 1323, 'was': 1253, ...})

In [38]:
#calculating all n_grams
sentence = sentences[34]
sentence

'He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by.'

In [39]:
for i in range(len(sentence)):
    n = i
    n_grams = ngrams(sentence.split(), n)
    for item in n_grams:
        print(item)

('He',)
('drummed',)
('his',)
('fingers',)
('on',)
('the',)
('steering',)
('wheel',)
('and',)
('his',)
('eyes',)
('fell',)
('on',)
('a',)
('huddle',)
('of',)
('these',)
('weirdos',)
('standing',)
('quite',)
('close',)
('by.',)
('He', 'drummed')
('drummed', 'his')
('his', 'fingers')
('fingers', 'on')
('on', 'the')
('the', 'steering')
('steering', 'wheel')
('wheel', 'and')
('and', 'his')
('his', 'eyes')
('eyes', 'fell')
('fell', 'on')
('on', 'a')
('a', 'huddle')
('huddle', 'of')
('of', 'these')
('these', 'weirdos')
('weirdos', 'standing')
('standing', 'quite')
('quite', 'close')
('close', 'by.')
('He', 'drummed', 'his')
('drummed', 'his', 'fingers')
('his', 'fingers', 'on')
('fingers', 'on', 'the')
('on', 'the', 'steering')
('the', 'steering', 'wheel')
('steering', 'wheel', 'and')
('wheel', 'and', 'his')
('and', 'his', 'eyes')
('his', 'eyes', 'fell')
('eyes', 'fell', 'on')
('fell', 'on', 'a')
('on', 'a', 'huddle')
('a', 'huddle', 'of')
('huddle', 'of', 'these')
('of', 'these', 'weirdos')

In [40]:
words_1 = word_tokenize(sentence)
pos_tag(words_1)

[('He', 'PRP'),
 ('drummed', 'VBD'),
 ('his', 'PRP$'),
 ('fingers', 'NNS'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('steering', 'NN'),
 ('wheel', 'NN'),
 ('and', 'CC'),
 ('his', 'PRP$'),
 ('eyes', 'NNS'),
 ('fell', 'VBD'),
 ('on', 'IN'),
 ('a', 'DT'),
 ('huddle', 'NN'),
 ('of', 'IN'),
 ('these', 'DT'),
 ('weirdos', 'NNS'),
 ('standing', 'VBG'),
 ('quite', 'RB'),
 ('close', 'JJ'),
 ('by', 'IN'),
 ('.', '.')]

In [41]:
porter = PorterStemmer()
for word in words_1:
    print(porter.stem(word))

he
drum
hi
finger
on
the
steer
wheel
and
hi
eye
fell
on
a
huddl
of
these
weirdo
stand
quit
close
by
.


In [51]:
lemmatizer = WordNetLemmatizer()
for word in words_1:
    print(lemmatizer.lemmatize(word, 'v'))

He
drum
his
finger
on
the
steer
wheel
and
his
eye
fell
on
a
huddle
of
these
weirdos
stand
quite
close
by
.
