# Introduction

You should process some texts using [NLTK](https://www.nltk.org/) or [spaCy](https://spacy.io/) libraries (ideally both). In particular, you should do the following:
- Load the `harry_potter` book. You can find this text corpus in the datasets folder.
- Segment the text of the book into sentences. How many sentences does this book have?
- Compute the frequency of each token in the book. What are the most frequent tokens?
- Choose a sentence from the book. Analyze this chosen sentence by
    - Calculating all [n-grams](https://en.wikipedia.org/wiki/N-gram).
    - Finding [POS tags](https://en.wikipedia.org/wiki/Part-of-speech_tagging) of tokens.
    - [Stemming](https://en.wikipedia.org/wiki/Stemming) and [lemmatizing](https://en.wikipedia.org/wiki/Lemmatisation) tokens.
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

# **import liberaries**

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

# **load the book**

In [None]:
f=open('/content/harry_potter.txt')
text=f.read()
print(text[:1000])

CHAPTER ONE THE BOY WHO LIVED 

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. 

Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. 

The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, b

# **sentence segmentation**

In [None]:
nltk_sentences=nltk.sent_tokenize(text)
print('number of sentences',len(nltk_sentences))

number of sentences 6394


# **Compute the frequency**

In [None]:
all_tokens={}
for s in nltk_sentences:
  sent_tokens=nltk.word_tokenize(s)
  for t in sent_tokens:
    if t not in all_tokens :
        all_tokens[t]=0
    all_tokens[t]+=1
frequent_tokens=sorted(all_tokens,key =all_tokens.get,reverse=True)[:20]
for t in frequent_tokens: 
  print(t,all_tokens[t])

, 5658
. 5119
the 3310
'' 2441
`` 2307
to 1845
and 1804
a 1578
Harry 1323
was 1253
of 1242
he 1208
's 997
in 933
I 919
it 897
his 896
you 837
n't 826
said 793


# **Calculating all N-grams.**

In [None]:
#selecting one sentence 
sentence= nltk_sentences[10]
print(sentence)
#tokenaizing e selected  sentence 
sentence_tokens=nltk.tokenize.word_tokenize(sentence)
#calculating N-grams 
ngrams=list(nltk.ngrams(sentence_tokens,3))
ngrams


The Dursleys knew that the Potters had a small son, too, but they had never even seen him.


[('The', 'Dursleys', 'knew'),
 ('Dursleys', 'knew', 'that'),
 ('knew', 'that', 'the'),
 ('that', 'the', 'Potters'),
 ('the', 'Potters', 'had'),
 ('Potters', 'had', 'a'),
 ('had', 'a', 'small'),
 ('a', 'small', 'son'),
 ('small', 'son', ','),
 ('son', ',', 'too'),
 (',', 'too', ','),
 ('too', ',', 'but'),
 (',', 'but', 'they'),
 ('but', 'they', 'had'),
 ('they', 'had', 'never'),
 ('had', 'never', 'even'),
 ('never', 'even', 'seen'),
 ('even', 'seen', 'him'),
 ('seen', 'him', '.')]

# **POS Tagging**

In [None]:
print(sentence)
tags=nltk.pos_tag(sentence_tokens)
tags

The Dursleys knew that the Potters had a small son, too, but they had never even seen him.


[('The', 'DT'),
 ('Dursleys', 'NNP'),
 ('knew', 'VBD'),
 ('that', 'IN'),
 ('the', 'DT'),
 ('Potters', 'NNPS'),
 ('had', 'VBD'),
 ('a', 'DT'),
 ('small', 'JJ'),
 ('son', 'NN'),
 (',', ','),
 ('too', 'RB'),
 (',', ','),
 ('but', 'CC'),
 ('they', 'PRP'),
 ('had', 'VBD'),
 ('never', 'RB'),
 ('even', 'RB'),
 ('seen', 'VBN'),
 ('him', 'PRP'),
 ('.', '.')]

# **Stemming**

In [None]:
sentince=nltk_sentences[1]
entence_tokens=nltk.tokenize.word_tokenize(sentence)
print(sentence)
print()
porter=nltk.stem.PorterStemmer()
for word in sentence_tokens :
  print(word,'\t\t\t',porter.stem(word))

The Dursleys knew that the Potters had a small son, too, but they had never even seen him.

The 			 the
Dursleys 			 dursley
knew 			 knew
that 			 that
the 			 the
Potters 			 potter
had 			 had
a 			 a
small 			 small
son 			 son
, 			 ,
too 			 too
, 			 ,
but 			 but
they 			 they
had 			 had
never 			 never
even 			 even
seen 			 seen
him 			 him
. 			 .


# **Lemmitization**

In [None]:
lemmatizer=nltk.stem.WordNetLemmatizer()
for t in sentence_tokens:
  print(t,'\t \t',lemmatizer.lemmatize(t))

The 	 	 The
Dursleys 	 	 Dursleys
knew 	 	 knew
that 	 	 that
the 	 	 the
Potters 	 	 Potters
had 	 	 had
a 	 	 a
small 	 	 small
son 	 	 son
, 	 	 ,
too 	 	 too
, 	 	 ,
but 	 	 but
they 	 	 they
had 	 	 had
never 	 	 never
even 	 	 even
seen 	 	 seen
him 	 	 him
. 	 	 .
