<a href="https://colab.research.google.com/github/fsafarkhani/NLP/blob/main/NLP_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction

You should process some texts using [NLTK](https://www.nltk.org/) or [spaCy](https://spacy.io/) libraries (ideally both). In particular, you should do the following:
- Load the `harry_potter` book. You can find this text corpus in the datasets folder.
- Segment the text of the book into sentences. How many sentences does this book have?
- Compute the frequency of each token in the book. What are the most frequent tokens?
- Choose a sentence from the book. Analyze this chosen sentence by
    - Calculating all [n-grams](https://en.wikipedia.org/wiki/N-gram).
    - Finding [POS tags](https://en.wikipedia.org/wiki/Part-of-speech_tagging) of tokens.
    - [Stemming](https://en.wikipedia.org/wiki/Stemming) and [lemmatizing](https://en.wikipedia.org/wiki/Lemmatisation) tokens.
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

# Import Libraries:

 - nltk: Natural Language Toolkit (basic NLP tasks like tokenization, tagging, stemming).

 - spacy: Advanced NLP library for fast sentence parsing, POS tagging, etc.

 - textacy: Built on top of spaCy, helps with advanced tasks like n-gram extraction.

 - nlp = spacy.load("en_core_web_sm"): Loads the English model into nlp.

In [75]:
import nltk
import spacy
import textacy
nlp = spacy.load("en_core_web_sm")


#Loading the File
Opens the file and reads the full Harry Potter text into the variable text.

In [84]:
f= open("/content/harry_potter.txt")
text = f.read()
print(text [:1000])

CHAPTER ONE THE BOY WHO LIVED 

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. 

Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere. 

The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, b

# Segment the text into Sentencce

nltk.sent_tokenize: Splits the text into sentences using NLTK andcuts sentences based on punctuation, so it might break sentences

In [64]:

nltk_Sentences = nltk.sent_tokenize(text)
len(nltk_Sentences)


6394

nlp(text).sents: spaCy also does sentence segmentation, but its more accure, because spaCy understands the grammar better, so it gives cleaner, more accurate sentence splits.

In [65]:
doc = nlp(text)
spacy_sentences = list(doc.sents)
len(spacy_sentences)

6232

  # Word Tokenization


In [66]:
from nltk.tokenize import word_tokenize #Split a Sentence into Words

tokens = {}

for s in nltk_Sentences: #This goes through each sentence, then each word in that sentence.

    sentence_tokens = word_tokenize(s)
    for t in sentence_tokens:

        if t not in tokens: #Couunts each words
            tokens[t] = 0
        tokens[t] += 1

frequent_tokens = sorted(tokens, key=tokens.get, reverse=True)[:20] #This gives you the top 20 most common words in the book.
for t in frequent_tokens:
    print(t, "\t\t", tokens[t])


, 		 5658
. 		 5119
the 		 3310
'' 		 2441
`` 		 2307
to 		 1845
and 		 1804
a 		 1578
Harry 		 1323
was 		 1253
of 		 1242
he 		 1208
's 		 997
in 		 933
I 		 919
it 		 897
his 		 896
you 		 837
n't 		 826
said 		 793


#N-Gram Computation

ngrams: Generates combinations of words.  means bigrams (2-word pairs), e.g.:

("Harry", "Potter")

In [80]:
from nltk import ngrams, word_tokenize

nltk_sentence = nltk_Sentences[50]  # This selects the 51st sentence from your list of sentences

sentence_tokens = word_tokenize(nltk_sentence) #This splits that sentence into individual words
bigrams = list(ngrams(sentence_tokens, 2)) #This makes bigrams — pairs of two words.

print(nltk_sentence)
bigrams


He didn't know why, but they made him uneasy.


[('He', 'did'),
 ('did', "n't"),
 ("n't", 'know'),
 ('know', 'why'),
 ('why', ','),
 (',', 'but'),
 ('but', 'they'),
 ('they', 'made'),
 ('made', 'him'),
 ('him', 'uneasy'),
 ('uneasy', '.')]

In [82]:
spacy_sentence = spacy_sentences[100]  #Picks the 101st sentence from
sentence_doc = nlp(spacy_sentence.text)

import textacy
ngrams = list(textacy.extract.basics.ngrams(sentence_doc, 2, filter_stops=False)) #Extracts bigrams using Textacy, even including stopwords (because filter_stops=False).

print(spacy_sentence)
ngrams


Going to be any more showers of owls tonight, Jim?" 




[Going to,
 to be,
 be any,
 any more,
 more showers,
 showers of,
 of owls,
 owls tonight]

#POS Tagging (Part-of-Speech)

POS labeling each word with its grammatical role, like noun, verb, adjective

In [69]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

print tokens each word with its tag WITH NLTK




In [70]:
pos_tags = nltk.pos_tag(sentence_tokens)
for t, tag in pos_tags:
    print(t, "\t\t", tag)


It 		 PRP
was 		 VBD
now 		 RB
sitting 		 VBG
on 		 IN
his 		 PRP$
garden 		 NN
wall 		 NN
. 		 .


POS Tagging Using spaCy-  The POS tag from spaCy (like NOUN, VERB, ADJ

In [71]:
print(spacy_sentence)

for t in sentence_doc:
    print(t.text, "\t\t", t.pos_)


Going to be any more showers of owls tonight, Jim?" 


Going 		 VERB
to 		 PART
be 		 AUX
any 		 PRON
more 		 ADJ
showers 		 NOUN
of 		 ADP
owls 		 NOUN
tonight 		 NOUN
, 		 PUNCT
Jim 		 PROPN
? 		 PUNCT
" 		 PUNCT


 		 SPACE


#ُStremming

Reduces each word to its stem/root but not accurate for example happily: happil!



In [72]:
from nltk.stem import PorterStemmer

print(nltk_sentence)

porter = PorterStemmer()
for t in sentence_tokens:
    print(t, "\t\t", porter.stem(t))


It was now sitting on his garden wall.
It 		 it
was 		 wa
now 		 now
sitting 		 sit
on 		 on
his 		 hi
garden 		 garden
wall 		 wall
. 		 .


#Lemmatization

The same as stremmig but in  NLTK: Assumes words are nouns only! → results may be less accurate

ensures word is root is valid.happlliy:happy.

In [73]:

print(nltk_sentence)

for t in sentence_tokens:
    print(t, "\t\t", lemmatizer.lemmatize(t))


It was now sitting on his garden wall.
It 		 It
was 		 wa
now 		 now
sitting 		 sitting
on 		 on
his 		 his
garden 		 garden
wall 		 wall
. 		 .


spacy understands the context and grammar → more accurate lemmatization

In [74]:
print(spacy_sentence)

for t in sentence_doc:
    print(t.text, "\t\t", t.lemma_ )

Going to be any more showers of owls tonight, Jim?" 


Going 		 go
to 		 to
be 		 be
any 		 any
more 		 more
showers 		 shower
of 		 of
owls 		 owl
tonight 		 tonight
, 		 ,
Jim 		 Jim
? 		 ?
" 		 "


 		 


