<a href="https://colab.research.google.com/github/Z4HRA-S/NLP_Course_Spring2023/blob/main/NLTK_nlp_spring2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLTK 
###Spring 2023 | Session 1 

The goal of this session is to get familiar with the `NLTK` package and some of the text pre-processing steps. 
We will cover the following concepts: 
* Tokenization
* Stemming
* Lemmatizing
* POS tagging

In [16]:

my_string = """Tolkien's fantasy books on Middle-earth, especially The Lord of the Rings and The Silmarillion,
 drew on a wide array of influences including his philological interest in language,
 Christianity, medievalism, mythology, archaeology, ancient and modern literature,
  and personal experience. His philological work centred on the study of Old English
   literature, especially Beowulf, and he acknowledged its importance to his writings.
   He was a gifted linguist, influenced by Germanic, Celtic, Finnish, and Greek language and mythology. """

We will try two of built-in python's function for strings.

In [17]:
my_string.split("\n")

["Tolkien's fantasy books on Middle-earth, especially The Lord of the Rings and The Silmarillion,",
 ' drew on a wide array of influences including his philological interest in language,',
 ' Christianity, medievalism, mythology, archaeology, ancient and modern literature,',
 '  and personal experience. His philological work centred on the study of Old English',
 '   literature, especially Beowulf, and he acknowledged its importance to his writings.',
 '   He was a gifted linguist, influenced by Germanic, Celtic, Finnish, and Greek language and mythology. ']

In [None]:
my_string.split(" ")

In [19]:
print(my_string.replace("he","***Tolkien***"))

Tolkien's fantasy books on Middle-earth, especially T***Tolkien*** Lord of t***Tolkien*** Rings and T***Tolkien*** Silmarillion,
 drew on a wide array of influences including his philological interest in language,
 Christianity, medievalism, mythology, archaeology, ancient and modern literature,
  and personal experience. His philological work centred on t***Tolkien*** study of Old English
   literature, especially Beowulf, and ***Tolkien*** acknowledged its importance to his writings.
   He was a gifted linguist, influenced by Germanic, Celtic, Finnish, and Greek language and mythology. 


###NLTK

In [8]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

##Tokenization

Token's definition by [Stanford NLP course](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html): "Tokens are often loosely referred to as terms or words, but it is sometimes important to make a type/token distinction. A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing."

What is Tokenization? An answer by [Anni Burchfiel](https://www.tokenex.com/blog/ab-what-is-nlp-natural-language-processing-tokenization/): "Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words)."

In [20]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(my_string)
print(tokens)

['Tolkien', "'s", 'fantasy', 'books', 'on', 'Middle-earth', ',', 'especially', 'The', 'Lord', 'of', 'the', 'Rings', 'and', 'The', 'Silmarillion', ',', 'drew', 'on', 'a', 'wide', 'array', 'of', 'influences', 'including', 'his', 'philological', 'interest', 'in', 'language', ',', 'Christianity', ',', 'medievalism', ',', 'mythology', ',', 'archaeology', ',', 'ancient', 'and', 'modern', 'literature', ',', 'and', 'personal', 'experience', '.', 'His', 'philological', 'work', 'centred', 'on', 'the', 'study', 'of', 'Old', 'English', 'literature', ',', 'especially', 'Beowulf', ',', 'and', 'he', 'acknowledged', 'its', 'importance', 'to', 'his', 'writings', '.', 'He', 'was', 'a', 'gifted', 'linguist', ',', 'influenced', 'by', 'Germanic', ',', 'Celtic', ',', 'Finnish', ',', 'and', 'Greek', 'language', 'and', 'mythology', '.']


In [21]:
tokens = [t.lower() for t in tokens]

###N_grams
N-grams are a continuous slice of a textual sequence. The N refers to the slice size (N tokens, symbols, words, etc).

In [22]:
from nltk.util import ngrams

[a for a in ngrams(tokens, 4, pad_left=True, pad_right=True, left_pad_symbol='_', right_pad_symbol='_')]

[('_', '_', '_', 'tolkien'),
 ('_', '_', 'tolkien', "'s"),
 ('_', 'tolkien', "'s", 'fantasy'),
 ('tolkien', "'s", 'fantasy', 'books'),
 ("'s", 'fantasy', 'books', 'on'),
 ('fantasy', 'books', 'on', 'middle-earth'),
 ('books', 'on', 'middle-earth', ','),
 ('on', 'middle-earth', ',', 'especially'),
 ('middle-earth', ',', 'especially', 'the'),
 (',', 'especially', 'the', 'lord'),
 ('especially', 'the', 'lord', 'of'),
 ('the', 'lord', 'of', 'the'),
 ('lord', 'of', 'the', 'rings'),
 ('of', 'the', 'rings', 'and'),
 ('the', 'rings', 'and', 'the'),
 ('rings', 'and', 'the', 'silmarillion'),
 ('and', 'the', 'silmarillion', ','),
 ('the', 'silmarillion', ',', 'drew'),
 ('silmarillion', ',', 'drew', 'on'),
 (',', 'drew', 'on', 'a'),
 ('drew', 'on', 'a', 'wide'),
 ('on', 'a', 'wide', 'array'),
 ('a', 'wide', 'array', 'of'),
 ('wide', 'array', 'of', 'influences'),
 ('array', 'of', 'influences', 'including'),
 ('of', 'influences', 'including', 'his'),
 ('influences', 'including', 'his', 'philological

###Stemming

Stemming is a process by which word endings or other affixes are removed or modified in order that word forms which differ in non-relevant ways may be merged and treated as equivalent. [link](https://link.springer.com/referenceworkentry/10.1007/978-1-4899-7993-3_942-2)

In [23]:
from nltk.stem import (PorterStemmer, LancasterStemmer)
from nltk.stem.snowball import SnowballStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")

print([porter.stem(t) for t in tokens])
print([lancaster.stem(t) for t in tokens])
print([snowball.stem(t) for t in tokens])

['tolkien', "'s", 'fantasi', 'book', 'on', 'middle-earth', ',', 'especi', 'the', 'lord', 'of', 'the', 'ring', 'and', 'the', 'silmarillion', ',', 'drew', 'on', 'a', 'wide', 'array', 'of', 'influenc', 'includ', 'hi', 'philolog', 'interest', 'in', 'languag', ',', 'christian', ',', 'mediev', ',', 'mytholog', ',', 'archaeolog', ',', 'ancient', 'and', 'modern', 'literatur', ',', 'and', 'person', 'experi', '.', 'hi', 'philolog', 'work', 'centr', 'on', 'the', 'studi', 'of', 'old', 'english', 'literatur', ',', 'especi', 'beowulf', ',', 'and', 'he', 'acknowledg', 'it', 'import', 'to', 'hi', 'write', '.', 'he', 'wa', 'a', 'gift', 'linguist', ',', 'influenc', 'by', 'german', ',', 'celtic', ',', 'finnish', ',', 'and', 'greek', 'languag', 'and', 'mytholog', '.']
['tolky', "'s", 'fantasy', 'book', 'on', 'middle-earth', ',', 'espec', 'the', 'lord', 'of', 'the', 'ring', 'and', 'the', 'silmaril', ',', 'drew', 'on', 'a', 'wid', 'array', 'of', 'influ', 'includ', 'his', 'philolog', 'interest', 'in', 'langu

###Lemmatizing


Lemmatization is a process of determining a base or dictionary form (lemma) for a given surface form. Especially for languages with rich morphology it is important to be able to normalize words into their base forms to better support for example search engines and linguistic studies. [link](https://arxiv.org/abs/1902.00972)

In [24]:
nltk.download('wordnet')
from nltk import WordNetLemmatizer

wnl = WordNetLemmatizer()

print([wnl.lemmatize(t, pos="v") for t in tokens])

['tolkien', "'s", 'fantasy', 'book', 'on', 'middle-earth', ',', 'especially', 'the', 'lord', 'of', 'the', 'ring', 'and', 'the', 'silmarillion', ',', 'draw', 'on', 'a', 'wide', 'array', 'of', 'influence', 'include', 'his', 'philological', 'interest', 'in', 'language', ',', 'christianity', ',', 'medievalism', ',', 'mythology', ',', 'archaeology', ',', 'ancient', 'and', 'modern', 'literature', ',', 'and', 'personal', 'experience', '.', 'his', 'philological', 'work', 'centre', 'on', 'the', 'study', 'of', 'old', 'english', 'literature', ',', 'especially', 'beowulf', ',', 'and', 'he', 'acknowledge', 'its', 'importance', 'to', 'his', 'write', '.', 'he', 'be', 'a', 'gift', 'linguist', ',', 'influence', 'by', 'germanic', ',', 'celtic', ',', 'finnish', ',', 'and', 'greek', 'language', 'and', 'mythology', '.']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


###POS
"Part-of-speech (POS) tagging, also called grammatical tagging, is the automatic assignment of part-of-speech tags to words in a sentence. A POS is a grammatical classification that commonly includes verbs, adjectives, adverbs, nouns, etc." [Chiche, A. 2022] [link](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-022-00561-y) 


|Tag |	Meaning |	English Examples|
|----|----------|-----------------|
ADJ |	adjective 	|new, good, high, special, big, local
ADP 	|adposition 	|on, of, at, with, by, into, under
ADV |	adverb| 	really, already, still, early, now
CONJ |	conjunction |	and, or, but, if, while, although
DET 	|determiner, article |	the, a, some, most, every, no, which
NOUN 	|noun |	year, home, costs, time, Africa
NUM 	|numeral 	|twenty-four, fourth, 1991, 14:24
PRT 	|particle |	at, on, out, over per, that, up, with
PRON 	|pronoun 	|he, their, her, its, my, I, us
VERB 	|verb |	is, say, told, given, playing, would
. 	|punctuation marks |	. , ; !
X 	|other |	ersatz, esprit, dunno, gr8, univeristy

In [25]:
nltk.download("tagsets")
#nltk.help.upenn_tagset()
nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(tokens)

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('tolkien', 'NN'),
 ("'s", 'POS'),
 ('fantasy', 'JJ'),
 ('books', 'NNS'),
 ('on', 'IN'),
 ('middle-earth', 'NN'),
 (',', ','),
 ('especially', 'RB'),
 ('the', 'DT'),
 ('lord', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('rings', 'NNS'),
 ('and', 'CC'),
 ('the', 'DT'),
 ('silmarillion', 'NN'),
 (',', ','),
 ('drew', 'VBD'),
 ('on', 'IN'),
 ('a', 'DT'),
 ('wide', 'JJ'),
 ('array', 'NN'),
 ('of', 'IN'),
 ('influences', 'NNS'),
 ('including', 'VBG'),
 ('his', 'PRP$'),
 ('philological', 'JJ'),
 ('interest', 'NN'),
 ('in', 'IN'),
 ('language', 'NN'),
 (',', ','),
 ('christianity', 'NN'),
 (',', ','),
 ('medievalism', 'NN'),
 (',', ','),
 ('mythology', 'NN'),
 (',', ','),
 ('archaeology', 'NN'),
 (',', ','),
 ('ancient', 'NN'),
 ('and', 'CC'),
 ('modern', 'JJ'),
 ('literature', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('personal', 'JJ'),
 ('experience', 'NN'),
 ('.', '.'),
 ('his', 'PRP$'),
 ('philological', 'JJ'),
 ('work', 'NN'),
 ('centred', 'VBN'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('study', 'NN')

If you want an informative description of each tag, you can use the code below.

In [15]:
nltk.help.upenn_tagset('VBN')

VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...


##Homework

1. Load the `Harry_potter_and_the_order_of_the_phoenix_chapter_1.txt` file from the `data` directory. Write a function to process the whole text and return one dictionary with the **verb's** root as the key and the list of occurrences of the **key-verb** as value. For example for this sentence: 
 "*The injustice of it all burst inside him so that he wanted to yell with fury, and he wanted to tell the whole world.*" we would have this dictionary: 
```
{"want":[11,19], "yell":[13], "burst":[5],"tell":[21]}
```
We consider the word position here. 

2. Write another program to take a verb from the user and search in the dictionary you made in Q.1 and return those sentences containing the verb or another form of the verb.


3. Search about the Stop Word, and answer these questions: 
* What are the Stop Words? 
* In what kind of situation do they act as noise in our data and when do they help to have better results?
* Can we remove the stop words using NLTK? 

4. We have 3 different texts, one is some of Shakespeare's poems `shakespeare.txt`, one is a scientific article from Wikipedia -`cognitive_revolution.txt`- and the other is one chapter of the Harry Potter book series `Harry_potter_and_the_order_of_the_phoenix_chapter_1.txt`.

 We want to calculate the readability of these texts. It means we want to know how easy each text is to understand. One of the most used readability measures is [flesch reading ease metric](https://readable.com/readability/flesch-reading-ease-flesch-kincaid-grade-level/) which is calculated based on the average number of words in a sentence and the average number of syllables in words. It's formula is: 
$206.835 - 1.015(\frac{\text{total words}}{\text{total sentences}}) - 84.6(\frac{\text{total syllables}}{\text{total words}})$

calculate the readability measure for these 3 texts.

Hint: The `nltk.tokenize.sent_tokenize` can be used for splitting the whole text into sentences. For `shakespeare.txt` you can use splitting by `\n` to have each line as a separate sentence. 


______________________


**Submit the code and the sample result for Q.1 and Q.2 and the result for Q.3 and Q.4 by 11-2-1402 at midnight. Please send an email to z.sarlak@iasbs.ac.ir with the title "NLTK_HW1".**