## NLTK Practice
- 1. Installing and Importing NLTK
- 2. Tokenization
---
### 1. Installing and Importing NLTK

In [42]:
# !pip install nltk

In [43]:
import nltk 

### 2. Tokenization
- Tokenization is the process of breaking down a text (like a sentence or paragraph) into smaller pieces called tokens. 
- It’s the first step in most NLP tasks (like translation, sentiment analysis, text classification).

#### 2.1 Turning CORPUS into DOCUMENTS

- If you have a corpus as a big chunk of text, you might want to split it into smaller pieces (the documents) so you can process or analyze them individually.
    - CORPUS ---> Paragraph 
    - DOCUMENTS ---> Sentences

In [44]:
corpus = """Hello welcome to my NLTK Prctice i.e., my rough work on nltk.
Let's explore what nltk can do.
I'm really excited! ready set go.
"""

In [45]:
print(corpus)

Hello welcome to my NLTK Prctice i.e., my rough work on nltk.
Let's explore what nltk can do.
I'm really excited! ready set go.



In [46]:
from nltk.tokenize import sent_tokenize
# nltk.download('punkt_tab')

In [47]:
documents = sent_tokenize(corpus)
documents

['Hello welcome to my NLTK Prctice i.e., my rough work on nltk.',
 "Let's explore what nltk can do.",
 "I'm really excited!",
 'ready set go.']

In [48]:
for sent in documents:
    print(sent)

Hello welcome to my NLTK Prctice i.e., my rough work on nltk.
Let's explore what nltk can do.
I'm really excited!
ready set go.


#### 2.2 Turning DOCUMENTS into WORDS
- Each document is a chunk of text, and word tokenization splits that text into individual words (tokens).

In [49]:
from nltk.tokenize import word_tokenize

In [50]:
word_tokenize(corpus)

['Hello',
 'welcome',
 'to',
 'my',
 'NLTK',
 'Prctice',
 'i.e.',
 ',',
 'my',
 'rough',
 'work',
 'on',
 'nltk',
 '.',
 'Let',
 "'s",
 'explore',
 'what',
 'nltk',
 'can',
 'do',
 '.',
 'I',
 "'m",
 'really',
 'excited',
 '!',
 'ready',
 'set',
 'go',
 '.']

In [51]:
for sent in documents:
    print(word_tokenize(sent))

['Hello', 'welcome', 'to', 'my', 'NLTK', 'Prctice', 'i.e.', ',', 'my', 'rough', 'work', 'on', 'nltk', '.']
['Let', "'s", 'explore', 'what', 'nltk', 'can', 'do', '.']
['I', "'m", 'really', 'excited', '!']
['ready', 'set', 'go', '.']


In [52]:
from nltk.tokenize import wordpunct_tokenize

wordpunct_tokenize(corpus) # will consider puncuations as words

['Hello',
 'welcome',
 'to',
 'my',
 'NLTK',
 'Prctice',
 'i',
 '.',
 'e',
 '.,',
 'my',
 'rough',
 'work',
 'on',
 'nltk',
 '.',
 'Let',
 "'",
 's',
 'explore',
 'what',
 'nltk',
 'can',
 'do',
 '.',
 'I',
 "'",
 'm',
 'really',
 'excited',
 '!',
 'ready',
 'set',
 'go',
 '.']

In [53]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

tokenizer.tokenize(corpus)   # won't treat fullstop as a word will consider it in the previous word

['Hello',
 'welcome',
 'to',
 'my',
 'NLTK',
 'Prctice',
 'i.e.',
 ',',
 'my',
 'rough',
 'work',
 'on',
 'nltk.',
 'Let',
 "'s",
 'explore',
 'what',
 'nltk',
 'can',
 'do.',
 'I',
 "'m",
 'really',
 'excited',
 '!',
 'ready',
 'set',
 'go',
 '.']

### 3. Stemming
- Stemming is the process of reducing a word to its root word called **Stem**, that affixes, suffixes or perfixes to the root word known as a **Lemma**
- Stemming is important in Natural Language Understanding (NLU) and Natural Langugae Processing
- Stemming Examples
    - [eat, eating, eaten] --> eat (root word, stem word)
    - [running, run, ran] --> run  (root word, stem word)

In [54]:
words = ['playing', 'played', 'plays', 'flying', 'flies', 'cried', 'crying', 'happier', 'happyly', 'studies', 'studying']

#### 3.1 Porter Stemming
- The Porter Stemmer is a widely used algorithm in natural language processing (NLP) for word stemming—reducing words to their base or root form

In [55]:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()

In [56]:
for word in words :
    print(f"{word} ---> {stemming.stem(word)}")
# will give some errors e.g. [flying ---> fli], [crying ---> cri] etc

playing ---> play
played ---> play
plays ---> play
flying ---> fli
flies ---> fli
cried ---> cri
crying ---> cri
happier ---> happier
happyly ---> happyli
studies ---> studi
studying ---> studi


In [57]:
stemming.stem('Congratulations') 
# returns word 'congratul' which completly changes the meaning

'congratul'

In [58]:
print(stemming.stem('sitting')) # returns sit
print(stemming.stem('ssitting')) # returns ssit

# This problem will get fixed with the help of Lemmatzation

sit
ssit


#### 3.2 RegexpStemmer
- The RegexpStemmer (Regular Expression Stemmer) is a simple and customizable rule-based stemmer that removes suffixes from words using regular expressions.
- Unlike more complex stemmers like the PorterStemmer, which use rule sets and conditions, RegexpStemmer works by applying your specified regular expression—which makes it very flexible but also very manual.

In [59]:
from nltk.stem import RegexpStemmer

reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [60]:
print(reg_stemmer.stem('eating'))
print(reg_stemmer.stem('ingeating')) # returns 'ingeat' coz we addded '$' at last if we'll remove '$' then it will return 'eat' for the same input

eat
ingeat


In [61]:
for word in words :
    print(reg_stemmer.stem(word))

play
played
play
fly
flie
cried
cry
happier
happyly
studie
study


#### 3.3 Snowball Stemmer
-  The Snowball Stemmer uses an improved version of the original Porter algorithm (often called Porter2), which is less aggressive and more accurate.
-  Unlike the Porter Stemmer, which primarily works for English, the Snowball Stemmer supports several languages.

In [62]:
from nltk.stem import SnowballStemmer

snowballstemmer = SnowballStemmer('english')

In [63]:
for word in words :
    print(f'{word} ---> {snowballstemmer.stem(word)}')

playing ---> play
played ---> play
plays ---> play
flying ---> fli
flies ---> fli
cried ---> cri
crying ---> cri
happier ---> happier
happyly ---> happyli
studies ---> studi
studying ---> studi


In [72]:
print('Porter : ' + stemming.stem('fairly'), stemming.stem('sportingly'))

print('Snowball : ' + snowballstemmer.stem('fairly'), snowballstemmer.stem('sportingly'))

Porter : fairli sportingli
Snowball : fair sport


### 4. Lemmatization

- Lemmatization is another text preprocessing technique in Natural Language Processing (NLP) that aims to reduce words to their base or dictionary form (called a lemma).
- Unlike stemming, which often removes suffixes in a mechanical or rule-based manner, lemmatization takes into account the context and the part of speech of a word.

In [73]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [None]:
# nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\vaibh\AppData\Roaming\nltk_data...


True

#### 4.1 The Part Of Speech tag. Valid options are :
- "n" : nouns **(By Deafult)**
- "v" : verbs
- "a" : adjectives 
- "r" : adverbs 
- "s" : satellite adjectives

In [78]:
lemmatizer.lemmatize('going', pos = 'v')

'go'

In [79]:
for word in words:
    print(f'{word} ---> {lemmatizer.lemmatize(word, pos = 'v')}')

playing ---> play
played ---> play
plays ---> play
flying ---> fly
flies ---> fly
cried ---> cry
crying ---> cry
happier ---> happier
happyly ---> happyly
studies ---> study
studying ---> study
