**Tokenization** is the process of breaking down a text into individual words or tokens. It helps in preparing the text data for
further analysis.

**POS tagging** assigns part-of-speech tags to each word in a text. It is useful for understanding the grammatical structure and
meaning of the text.

**Stop words** are common words that often occur in a text but do not carry much significance. Removing stop words can help
reduce noise in the text data.

**Stemming** is the process of reducing words to their base or root form. It helps in reducing the dimensionality of the text
data by grouping words with the same root. Reduces the words to their base or root form using the Porter stemming algorithm.

**Lemmatization** is similar to stemming but aims to reduce words to their dictionary or base form, called a lemma. It considers
the context and meaning of words, resulting in more accurate base forms.
Reduces the words to their dictionary or base form using the WordNet lemmatizer.

**Term Frequency** (TF) calculates the frequency of each word in a document. It represents how often a word appears in the
document.

**Inverse Document Frequency** (IDF) measures the importance of a word in a collection of documents. It penalizes common words
and gives more weight to rare words.

In [None]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import sent_tokenize

text = """ Django is a python based web application framework that is free and open source. A framework is simply a collection of module
        that facilitate development.Rapid Development and pragmatic design are key benefit of django.
        Django is python based programming framework. It's python coding tools that adds functionality and speed up the process. """

tokenized_text=sent_tokenize(text)
print(tokenized_text)

[' Django is a python based web application framework that is free and open source.', 'A framework is simply a collection of module\n        that facilitate development.Rapid Development and pragmatic design are key benefit of django.', 'Django is python based programming framework.', "It's python coding tools that adds functionality and speed up the process."]


**Tokenisation**

In [None]:
from nltk.tokenize import word_tokenize    #tokenization separate a piece of text into smaller unit called token
tokenized_word = word_tokenize(text)
print(tokenized_word)

['Django', 'is', 'a', 'python', 'based', 'web', 'application', 'framework', 'that', 'is', 'free', 'and', 'open', 'source', '.', 'A', 'framework', 'is', 'simply', 'a', 'collection', 'of', 'module', 'that', 'facilitate', 'development.Rapid', 'Development', 'and', 'pragmatic', 'design', 'are', 'key', 'benefit', 'of', 'django', '.', 'Django', 'is', 'python', 'based', 'programming', 'framework', '.', 'It', "'s", 'python', 'coding', 'tools', 'that', 'adds', 'functionality', 'and', 'speed', 'up', 'the', 'process', '.']


**Frequency Distribution**

In [None]:
from nltk.probability import FreqDist
fdist = FreqDist(tokenized_word)
print(fdist)

<FreqDist with 39 samples and 57 outcomes>


In [None]:
fdist.most_common()

[('is', 4),
 ('.', 4),
 ('python', 3),
 ('framework', 3),
 ('that', 3),
 ('and', 3),
 ('Django', 2),
 ('a', 2),
 ('based', 2),
 ('of', 2),
 ('web', 1),
 ('application', 1),
 ('free', 1),
 ('open', 1),
 ('source', 1),
 ('A', 1),
 ('simply', 1),
 ('collection', 1),
 ('module', 1),
 ('facilitate', 1),
 ('development.Rapid', 1),
 ('Development', 1),
 ('pragmatic', 1),
 ('design', 1),
 ('are', 1),
 ('key', 1),
 ('benefit', 1),
 ('django', 1),
 ('programming', 1),
 ('It', 1),
 ("'s", 1),
 ('coding', 1),
 ('tools', 1),
 ('adds', 1),
 ('functionality', 1),
 ('speed', 1),
 ('up', 1),
 ('the', 1),
 ('process', 1)]

In [None]:
fdist.most_common(2)

[('is', 4), ('.', 4)]

**StopWord**

In [None]:
import nltk
nltk.download('stopwords')                 #remove word like the, a, an and so on

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
stop_word = set(stopwords.words("english"))

In [None]:
print(stop_word)

{'shan', 'its', 'that', 'i', "you've", "doesn't", 'while', 'have', 'your', "didn't", 'why', "it's", 'yourself', 'these', 'how', 'am', 'his', 're', "you're", 'up', 'those', 'll', 'it', 've', "hadn't", 'hers', 'needn', 'on', 'all', 'few', 'weren', 'this', 'each', "haven't", 'only', 'nor', 't', 'out', 'didn', 'through', 'what', 'over', 'off', 'ain', 'being', 'doing', 'under', 'as', 'against', "mustn't", 'be', 'until', 'mustn', 'o', 'their', 'because', 'then', 'you', 'theirs', "shouldn't", 'from', 'down', 'very', 'wouldn', 'no', 'above', 'now', 'at', 'ourselves', 'by', 'but', 'between', 'the', 'don', "aren't", 'does', 'm', 'he', "don't", 'shouldn', 'should', 'can', 'or', "you'll", 'whom', 'hasn', 'do', "isn't", 'd', 'both', 'themselves', 'was', 'had', 'once', 'into', 'wasn', 'just', "that'll", 'not', 'too', "mightn't", "won't", "shan't", 'most', 'she', 'haven', 'myself', 'further', "she's", 'some', 'other', 'y', 'my', "weren't", 'we', 'are', 'yours', 'a', 'such', 'own', 's', "wasn't", 'bel

**StopWord Removal**

In [None]:
filtered_sent = []
for w in tokenized_word:
  if w not in stop_word:
    filtered_sent.append(w)

print("Tokenized Senetence:", tokenized_word)
print("\nFiltered Sentence:", filtered_sent)

Tokenized Senetence: ['Django', 'is', 'a', 'python', 'based', 'web', 'application', 'framework', 'that', 'is', 'free', 'and', 'open', 'source', '.', 'A', 'framework', 'is', 'simply', 'a', 'collection', 'of', 'module', 'that', 'facilitate', 'development.Rapid', 'Development', 'and', 'pragmatic', 'design', 'are', 'key', 'benefit', 'of', 'django', '.', 'Django', 'is', 'python', 'based', 'programming', 'framework', '.', 'It', "'s", 'python', 'coding', 'tools', 'that', 'adds', 'functionality', 'and', 'speed', 'up', 'the', 'process', '.']

Filtered Sentence: ['Django', 'python', 'based', 'web', 'application', 'framework', 'free', 'open', 'source', '.', 'A', 'framework', 'simply', 'collection', 'module', 'facilitate', 'development.Rapid', 'Development', 'pragmatic', 'design', 'key', 'benefit', 'django', '.', 'Django', 'python', 'based', 'programming', 'framework', '.', 'It', "'s", 'python', 'coding', 'tools', 'adds', 'functionality', 'speed', 'process', '.']


**Stemming**

In [None]:
from nltk.stem import PorterStemmer                        #stemming reduce word to their base form
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
ps = PorterStemmer()

stemmed_words=[]

for w in filtered_sent:
  stemmed_words.append(ps.stem(w))

print("Filtered Sentence: ",filtered_sent)
print("\nStemmed Sentence: ",stemmed_words)

Filtered Sentence:  ['Django', 'python', 'based', 'web', 'application', 'framework', 'free', 'open', 'source', '.', 'A', 'framework', 'simply', 'collection', 'module', 'facilitate', 'development.Rapid', 'Development', 'pragmatic', 'design', 'key', 'benefit', 'django', '.', 'Django', 'python', 'based', 'programming', 'framework', '.', 'It', "'s", 'python', 'coding', 'tools', 'adds', 'functionality', 'speed', 'process', '.']

Stemmed Sentence:  ['django', 'python', 'base', 'web', 'applic', 'framework', 'free', 'open', 'sourc', '.', 'a', 'framework', 'simpli', 'collect', 'modul', 'facilit', 'development.rapid', 'develop', 'pragmat', 'design', 'key', 'benefit', 'django', '.', 'django', 'python', 'base', 'program', 'framework', '.', 'it', "'s", 'python', 'code', 'tool', 'add', 'function', 'speed', 'process', '.']


**Lemmatization**

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
lem=WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
stem=PorterStemmer()

lemma_word_list=[]
for word in stemmed_words:
  lemmat = lem.lemmatize(word,"v")       #change every word from word to its basic verb form
  lemma_word_list.append(lemmat)

print("Lemmatized Words:",lemma_word_list)
print("\nStemmed Word:",stemmed_words)


Lemmatized Words: ['django', 'python', 'base', 'web', 'applic', 'framework', 'free', 'open', 'sourc', '.', 'a', 'framework', 'simpli', 'collect', 'modul', 'facilit', 'development.rapid', 'develop', 'pragmat', 'design', 'key', 'benefit', 'django', '.', 'django', 'python', 'base', 'program', 'framework', '.', 'it', "'s", 'python', 'code', 'tool', 'add', 'function', 'speed', 'process', '.']

Stemmed Word: ['django', 'python', 'base', 'web', 'applic', 'framework', 'free', 'open', 'sourc', '.', 'a', 'framework', 'simpli', 'collect', 'modul', 'facilit', 'development.rapid', 'develop', 'pragmat', 'design', 'key', 'benefit', 'django', '.', 'django', 'python', 'base', 'program', 'framework', '.', 'it', "'s", 'python', 'code', 'tool', 'add', 'function', 'speed', 'process', '.']


**POS TAgging**

In [None]:
tokens = nltk.word_tokenize(text)
print(tokens)

['Django', 'is', 'a', 'python', 'based', 'web', 'application', 'framework', 'that', 'is', 'free', 'and', 'open', 'source', '.', 'A', 'framework', 'is', 'simply', 'a', 'collection', 'of', 'module', 'that', 'facilitate', 'development.Rapid', 'Development', 'and', 'pragmatic', 'design', 'are', 'key', 'benefit', 'of', 'django', '.', 'Django', 'is', 'python', 'based', 'programming', 'framework', '.', 'It', "'s", 'python', 'coding', 'tools', 'that', 'adds', 'functionality', 'and', 'speed', 'up', 'the', 'process', '.']


In [None]:
import nltk 
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
nltk.pos_tag(tokens)

[('Django', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('python', 'NN'),
 ('based', 'VBN'),
 ('web', 'JJ'),
 ('application', 'NN'),
 ('framework', 'NN'),
 ('that', 'WDT'),
 ('is', 'VBZ'),
 ('free', 'JJ'),
 ('and', 'CC'),
 ('open', 'JJ'),
 ('source', 'NN'),
 ('.', '.'),
 ('A', 'DT'),
 ('framework', 'NN'),
 ('is', 'VBZ'),
 ('simply', 'RB'),
 ('a', 'DT'),
 ('collection', 'NN'),
 ('of', 'IN'),
 ('module', 'NN'),
 ('that', 'IN'),
 ('facilitate', 'JJ'),
 ('development.Rapid', 'NN'),
 ('Development', 'NNP'),
 ('and', 'CC'),
 ('pragmatic', 'JJ'),
 ('design', 'NN'),
 ('are', 'VBP'),
 ('key', 'JJ'),
 ('benefit', 'NN'),
 ('of', 'IN'),
 ('django', 'NN'),
 ('.', '.'),
 ('Django', 'NNP'),
 ('is', 'VBZ'),
 ('python', 'VBN'),
 ('based', 'VBN'),
 ('programming', 'VBG'),
 ('framework', 'NN'),
 ('.', '.'),
 ('It', 'PRP'),
 ("'s", 'VBZ'),
 ('python', 'JJ'),
 ('coding', 'NN'),
 ('tools', 'NNS'),
 ('that', 'WDT'),
 ('adds', 'VBZ'),
 ('functionality', 'NN'),
 ('and', 'CC'),
 ('speed', 'NN'),
 ('up', 'RP'),
 ('th