### Text Preprocessing

#### Import Required Library

In [3]:
import warnings
warnings.filterwarnings("ignore")
import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
import re
import numpy as np

[nltk_data] Downloading package stopwords to C:\Users\Anjali
[nltk_data]     Sharma\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We have taken the paragraph

In [4]:
paragraph = """Thank you all so very much. Thank you to the Academy. Thank you to all of you in this room. 
            I have to congratulate the other incredible nominees this year. The Revenant was the product of the tireless efforts of an unbelievable cast and crew. 
            First off, to my brother in this endeavor, Mr. Tom Hardy. Tom, your talent on screen can only be surpassed by your friendship off screen,
            thank you for creating a transcendent cinematic experience. Thank you to everybody at Fox and New Regency my entire team. 
            I have to thank everyone from the very onset of my career … To my parents; 
            none of this would be possible without you. And to my friends, I love you dearly; you know who you are.

            And lastly, I just want to say this: Making The Revenant was about man's relationship to the natural world. 
            A world that we collectively felt in 2015 as the hottest year in recorded history. Our production needed to move to 
            the southern tip of this planet just to be able to find snow.Climate change is real, it is happening right now. 
            It is the most urgent threat facing our entire species, and we need to work collectively together and stop procrastinating. 
            We need to support leaders around the world who do not speak for the big polluters, but who speak for all of humanity, for the indigenous people of the world, 
            for the billions and billions of underprivileged people out there who would be most affected by this. For our children’s children, 
            and for those people out there whose voices have been drowned out by the politics of greed. I thank you all for this amazing award tonight. 
            Let us not take this planet for granted. I do not take tonight for granted. 
            Thank you so very much."""

We will use sent_tokenize which return the list of sentences from above paragraph

In [5]:
sentence = nltk.sent_tokenize(paragraph) #tokenize the paragraph into sentences

In [6]:
print(sentence)

['Thank you all so very much.', 'Thank you to the Academy.', 'Thank you to all of you in this room.', 'I have to congratulate the other incredible nominees this year.', 'The Revenant was the product of the tireless efforts of an unbelievable cast and crew.', 'First off, to my brother in this endeavor, Mr. Tom Hardy.', 'Tom, your talent on screen can only be surpassed by your friendship off screen,\n            thank you for creating a transcendent cinematic experience.', 'Thank you to everybody at Fox and New Regency my entire team.', 'I have to thank everyone from the very onset of my career … To my parents; \n            none of this would be possible without you.', 'And to my friends, I love you dearly; you know who you are.', "And lastly, I just want to say this: Making The Revenant was about man's relationship to the natural world.", 'A world that we collectively felt in 2015 as the hottest year in recorded history.', 'Our production needed to move to \n            the southern ti

In [7]:
print(len(sentence)) #length of sentences

20


Let's use word_tokenize() which tokenize the sentence into words.

In [8]:
words = word_tokenize(paragraph) #tokenize the paragraph into word

In [9]:
print(words)

['Thank', 'you', 'all', 'so', 'very', 'much', '.', 'Thank', 'you', 'to', 'the', 'Academy', '.', 'Thank', 'you', 'to', 'all', 'of', 'you', 'in', 'this', 'room', '.', 'I', 'have', 'to', 'congratulate', 'the', 'other', 'incredible', 'nominees', 'this', 'year', '.', 'The', 'Revenant', 'was', 'the', 'product', 'of', 'the', 'tireless', 'efforts', 'of', 'an', 'unbelievable', 'cast', 'and', 'crew', '.', 'First', 'off', ',', 'to', 'my', 'brother', 'in', 'this', 'endeavor', ',', 'Mr.', 'Tom', 'Hardy', '.', 'Tom', ',', 'your', 'talent', 'on', 'screen', 'can', 'only', 'be', 'surpassed', 'by', 'your', 'friendship', 'off', 'screen', ',', 'thank', 'you', 'for', 'creating', 'a', 'transcendent', 'cinematic', 'experience', '.', 'Thank', 'you', 'to', 'everybody', 'at', 'Fox', 'and', 'New', 'Regency', 'my', 'entire', 'team', '.', 'I', 'have', 'to', 'thank', 'everyone', 'from', 'the', 'very', 'onset', 'of', 'my', 'career', '…', 'To', 'my', 'parents', ';', 'none', 'of', 'this', 'would', 'be', 'possible', 'w

In [10]:
print(len(words))

343


### Stemming And Lemmatization

Stemming stem the similar words into a word with meaningless .So it takes less time while processing.We can use stemming on that preprocessing analysis where meaning of words are not that much important.

While Lemmatization return meaningful word so it take time while processing. We can use stemming on that preprocessing analysis where meaning of words are important.


![stemmingVsLemma.png](attachment:stemmingVsLemma.png)

### Observation

We got "chang" word after stemming but "change" word after Lemmatization.

### Applied Stemming on Paragraph

In [11]:
stemmer = PorterStemmer()
sentence = nltk.sent_tokenize(paragraph) #tokenize the paragraph into sentences
#stemming

for i in range(len(sentence)): #took the length of sentences
    words = nltk.word_tokenize(sentence[i]) #tokenize the sentences into word
    newwords = [stemmer.stem(word) for word in words]#iterate over the words and found stemming word in newwords
    sentence[i] =" ".join(newwords) # join the all newwords

In [12]:
print(sentence)

['thank you all so veri much .', 'thank you to the academi .', 'thank you to all of you in thi room .', 'I have to congratul the other incred nomine thi year .', 'the reven wa the product of the tireless effort of an unbeliev cast and crew .', 'first off , to my brother in thi endeavor , mr. tom hardi .', 'tom , your talent on screen can onli be surpass by your friendship off screen , thank you for creat a transcend cinemat experi .', 'thank you to everybodi at fox and new regenc my entir team .', 'I have to thank everyon from the veri onset of my career … To my parent ; none of thi would be possibl without you .', 'and to my friend , I love you dearli ; you know who you are .', "and lastli , I just want to say thi : make the reven wa about man 's relationship to the natur world .", 'A world that we collect felt in 2015 as the hottest year in record histori .', 'our product need to move to the southern tip of thi planet just to be abl to find snow.clim chang is real , it is happen righ

### Applied Lemmatizer on Paragraph

In [13]:
sentence = nltk.sent_tokenize(paragraph) # tokenize the paragraph into sentences
lemmatizer = WordNetLemmatizer()

#lemmatizing
for i in range(len(sentence)): #took all the sentences
    words = nltk.word_tokenize(sentence[i]) # did word tokenization of each of sentences
    newwords = [lemmatizer.lemmatize(word) for word in words] #lemmatize the word into newwords
    sentence[i]=" ".join(newwords) #join the newwords into sentence list

In [14]:
print(sentence)

['Thank you all so very much .', 'Thank you to the Academy .', 'Thank you to all of you in this room .', 'I have to congratulate the other incredible nominee this year .', 'The Revenant wa the product of the tireless effort of an unbelievable cast and crew .', 'First off , to my brother in this endeavor , Mr. Tom Hardy .', 'Tom , your talent on screen can only be surpassed by your friendship off screen , thank you for creating a transcendent cinematic experience .', 'Thank you to everybody at Fox and New Regency my entire team .', 'I have to thank everyone from the very onset of my career … To my parent ; none of this would be possible without you .', 'And to my friend , I love you dearly ; you know who you are .', "And lastly , I just want to say this : Making The Revenant wa about man 's relationship to the natural world .", 'A world that we collectively felt in 2015 a the hottest year in recorded history .', 'Our production needed to move to the southern tip of this planet just to b

### Stopword Removal

This is a such important step for text preprocessing. It will remove the words like is,am,or,not etc.

In [15]:

for i in range(len(sentence)):
    words = nltk.word_tokenize(sentence[i])
    newwords = [word for word in words if word not in stopwords.words('english')]
    sentence[i] = " ".join(newwords)

In [16]:
print(sentence)

['Thank much .', 'Thank Academy .', 'Thank room .', 'I congratulate incredible nominee year .', 'The Revenant wa product tireless effort unbelievable cast crew .', 'First , brother endeavor , Mr. Tom Hardy .', 'Tom , talent screen surpassed friendship screen , thank creating transcendent cinematic experience .', 'Thank everybody Fox New Regency entire team .', 'I thank everyone onset career … To parent ; none would possible without .', 'And friend , I love dearly ; know .', "And lastly , I want say : Making The Revenant wa man 's relationship natural world .", 'A world collectively felt 2015 hottest year recorded history .', 'Our production needed move southern tip planet able find snow.Climate change real , happening right .', 'It urgent threat facing entire specie , need work collectively together stop procrastinating .', 'We need support leader around world speak big polluter , speak humanity , indigenous people world , billion billion underprivileged people would affected .', 'For 

#### How to find part of speech for different-2 word in above sentence

In [17]:
words = word_tokenize(paragraph)
tagged_word = nltk.pos_tag(words)
word_tag = []
for i in tagged_word:
    word_tag.append(i[0]+"_"+i[1])
tagged_paragraph = " ".join(word_tag)    

In [18]:
print(tagged_paragraph)

Thank_NNP you_PRP all_DT so_RB very_RB much_JJ ._. Thank_VB you_PRP to_TO the_DT Academy_NNP ._. Thank_NNP you_PRP to_TO all_DT of_IN you_PRP in_IN this_DT room_NN ._. I_PRP have_VBP to_TO congratulate_VB the_DT other_JJ incredible_JJ nominees_NNS this_DT year_NN ._. The_DT Revenant_NNP was_VBD the_DT product_NN of_IN the_DT tireless_NN efforts_NNS of_IN an_DT unbelievable_JJ cast_NN and_CC crew_NN ._. First_NNP off_RB ,_, to_TO my_PRP$ brother_NN in_IN this_DT endeavor_NN ,_, Mr._NNP Tom_NNP Hardy_NNP ._. Tom_NNP ,_, your_PRP$ talent_NN on_IN screen_NN can_MD only_RB be_VB surpassed_VBN by_IN your_PRP$ friendship_NN off_IN screen_NN ,_, thank_NN you_PRP for_IN creating_VBG a_DT transcendent_JJ cinematic_JJ experience_NN ._. Thank_NNP you_PRP to_TO everybody_VB at_IN Fox_NNP and_CC New_NNP Regency_NNP my_PRP$ entire_JJ team_NN ._. I_PRP have_VBP to_TO thank_VB everyone_NN from_IN the_DT very_RB onset_NN of_IN my_PRP$ career_NN …_NN To_TO my_PRP$ parents_NNS ;_: none_NN of_IN this_DT wo

#### POS Tag Meanings : Here are the meanings of the Parts-Of-Speech tags used in NLTK



CC - Coordinating conjunction

CD - Cardinal number

DT - Determiner

EX - Existential there

FW - Foreign word

IN - Preposition or subordinating conjunction

JJ - Adjective

JJR - Adjective, comparative

JJS - Adjective, superlative

LS - List item marker

MD - Modal

NN - Noun, singular or mass

NNS - Noun, plural

NNP - Proper noun, singular

NNPS - Proper noun, plural

PDT - Predeterminer

POS - Possessive ending

PRP - Personal pronoun

PRP$ - Possessive pronoun

RB - Adverb

RBR - Adverb, comparative

RBS - Adverb, superlative

RP - Particle

SYM - Symbol

TO - to

UH - Interjection

VB - Verb, base form

VBD - Verb, past tense

VBG - Verb, gerund or present participle

VBN - Verb, past participle

VBP - Verb, non-3rd person singular present

VBZ - Verb, 3rd person singular present

WDT - Wh-determiner

WP - Wh-pronoun

WP$ -- Possessive wh-pronoun

WRB - Wh-adverb



### Bag Of Words

In [19]:
dataset = nltk.sent_tokenize(paragraph)
for i in range(len(dataset)):
    dataset[i] = dataset[i].lower() # converted all the words in lower cae
    dataset[i] = re.sub(r'\W',' ',dataset[i]) #replace non-word by space
    dataset[i] = re.sub(r'\s+',' ',dataset[i]) # replace the all space by single space

In [20]:
print(dataset)

['thank you all so very much ', 'thank you to the academy ', 'thank you to all of you in this room ', 'i have to congratulate the other incredible nominees this year ', 'the revenant was the product of the tireless efforts of an unbelievable cast and crew ', 'first off to my brother in this endeavor mr tom hardy ', 'tom your talent on screen can only be surpassed by your friendship off screen thank you for creating a transcendent cinematic experience ', 'thank you to everybody at fox and new regency my entire team ', 'i have to thank everyone from the very onset of my career to my parents none of this would be possible without you ', 'and to my friends i love you dearly you know who you are ', 'and lastly i just want to say this making the revenant was about man s relationship to the natural world ', 'a world that we collectively felt in 2015 as the hottest year in recorded history ', 'our production needed to move to the southern tip of this planet just to be able to find snow climate

In [21]:
print(type(dataset))

<class 'list'>


In [22]:
#create histogram
word2count = {}
for data in dataset: # here data returns the list of sentence
    words = nltk.word_tokenize(data)#tokenize the list of sentences into words
    for word in words: # using for loop, we are taking each word from words
        if word not in word2count.keys(): # if word is not in word2count set then we are counting as first word
            word2count[word] = 1 # and setting as 1
        else:
            word2count[word] += 1 # if word is already contain in word2count then we are increasing the word occurence

In [23]:
print(word2count)

{'thank': 8, 'you': 12, 'all': 4, 'so': 2, 'very': 3, 'much': 2, 'to': 16, 'the': 17, 'academy': 1, 'of': 10, 'in': 4, 'this': 9, 'room': 1, 'i': 6, 'have': 3, 'congratulate': 1, 'other': 1, 'incredible': 1, 'nominees': 1, 'year': 2, 'revenant': 2, 'was': 2, 'product': 1, 'tireless': 1, 'efforts': 1, 'an': 1, 'unbelievable': 1, 'cast': 1, 'and': 8, 'crew': 1, 'first': 1, 'off': 2, 'my': 5, 'brother': 1, 'endeavor': 1, 'mr': 1, 'tom': 2, 'hardy': 1, 'your': 2, 'talent': 1, 'on': 1, 'screen': 2, 'can': 1, 'only': 1, 'be': 4, 'surpassed': 1, 'by': 3, 'friendship': 1, 'for': 10, 'creating': 1, 'a': 2, 'transcendent': 1, 'cinematic': 1, 'experience': 1, 'everybody': 1, 'at': 1, 'fox': 1, 'new': 1, 'regency': 1, 'entire': 2, 'team': 1, 'everyone': 1, 'from': 1, 'onset': 1, 'career': 1, 'parents': 1, 'none': 1, 'would': 2, 'possible': 1, 'without': 1, 'friends': 1, 'love': 1, 'dearly': 1, 'know': 1, 'who': 4, 'are': 1, 'lastly': 1, 'just': 2, 'want': 1, 'say': 1, 'making': 1, 'about': 1, 'man

### Find most frequent words

In [24]:
import heapq

In [25]:
freq_words = heapq.nlargest(100,word2count,key = word2count.get)

In [26]:
print(freq_words)

['the', 'to', 'you', 'of', 'for', 'this', 'thank', 'and', 'i', 'my', 'all', 'in', 'be', 'who', 'world', 'very', 'have', 'by', 'we', 'our', 'is', 'not', 'people', 'out', 'so', 'much', 'year', 'revenant', 'was', 'off', 'tom', 'your', 'screen', 'a', 'entire', 'would', 'just', 's', 'collectively', 'planet', 'it', 'most', 'need', 'do', 'speak', 'billions', 'there', 'children', 'tonight', 'take', 'granted', 'academy', 'room', 'congratulate', 'other', 'incredible', 'nominees', 'product', 'tireless', 'efforts', 'an', 'unbelievable', 'cast', 'crew', 'first', 'brother', 'endeavor', 'mr', 'hardy', 'talent', 'on', 'can', 'only', 'surpassed', 'friendship', 'creating', 'transcendent', 'cinematic', 'experience', 'everybody', 'at', 'fox', 'new', 'regency', 'team', 'everyone', 'from', 'onset', 'career', 'parents', 'none', 'possible', 'without', 'friends', 'love', 'dearly', 'know', 'are', 'lastly', 'want']


In [27]:
X = []
for data in dataset:
    vector = []
    for word in freq_words:
        #print(word)
        if word in nltk.word_tokenize(data):
            vector.append(1)
        else:
            vector.append(0)
    X.append(vector)        
        

In [28]:
print(X)

[[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,

We found the result as list of list. We need to convet it into 2D array

In [29]:
X = np.asarray(X)

In [30]:
# got the bag of words matrix
print(X)

[[0 0 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [0 1 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]]


In [31]:
print(X.shape)

(20, 100)


## TF-IDF

TF stands for "Term Frequency"

#### Formula :

TF : (Number of occurrences of a word in the document)/(total number of word in that document)

### Example: 

"to be or not to be" if we calculate TF for word to,be,or then 
 1. for "to" = 2/6 = 0.33
 2. for "be" = 2/6 = 0.33
 3. for "or" = 1/6 = 0.16

IDF stands for Inverse Document Frequency

#### Formula:

IDF = loge(Number of documents)/(Number of documents containing word)

#### Example:

doc1 : "to be or not to be"

doc2 : i have to be

doc3 : you got to be

1. IDF for word "to" = log(e)(3/3) = 0
2. IDF for word "be" = log(e)(3/3) = 0
3. IDF for word "or" = log(e)(3/1) = 0.477

In [32]:
word_idfs = {} # take a sets word_idfs
for word in freq_words: # take each word from freq_words
    doc_count = 0
    for data in dataset: # take the each sentences from dataset
        words = nltk.word_tokenize(data)# coverted to word_tokenize
        if word in words:#check the frequent word from words if it ? we are counting 1
            doc_count += 1 
    word_idfs[word] = np.log(len(dataset)/doc_count+1)#calculating the IDF values for each frequent words        

In [33]:
print(word_idfs)

{'the': 1.0986122886681098, 'to': 1.0360919316867758, 'you': 1.1700712526502546, 'of': 1.4663370687934272, 'for': 1.4663370687934272, 'this': 1.1700712526502546, 'thank': 1.252762968495368, 'and': 1.349926716949016, 'i': 1.4663370687934272, 'my': 1.791759469228055, 'all': 1.791759469228055, 'in': 2.03688192726104, 'be': 1.791759469228055, 'who': 2.3978952727983707, 'world': 2.03688192726104, 'very': 2.03688192726104, 'have': 2.03688192726104, 'by': 2.03688192726104, 'we': 2.03688192726104, 'our': 2.03688192726104, 'is': 2.3978952727983707, 'not': 2.03688192726104, 'people': 2.3978952727983707, 'out': 2.3978952727983707, 'so': 2.3978952727983707, 'much': 2.3978952727983707, 'year': 2.3978952727983707, 'revenant': 2.3978952727983707, 'was': 2.3978952727983707, 'off': 2.3978952727983707, 'tom': 2.3978952727983707, 'your': 3.044522437723423, 'screen': 3.044522437723423, 'a': 2.3978952727983707, 'entire': 2.3978952727983707, 'would': 2.3978952727983707, 'just': 2.3978952727983707, 's': 2.39

## TF - Matrix

In [34]:
tf_matrix = {}
for word in freq_words:
    doc_tf = []
    for data in dataset:
        frequency = 0
        for w in nltk.word_tokenize(data):
            if word==w:
                frequency+=1
        tf_words = frequency/(len(nltk.word_tokenize(data)))        
        doc_tf.append(tf_words)
    tf_matrix[word] = doc_tf   

In [35]:
print(tf_matrix)

{'the': [0.0, 0.2, 0.0, 0.1, 0.2, 0.0, 0.0, 0.0, 0.043478260869565216, 0.0, 0.1, 0.06666666666666667, 0.03571428571428571, 0.05, 0.10638297872340426, 0.045454545454545456, 0.0, 0.0, 0.0, 0.0], 'to': [0.0, 0.2, 0.1111111111111111, 0.1, 0.0, 0.09090909090909091, 0.0, 0.08333333333333333, 0.08695652173913043, 0.07692307692307693, 0.1, 0.0, 0.14285714285714285, 0.05, 0.02127659574468085, 0.0, 0.0, 0.0, 0.0, 0.0], 'you': [0.16666666666666666, 0.2, 0.2222222222222222, 0.0, 0.0, 0.0, 0.045454545454545456, 0.08333333333333333, 0.043478260869565216, 0.23076923076923078, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1111111111111111, 0.0, 0.0, 0.2], 'of': [0.0, 0.0, 0.1111111111111111, 0.0, 0.13333333333333333, 0.0, 0.0, 0.0, 0.08695652173913043, 0.0, 0.0, 0.0, 0.03571428571428571, 0.0, 0.06382978723404255, 0.045454545454545456, 0.0, 0.0, 0.0, 0.0], 'for': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.045454545454545456, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0851063829787234, 0.09090909090909091, 0.1111111111111111, 0.125,

## TF-IDF Matrix

TF- IDF will be multiplication of TF and IDF values

In [36]:
tfidf_matrix = []
for word in tf_matrix.keys():
    tf_idf = []
    for value in tf_matrix[word]:
        score = value*word_idfs[word]
        tf_idf.append(score)
    tfidf_matrix.append(tf_idf)

In [37]:
print(tfidf_matrix)

[[0.0, 0.21972245773362198, 0.0, 0.10986122886681099, 0.21972245773362198, 0.0, 0.0, 0.0, 0.047765751681222164, 0.0, 0.10986122886681099, 0.07324081924454065, 0.039236153166718205, 0.054930614433405495, 0.11687364773064998, 0.04993692221218681, 0.0, 0.0, 0.0, 0.0], [0.0, 0.20721838633735518, 0.11512132574297508, 0.10360919316867759, 0.0, 0.09419017560788871, 0.0, 0.08634099430723131, 0.09009495058145876, 0.07969937936052122, 0.10360919316867759, 0.0, 0.14801313309811082, 0.051804596584338794, 0.022044509184825017, 0.0, 0.0, 0.0, 0.0, 0.0], [0.1950118754417091, 0.23401425053005093, 0.2600158339222788, 0.0, 0.0, 0.0, 0.05318505693864794, 0.09750593772085454, 0.05087266315870672, 0.27001644291928956, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1300079169611394, 0.0, 0.0, 0.23401425053005093], [0.0, 0.0, 0.16292634097704745, 0.0, 0.19551160917245697, 0.0, 0.0, 0.0, 0.12750757119942846, 0.0, 0.0, 0.0, 0.05236918102833668, 0.0, 0.09359598311447406, 0.06665168494515578, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.

In [38]:
tfidf_matrix = np.asarray(tfidf_matrix)

In [39]:
print(tfidf_matrix)

[[0.         0.21972246 0.         ... 0.         0.         0.        ]
 [0.         0.20721839 0.11512133 ... 0.         0.         0.        ]
 [0.19501188 0.23401425 0.26001583 ... 0.         0.         0.23401425]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [40]:
print(tfidf_matrix.shape)

(100, 20)


In [41]:
tfidf_matrix = np.transpose(tfidf_matrix)

In [42]:
print(tfidf_matrix)

[[0.         0.         0.19501188 ... 0.         0.         0.        ]
 [0.21972246 0.20721839 0.23401425 ... 0.         0.         0.        ]
 [0.         0.11512133 0.26001583 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.23401425 ... 0.         0.         0.        ]]


## N-Gram Modeling 

In [51]:
text = """he term is frequently used interchangeably with the term climate change, though the latter refers to 
          both human- and naturally produced warming and the effects it has on our planet. 
          It is most commonly measured as the average increase in Earth’s global surface temperature."""

In [44]:
n= 2
ngrams = {}
for i in range(len(text)-n):
    gram = text[i:i+n] # text[0:3]
    if gram not in ngrams.keys():
        ngrams[gram] = []
    ngrams[gram].append(text[i+n])    
        
    

In [45]:
print(ngrams)

{'he': [' ', ' ', ' ', ' ', ' '], 'e ': ['t', 't', 'c', 'l', 'e', 'a', 'i', 'i', 't'], ' t': ['e', 'h', 'e', 'h', 'h', 'o', 'h', 'h', 'e'], 'te': ['r', 'r', 'r', ' ', 'r', 'm'], 'er': ['m', 'c', 'm', ' ', 's', 'a', 'a'], 'rm': [' ', ' ', 'i'], 'm ': ['i', 'c'], ' i': ['s', 'n', 't', 's', 'n', 'n'], 'is': [' ', ' '], 's ': ['f', 't', 'i', 'o', 'm', 't', 'g'], ' f': ['r'], 'fr': ['e'], 're': ['q', 'f', 'd', 'a', '.'], 'eq': ['u'], 'qu': ['e'], 'ue': ['n'], 'en': ['t'], 'nt': ['l', 'e'], 'tl': ['y'], 'ly': [' ', ' ', ' ', ' '], 'y ': ['u', 'w', 'p', 'm'], ' u': ['s'], 'us': ['e'], 'se': ['d', ' '], 'ed': [' ', ' ', ' '], 'd ': ['i', 'n', 'w', 't', 'a'], 'in': ['t', 'g', 'c', ' '], 'rc': ['h'], 'ch': ['a', 'a'], 'ha': ['n', 'n', 's'], 'an': ['g', 'g', '-', 'd', 'd', 'e'], 'ng': ['e', 'e', ' '], 'ge': ['a', ',', ' '], 'ea': ['b', 's', 's'], 'ab': ['l'], 'bl': ['y'], ' w': ['i', 'a'], 'wi': ['t'], 'it': ['h', ' '], 'th': [' ', 'e', 'o', 'e', ' ', 'e', 'e', '’'], 'h ': ['t', 't', 'h'], ' c': 

In [52]:
ngrams1 = {}
words = nltk.word_tokenize(text)
for i in range(len(words)-n):
    gram = ' '.join(words[i:i+n])
    if gram not in ngrams1.keys():
        ngrams1[gram] = []
    ngrams1[gram].append(words[i+n])    

In [53]:
print(ngrams1)

{'he term': ['is'], 'term is': ['frequently'], 'is frequently': ['used'], 'frequently used': ['interchangeably'], 'used interchangeably': ['with'], 'interchangeably with': ['the'], 'with the': ['term'], 'the term': ['climate'], 'term climate': ['change'], 'climate change': [','], 'change ,': ['though'], ', though': ['the'], 'though the': ['latter'], 'the latter': ['refers'], 'latter refers': ['to'], 'refers to': ['both'], 'to both': ['human-'], 'both human-': ['and'], 'human- and': ['naturally'], 'and naturally': ['produced'], 'naturally produced': ['warming'], 'produced warming': ['and'], 'warming and': ['the'], 'and the': ['effects'], 'the effects': ['it'], 'effects it': ['has'], 'it has': ['on'], 'has on': ['our'], 'on our': ['planet'], 'our planet': ['.'], 'planet .': ['It'], '. It': ['is'], 'It is': ['most'], 'is most': ['commonly'], 'most commonly': ['measured'], 'commonly measured': ['as'], 'measured as': ['the'], 'as the': ['average'], 'the average': ['increase'], 'average incr

### Latent Semantic Analysis

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

#sample data
dataset = ["The amount of polution is increasing day by daya",
           "The concert was just great",
           "I Love to see Garden Ramsay Cook",
            "Google is introducting a new technology",
            "AI Robots are examples of great technology present to you",
            "All of us were singing in the concert",
            "We have Launch compaigns to stop pollution and global warming"]

dataset = [line.lower() for line in dataset]

vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(dataset)
print(x[0])

  (0, 10)	0.3604707823245737
  (0, 5)	0.3604707823245737
  (0, 9)	0.3604707823245737
  (0, 18)	0.3604707823245737
  (0, 20)	0.29922170630677863
  (0, 27)	0.3604707823245737
  (0, 25)	0.25576479528730944
  (0, 2)	0.3604707823245737
  (0, 35)	0.25576479528730944


Observation :  The tfidf value of The is 0.3604707823245737. The 0 indicates the first row and 10 indicates the position of The. Similary for all the words for first document

In [60]:
print(x[1])

  (0, 15)	0.4213298560187446
  (0, 21)	0.5075738143811802
  (0, 39)	0.5075738143811802
  (0, 7)	0.4213298560187446
  (0, 35)	0.36013879374975194


Observation :  The tfidf value of The is 0.3604707823245737. The 0 indicates the first row and 10 indicates the position of The. Similary for all the words for first document

In [66]:
lsa = TruncatedSVD(n_components = 4,n_iter=100) # we have taken the 4 components
lsa.fit(x)


terms = vectorizer.get_feature_names()
for i,comp in enumerate(lsa.components_):
    componentTerms = zip(terms,comp)
    sortedTerms = sorted(componentTerms,key=lambda x:x[1],reverse=True)
    sortedTerms = sortedTerms[:10]
    print("\nconcept",i,":")
    for term in sortedTerms:
        print(term)


concept 0 :
('the', 0.37931274666161524)
('concert', 0.3322524212584805)
('of', 0.30297933562408524)
('great', 0.2883406330890268)
('just', 0.22747274585828073)
('was', 0.22747274585828073)
('is', 0.1938158804638517)
('technology', 0.18182537167789728)
('all', 0.17278996134480654)
('in', 0.17278996134480654)

concept 1 :
('to', 0.3549401919583129)
('technology', 0.23931386858616868)
('cook', 0.21500250606772944)
('garden', 0.21500250606772944)
('love', 0.21500250606772944)
('ramsay', 0.21500250606772944)
('see', 0.21500250606772944)
('google', 0.15506130357728057)
('introducting', 0.15506130357728057)
('new', 0.15506130357728057)

concept 2 :
('is', 0.3648279017955781)
('google', 0.3151297909758329)
('introducting', 0.3151297909758329)
('new', 0.3151297909758329)
('technology', 0.26764814808496595)
('amount', 0.12437642265177631)
('by', 0.12437642265177631)
('day', 0.12437642265177631)
('daya', 0.12437642265177631)
('increasing', 0.12437642265177631)

concept 3 :
('and', 0.25491788470

### Word Synonyms and Antonyms using NLTK

In [69]:
from nltk.corpus import wordnet

synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
    for s in syn.lemmas():
        synonyms.append(s.name())
        for a in s.antonyms():
            antonyms.append(a.name())

In [71]:
print(set(synonyms))
print(set(antonyms))

{'commodity', 'unspoiled', 'just', 'good', 'goodness', 'dependable', 'full', 'undecomposed', 'skilful', 'expert', 'right', 'in_force', 'trade_good', 'upright', 'skillful', 'estimable', 'honorable', 'well', 'practiced', 'safe', 'in_effect', 'near', 'unspoilt', 'dear', 'honest', 'soundly', 'proficient', 'beneficial', 'ripe', 'adept', 'salutary', 'effective', 'serious', 'respectable', 'sound', 'thoroughly', 'secure'}
{'ill', 'badness', 'bad', 'evil', 'evilness'}
