## What is Textblob

* Textblob is an open-source python library used to perform NLP activities like Lemmatization, Stemming, Tokenization, Noun Phrase Extraction, POS Tagging, N-Grams, Sentiment Analysis. 

* It is faster than NLTK, however it does not provide the functionalities like vectorization, dependency parsing.

* Text Classification, Sentiment Analysis can be performed using Textblob. 
* Official Link to Textblob is: https://textblob.readthedocs.io/en/dev/

* Installation: pip install textblob

In [1]:
### Install Textblob
#!pip install nltk
#!pip install textblob

In [2]:
import nltk 
#nltk.download('popular')

In [3]:
import nltk
#nltk.download('averaged_perceptron_tagger')

### Functionalities of Textblob
* Language Traslation
* Word Correction
* Word Count
* Phrase Extraction
* POS Tagging
* Tokenization
* Plularization of words using Textblob
* Lemmatization using Textblob
* n-gram in Textblob

In [4]:
! pip install -U textblob
! python -m textblob.download_corpora

Finished.


[nltk_data] Downloading package brown to C:\Users\Harini
[nltk_data]     Balaji\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Harini
[nltk_data]     Balaji\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Harini
[nltk_data]     Balaji\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Harini Balaji\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to C:\Users\Harini
[nltk_data]     Balaji\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to C:\Users\Harini
[nltk_data]     Balaji\AppData\Roaming\nltk_data...
[nltk_data

In [5]:
from textblob import TextBlob
blob = TextBlob('TextBlob is a great tool for developers')
print(blob.translate(from_lang='en', to='ta'))
print(blob.translate(from_lang='en', to='hi'))

டெவலப்பர்களுக்கு TextBlob ஒரு சிறந்த கருவி
टेक्स्टब्लॉब डेवलपर्स के लिए एक बेहतरीन टूल है


In [6]:
blob = TextBlob("Hey John, How are You")
 
print("Input text in Spanish:",blob.translate(from_lang='en', to='es'))


Input text in Spanish: Hola Juan, ¿cómo estás?


#### Spelling Correction

In [7]:
text=""" ABCD Corp alays values ttheir employees!!!"""

In [8]:
print(text)

 ABCD Corp alays values ttheir employees!!!


In [9]:
blob=TextBlob(text)

In [10]:
blob

TextBlob(" ABCD Corp alays values ttheir employees!!!")

In [11]:
blob.correct()

TextBlob(" ABCD For always values their employees!!!")

In [12]:
TextBlob('hasss').correct()

TextBlob("has")

In [13]:
### Sometimes it failsas well
TextBlob('ur').correct()

TextBlob("or")

### Word Count 
With the help of word count, we can count the frequency of words or a noun phrase in a given sentence.

In [14]:
text="Sentiment Analysis is a process by which we can find the sentiment of a text. Sentiment can be Positive, Negative or Neutral"

In [15]:
blob=TextBlob(text)

In [16]:
blob.word_counts["analysis"]

1

In [17]:
blob.word_counts["Sentiment"]

0

In [18]:
blob.word_counts["sentiment"]

3

In [19]:
blob.word_counts["Analysis"]

0

### POS Tagging
With the help of tags function of textblob, we can get tag each words of a sentence with a tag that can be either noun, pronoun, verb, adverb, adjective and more.


Abbreviation	Meaning
CC	coordinating conjunction
CD	cardinal digit
DT	determiner
EX	existential there
FW	foreign word
IN	preposition/subordinating conjunction
JJ	This NLTK POS Tag is an adjective (large)
JJR	adjective, comparative (larger)
JJS	adjective, superlative (largest)
LS	list market
MD	modal (could, will)
NN	noun, singular (cat, tree)
NNS	noun plural (desks)
NNP	proper noun, singular (sarah)
NNPS	proper noun, plural (indians or americans)
PDT	predeterminer (all, both, half)
POS	possessive ending (parent\ ‘s)
PRP	personal pronoun (hers, herself, him, himself)
PRP$	possessive pronoun (her, his, mine, my, our )
RB	adverb (occasionally, swiftly)
RBR	adverb, comparative (greater)
RBS	adverb, superlative (biggest)
RP	particle (about)
TO	infinite marker (to)
UH	interjection (goodbye)
VB	verb (ask)
VBG	verb gerund (judging)
VBD	verb past tense (pleaded)
VBN	verb past participle (reunified)
VBP	verb, present tense not 3rd person singular(wrap)
VBZ	verb, present tense with 3rd person singular (bases)
WDT	wh-determiner (that, what)
WP	wh- pronoun (who)
WRB	wh- adverb (how)

In [20]:
from textblob import TextBlob
 
text = TextBlob("My name is Adam. I like to read about NLP. I work at ABCD Corp.")
print(text.tags)


[('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), ('Adam', 'NNP'), ('I', 'PRP'), ('like', 'VBP'), ('to', 'TO'), ('read', 'VB'), ('about', 'IN'), ('NLP', 'NNP'), ('I', 'PRP'), ('work', 'VBP'), ('at', 'IN'), ('ABCD', 'NNP'), ('Corp', 'NNP')]


In [21]:
new_tuple=[]
for i in text.tags:
    print(i)
    if 'VBP' not in i[1]:
        new_tuple.append(i) 

('My', 'PRP$')
('name', 'NN')
('is', 'VBZ')
('Adam', 'NNP')
('I', 'PRP')
('like', 'VBP')
('to', 'TO')
('read', 'VB')
('about', 'IN')
('NLP', 'NNP')
('I', 'PRP')
('work', 'VBP')
('at', 'IN')
('ABCD', 'NNP')
('Corp', 'NNP')


In [22]:
new_tuple

[('My', 'PRP$'),
 ('name', 'NN'),
 ('is', 'VBZ'),
 ('Adam', 'NNP'),
 ('I', 'PRP'),
 ('to', 'TO'),
 ('read', 'VB'),
 ('about', 'IN'),
 ('NLP', 'NNP'),
 ('I', 'PRP'),
 ('at', 'IN'),
 ('ABCD', 'NNP'),
 ('Corp', 'NNP')]

In [23]:
value=''
for i in new_tuple:
    value=value+" " + "".join(i[0])

In [24]:
value

' My name is Adam I to read about NLP I at ABCD Corp'

#### Tokenization

* Corpus (or corpora in plural) - Corpus is nothing but a collection of text data. The text maybe in one language or maybe a combination of two or more. 

* Token - The term "Token" is nothing but the total number of words in a text, corpus etc, regardless of their freuqncy of occurrence in the text. Tokens are nothing but a string of contiguous characters which either lies between the two spaces or it lies between a space and punctuation. For Example: Suppose you have the following string : "abc_123_defg", if you split it on basis of underscores "_" you obtained three tokens : "abc", "123" and "defg".

**What is tokenization?**

Tokenization is a process of splitting the sentence or corpus into its smalles unit i.e. "Tokens"

In [25]:
text="""
R is a comprehensive statistical and graphical programming language, which is fast gaining popularity among data analysts. It is free and runs on a variety of platforms, including Windows, Unix, and macOS. It provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner. 
"""

In [26]:
blob_object = TextBlob(text)

In [27]:
# Word tokenization of the sample corpus
corpus_words = blob_object.words

In [28]:
corpus_words

WordList(['R', 'is', 'a', 'comprehensive', 'statistical', 'and', 'graphical', 'programming', 'language', 'which', 'is', 'fast', 'gaining', 'popularity', 'among', 'data', 'analysts', 'It', 'is', 'free', 'and', 'runs', 'on', 'a', 'variety', 'of', 'platforms', 'including', 'Windows', 'Unix', 'and', 'macOS', 'It', 'provides', 'an', 'unparalleled', 'platform', 'for', 'programming', 'new', 'statistical', 'methods', 'in', 'an', 'easy', 'and', 'straightforward', 'manner'])

In [29]:
print(len(corpus_words))

48


In [30]:
corpus_sentences= blob_object.sentences

In [31]:
corpus_sentences

[Sentence("
 R is a comprehensive statistical and graphical programming language, which is fast gaining popularity among data analysts."),
 Sentence("It is free and runs on a variety of platforms, including Windows, Unix, and macOS."),
 Sentence("It provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner.")]

In [32]:
print(len(corpus_sentences))

3


#### Pluralization of words using Textblob 

In [33]:
from textblob import Word
w = Word('Platform')
w.pluralize()

'Platforms'

In [34]:
from textblob import Word
w = Word('Platforms')
w.pluralize()

'Platformss'

In [35]:
blob = TextBlob("Great Learning is a great platform to learn data science. \n It helps community through blogs, Youtube, GLA,etc.")
for word,pos in blob.tags:
    if pos == 'NN':
        print (word.pluralize())

platforms
sciences
communities
etcs


#### Lemmatization using Textblob

In [36]:
blob = TextBlob("Great Learning is a great platform to learn data science. \n It helps community through blogs, Youtube, GLA,etc.")
words = blob.words

for word in words:
    print("ORIGINAL:", word, "| LEMMA:", word.lemmatize(), "| STEM:", word.stem())

ORIGINAL: Great | LEMMA: Great | STEM: great
ORIGINAL: Learning | LEMMA: Learning | STEM: learn
ORIGINAL: is | LEMMA: is | STEM: is
ORIGINAL: a | LEMMA: a | STEM: a
ORIGINAL: great | LEMMA: great | STEM: great
ORIGINAL: platform | LEMMA: platform | STEM: platform
ORIGINAL: to | LEMMA: to | STEM: to
ORIGINAL: learn | LEMMA: learn | STEM: learn
ORIGINAL: data | LEMMA: data | STEM: data
ORIGINAL: science | LEMMA: science | STEM: scienc
ORIGINAL: It | LEMMA: It | STEM: it
ORIGINAL: helps | LEMMA: help | STEM: help
ORIGINAL: community | LEMMA: community | STEM: commun
ORIGINAL: through | LEMMA: through | STEM: through
ORIGINAL: blogs | LEMMA: blog | STEM: blog
ORIGINAL: Youtube | LEMMA: Youtube | STEM: youtub
ORIGINAL: GLA | LEMMA: GLA | STEM: gla
ORIGINAL: etc | LEMMA: etc | STEM: etc


In [37]:
w = Word('learner')
w.lemmatize("n") ## n here represents noun

'learner'

In [38]:
w = Word('learning')
w.lemmatize("v") ## v here represents verb

'learn'

In [39]:
w = Word('booking')
w.lemmatize("v") 

'book'

In [40]:
w = Word('books')
w.lemmatize("n") 

'book'

In [41]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = """This is a sample sentence,
showing off the stop words filtration."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and then checks whether
#they are present in stop_words or not
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence=str(filtered_sentence)
print(word_tokens)
print(filtered_sentence)


['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [42]:
import re
clean_sent=re.sub(r'[^\w\s]','',filtered_sentence)
clean_sent

'sample sentence  showing stop words filtration '

In [43]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}
def _lookup_words(input_text):
    words = input_text.split() 
    print(words)
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word)
        new_text = " ".join(new_words) 
    return new_text
output=_lookup_words("RT this is a retweeted tweet by Shivam Bansal")
output



['RT', 'this', 'is', 'a', 'retweeted', 'tweet', 'by', 'Shivam', 'Bansal']


'Retweet this is a retweeted tweet by Shivam Bansal'

#### n-gram in Textblob

An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a two-word sequence of words like “really good”, “not good”, or “your homework”, and a 3-gram (more commonly called a trigram) is a three-word sequence of words like “not at all”, or “turn off light”.

In [44]:
blob

TextBlob("Great Learning is a great platform to learn data science. 
 It helps community through blogs, Youtube, GLA,etc.")

In [45]:
blob.ngrams(n=1)

[WordList(['Great']),
 WordList(['Learning']),
 WordList(['is']),
 WordList(['a']),
 WordList(['great']),
 WordList(['platform']),
 WordList(['to']),
 WordList(['learn']),
 WordList(['data']),
 WordList(['science']),
 WordList(['It']),
 WordList(['helps']),
 WordList(['community']),
 WordList(['through']),
 WordList(['blogs']),
 WordList(['Youtube']),
 WordList(['GLA']),
 WordList(['etc'])]

In [46]:
blob.ngrams(n=2)

[WordList(['Great', 'Learning']),
 WordList(['Learning', 'is']),
 WordList(['is', 'a']),
 WordList(['a', 'great']),
 WordList(['great', 'platform']),
 WordList(['platform', 'to']),
 WordList(['to', 'learn']),
 WordList(['learn', 'data']),
 WordList(['data', 'science']),
 WordList(['science', 'It']),
 WordList(['It', 'helps']),
 WordList(['helps', 'community']),
 WordList(['community', 'through']),
 WordList(['through', 'blogs']),
 WordList(['blogs', 'Youtube']),
 WordList(['Youtube', 'GLA']),
 WordList(['GLA', 'etc'])]

In [47]:
blob.ngrams(n=3)

[WordList(['Great', 'Learning', 'is']),
 WordList(['Learning', 'is', 'a']),
 WordList(['is', 'a', 'great']),
 WordList(['a', 'great', 'platform']),
 WordList(['great', 'platform', 'to']),
 WordList(['platform', 'to', 'learn']),
 WordList(['to', 'learn', 'data']),
 WordList(['learn', 'data', 'science']),
 WordList(['data', 'science', 'It']),
 WordList(['science', 'It', 'helps']),
 WordList(['It', 'helps', 'community']),
 WordList(['helps', 'community', 'through']),
 WordList(['community', 'through', 'blogs']),
 WordList(['through', 'blogs', 'Youtube']),
 WordList(['blogs', 'Youtube', 'GLA']),
 WordList(['Youtube', 'GLA', 'etc'])]

In [48]:
blob.ngrams(n=4)

[WordList(['Great', 'Learning', 'is', 'a']),
 WordList(['Learning', 'is', 'a', 'great']),
 WordList(['is', 'a', 'great', 'platform']),
 WordList(['a', 'great', 'platform', 'to']),
 WordList(['great', 'platform', 'to', 'learn']),
 WordList(['platform', 'to', 'learn', 'data']),
 WordList(['to', 'learn', 'data', 'science']),
 WordList(['learn', 'data', 'science', 'It']),
 WordList(['data', 'science', 'It', 'helps']),
 WordList(['science', 'It', 'helps', 'community']),
 WordList(['It', 'helps', 'community', 'through']),
 WordList(['helps', 'community', 'through', 'blogs']),
 WordList(['community', 'through', 'blogs', 'Youtube']),
 WordList(['through', 'blogs', 'Youtube', 'GLA']),
 WordList(['blogs', 'Youtube', 'GLA', 'etc'])]

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
print (X)

  (0, 1)	0.34520501686496574
  (0, 4)	0.444514311537431
  (0, 2)	0.5844829010200651
  (0, 7)	0.5844829010200651
  (1, 3)	0.652490884512534
  (1, 0)	0.652490884512534
  (1, 1)	0.3853716274664007
  (2, 5)	0.5844829010200651
  (2, 6)	0.5844829010200651
  (2, 1)	0.34520501686496574
  (2, 4)	0.444514311537431


In [50]:
for i in range(0,8):
    print(i,obj.get_feature_names()[i])


0 another
1 document
2 is
3 random
4 sample
5 text
6 third
7 this




The model creates a vocabulary dictionary and assigns an index to each word. Each row in the output contains a tuple (i,j) and a tf-idf value of word at index j in document i.
(0, 1) 0.34520501686496574: In the first document (index 0), the word at column index 1 (which corresponds to "document") has a TF-IDF score of approximately 0.345.

(0, 4) 0.444514311537431: In the first document (index 0), the word at column index 4 (which corresponds to "sample") has a TF-IDF score of approximately 0.445.

(0, 2) 0.5844829010200651: In the first document (index 0), the word at column index 2 (which corresponds to "is") has a TF-IDF score of approximately 0.584.

(1, 3) 0.652490884512534: This means that in the second document (index 1), the word at column index 3 (which corresponds to "random") has a TF-IDF score of approximately 0.652.

(1, 0) 0.652490884512534: This means that in the same second document (index 1), the word at column index 0 (which corresponds to "another") also has a TF-IDF score of approximately 0.652.

The reason both "random" and "another" have the same TF-IDF score in the second document could be due to similar term frequencies and inverse document frequencies for these words. If the term frequencies (how often the word appears in the document) and the inverse document frequency (how unique the word is across the entire corpus) are similar for these words in that specific document, their TF-IDF scores can be the same.

In [51]:
from gensim.models import Word2Vec
sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]

# train the model on your corpus  
model1 = Word2Vec(sentences, min_count = 1)#min_count=1 words that appear only once in the training data will also be included in the vocabulary and used to generate word embeddings
print (model1.wv.similarity('data', 'science'))
print (model1.wv.similarity('vidhya', 'science'))  
print (model1.wv.similarity('vidhya', 'vidhya'))

-0.02367166
-0.11410725
1.0


Similarity Scores Near 1: A similarity score close to 1 indicates that the words are very similar in their semantic meaning. They are likely to co-occur frequently in similar contexts and have a strong relationship.

Similarity Scores Around 0: A similarity score around 0 suggests that the words are neither very similar nor dissimilar. They might not have a strong relationship in terms of co-occurrence patterns.

Similarity Scores Near -1: A similarity score close to -1 indicates that the words are dissimilar and might even have an opposite or contrasting meaning. They are unlikely to co-occur in similar contexts.
A positive similarity score indicates that the words are semantically similar or related in meaning. A higher positive value generally indicates a stronger similarity.

A negative similarity score, like the one you provided, indicates that the words are less similar or even dissimilar in their semantic meaning. A more negative value suggests a stronger dissimilarity.