<a href="https://colab.research.google.com/github/dipesh2108/AI_Notes/blob/main/11_Converting_Text_to_Features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Converting Text to Features Using One Hot Encoding
--
The traditional method used for feature engineering is One Hot encoding.
If anyone knows the basics of machine learning, One Hot encoding is
something they should have come across for sure at some point of time or
maybe most of the time. It is a process of converting categorical variables
into features or columns and coding one or zero for the presence of that
particular category. We are going to use the same logic here, and the
number of features is going to be the number of total tokens present in the
whole corpus.

**On hot encoding was covered under "Machine Learning Course" also.

Problem
--
You want to convert text to feature using One Hot encoding.

Solution
--
One Hot Encoding will basically convert characters or words into binary
numbers as shown below.

In [1]:
Text = "I am learning NLP and it is fun"

# Importing the library
import pandas as pd

# Generating the features
pd.get_dummies(Text.split())

Unnamed: 0,I,NLP,am,and,fun,is,it,learning
0,True,False,False,False,False,False,False,False
1,False,False,True,False,False,False,False,False
2,False,False,False,False,False,False,False,True
3,False,True,False,False,False,False,False,False
4,False,False,False,True,False,False,False,False
5,False,False,False,False,False,False,True,False
6,False,False,False,False,False,True,False,False
7,False,False,False,False,True,False,False,False


Output has 7 features since the number of distinct words present in the
input was 7.

Converting Text to Features Using Count Vectorizing
--
Above approach "One hot encoding" has a disadvantage. It does not take the
frequency of the word occurring into consideration. If a particular word
is appearing multiple times, there is a chance of missing the information
if it is not included in the analysis. A count vectorizer will solve that
problem.

Problem
--
How do we convert text to feature using a count vectorizer?

Solution
--
Count vectorizer is almost similar to One Hot encoding. The only
difference is instead of checking whether the particular word is present or
not, it will count the words that are present in the document.
Observe the below example. The words “I” and “NLP” occur twice in
the first document.

In [2]:
#importing the function
from sklearn.feature_extraction.text import CountVectorizer

# Text
text = ["I love nlp and I will learn NLP in 4 sessions"]

# create the CountVectorizer
vectorizer = CountVectorizer()

# tokenizing
vectorizer.fit(text)

# encode document
vector = vectorizer.transform(text)
# counting is case in-sensitive

# summarize & generating output
print(vectorizer.vocabulary_)
print(vector.toarray())

{'love': 3, 'nlp': 4, 'and': 0, 'will': 6, 'learn': 2, 'in': 1, 'sessions': 5}
[[1 1 1 1 2 1 1]]


The fifth token nlp has appeared twice in the document.

**Note** : CountVectorizer does not consider single char words like `I` , `a`.

Generating N-grams ( also called Bag of words )
--
If you observe the above methods, each word is considered as a feature.
There is a drawback to this method. It does not consider the previous and the next words, to see if that would give a proper and complete meaning to the words.

For example: consider the word “not bad.” If this is split into individual
words, then it will lose out on conveying “good” – which is what this word
actually means.

As we saw, we might lose potential information or insight because a lot of words make sense once they are put together. This problem can be solved by N-grams.

N-grams are the fusion of multiple letters or multiple words. They are
formed in such a way that even the previous and next words are captured.

• Unigrams are the unique words present in the sentence. <br />
• Bigram is the combination of 2 words. <br />
• Trigram is 3 words and so on. <br />

For example,
“I am learning NLP”

Unigrams: “I”, “am”, “ learning”, “NLP”

Bigrams: “I am”, “am learning”, “learning NLP”

Trigrams: “I am learning”, “am learning NLP”

Problem
--
Generate the N-grams for the given sentence.

Solution
--
There are a lot of packages that will generate the N-grams. The one that is
mostly used is TextBlob.

In [3]:
import nltk
nltk.download('punkt')

Text = "I am learning NLP"

# Use the below TextBlob function to create N-grams.
# Use the text that is defined above
# and mention the “n” based on the requirement.

#Import textblob
from textblob import TextBlob

#For unigram : Use n = 1
TextBlob(Text).ngrams(1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


[WordList(['I']), WordList(['am']), WordList(['learning']), WordList(['NLP'])]

In [4]:
#For Bigram : use n = 2
TextBlob(Text).ngrams(2)

[WordList(['I', 'am']),
 WordList(['am', 'learning']),
 WordList(['learning', 'NLP'])]

In [5]:
#For trigram : use n = 3
TextBlob(Text).ngrams(3)

[WordList(['I', 'am', 'learning']), WordList(['am', 'learning', 'NLP'])]

Simple case study - to understand better
--

**input** :
hello, thank you all for participating in the workshop. Participation in the workshop was worth it.

**after lemma and basic text cleaning** :
hello thank you all for participate in the workshop. Participate in the workshop was worth it.


**n=4 BOW model of the above** :
[hello thank you all], [thank you all for] , [you all for participate] ...
[participate in the workshop], ... [Participate in the workshop], ...


**Now do CV : count vect** :
[participate in the workshop] -> 2
and all other 4-gram phrases have a freqency of 1

The context of this sentence is :
[participate in the workshop]

This brings out the importance of the document , which helps to classify the document.

<hr />

**Now Suppose , I give you 20 such sentences, and Ask which of them are similar ?**

<br />
Steps :
1. Every sentence u would do the above.
2. similar context would form a class of sentences.
<br />
for eg :
Suppose sent_1 : has context as  [participate in the workshop]
Suppose sent_9 : has context as  [fee for workshop participate]
<br />
hence both of the above sentences would be classified under one class.
<br />
Funda : Document Classification in NLP.

Generating Bigram-based features for a document using CountVectorizer
--

Just like in the last code-example, we used TextBlob class, we can do the same thing by using count vectorizer to generate features.

In [None]:
#importing the function
from sklearn.feature_extraction.text import CountVectorizer

# Text
text = ["I love NLP and I will learn NLP in 2month"]

# create the transform
vectorizer = CountVectorizer(ngram_range=(1,2))
                            #ngram_range=(min,max)

# tokenizing
vectorizer.fit(text)

# encode document
vector = vectorizer.transform(text)

# summarize & generating output
print(vectorizer.vocabulary_)
print(vector.toarray())

{'love': 7, 'nlp': 9, 'and': 1, 'will': 12, 'learn': 5, 'in': 3, '2month': 0, 'love nlp': 8, 'nlp and': 10, 'and will': 2, 'will learn': 13, 'learn nlp': 6, 'nlp in': 11, 'in 2month': 4}
[[1 1 1 1 1 1 1 1 1 2 1 1 1 1]]


The output has features with bigrams, and for our example, the count
is one for all the tokens.

**Note** : single letter words are not considered as "words" by the CountVectorizer

Hash Vectorizing
--
CountVectorizer has one limitation. In this method, the vocabulary can
become very large and cause memory/computation issues.

One of the ways to solve this problem is a Hash Vectorizer.

Problem
--
Understand and generate a Hash Vectorizer.

Solution
--
Hash Vectorizer is memory efficient and instead of storing the tokens
as strings, the vectorizer applies the hashing trick to encode them as
numerical indexes. The downside is that it’s one way and once vectorized,
the features cannot be retrieved.  

In [6]:
from sklearn.feature_extraction.text import HashingVectorizer

# list of text documents
text = ["The quick brown fox and the lazy dog."]

# transform
vectorizer = HashingVectorizer(n_features=3)
# recommended reading : https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
# https://datascience.stackexchange.com/questions/22250/what-is-the-difference-between-a-hashing-vectorizer-and-a-tfidf-vectorizer

# create the hashing vector
vector = vectorizer.transform(text)

# summarize the vector
print(vector.shape)
print(vector.toarray())

(1, 3)
[[-0.81649658  0.40824829 -0.40824829]]


It created vector of size __ and now this can be used for any
supervised/unsupervised tasks.

The trainer and participants should discuss the relevance of above vector o/p.

References :

Recommended reading :

https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

https://datascience.stackexchange.com/questions/22250/what-is-the-difference-between-a-hashing-vectorizer

Converting Text to Features Using TF-IDF
--
Again, in the above-mentioned text-to-feature methods, there are few
drawbacks, hence the introduction of TF-IDF.

Below are the disadvantages of the above methods.

• Let’s say a particular word is appearing in all the documents of the corpus, then it will achieve higher importance in our previous methods. That’s bad for our analysis.

• The whole idea of having TF-IDF is to reflect on how important a word is to a document in a collection, and hence normalizing words appeared frequently in all the documents.

Problem
--
Text to feature using TF-IDF.

Solution
--
Term frequency (TF): Term frequency is simply the ratio of the count of a
word present in a sentence, to the length of the sentence.

TF is basically capturing the importance of the word w.r.t the
length of the document. For example, a word with the frequency of 3 with
the length of sentence being 10 is not the same as when the word length of
sentence is 100 words. It should get more importance in the first scenario;
that is what TF does.

Inverse Document Frequency (IDF): IDF of each word is the log of the ratio of the total number of rows to the number of rows in a particular document in which that word is present.

IDF = log(N/n), where N is the total number of rows and n is the
number of rows in which the word was present.

IDF will measure the rareness of a term. Words like “a,” and “the” show
up in all the documents of the corpus, but rare words will not be there
in all the documents. So, if a word is appearing in almost all documents,
then that word is of no use to us since it is not helping to classify or in
information retrieval. IDF will nullify this problem.

TF-IDF is the simple product of TF and IDF so that both of the drawbacks are addressed, which makes predictions and information retrieval relevant.

In [7]:
Text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox",
"The is a dummy example"]

#Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#Create the TfidfVectorizer
vectorizer = TfidfVectorizer()

#Tokenize and build vocab
vectorizer.fit(Text)

#Summarize
print(vectorizer.vocabulary_)
#print(vectorizer.idf_)

{'the': 10, 'quick': 9, 'brown': 0, 'fox': 4, 'jumped': 6, 'over': 8, 'lazy': 7, 'dog': 1, 'is': 5, 'dummy': 2, 'example': 3}


If you observe, “the” is appearing in all the 3 documents and it does
not add much value, and hence the vector value is 1, which is less than all
the other vector representations of the tokens.

All these methods or techniques we have looked into so far are based
on frequency and hence called frequency-based embeddings or features.
And in the next section, let us look at prediction-based embeddings,
typically called word embeddings.

In [8]:
# encode document
vector = vectorizer.transform(["The quick brown fox jumped over the lazy dog"])

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 11)
[[0.36929648 0.29115758 0.         0.         0.29115758 0.
  0.36929648 0.36929648 0.36929648 0.36929648 0.38542844]]


Implementing Word Embeddings
--
This section assumes that you have a working knowledge of how a neural
network works and you know terms like :

a. Deep learning

b. Perceptron and Sigmoid

c. FFNN ( feed forward Neural Network)

d. RNN (Recurrent Neural Network )

**( If new to a Neural Network (NN), it is suggested that you go through Chapter 1 to gain a basic understanding of how NN works. )

Even though all previous methods solve most of the problems, once we get into more complicated problems where we want to capture the semantic relation between the words, these methods fail to perform.

Below are the challenges:

• All these techniques fail to capture the context and meaning of the words. All the methods discussed so far basically depend on the appearance or frequency of the words. But we need to look at how to capture the context or semantic relations: that is, how frequently the words are appearing close by.

>a. I am eating an apple.

>b. I am using apple.

If you observe the above example, Apple gives different meanings when it is used with different (close by) adjacent words, eating and using.

• For a problem like a document classification (book classification in the library), a document is really huge and there are a humongous number of tokens
generated. In these scenarios, your number of features can get out of control (wherein) thus hampering the accuracy and performance.

A machine/algorithm can match two documents/texts and say whether they are same or not. But how do we make machines tell you about cricket or Virat Kohli when you search for MS Dhoni? How do you make a machine understand that “Apple” in “Apple is a tasty fruit” is a fruit that can be eaten and not a company?

The answer to the above questions lies in creating a representation for words that capture their meanings, semantic relationships, and the different types of contexts they are used in.

> The above challenges are addressed by Word Embeddings.

Word embedding is the feature learning technique where words from the vocabulary are mapped to vectors of real numbers capturing the contextual hierarchy.

If you observe the below table, every word is represented with 4 numbers called vectors. Using the word embeddings technique, we are going to derive those vectors for each and every word so that we can use it in future analysis. In the below example, the dimension is 4. But we usually use a dimension greater than 100.

<img src="https://drive.google.com/uc?id=165llWGYsReLC4BCtyZs6ZLYeggkg1k1m">


Problem
--
You want to implement word embeddings.

Solution
--
Word embeddings are prediction based, and they use shallow neural networks to train the model that will lead to learning the weight and using them as a vector representation.

word2vec
--
word2vec is the deep learning Google framework to train word embeddings. It will use all the words of the whole corpus and predict
the nearby words. It will create a vector for all the words present in the
corpus in a way so that the context is captured. It also outperforms any
other methodologies in the space of word similarity and word analogies.

There are mainly 2 types of word2vec Model.

• Skip-Gram

• Continuous Bag of Words (CBOW)

<img src="https://drive.google.com/uc?id=1ZC7kOYkuY2BGRCONWde38usTOCRJqJlR" altext = "word2vec_block_diagram">


The above figure shows the architecture of the CBOW and skip-gram
algorithms used to build word embeddings.

CBOW vs Skip-Gram
-------
CBOW (Continuous Bag-Of-Words) is about creating a network that tries to predict the word in the middle given some surrounding words: [W[-3], W[-2], W[-1], W[1], W[2], W[3]] => W[0]

Skip-Gram is the opposite of CBOW, try to predict the surrounding words given the word in the middle: W[0] => [W[-3], W[-2], W[-1], W[1], W[2], W[3]]


Let us see how these models work in detail.

Skip-Gram
--
The skip-gram model is used to predict the probabilities of a word given the context of word or words.

Let us take a small sentence and understand how it actually works.
Each sentence will generate a target word and context, which are the words
nearby. The number of words to be considered around the target variable
is called the window size. The table below shows all the possible target
and context variables for window size 2. Window size needs to be selected
based on data and the resources at your disposal. The larger the window
size, the higher the computing power.

<img src="https://drive.google.com/uc?id=18nKDL_JAX96Zs_ILGMrcdd517GWLwrW2" altext = "skipgram_output">



Since it takes a lot of text and computing power, let us go ahead and
take sample data and build a skip-gram model.

As mentioned in Part 3(of this Course), import the text corpus and break it into sentences. Perform some cleaning and preprocessing like the removal of
punctuation and digits, and split the sentences into words or tokens, etc.

In [9]:
#Example sentences
sentences = [['I', 'love', 'nlp'],
['I', 'will', 'learn', 'nlp', 'in', '2','months'],
['nlp', 'is', 'future'],
[ 'nlp', 'saves', 'time', 'and', 'solves',
'lot', 'of', 'industry', 'problems'],
['nlp', 'uses', 'machine', 'learning']]

#import library
#!pip install gensim
import gensim
from gensim.models import Word2Vec
#from sklearn.decomposition import PCA
from matplotlib import pyplot

In [10]:
# training the model
skipgram = Word2Vec(sentences, vector_size=50, window = 3, min_count=1, sg=1)
# vector_size=50 -> means size of vector to represent each token or word (default 100)
# window=3 -> The maximum distance between the target word and its neighboring word.(default 5)
# min_count=1 -> Minimium frequency count of words.
#                The model would ignore words that do not satisfy the min_count.
#                Extremely infrequent words are usually unimportant. (default 5)
# workers -> How many threads to use behind the scenes? (default 3)
# sg -> (default 0 or CBOW) The training algorithm, either CBOW (0)
#                           or skip gram (1).


# access vector for one word
print(skipgram.wv['nlp'])
## skipgram.wv['nlp'] accesses the word vector for the word 'nlp' in the Word2Vec model skipgram.

# Since our vector size parameter was 50, the model
# gives a vector of size 50 for each word.

[-1.0724545e-03  4.7286271e-04  1.0206699e-02  1.8018546e-02
 -1.8605899e-02 -1.4233618e-02  1.2917745e-02  1.7945977e-02
 -1.0030856e-02 -7.5267432e-03  1.4761009e-02 -3.0669428e-03
 -9.0732267e-03  1.3108104e-02 -9.7203208e-03 -3.6320353e-03
  5.7531595e-03  1.9837476e-03 -1.6570430e-02 -1.8897636e-02
  1.4623532e-02  1.0140524e-02  1.3515387e-02  1.5257311e-03
  1.2701781e-02 -6.8107317e-03 -1.8928028e-03  1.1537147e-02
 -1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
  1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
  1.6154874e-02 -1.1861792e-02  9.0324880e-05 -9.5074680e-03
 -1.9207101e-02  1.0014586e-02 -1.7519170e-02 -8.7836506e-03
 -7.0199967e-05 -5.9236289e-04 -1.5322480e-02  1.9229487e-02
  9.9641159e-03  1.8466286e-02]


Recommended reading :
--
https://medium.freecodecamp.org/how-to-get-started-with-word2vec-and-then-how-to-make-it-work-d0a2fca9dad3

https://nlpforhackers.io/word-embeddings/

https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

In [16]:
# access vector for another word
print(skipgram.wv['nlp'])

[-1.0724545e-03  4.7286271e-04  1.0206699e-02  1.8018546e-02
 -1.8605899e-02 -1.4233618e-02  1.2917745e-02  1.7945977e-02
 -1.0030856e-02 -7.5267432e-03  1.4761009e-02 -3.0669428e-03
 -9.0732267e-03  1.3108104e-02 -9.7203208e-03 -3.6320353e-03
  5.7531595e-03  1.9837476e-03 -1.6570430e-02 -1.8897636e-02
  1.4623532e-02  1.0140524e-02  1.3515387e-02  1.5257311e-03
  1.2701781e-02 -6.8107317e-03 -1.8928028e-03  1.1537147e-02
 -1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
  1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
  1.6154874e-02 -1.1861792e-02  9.0324880e-05 -9.5074680e-03
 -1.9207101e-02  1.0014586e-02 -1.7519170e-02 -8.7836506e-03
 -7.0199967e-05 -5.9236289e-04 -1.5322480e-02  1.9229487e-02
  9.9641159e-03  1.8466286e-02]


Note : We get an error saying the word doesn’t exist because this word was
not there in our input training data. This is the reason we need to train the
algorithm on as much data possible so that we do not miss out on words.

Continuous Bag of Words (CBOW)
--
Now let’s see how to build CBOW model. (Its very similar to SkipGram model)

In [11]:
#import library
from gensim.models import Word2Vec
#from sklearn.decomposition import PCA
from matplotlib import pyplot

#Example sentences
sentences = [['I', 'love', 'nlp'],
['I', 'will', 'learn', 'nlp', 'in', '2','months'],
['nlp', 'is', 'future'],
[ 'nlp', 'saves', 'time', 'and', 'solves',
'lot', 'of', 'industry', 'problems'],
['nlp', 'uses', 'machine', 'learning']]

In [12]:
# training the model
cbow = Word2Vec(sentences, vector_size=50, window = 3, min_count=1, sg=0)
# vector_size=50 -> means size of vector to represent each token or word
# window=1 -> The maximum distance between the target word and its neighboring word.
# min_count=1 -> Minimium frequency count of words.
#                The model would ignore words that do not satisfy the min_count.
#                Extremely infrequent words are usually unimportant.
# workers -> How many threads to use behind the scenes?
# as sg=0 i.e no skipgram , hence default CBOW


# access vector for one word
print(cbow.wv['nlp'])

[-1.0724545e-03  4.7286271e-04  1.0206699e-02  1.8018546e-02
 -1.8605899e-02 -1.4233618e-02  1.2917745e-02  1.7945977e-02
 -1.0030856e-02 -7.5267432e-03  1.4761009e-02 -3.0669428e-03
 -9.0732267e-03  1.3108104e-02 -9.7203208e-03 -3.6320353e-03
  5.7531595e-03  1.9837476e-03 -1.6570430e-02 -1.8897636e-02
  1.4623532e-02  1.0140524e-02  1.3515387e-02  1.5257311e-03
  1.2701781e-02 -6.8107317e-03 -1.8928028e-03  1.1537147e-02
 -1.5043275e-02 -7.8722071e-03 -1.5023164e-02 -1.8600845e-03
  1.9076237e-02 -1.4638334e-02 -4.6675373e-03 -3.8754821e-03
  1.6154874e-02 -1.1861792e-02  9.0324880e-05 -9.5074680e-03
 -1.9207101e-02  1.0014586e-02 -1.7519170e-02 -8.7836506e-03
 -7.0199967e-05 -5.9236289e-04 -1.5322480e-02  1.9229487e-02
  9.9641159e-03  1.8466286e-02]


Important Observation
--
To train these models, it requires a huge amount of computing
power. So, let us go ahead and use Google’s pre-trained model, which has
been trained with over 100 billion words.

<font color='red'> <u>Note</u> : The Google Db is soo large that we may get ValueError, like this :  ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size. </font>

In [17]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# import gensim package
import gensim

# load the saved model
#model = gensim.models.KeyedVectors.load_word2vec_format('datasets/GoogleNews-vectors-negative300.bin', binary=True)
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')




In [21]:
#Checking how similarity works.
print (wv.similarity('her', 'she'))

0.7834683


In [25]:
#Lets check one more.
print (wv.similarity('pizza', 'water'))

0.12112989


> Analysis on similarity :

“This” and “is” has some similarity ( around 40 %), but the similarity
between the words “post” and “book” is poor ( just 5 %). For any given set of words, it uses the vectors of both the words and calculates the similarity between them.

In [26]:
# Finding the odd one out.
wv.doesnt_match('breakfast cereal dinner lunch'.split())

'cereal'

Of 'breakfast’, ‘cereal’, ‘dinner’ and ‘lunch', only cereal is the word that is
not anywhere related to the remaining 3 words.

In [27]:
# It is also finding the relations between words.
wv.most_similar(positive=['woman', 'king'],negative=['man'])
# default value of topn is 10

# try this too :
# model.most_similar(positive=['woman', 'king'],negative=['man'], topn=1)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

<img src="https://drive.google.com/uc?id=11Yu1Gj4Rw5BccL6KXnT_rXqYPyJbEUfZ" altext = "word2Vec_google_sample_output_image">

Implementing fastText
--
fastText is another deep learning framework developed by Facebook to
capture context and meaning.

Problem
--
How to implement fastText in Python.

Solution
--
fastText is the improvised version of word2vec. word2vec basically
considers words to build the representation. But fastText takes each
character while computing the representation of the word.

In [None]:
# Import FastText
from gensim.models import FastText

# Example sentences related to food ingredients and recipes
sentences = [
    ['pizza', 'toppings', 'include', 'cheese', 'tomato', 'pepperoni'],
    ['spaghetti', 'sauce', 'ingredients', 'include', 'tomato', 'garlic', 'onion'],
    ['chocolate', 'chip', 'cookie', 'recipe', 'calls', 'for', 'butter', 'flour', 'sugar', 'chocolate'],
    ['chicken', 'noodle', 'soup', 'requires', 'chicken', 'noodles', 'carrots', 'celery', 'broth'],
    ['vegetarian', 'taco', 'recipe', 'includes', 'black', 'beans', 'corn', 'bell', 'peppers', 'avocado']
]

# Train the FastText model
fast = FastText(sentences, vector_size=10, window=3, min_count=2, workers=5, min_n=1, max_n=2)
# vector_size=10 -> means size of vector to represent each token or word
# window=1 -> The maximum distance between the target word and its neighboring word.
# min_count=1 -> Minimium frequency count of words.
#                The model would ignore words that do not satisfy the min_count.
#                Extremely infrequent words are usually unimportant.
# workers -> How many threads to use behind the scenes?
# min_n=1, max_n=2  -> means model will consider both unigrams and bigrams.
# By default, min_n=3 and max_n=6 in the FastText model.

# Perform vector arithmetic to find a word analogy using the FastText model
similar_words = fast.wv.most_similar(positive=['cheese', 'bread'], negative=['pizza'], topn=3)
## we perform vector arithmetic to find a word analogy. In this case, we're looking
## for words similar to "cheese" and "bread" but not similar to "pizza".
## This can represent a query like "What goes well with bread, other than pizza?"

print(similar_words)

[('recipe', 0.7209869027137756), ('chicken', 0.4884187579154968), ('chocolate', -0.04585491493344307)]


Here's a use case that demonstrates the usage of FastText for building a text classification model:

### Use Case: Text Classification for Customer Reviews

#### Problem Statement:
An e-commerce company wants to classify customer reviews into different categories (e.g., positive, neutral, negative) to gain insights into customer sentiment and improve their products and services.

#### Solution with FastText:
We can use FastText to build a text classification model that analyzes customer reviews and predicts the sentiment category.

#### Steps:

1. **Data Collection**: Collect a dataset of customer reviews along with their sentiment labels (e.g., positive, neutral, negative).

2. **Data Preprocessing**: Preprocess the text data by removing stopwords, punctuation, and converting text to lowercase. Tokenization and lemmatization can also be applied.

3. **Feature Extraction**: Train a FastText model on the preprocessed text data to learn word embeddings. These embeddings capture semantic information about words and their contexts.

4. **Model Training**: Use the word embeddings learned by FastText to represent each review as a vector by averaging the embeddings of its constituent words. Train a classification model (e.g., logistic regression, support vector machine) on these vector representations to predict the sentiment category.

5. **Model Evaluation**: Evaluate the performance of the trained model using appropriate evaluation metrics such as accuracy, precision, recall, and F1-score on a held-out test set.

6. **Deployment**: Deploy the trained model into production to classify new customer reviews in real-time. Monitor the model's performance and fine-tune as necessary.

#### Benefits of Using FastText:

- **Efficiency**: FastText is known for its efficiency in training and inference, making it suitable for large-scale text classification tasks.
  
- **Semantic Information**: FastText captures semantic relationships between words, allowing the model to generalize well to unseen words and contexts.

- **Robustness**: FastText can handle out-of-vocabulary words, misspellings, and morphologically rich languages effectively, enhancing the robustness of the text classification model.

- **Interpretability**: The word embeddings learned by FastText provide interpretable representations of words, facilitating the understanding of the model's predictions.

#### Conclusion:
By leveraging FastText for text classification, the e-commerce company can effectively analyze customer reviews, gain insights into customer sentiment, and make data-driven decisions to enhance customer satisfaction and improve their products and services.

In [None]:
from gensim.models import FastText
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Sample customer reviews on clothes and electronic products
clothes_reviews = [
    "The dress is beautiful and fits perfectly.",
    "The fabric of the shirt is soft and comfortable.",
    "I love the design of these jeans.",
    "The quality of the sweater is excellent.",
    "The skirt is too tight and uncomfortable.",
]

electronic_reviews = [
    "The phone is fast and has a great camera.",
    "The laptop has a long battery life and works smoothly.",
    "I'm impressed with the performance of the tablet.",
    "The sound quality of the headphones is amazing.",
    "The screen of the monitor is clear and sharp.",
]

# Labels for the reviews (0: clothes, 1: electronics)
labels = [0] * len(clothes_reviews) + [1] * len(electronic_reviews)

# Combine clothes and electronic reviews
all_reviews = clothes_reviews + electronic_reviews

# Preprocessing function to tokenize, remove stopwords, punctuation, and lemmatize the text data
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and word.isalpha()]
    return tokens

# Train FastText model on the preprocessed text data
fasttext_model = FastText([preprocess(review) for review in all_reviews], vector_size=100, window=5, min_count=1, workers=4)

# Convert each review into a vector representation using the trained FastText model
def review_to_vector(review):
    vectors = [fasttext_model.wv[word] for word in preprocess(review) if word in fasttext_model.wv]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(fasttext_model.vector_size)

review_vectors = np.array([review_to_vector(review) for review in all_reviews])

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(review_vectors, labels, test_size=0.2, random_state=42)

# Train a Random Forest classifier on the vector representations of the reviews
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=["Clothes", "Electronics"]))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Accuracy: 1.0
Classification Report:
              precision    recall  f1-score   support

     Clothes       1.00      1.00      1.00         1
 Electronics       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



Logical explanation of the code:

1. **Data Preparation**:
   - We start by defining sample customer reviews for clothes and electronic products, along with their corresponding labels.
   - These reviews are then combined into a single list, `all_reviews`, along with their labels.

2. **Preprocessing**:
   - We perform text preprocessing on the reviews before training the model.
   - The `preprocess` function tokenizes each review, removes stopwords, punctuation, and performs lemmatization to normalize the text data.

3. **FastText Model Training**:
   - We train a FastText model on the preprocessed text data using Gensim.
   - FastText learns vector representations (embeddings) for each word in the vocabulary, capturing semantic meanings of words based on their context in the reviews.

4. **Feature Extraction**:
   - For each review, we use the trained FastText model to convert the text into a fixed-length vector representation.
   - The `review_to_vector` function computes the average vector of all words in the review, excluding stopwords and non-alphabetic tokens.

5. **Train-Test Split**:
   - We split the data into training and testing sets using `train_test_split` from scikit-learn.
   - This ensures that the model is trained on a portion of the data and evaluated on a separate unseen portion.

6. **Model Training (Random Forest)**:
   - We use a Random Forest classifier to train the model on the vector representations of the reviews.
   - Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the mode (or average) prediction of the individual trees.

7. **Model Evaluation**:
   - We make predictions on the test set using the trained model.
   - Accuracy is calculated as the proportion of correctly classified samples out of the total number of samples.
   - The classification report provides precision, recall, and F1-score for each class (Clothes and Electronics), along with their average values.

By following these steps, the code aims to train a model that can accurately classify customer reviews as either related to clothes or electronic products based on their text content.

# **Home work - Use case**

Use Continuous Bag of Words (CBOW) Word2Vec model for sentiment analysis on movie reviews:

### Use Case: Movie Review Sentiment Analysis

**Problem Statement**:
We want to classify movie reviews as positive or negative based on their sentiment.

**Dataset**:
Use the IMDb movie review dataset, which contains labeled movie reviews as positive or negative.

**Approach**:
1. **Data Preparation**:
   - Download and preprocess the IMDb movie review dataset. ( source : kaggle.com )
   - Split the dataset into training and testing sets.

2. **Word Embedding (CBOW Word2Vec)**:
   - Train a CBOW Word2Vec model on the preprocessed movie review text.
   - CBOW model predicts the target word (current word) based on the context words (surrounding words) within a fixed-size window.

3. **Feature Extraction**:
   - For each movie review, compute the average vector representation of all words using the trained CBOW Word2Vec model.
   - These average vectors serve as features for sentiment analysis.

4. **Model Training**:
   - Train a machine learning model (e.g., Logistic Regression, Random Forest, or Support Vector Machine) on the feature vectors extracted from the movie reviews.

5. **Model Evaluation**:
   - Evaluate the trained model on the testing set to measure its performance in classifying movie reviews as positive or negative.

**Code Implementation** (using Python and Gensim):
```python
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Load and preprocess the IMDb movie review dataset

# Train CBOW Word2Vec model on preprocessed text data
cbow_model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=0)

# Convert each movie review into a vector representation using the trained CBOW Word2Vec model

# Split the data into train and test sets

# Train a machine learning model (e.g., Logistic Regression) on the feature vectors

# Evaluate the trained model on the testing set

# Print accuracy and classification report
```

This use case demonstrates how to leverage the CBOW Word2Vec model for sentiment analysis on movie reviews. By learning vector representations of words in the context of movie reviews, the model can capture semantic meanings and relationships, enabling accurate sentiment classification.