# EMBEDDINGS

Embeddings are vectors that represent words. They are used to represent words in a vector space. It is mathematically represented as a matrix of size (d, V), where d is the dimensionality of the embedding and V is the number of words in the vocabulary. The mapping function is given by the equation:

> f: X-> Y, which is a function. 


Where the function is

•	**injective** (which is what we call an **injective function** , each Y has a unique X correspondence, and vice versa)

•	**structure-preserving** ( structure preservation , for example, X1 < X2 in the space to which X belongs, then the same applies to Y1 <Y2 in the space to which Y belongs after mapping)

Popular translation can be considered as word embedding, **which is to map the words in the space to which X belongs to a multi-dimensional vector in Y space , then the multi-dimensional vector is equivalent to embedding in the space to which Y belongs**.

1. Text data needs to be pre-processed into tensor form before it can be input to the neural network.
2. The process of dividing text into units is called tokenization, and the unit of division is called tokens.
3. Text can be divided into words, characters (abcdefg ...), n-gram and so on.
4. Generally use one-hot encoding or word-embedding to process words into numerical tensors.
5. One-hot encoding is simple, but without structure, the distance between any two words is √2.
6. The word-embedding space has small dimensions, structure in space, similar words are near, and unrelated words are far away.
7. The role of the embedding layer can actually be seen as a matrix that maps the points in the high-dimensional space to the low-dimensional space.

## WORD EMBEDDINGS

Word embedding is a form of word representation **that connects the human understanding of language to that of the machine**. Word embeddings are the distributed representations of text in an ample dimensional space. By looking at different researches in the area of deep learning, word embeddings are essential. **It is the approach of representing words and documents that may be considered as one of the crucial breakthroughs in the field of deep learning on challenging NLP problems.**

Word embeddings are a class of techniques **where the individual word, is represented as a real-valued vector in a vector space**. The main idea is to use a densely distributed representation for all the words.
Each word is represented by a real-value vector. Each word is mapped to a single vector, and the vector values are learned in a way that resembles a neural network, and hence the technique is often lumped into the field of deep learning.

**The neural network cannot train the original text data. We need to process the text data into numerical tensors first. This process is also called text vectorization.**

There are several strategies for text vectorization:
1. Split text into words, each word is converted into a vector
2. Split text into characters, each character is converted into a vector
3. Extract n-gram of words or characters n-gram to a vector
4. One-hot encoding

**The unit into which text is decomposed is called token, and the process of decomposing text into token is called tokenization.**

To put it simply, we need to input text data into a neural network and let it train. However, neural networks cannot directly process text data. We need to pre-process text data into a format that the neural network can understand, which is the following process:

**Text ----> Participle ----> Vectorization**


There are two main methods for word vectorization:
1. One-hot encoding
2. Word embedding

# ONE HOT ENCODING

**Why is it called one-hot?** 

**After each word is one-hot encoded, only one position has an element of 1 and the other positions are all 0.**

For example, 
the sentence **"the boy is crying"** (assuming there are only four English words in the world), after one-hot encoding,

**the corresponds to (1, 0, 0, 0)**

**boy corresponds to (0, 1, 0 ， 0）**

**is corresponds to (0,0,1,0)**

**crying corresponds to (0,0,0,1)**

Each word corresponds to a position in the vector, and this position represents the word.

But this way requires a very high dimension, because if all vocabularies have 100,000 words, then each word needs to be represented by a vector of length 100,000.

**the corresponding to (1, 0, 0, 0, ..., 0) (length is 100,000)**

**boy corresponding to (0, 1, 0, 0, ..., 0)**

**is corresponding to (0, 0, 1, 0 , ..., 0)**

**crying corresponds to (0,0,0,1, ..., 0) to get high-dimensional sparse tensors.**

<figure>
<img src= "./res/04_1_one-hot.jpg" style="width:100%">
<figcaption align = "center"><b>Fig.1 -  One Hot Encoding</b></figcaption>
</figure>


## ADVANTAGES AND DISADVANTAGES OF ONE HOT ENCODING

## ADVANTAGES

* One-hot encoding is a simple way to represent words.
* One-hot encoding is a sparse representation.
* The working of one-hot encoding is easy to understand.

## DISADVANTAGES

* The length of the word vector is equal to the length of the vocabulary, and the word vector is extremely sparse. When the vocabulary is large, the computational complexity will be very large.

* Any two words are orthogonal, meaning that the relationship between words cannot be obtained from the One-Hot code

* The distance between any two words is equal, and the semantic relevance of the two words cannot be reflected from the distance


# BAG OF WORDS - BOW

**Bag-of-words model is a commonly used document representation method in the field of information retrieval .**

In information retrieval, the BOW model assumes that for a document, it ignores its word order, grammar, syntax and other factors, and treats it as a collection of several words. The appearance of each word in the document is independent and independent of whether other words appear. **(It's out of order)**

The Bag-of-words model (BoW model) ignores the grammar and word order of a text, and uses a set of unordered words to express a text or a document.

#### Let's take an example

`John likes to watch movies. Mary likes too.`

`John also likes to watch football games.`

Build a dictionary based on the words that appear in the above two sentences:

`{"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10}`


The dictionary contains 10 words, each word has a unique index. Note that their order is not related to the order in which they appear in the sentence. According to this dictionary, we re-express the above two sentences into the following two vectors:

`[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]`

`[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]`


These two vectors contain a total of 10 elements, where the i-th element represents the number of times the i-th word in the dictionary appears in the sentence. 

Now imagine a **huge document set D with a total of M documents**. After all the words in the document are extracted, they form a dictionary containing N words. Using the Bag-of-words model, **each document can be represented as an N-dimensional vector**.


Therefore, the BoW model can be considered as a statistical histogram. It is used in text retrieval and processing applications.

## IMPLEMENTING BOW USING SCIKIT-LEARN

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

sentence_1="Ineuron is an Edtech company that provides a platform for students to learn and interact with the world around them."
sentence_2="It is in the aim of providing an affordable and easy way for students to learn SOTA technologies."

vectorizer = CountVectorizer(ngram_range=(1,1)
                        ,stop_words='english')

vector = vectorizer.fit_transform([sentence_1,sentence_2])

output=pd.DataFrame(vector.toarray(),columns=vectorizer.get_feature_names())
print(output)

   affordable  aim  company  easy  edtech  ineuron  interact  learn  platform  \
0           0    0        1     0       1        1         1      1         1   
1           1    1        0     1       0        0         0      1         0   

   provides  providing  sota  students  technologies  way  world  
0         1          0     0         1             0    0      1  
1         0          1     1         1             1    1      0  




# TF-IDF VECTORIZATION

**TF-IDF (Term Frequency-Inverse Document Frequency)**, a commonly used weighting technique for information retrieval and information exploration.

TF-IDF is a statistical method used to evaluate the importance of a word to a file set or a file in a corpus. The importance of the word increases in proportion to the number of times it appears in the file, but at the same time decreases inversely with the frequency of its appearance in the corpus.

<figure>
<img src= "./res/04_2_tfidf.png">
<figcaption><b>Fig.2 -  Formula for TF-IDF</b></figcaption>
</figure>

## TF(TERM FREQUENCY)

* **Term frequency TF (item frequency)**: number of times a given word appears in the text. This number is usually normalized (the numerator is generally smaller than the denominator) to prevent it from favoring long documents, because whether the term is important or not, it is likely to appear more often in long documents than in paragraph documents.

> **TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).**

Term frequency (TF) indicates how often a term (keyword) appears in the text .

This number is usually normalized (usually the word frequency divided by the total number of words in the article) to prevent it from favoring long documents.

<figure>
<img src= "./res/04_3_tf.png">
<figcaption><b>Fig.3 -  Formula for TF</b></figcaption>
</figure>

where  ni, j  is the number of occurrences of the word in the file  dj, and the denominator is the sum of the occurrences of all words in the file dj;

## IDF(INVERSE DOCUMENT FREQUENCY)

* **Inverse document frequency (IDF)**: A measure of the general importance of a word. The main idea is that if there are fewer documents containing the entry t and the larger, it means that the entry has a good ability to distinguish categories. The IDF of a specific word can be calculated by dividing the total number of files by the number of files containing the word, and then taking the log of the obtained quotient.

>**IDF(t) = log_e(Total number of documents / Number of documents with term t in it).**

<figure>
<img src= "./res/04_4_idf.png">
<figcaption><b>Fig.4 -  Formula for IDF</b></figcaption>
</figure>

### Example:

Consider a document containing 100 words where in the word cat appears 3 times. 

The **term frequency (Tf) for cat** is then **(3 / 100) = 0.03**. Now, assume we have 10 million documents and the word cat appears in one thousand of these.

Then, the **inverse document frequency (Idf)** is calculated as **log(10,000,000 / 1,000) = 4.** 

Thus, the **Tf-idf** weight is the product of these quantities: **0.03 * 4 = 0.12.**

### APPLICATIONS OF TF-IDF

1.  **Search engine**
2.  **Keyword extraction**
3.  **Text similarity**
4.  **Text summary**

## PYTHONIC IMPLEMENTATION OF TF-IDF

In [4]:
from collections import defaultdict
import math
import operator

def loadDataSet():
    dataset = [ ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],    
                ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid'] ]
    classVec = [0, 1, 0, 1, 0, 1]  
    return dataset, classVec


def feature_select(list_words):
    
    doc_frequency=defaultdict(int)
    for word_list in list_words:
        for i in word_list:
            doc_frequency[i]+=1
    
    word_tf={}  
    for i in doc_frequency:
        word_tf[i]=doc_frequency[i]/sum(doc_frequency.values())
    
    doc_num=len(list_words)
    word_idf={} 
    word_doc=defaultdict(int) 
    for i in doc_frequency:
        for j in list_words:
            if i in j:
                word_doc[i]+=1
    for i in doc_frequency:
        word_idf[i]=math.log(doc_num/(word_doc[i]+1))
    
    word_tf_idf={}
    for i in doc_frequency:
        word_tf_idf[i]=word_tf[i]*word_idf[i]
    
    dict_feature_select=sorted(word_tf_idf.items(),key=operator.itemgetter(1),reverse=True)
    return dict_feature_select
    
if __name__=='__main__':
    data_list,label_list=loadDataSet() 
    features=feature_select(data_list) 
    print(features)
    print(len(features))

[('to', 0.0322394037469742), ('stop', 0.0322394037469742), ('worthless', 0.0322394037469742), ('my', 0.028288263356383563), ('dog', 0.028288263356383563), ('him', 0.028288263356383563), ('stupid', 0.028288263356383563), ('has', 0.025549122992281622), ('flea', 0.025549122992281622), ('problems', 0.025549122992281622), ('help', 0.025549122992281622), ('please', 0.025549122992281622), ('maybe', 0.025549122992281622), ('not', 0.025549122992281622), ('take', 0.025549122992281622), ('park', 0.025549122992281622), ('dalmation', 0.025549122992281622), ('is', 0.025549122992281622), ('so', 0.025549122992281622), ('cute', 0.025549122992281622), ('I', 0.025549122992281622), ('love', 0.025549122992281622), ('posting', 0.025549122992281622), ('garbage', 0.025549122992281622), ('mr', 0.025549122992281622), ('licks', 0.025549122992281622), ('ate', 0.025549122992281622), ('steak', 0.025549122992281622), ('how', 0.025549122992281622), ('quit', 0.025549122992281622), ('buying', 0.025549122992281622), ('f

## IMPLEMENTING TF-IDF USING NLTK

In [5]:
from nltk.text import TextCollection
from nltk.tokenize import word_tokenize

sents=['this is sentence one','this is sentence two','this is sentence three']
sents=[word_tokenize(sent) for sent in sents]
print(sents)
corpus=TextCollection(sents)
print(corpus)

tf=corpus.tf('one',corpus)
print(tf)

idf=corpus.idf('one')
print(idf)

tf_idf=corpus.tf_idf('one',corpus)
print(tf_idf)

[['this', 'is', 'sentence', 'one'], ['this', 'is', 'sentence', 'two'], ['this', 'is', 'sentence', 'three']]
<Text: this is sentence one this is sentence two...>
0.08333333333333333
1.0986122886681098
0.0915510240556758


## IMPLEMENTATION OF TF-IDF USING SCIKIT-LEARN

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

x_train = ['The main idea of TF-IDF is that algorithm is an important feature that can be separated from the corpus background']
x_test=['Original text marked ',' main idea']

vectorizer = CountVectorizer(max_features=10)

tf_idf_transformer = TfidfTransformer()

tf_idf = tf_idf_transformer.fit_transform(vectorizer.fit_transform(x_train))

x_train_weight = tf_idf.toarray()

tf_idf = tf_idf_transformer.transform(vectorizer.transform(x_test))
x_test_weight = tf_idf.toarray()

print('Output x_train text vector：')
print(x_train_weight)
print('Output x_test text vector：')
print(x_test_weight)

Output x_train text vector：
[[0.22941573 0.22941573 0.22941573 0.45883147 0.22941573 0.22941573
  0.22941573 0.22941573 0.45883147 0.45883147]]
Output x_test text vector：
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]


# N-GRAM

**Wikipedia definition**: 

**In computational linguistics, n-gram refers to n consecutive items in the text (items can be phoneme, syllable, letter, word or base pairs)**

N-grams of texts are widely used in the field of text mining and natural language processing. They are basically a set of co-occurring words within a defined window and when computing the n-grams, we typically move one word forward or more depending upon the scenario.

For example, for the sentence **“The cow jumps over the moon”**. If **N=2** (known as bigrams), then the ngrams would be:

* the cow
* cow jumps
* jumps over
* over the
* the moon

In n-gram, **n = 1 is unigram**, **n = 2 is bigram**, **n = 3 is trigram**. 
After **n> 4**, refer directly to numbers, such as **4-gram, 5-gram**.
gram is often used to compare sentence similarity, fuzzy query, sentence rationality, sentence correction, etc.

<figure>
<img src= "./res/04_5_ngram.png", style="width:100%">
<figcaption align="center"><b>Fig.5 -  N Gram Formula Approximation</b></figcaption>
</figure>

### Determination of N in N-gram

To confirm the value of N. "Language Modeling with Ngrams" uses the indicator **Perplexity**. The smaller the indicator, the better the effect of a language model. 

>The article uses a Wall Street Journal database with a dictionary size of 19,979. The training set contains 38 million words and the test set contains 1.5 million words. 

For different N-grams, calculate their respective perplexity.

<figure>
<img src= "./res/04_6_perp_formula.png", style="width:100%">
<figcaption align="center"><b>Fig.6 -  Perplexity Calculation forumula</b></figcaption>
</figure>

The results show that Tri-gram's Perplexity is the smallest, so it works best.

<figure>
<img src= "./res/04_7_perp_result.png", style="width:100%">
<figcaption align="center"><b>Fig.7 -  Perplexity Result</b></figcaption>
</figure>

## UNIGRAM IMPLEMENTATION

In [7]:
sent = "I will go to United States"
lst_sent = sent.split (" ")
of_unigrams_in = []
for i in range(len(lst_sent)):
    of_unigrams_in.append(lst_sent[i])
print(of_unigrams_in)

['I', 'will', 'go', 'to', 'United', 'States']


## BIGRAM IMPLEMENTATION

In [9]:
sent = "I will go to United States"
lst_sent = sent.split (" ")
of_bigram_in = []
for i in range(len(lst_sent)-1):
    of_bigram_in.append(lst_sent[i] + " " + lst_sent[i+1])
print(of_bigram_in)

['I will', 'will go', 'go to', 'to United', 'United States']


## TRIGRAM IMPLEMENTATION

In [10]:
sent = "I will go to United States"
lst_sent = sent.split (" ")
of_trigram_in = []
for i in range(len(lst_sent)-2):
    of_trigram_in.append(lst_sent[i] + " " + lst_sent[i+1] + " " + lst_sent[i+2])
print(of_trigram_in)

['I will go', 'will go to', 'go to United', 'to United States']


## USE OF NGRAM

* CountVectorizer class of scikit-learn package implements ngram
* This is used as a step before vectorizing using TF_IDF by using the **co-occurence matrix**

# GloVE

**GloVe is an unsupervised learning algorithm for obtaining vocabulary vector representations. The aggregated global word co-occurrence statistics from the corpus are trained and the resulting representations show interesting linear substructures of the word vector space.**


Official website homepage address: <a href="https://nlp.stanford.edu/projects/glove/" target="_blank">https://nlp.stanford.edu/projects/glove/</a>

Github: <a href="https://github.com/stanfordnlp/GloVe" target="_blank">https://github.com/stanfordnlp/GloVe</a>

Paper download address: <a href="https://nlp.stanford.edu/pubs/glove.pdf" target="_blank">https://nlp.stanford.edu/pubs/glove.pdf</a>

## GloVe word vector format

GloVe is a type of Word embedding. The format of the GloVe word vector and word2vec is a little different from the Stanford open source code training. **The first line of the model trained by word2vec is: thesaurus size and dimensions, while gloVe does not**

Word2vec training format:

    Size Dimension

    Word1 vector1
    Word2 vector1
    ....
    WordN vectorN
    

GloVe training format:


    Word1 vector1
    Word2 vector1
    ....
    WordN vectorN

>Therefore, we use the model trained by Glove to add a line of Vocabulary Size in front, and the model is used in the same way as word2vec. The official website provides a lot of word vector models trained using thesaurus, which can be downloaded and used directly. It can be used with the help of **gensim** package as well.

# WORD2VEC

Word2Vec is a collection of models that are used to generate embeddings for any given sentence of our choice. The models that are present in Word2Vec are very shallow neural networks. It consists of only one input layer, one hidden layer and one output layer.

The Word2Vec model utilizes two main types of architectures:
* Skip-gram
* CBOW

<figure>
<img src= "./res/04_8_word2vec.png" style="width:100%">
<figcaption align="center"><b>Fig.8 -  Word2Vec</b></figcaption>
</figure>

## CBOW(Continuous Bag of Words)

CBOW model predicts a given current word in the context of words within a specific window in the text. Here the input layer consists of the context words and the output layer consists of the current word. The hidden layer of this architecture is a fully connected layer that contains the number of dimensions in which we want to represent the current word i.e. the dimension of the word vector(embedding).

<figure>
<img src= "./res/04_9_CBOW.png" style="width:100%">
<figcaption align="center"><b>Fig.9 -  CBOW Architecture</b></figcaption>
</figure>

## SKIP-GRAM

Skip gram predicts the surrounding context words for a given current word within a given specific window of text. Here in this architecturem the input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer (i.e) the dimension of the word vector(embedding).

<figure>
<img src= "./res/04_10_skip.png" style="width:100%">
<figcaption align="center"><b>Fig.10 -  Skip Gram Architecture</b></figcaption>
</figure>

## WORD2VEC IMPLEMENTATION USING GENSIM

In [45]:
## Implementing Word2Vec using gensim
import nltk
import gensim
from gensim.models import Word2Vec
from gensim.test.utils import common_texts

import numpy as np
from nltk.tokenize import word_tokenize
sentence = "Ineuron is an Edtech company that provides a platform for students to learn and interact with the world around them."
sentence = gensim.utils.simple_preprocess(sentence)
print(sentence)
w2vvectorizer = Word2Vec([sentence], window=5, min_count=1, workers=4)
w2vvectorizer.build_vocab([sentence])
w2vvectorizer.train([sentence], total_examples=w2vvectorizer.corpus_count, epochs=w2vvectorizer.epochs)
print(w2vvectorizer.wv.most_similar('ineuron'))
print(w2vvectorizer.wv.similarity('ineuron', 'edtech'))
print(w2vvectorizer.wv.get_vector('ineuron'))

['ineuron', 'is', 'an', 'edtech', 'company', 'that', 'provides', 'platform', 'for', 'students', 'to', 'learn', 'and', 'interact', 'with', 'the', 'world', 'around', 'them']
[('world', 0.21892882883548737), ('students', 0.21617504954338074), ('and', 0.09320056438446045), ('platform', 0.0928734838962555), ('learn', 0.07977091521024704), ('to', 0.06299277395009995), ('the', 0.05441128462553024), ('for', 0.02752850018441677), ('provides', 0.016657229512929916), ('is', -0.010736537165939808)]
-0.111435644
[-5.3868821e-04  2.3831481e-04  5.1017073e-03  9.0057505e-03
 -9.3020368e-03 -7.1141296e-03  6.4635910e-03  8.9763561e-03
 -5.0250809e-03 -3.7653032e-03  7.3808716e-03 -1.5371587e-03
 -4.5333384e-03  6.5539577e-03 -4.8574321e-03 -1.8212905e-03
  2.8798427e-03  9.9633494e-04 -8.2910145e-03 -9.4572818e-03
  7.3053432e-03  5.0678705e-03  6.7687123e-03  7.6206285e-04
  6.3510793e-03 -3.4057270e-03 -9.4230933e-04  5.7612793e-03
 -7.5225709e-03 -3.9330292e-03 -7.5007030e-03 -9.3456963e-04
  9.533

# DOC2VEC

Doc2vec (also known as: paragraph2vec or sentence embedding) is the modified version of word2vec. The main objective of doc2vec is to convert sentence or paragraph to vector (numeric) form. In Natural Language Processing Doc2Vec is used to find related sentences for a given sentence (instead of word in Word2Vec).

The doc2vec models may be used in the following way: for training, a set of documents is required. A word vector W is generated for each word, and a document vector D is generated for each document. The model also trains weights for a softmax hidden layer. In the inference stage, a new document may be presented, and all weights are fixed to calculate the document vector.

Here it has two main architectures:
* PV-DM
* PV-DBOW

## PV-DM

PV-DM is similar to CBOW, it's because it's a little addition to the CBOW model. Instead of merely utilising words to anticipate the following word, we additionally included a document-unique feature vector.


As a result, when the word vectors W are trained, the document vector D is also trained, and it carries a numeric representation of the document at the end of the training.


The model described above is known as the Distributed Memory version of the Paragraph Vector model (PV-DM). It serves as a recollection, recalling what is absent from the current context — or as the paragraph's theme. The document vector seeks to represent the concept of a document, whereas the word vectors convey the concept of a word in a document.

## PV-DBOW

Another technique, similar to skip-gram, could be utilised in the same way as word2vec,  Paragraph Vector with a Distributed Bag of Words version (PV-DBOW) because there is no need to preserve the word vectors, this approach is actually faster (than word2vec) and uses less memory.


The authors of the paper propose combining the two algorithms, even though the PV-DM model is superior and can generally reach state-of-the-art results on its own.

## DOC2VEC IMPLEMENTATION USING GENSIM

In [46]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

doc = ["Ineuron is an Edtech company that provides a platform for students to learn and interact with the world around them.", 
"It aims to provide the platform in an affordable way.",
"It encourages students to learn the trending technologies"]

tokenized_doc = []
for d in doc:
    tokenized_doc.append(word_tokenize(d.lower()))
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_doc)]
print(tagged_data)

[TaggedDocument(words=['ineuron', 'is', 'an', 'edtech', 'company', 'that', 'provides', 'a', 'platform', 'for', 'students', 'to', 'learn', 'and', 'interact', 'with', 'the', 'world', 'around', 'them', '.'], tags=[0]), TaggedDocument(words=['it', 'aims', 'to', 'provide', 'the', 'platform', 'in', 'an', 'affordable', 'way', '.'], tags=[1]), TaggedDocument(words=['it', 'encourages', 'students', 'to', 'learn', 'the', 'trending', 'technologies'], tags=[2])]


In [48]:
d2vvectorizer = Doc2Vec(tagged_data, vector_size=20, window=2, min_count=1, workers=4, epochs = 100)
d2vvectorizer.build_vocab(tagged_data)
d2vvectorizer.train(tagged_data, total_examples=d2vvectorizer.corpus_count, epochs=d2vvectorizer.epochs)
print(d2vvectorizer.wv.most_similar('ineuron'))
print(d2vvectorizer.wv.similarity('ineuron', 'edtech'))

[('that', 0.5715950727462769), ('around', 0.5659588575363159), ('the', 0.5330694913864136), ('it', 0.4953001141548157), ('learn', 0.49458616971969604), ('in', 0.4902716875076294), ('and', 0.4898565113544464), ('them', 0.44566160440444946), ('encourages', 0.40576207637786865), ('to', 0.3959006667137146)]
0.37990546


# FASTTEXT

FastText is a Word2Vec enhancement proposed by Facebook in 2016. FastText divides words into many n-grams instead of providing individual words to the Neural Network (sub-words). The trigrams for the word apple, for example, are app, ppl, and ple (ignoring the starting and ending of boundaries of words). The total of these n-grams will be the word embedding vector for apple. We will get word embeddings for all n-grams given the training dataset after training the Neural Network. Rare words can now be adequately represented because some of their n-grams are likely to exist in other words.

## FASTTEXT IMPLEMENTATION USING GENSIM

In [53]:
from gensim.models import FastText
ftsentence  = ["Ineuron is an Edtech company that provides a platform for students to learn and interact with the world around them."]
ftvectorizer = FastText(ftsentence, window=5, min_count=5, workers=4,sg=1)
ftvectorizer.build_vocab(ftsentence)
ftvectorizer.train(ftsentence, total_examples=ftvectorizer.corpus_count, epochs=ftvectorizer.epochs)
print(ftvectorizer.wv.most_similar('ineuron'))
print(ftvectorizer.wv.similarity('ineuron', 'edtech'))

[('r', 0.09310827404260635), ('h', 0.07129018753767014), ('e', -0.0678272694349289), ('o', -0.07562373578548431), ('n', -0.0787012130022049), (' ', -0.08875028043985367), ('t', -0.11307767033576965), ('d', -0.13960185647010803), ('a', -0.17304684221744537)]
-0.09437718


# EMBEDDINGS FROM PRETRAINED TRANSFORMER MODELS

By using the State of the Art Transformer Models such as BERT, XLNet, RoBERTa, XLM, DistilBERT, Flaubert, Electra, XLM-R, and XLM-T, we can easily get embeddings for any given sentence.   The embeddings can be used for any purpose, such as for text classification, text generation, or for machine translation. 

A specific variant of Transformer is sentence transformer, which is a model that is trained to generate embeddings for sentences by which the embeddings can be used to find similarities between sentences or any search query.

## EMBEDDINGS FROM PRETRAINED SENTENCE TRANSFORMER MODELS

In [1]:
from datasets import load_dataset
data = load_dataset('emotion', split='train')

Downloading builder script:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset emotion/default (download: 1.97 MiB, generated: 2.07 MiB, post-processed: Unknown size, total: 4.05 MiB) to C:\Users\Admin\.cache\huggingface\datasets\emotion\default\0.0.0\348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705...


Downloading data:   0%|          | 0.00/1.66M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/204k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/207k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset emotion downloaded and prepared to C:\Users\Admin\.cache\huggingface\datasets\emotion\default\0.0.0\348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705. Subsequent calls will reuse this data.


In [2]:
data.set_format("pandas")
df = data[:]
df.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3


In [3]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import numpy as np

In [7]:
def transformer_embedding_generator(df, model):
    tqdm.pandas()
    model = SentenceTransformer(model)
    df['embeddings'] = df['text'].progress_apply(lambda x: model.encode(x))
    return df

In [8]:
def coltoarr(df):
    emb_array = []
    for i in tqdm(range(len(df))):
        emb_array.append(np.array(df["embeddings"][i]))
    emb_array = np.array(emb_array)
    print(emb_array.shape)
    return emb_array

In [9]:
sent_trans_model = "sentence-transformers/all-MiniLM-L6-v2" 
col_embeddings = transformer_embedding_generator(df, sent_trans_model)
embeddings = coltoarr(col_embeddings)
embed = pd.DataFrame(embeddings)
df = pd.concat([df, embed], axis=1)
df

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

100%|██████████| 16000/16000 [07:05<00:00, 37.61it/s]
100%|██████████| 16000/16000 [00:00<00:00, 56737.86it/s]


(16000, 384)


Unnamed: 0,text,label,embeddings,0,1,2,3,4,5,6,...,374,375,376,377,378,379,380,381,382,383
0,i didnt feel humiliated,0,"[-0.055050936, -0.0076969415, 0.06353022, -0.0...",-0.055051,-0.007697,0.063530,-0.039664,0.116901,-0.123296,0.058080,...,0.063319,-0.044138,-0.034640,0.021249,-0.029084,0.084679,0.016152,0.015425,-0.135161,-0.064534
1,i can go from feeling so hopeless to so damned...,0,"[0.009238858, -0.052964333, 0.01926254, 0.0340...",0.009239,-0.052964,0.019263,0.034021,0.125203,0.027428,0.077058,...,-0.016320,-0.024402,-0.044897,0.132352,-0.082222,0.003469,0.095559,-0.060182,-0.027176,-0.026275
2,im grabbing a minute to post i feel greedy wrong,3,"[-0.074502915, -0.010641919, -0.0034595819, -0...",-0.074503,-0.010642,-0.003460,-0.073246,-0.018509,-0.026024,0.023560,...,0.050347,-0.030673,-0.001017,0.019752,0.078385,-0.010269,0.041514,-0.024779,-0.042020,0.024512
3,i am ever feeling nostalgic about the fireplac...,2,"[0.10859433, 0.09532226, 0.036476813, 0.015178...",0.108594,0.095322,0.036477,0.015178,0.089073,-0.012647,-0.089686,...,0.019334,-0.076964,-0.004122,0.023587,0.056529,0.024166,0.103731,-0.044091,-0.109329,0.034851
4,i am feeling grouchy,3,"[-0.016712196, -0.078770876, 0.032170076, -0.0...",-0.016712,-0.078771,0.032170,-0.053829,0.115593,-0.051190,0.132093,...,-0.011990,0.003192,-0.077645,-0.016146,0.007182,0.029738,0.059137,-0.062703,-0.019559,-0.057704
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15995,i just had a very brief time in the beanbag an...,0,"[-0.01957004, 0.06608577, 0.0778042, 0.0881720...",-0.019570,0.066086,0.077804,0.088172,0.091618,0.054001,0.057446,...,-0.027110,0.018090,0.031772,0.013136,0.034591,0.068123,-0.029136,-0.083160,-0.095693,-0.019246
15996,i am now turning and i feel pathetic that i am...,0,"[0.062311042, -0.112134434, 0.055730484, 0.021...",0.062311,-0.112134,0.055730,0.021169,0.025935,-0.031744,-0.054156,...,-0.060812,0.052070,-0.014596,0.011752,-0.119441,0.051849,0.018489,-0.046088,-0.026123,0.017155
15997,i feel strong and good overall,1,"[-0.07894578, -0.0861132, 0.028296798, 0.00922...",-0.078946,-0.086113,0.028297,0.009220,-0.014465,-0.012471,0.109142,...,-0.027028,0.037536,-0.066947,0.055402,-0.026460,-0.028652,0.054203,-0.070409,-0.009608,-0.012120
15998,i feel like this was such a rude comment and i...,3,"[-0.08410305, -0.040798713, 0.03854899, -0.073...",-0.084103,-0.040799,0.038549,-0.073090,0.067912,0.003937,0.060229,...,0.089333,0.017486,0.007523,0.003251,0.030494,0.037879,0.050217,-0.059059,0.027899,0.026300
