In [None]:
import pandas as pd
import numpy as np
import nltk
import re
import os
import random
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer

<h1 style="text-align: center;" class="list-group-item list-group-item-action active">Table of Contents</h1>
<a class="list-group-item list-group-item-action" data-toggle="list" href = "#1" role="tab" aria-controls="settings">1. Introduction<span class="badge badge-primary badge-pill"></span></a>
<br>
<a class="list-group-item list-group-item-action" data-toggle="list" href = "#2" role="tab" aria-controls="settings">2. Load a Clean Dataset<span class="badge badge-primary badge-pill"></span></a>
<br>
<a class="list-group-item list-group-item-action"  data-toggle="list" href="#3" role="tab" aria-controls="settings">3. Basic Text Pre-Processing<span class="badge badge-primary badge-pill"></span></a>
<br>
   <a class="list-group-item list-group-item-action"  data-toggle="list" href="#4" role="tab" aria-controls="settings">4. One-Hot Encoding<span class="badge badge-primary badge-pill"></span></a>
   <br>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#5" role="tab" aria-controls="settings">5. Bag of Words<span class="badge badge-primary badge-pill"></span></a>
  <br>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#6" role="tab" aria-controls="settings">6.N grams<span class="badge badge-primary badge-pill"></span></a>
  <br>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#7" role="tab" aria-controls="settings">7. TF-IDF<span class="badge badge-primary badge-pill"></span></a>
  <br>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#8" role="tab" aria-controls="settings">8. Word2vec Word Embeddings<span class="badge badge-primary badge-pill"></span></a>  


<h1  style="text-align: center" class="list-group-item list-group-item-action active">Introduction</h1><a id = "1" ></a>


In Natural Language Processing (NLP) the conversion of raw-text to numerical form is called <b>Text Representation</b> and believe me this step is one of the most important steps in the NLP pipeline as if we feed in poor features in ML Model, we will get poor results. In computer science, this is often called “garbage in, garbage out.”

<b>I observed in NLP feeding a good text representation to an ordinary algorithm will get you much farther compared to applying a topnotch algorithm to an ordinary text representation.</b>

In this notebook, I will discuss various text-representation schemes with their advantages and disadvantages so that you can choose one of the schemes which suit your task most. Our main objective is to transform a given text into numerical form so that it can be fed
into NLP and ML algorithms.

![](https://www.oreilly.com/library/view/practical-natural-language/9781492054047/assets/pnlp_0301.png)

In this notebook, the focus will be on the dotted box in the figure


here write a para on the flow of the notebook later


But before moving on to the Text representation step first we have to get a cleaned dataset which then has to be preprocessed. In this notebook, I will be using only a few basic steps to preprocess the text data

<h1  style="text-align: center" class="list-group-item list-group-item-action active">Load a Clean Dataset</h1><a id = "2" ></a>

Kaggle Datasets is one of the best sources to get a clean dataset for this notebook I will be using [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) dataset.


In [1]:
#import data
import kagglehub

# Download latest version
path = kagglehub.dataset_download("crowdflower/twitter-airline-sentiment")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/crowdflower/twitter-airline-sentiment?dataset_version_number=4...


100%|██████████| 2.55M/2.55M [00:00<00:00, 44.9MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/crowdflower/twitter-airline-sentiment/versions/4


In [None]:
#top 5 tweet


In [None]:
# info

In [None]:
# plot for airline_sentiment count for each class

<h1  style="text-align: center" class="list-group-item list-group-item-action active">Basic Text Pre-Processing</h1><a id = "3" ></a>

Text preprocessing steps include a few essential tasks to further clean the available text data. It includes tasks like:-

**1. Stop-Word Removal** : In English words like a, an, the, as, in, on, etc. are considered as stop-words so according to our requirements we can remove them to reduce vocabulary size as these words don't have some specific meaning

**2. Lower Casing** : Convert all words into the lower case because the upper or lower case may not make a difference for the problem.
And we are reducing vocabulary size by doing so.

**3. Stemming** : Stemming refers to the process of removing suffixes and reducing a word to some base form such that all different variants of that word can be represented by the same form (e.g., “walk” and “walking” are both reduced to “walk”).

**4. Tokenization** : NLP software typically analyzes text by breaking it up into words (tokens) and sentences.

Pre-processing of the text is not the main objective of this notebook that's why I am just covering a few basic steps in a brief


In [None]:
# First of all let's drop the columns which we don't required


In [None]:
#


In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

# 1. Stop-Word Removal

# 2. Lower Casing

# 3. Stemming

# 4. Tokenization

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
print(f"Orignal Text : {data.text[11]}")
print()
print(f"Preprocessed Text : {preprocess_text(data.text[11])}")

Orignal Text : @VirginAmerica I &lt;3 pretty graphics. so much better than minimal iconography. :D

Preprocessed Text : ['i', 'lt', '3', 'pretty', 'graphics', 'so', 'much', 'better', 'than', 'minimal', 'iconography', 'd']


In [None]:
data.text = data.text.map(preprocess_text)
data.head()

Unnamed: 0,airline_sentiment,text
0,neutral,"[what, said]"
1,positive,"[plus, you, ve, added, commercials, to, the, e..."
2,neutral,"[i, didn, t, today, must, mean, i, need, to, t..."
3,negative,"[it, s, really, aggressive, to, blast, obnoxio..."
4,negative,"[and, it, s, a, really, big, bad, thing, about..."


Now we have preprocessed textual data so now we can proceed further in this notebook and discuss various text representation approaches in detail

<h1  style="text-align: center" class="list-group-item list-group-item-action active">One-Hot Encoding</h1><a id = "4" ></a>


In one-hot encoding, each word w in the corpus vocabulary is given a unique integer ID (wid) that is between 1 and |V|, where V is the set of the corpus vocabulary. Each word is then represented by a V dimensional binary vector of 0s and 1s. This is done via a |V| dimension vector filled with all 0s barring the index, where index = wid. At this index, we simply put a 1. The representation for individual words is then combined to form a sentence representation.

Consider an Example

![](https://miro.medium.com/max/886/1*_da_YknoUuryRheNS-SYWQ.png)

In [None]:
#this is an example vocabulary just to make concept clear
sample_vocab = ['the', 'cat', 'sat', 'on', 'mat', 'dog', 'run', 'green', 'tree']

In [None]:
# vocabulary of words present in dataset


In [None]:
#function to return one-hot representation of passed text get_onehot_representation()

#try One Hot Representation for sentence "the cat sat on the mat"


Shapes of a single sentence : (15, 14276)


In [None]:
#one-hot representation for dataset sentences


#if you run this cell it will give you a memory error 😂

One-hot encoding is intuitive to understand and straightforward to implement. However, it has lots of disadvantages listed below

1. The size of a one-hot vector is directly proportional to the size of the vocabulary and if we consider a real-world vocabulary size it may be in millions so we can not represent a single word with a million-dimensional vector.

2. One-hot representation does not give a fixed-length representation for text, i.e., the sentence with 32 words in it and 40 words in it has variable length representation. But for most learning algorithms, we need the feature vectors to be of the same length.

3. One-Hot representation gives each word the same weight whether that word is important for the task or not.

4. One-Hot representation does not represent the meaning of the word in a proper numerical manner as embedding vectors do. Consider an example word read, reading should have similar real-valued vector representation but in this case, they have different representations.

5. Let say we train the model on some article and get the vocabulary of size 10000 but what if we use this vocabulary on that text which contains words that are not present in learned vocabulary. This is Known as **Out Of Vocabulary (OOV)** problem.


<h1  style="text-align: center" class="list-group-item list-group-item-action active">Bag of words</h1><a id = "5" ></a>

Bag of words (BoW) is a classical text representation technique that has been used commonly in NLP, especially in text classification problems. The key idea behind it is as follows: represent the text under consideration as a bag (collection) of words while ignoring the order and context.

Similar to one-hot encoding, BoW maps words to unique integer IDs between 1 and |V|. Each document in the corpus is then converted into a vector of |V| dimensions were in the ith component of the vector, i = wid, is simply the number of times the word w occurs in the document, i.e., we simply score each word in V by their occurrence count in the document.

Consider an example:

let say we have a vocabulary **V consisting of words --> {the, cat, sat, in, hat, with}** then the bag of word representation of a few sentences will be given as

![](https://miro.medium.com/max/1400/1*3IACMnNpwVlCl8kSTJocPA.png)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

sample_bow = CountVectorizer()

# sample_corpus = [['the', 'cat', 'sat'],
#                  ['the', 'cat', 'sat', 'in', 'the', 'hat'],
#                  ['the', 'cat', 'with', 'the', 'hat']]

sample_corpus = ["the cat sat", "the cat sat in the hat", "the cat with the hat"]

# code


# Bag of word Representation of sentence 'the cat cat sat in the hat'

**Advantages of this Bag of words(BoW) encoding** :

1. Like one-hot encoding, BoW is fairly simple to understand and implement.

2. With this representation, documents having the same words will have their vector representations closer to each other in Euclidean space as compared to documents with completely different words.

    Consider an example Where

    S1 = "cat on the mat" --> BoW Representation --> {0 1 1 0 1 0 1} <br>
    S2 = "mat on the cat" --> BoW Representation --> {0 1 1 0 1 0 1} <br>
    S3 = "dog in the mat" --> BoW Representation --> {0 1 0 1 1 1 0} <br>

    The distance between S1 and S2 is 0 as compared to the distance between S1 and S3, which is 2. Thus, the vector space resulting from the BoW scheme captures the semantic similarity of documents. So if two documents have a similar vocabulary, they’ll be closer to each other in the vector space and vice versa.

3. We have a fixed-length encoding for any sentence of arbitrary length.

**Disadvantages of this Bag of words(BoW) encoding** :

1. The size of the vector increases with the size of the vocabulary as in our case it is 14238 dimensional. Thus, sparsity continues to be a problem. One way to control it is by limiting the vocabulary to n number of the most frequent words.

2. It does not capture the similarity between different words that mean the same thing. Say we have three documents: “walk”, “walked”, and “walking”. BoW vectors of all three documents will be equally apart.

3. This representation does not have any way to handle **out of vocabulary (OOV)** words (i.e., new words that were not seen in the corpus that was used to build the vectorizer).

4. As the name indicates, it is a “bag” of words—word order information is lost in this representation. Both S1 and S2 will have the same representation in this scheme.


<h1  style="text-align: center" class="list-group-item list-group-item-action active">Bag of N-Grams</h1><a id = "6" ></a>

All the representation schemes we’ve seen so far treat words as independent units. There is no notion of phrases or word order. The bag-of-n-grams (BoN) approach tries to remedy this. It does so by breaking text into chunks of n contiguous words (or tokens). This can help us capture some context, which earlier approaches could not do. Each chunk is called an n-gram.

**One can simply say Bag of words (BoW) is a special case of the Bag of n-grams having n = 1.**

The corpus vocabulary, V, is then nothing but a collection of all unique n-grams across the text corpus. Then, each document in the corpus is represented by a vector of length |V|. This vector simply contains the frequency counts of n-grams present in the document and zero for the n-grams that are not present.

Consider an Example:

![](https://i.stack.imgur.com/8ARA1.png)


The following code cell shows an example of a BoN representation considering 1–3 n-gram word features to represent the corpus that we’ve used so far.


In [None]:
# Bag of 1-gram (unigram)
from sklearn.feature_extraction.text import CountVectorizer

# code




# Bag of 1-gram (unigram) Representation of sentence 'the cat cat sat in the hat'

In [None]:
# Bag of 2-gram (bigram)
from sklearn.feature_extraction.text import CountVectorizer

# code



# Bag of 2-gram (bigram) Representation of sentence 'the cat cat sat in the hat'

In [None]:
# Bag of 3-gram (trigram)
from sklearn.feature_extraction.text import CountVectorizer


# code




# Bag of 3-gram (trigram) Representation of sentence 'the cat cat sat in the hat'

**Here are the main advantages and disadvantages of BoN Representation:**

1. It captures some context and word-order information in the form of n-grams.

2. Thus, the resulting vector space can capture some semantic similarity. Documents having the same n-grams will have their vectors closer to each other in Euclidean space as compared to documents with completely different n-grams.

3. As n increases, dimensionality (and therefore sparsity) only increases rapidly.

4. It still provides no way to address the **out of vocabulary(OOV)** problem.

<h1  style="text-align: center" class="list-group-item list-group-item-action active">TF-IDF</h1><a id = "7" ></a>


In all the three approaches we’ve seen so far, all the words in the text are treated as equally important—there’s no notion of some words in the document being more important than others. TF-IDF, or term frequency-inverse document frequency, addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the corpus.

The intuition behind TF-IDF is as follows: if a word w appears many times in a sentence S1 but does not occur much in the rest of the Sentences Sn in the corpus, then the word w must be of great importance to the Sentence S1. The importance of w should increase in proportion to its frequency in S1 (how many times that word occurs in sentence S1), but at the same time, its importance should decrease in proportion to the word’s frequency in other Sentence Sn in the corpus. **Mathematically, this is captured using two quantities: TF and IDF. The two are then multiplied to arrive at the TF-IDF score.**

**TF (term frequency) measures how often a term or word occurs in a given document.**

Mathematical Expression of TF

![image-2.png](attachment:image-2.png)

**IDF (inverse document frequency)** measures the importance of the term across a corpus. In computing TF, all terms are given equal importance (weightage). However, it’s a well-known fact that stop words like is, are, am, etc., are not important, even though they occur frequently. To account for such cases, IDF weighs down the terms that are very common across a corpus and weighs up the rare terms. IDF of a term t is calculated as follows:

![image.png](attachment:image.png)

The TF-IDF score is a product of these two terms. Thus, TF-IDF score = TF * IDF. Let’s consider an example.

Sentence A = The Car is Driven on the Road <br>
Sentence B = The Truck is Driven on the highway <br>

Computation of TF-IDF scores are shown below

![](https://cdn-media-1.freecodecamp.org/images/1*q3qYevXqQOjJf6Pwdlx8Mw.png)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()


# code



# TF-IDF Representation for sentence 'the cat sat in the hat'

IDF Values for sample corpus : [1.         1.28768207 1.69314718 1.28768207 1.         1.69314718]
TF-IDF Representation for sentence 'the cat sat in the hat' :
[[0.29903422 0.385061   0.50630894 0.385061   0.59806843 0.        ]]


Similar to BoW, we can use the TF-IDF vectors to calculate the similarity between two texts using a similarity measure like Euclidean distance or cosine similarity. TF-IDF is a commonly used representation in application scenarios such as information
retrieval and text classification. However, even though TF-IDF is better than the vectorization methods we saw earlier in terms of capturing similarities between words, **it still suffers from the curse of high dimensionality.**

**Here are the main advantages and disadvantages of TF-IDF Representation:**

1. Its Implementation is not that easy as compared to techniques discussed above
2. We have a fixed-length encoding for any sentence of arbitrary length.
3. The feature vectors are high-dimensional representations. The dimensionality increases with the size of the vocabulary.
4. It did capture a bit of the semantics of the sentence.
5. They too cannot handle OOV words.

With this, we come to the end of basic vectorization approaches. Now, let’s start looking at distributed representations.

<h1  style="text-align: center" class="list-group-item list-group-item-action active">Word2vec Word Embeddings</h1><a id = "8" ></a>

**Word Embeddings** : They are a real-valued vector representation of words that allows words with the same meaning to have similar representation. Thus we can say word embeddings are the projection of meanings of words in a real-valued vector

Word2vec is a Word Embedding Technique published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text.

It is the representation of words that allows words with the same meaning to have similar representation, Word2vec operationalizes this by projecting the meaning of the words in a vector space where words with similar meanings will tend to cluster together, and works with very different meanings are far from one another.

**Using Pre-trained word2vec word embeddings** <br>
Training your own word embeddings is a pretty expensive process (in terms of both time and computing). Thankfully, for many scenarios, it’s not necessary to train your own embeddings Someone has done the hard work of training word embeddings on a large corpus, such as Wikipedia, news articles, or even the entire web, and has put words and their corresponding vectors on the web. These embeddings
can be downloaded and used to get the vectors for the words you want.  

Some of the most popular pre-trained embeddings are Word2vec by Google, GloVe by Stanford, and fasttext embeddings by Facebook, to name a few.

Below code, cell demonstrates how to use pre-trained word2vec word embeddings.

**Training our own embeddings**

Now we’ll focus on training our own word embeddings. For this, we’ll look at two architectural variants that were proposed in the original Word2vec approach. The two variants are:

1. Continuous bag of words (CBOW)
2. SkipGram

Both of these have a lot of similarities in many respects.

Throughout this section, we’ll use the sentence “The quick brown fox jumps over the lazy dog” as our example text.

**1. Continuous bag of words (CBOW)**

In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears. Consider our example sentence we take the word “jumps” as the center word, then its context is formed by words in its vicinity. If we take the context size of 2, then for our example, the context is given by brown, fox, over, the. CBOW uses the context words to predict the target word—jumps—as shown in the below figure
<br><br>

![image-1.png](attachment:image.png)

<br><br>
Now next task is to create a training sample of the form (X, Y) for this task where X will be context words and Y will be Center word. We define the value of context window = 2 in this case.
![image-3.png](attachment:image-2.png)

<br><br>
Now that we have the training data ready, let’s focus on the model. For this, we construct a shallow net (it’s shallow since it has a single hidden layer). We assume we want to learn D-dim word embeddings. Further, let V be the vocabulary of the text corpus

![image-4.png](attachment:image.png)


<br><br>
The objective is to learn an embedding matrix E|V| x d.To begin with, we initialize the matrix randomly. Here, |V| is the size of corpus vocabulary and d is the dimension of the embedding. Let’s break down the shallow net in Figure layer by layer. In the input layer, indices of the words in context are used to fetch the corresponding rows from the embedding matrix E|V| x d. The vectors fetched are then added to get a single D-dim vector, and this is passed to the next layer. The next layer simply takes this d vector and multiplies it with another matrix E’d x |V|.. This gives a 1 x |V| vector, which is fed to a softmax function to get probability distribution over the vocabulary space. This distribution is compared with the label and uses backpropagation to update both the matrices E and E’ accordingly. At the end of the training, E is the embedding matrix we wanted to learn.
<br><br>

**2. SkipGram**

SkipGram is very similar to CBOW, with some minor changes. In Skip‐ Gram, the task is to predict the context words from the center word. For our toy corpus with context size 2, using the center word “jumps,” we try to predict every word in context—“brown,” “fox,” “over,” “the”—as shown in the Figure below

![image-5.png](attachment:image.png)

Now we will create a training sample of the form (X, Y) for this task where X will be the center word and Y will be Context words.

<br>
<br>

![image-6.png](attachment:image-2.png)

<br>
<br>

![image-9.png](attachment:image-3.png)

<br>
<br>

The shallow network used to train the SkipGram model, shown in the below Figure, is very similar to the network used for CBOW, with some minor changes. In the input layer, the index of the word in the target is used to fetch the corresponding row from the embedding matrix E|V| x d. The vectors fetched are then passed to the next layer. The next layer simply takes this d vector and multiplies it with another matrix E’d x |V|. This gives a 1 x |V| vector, which is fed to a softmax function to get probability distribution over the vocabulary space. This distribution is compared with the label and uses backpropagation to update both the matrices E and E’ accordingly. At the end of the training, E is the embedding matrix we wanted to learn.