# <div style="text-align:center; font-size:40px;"><span style="color:#3A506B">❗ Important ❗ </span></div>

Hey there, whether you come from the ✅ Comprehensive Overview on NLP for Beginners 🥳  notebook or you click on this notebook directly, I just want to let you know that this notebook is part 2.1 to the NLP beginner series. Here are the links to the whole series:
    
**✅ Comprehensive Overview on NLP for Beginners 🥳 (collection of all series)** <br>
 https://www.kaggle.com/code/crxxom/comprehensive-overview-on-nlp-for-beginners
 
    
**🔴 NLP Beginner Series Part 1: NLP Preprocessing** <br>
https://www.kaggle.com/code/crxxom/nlp-beginner-series-part-1-nlp-preprocessing

**🟡 NLP Beginner Series Part 2.1: Word Embeddings** <br>
https://www.kaggle.com/code/crxxom/nlp-beginner-series-part-2-1-word-embeddings

**🟢 NLP Beginner Series Part 2.2: Embedding Models** <br>
https://www.kaggle.com/code/crxxom/nlp-beginner-series-part-2-2-embedding-models

**🟣 NLP Beginner Series Part 3: Case Study** <br>
 https://www.kaggle.com/code/crxxom/nlp-beginner-series-part-3-case-study

# <div style="text-align:center; font-size:40px;"><span style="color:#3A506B">🔥  Overview of the Notebook  🔥</div>


## >> Important concepts and techniques you need to know to get you started with NLP explained in a **➡ beginner friendly way ⬅**

## >> **ONLY** pre-requisite you need: Python, Pandas and Numpy

## >> 🕑 Take your Time! (~ 1 hour)

## >> Is this notebook for me 🤔

This notebook **covers the absolute fundamentals of NLP to let you have an overview with what exactly is NLP and how it work.** 

### 🌟 Skim through the notebook

Quickly scroll through the whole notebook to see if the style suit your learning needs!

### 🌟 This notebook is beginner friendly
This notebook is very much a **beginner friendly notebook**, the only pre-requisite you will need is Python and some common data science frameworks such as Pandas and Numpy. All the concepts covered in this notebook will be clearly explained but may not be deeply explained, you will be encouraged to do additional research on your own if you would like to understand more about a particular topic. The goal of this notebook is to give you an overview to NLP and introduce you with key vocabularies and concepts you will need to know in NLP.


### 🌟 Seperate sections to suit your needs
**Each section of the notebook is seperated**, ie. you do NOT need to run the code on section 1 to be able to run the code in section 2, libraries and framework will be introduced and imported in each individual section, dataset are not interrelatble across sections. If you are only interested in any particular section, **you can skip through the sections to run the code in that particular section only**. In fact, I try to make each codeblock as seperated as possible so that you do not need to run codeblock A to be able to run B.


### 🌟 Images to help you learn
This notebook contains **a lot of images** to demonstrate the concepts and principles as I find it to be the best way to understand complicated concepts. These images are mostly images from google and screenshots from youtube tutorials, if any parties find it inappropriate to use their images/screenshots, please let me know I will delete is as soon as possible.

### 🌟 I am a beginner also
This notebook is **written from the perspective of a beginner** to the world of machine learning. With that being said, there may be some concepts that isn't covered clearly/deeply, but on the other hand, I feel like it will be easier to catch up if you are a beginner as well, as I will not be throwing some complicated jargons and overload you too much since I am a beginner as well. Additionally I have also included a lot of resources if you want to have a deeper insight for each topics.


### 🌟 Notebook designed in a way to not overwhelm you
**Take your time!** Small and lengthy words are hard and dull to read, that's why this notebook trys to **minimize the amount of text and use bold text, images and emojis** so that you can navigate the notebook faster. The notebook also attempts to **minimize the amount of code** as much as possible, to save you from being too overwhelmed. 

**Try to run the codeblocks youself**, running and changing codes in the codeblocks can give you a more immersive learning experience. 

If you appreciate my work or would like to express your views/make suggestions on some of my code and content, feel free to comment and correct me, I will make sure to keep updating and make improvements to this notebook! I hope you will find this notebook useful and an enjoyable journey.

## 🔎 Concepts and techniques you can take away with you from this notebook

#### 🟢 [Bag of Words (BOW)](#bag-of-words)

#### 🟢 [Bag of n-grams](#bag-of-n-grams)

#### 🟢 [Term Frequency - Inverse Document Frequency (TF-IDF)](#tf-idf)

#### 🟢 [Word Embeddings](#word-embeddings)

#### 🟢 [Cosine Similarity](#cosine-similarity)

#### 🟢 [Arithmetic of Word Embeddings](#arithmetic-of-word-embeddings)

# <a id="bag-of-words"></a><div style="text-align:center; font-size:40px;"><span style="color:#3A506B">Bag of Words (BOW)</span></div>


<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/iiIOKHx.png" style="height:200px;"> </div>

❗ *The bag-of-words model is a simplifying representation used in natural language processing and information retrieval. In this model, **a text is represented as the bag of its words**, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision.*

### Computer doesn't understandand string/words,
that's why we need certain algorithm and method to convert words into numbers that computer understand. One of these techniques is Bag of Word (BOW). 

The idea behind Bag of Word is fairly simple. Basically we would like to **get all unique tokens in all the data you feed in as a column**. For each of the sentence, we assign a counter for each word/column, just like what the above image illustrate. 

### 🧑‍💻 Let's code 

There are certain limitations associated with BOW, but first let us see BOW in action with sklearn, a popular machine learning framework in Python.

In [1]:
# filter out the warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Prepare a very small dataset, you can change the text to test the difference!
import pandas as pd

corpus = ["Hey there, how's the weather today.",
          "Liverpool announced the signing of Szoboszlai",
          "An apple a day keeps the doctor away.",
          "Natural Language Processing is fun!"]

text_df = pd.DataFrame({"Text": corpus})

text_df

Unnamed: 0,Text
0,"Hey there, how's the weather today."
1,Liverpool announced the signing of Szoboszlai
2,An apple a day keeps the doctor away.
3,Natural Language Processing is fun!


In [3]:
# Text preprocessing, we will be 
# 1. removing stop words
# 2. removing punctuations
# 3. changing word to lower case
# 4. lemmatization

import spacy

nlp = spacy.load("en_core_web_sm")

def preprocessing(text):
    doc = nlp(text.lower()) # lowercase
    preprocessed_text = []
    for token in doc:
        if token.is_stop or token.is_punct: # if the token is stop word/punctuation we do not add it to the list
            continue
        preprocessed_text.append(token.lemma_)
    
    return " ".join(preprocessed_text)

In [4]:
# Transform our dataset through applying our preprocessing function to the text_df

text_df["Text"] = text_df["Text"].apply(preprocessing)
text_df

Unnamed: 0,Text
0,hey weather today
1,liverpool announce signing szoboszlai
2,apple day keep doctor away
3,natural language processing fun


In [5]:
# Now let's do our BOW 

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cv.fit(text_df['Text'])
cv.vocabulary_ # Each unique token will be assigned as a dictionary with some kind of index

{'hey': 6,
 'weather': 15,
 'today': 14,
 'liverpool': 9,
 'announce': 0,
 'signing': 12,
 'szoboszlai': 13,
 'apple': 1,
 'day': 3,
 'keep': 7,
 'doctor': 4,
 'away': 2,
 'natural': 10,
 'language': 8,
 'processing': 11,
 'fun': 5}

In [6]:
cv.get_feature_names_out() # we can also convert the vocabs to an array of the right order

array(['announce', 'apple', 'away', 'day', 'doctor', 'fun', 'hey', 'keep',
       'language', 'liverpool', 'natural', 'processing', 'signing',
       'szoboszlai', 'today', 'weather'], dtype=object)

In [7]:
# To get the vector of a particular corpus, we use the .transform function and turn the result to an array

cv.transform(['hey weather today']).toarray()

# You can see from the results, there are three position that shows '1'
# which is array[6],array[13],array[14] -> matching the position of the word in cv.vocabulary_

array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]])

In [8]:
# Let's visualize the result of countvectorizer

cv = CountVectorizer()
X = cv.fit_transform(text_df['Text'])

feature_names = cv.get_feature_names_out() # get all the unique tokens 

df = pd.DataFrame(X.toarray(), columns=feature_names, index=text_df['Text']) # create a pandas dataframe

df

Unnamed: 0_level_0,announce,apple,away,day,doctor,fun,hey,keep,language,liverpool,natural,processing,signing,szoboszlai,today,weather
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
hey weather today,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1
liverpool announce signing szoboszlai,1,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0
apple day keep doctor away,0,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0
natural language processing fun,0,0,0,0,0,1,0,0,1,0,1,1,0,0,0,0


Now I hope you grasp the general idea of how BOW work. Indeed is it pretty simply process and you can use this result dataframe to train your machine learning models for your NLP task. But in reality, there are several huge flaws in using BOW to vectorize the corpus. For instance, it faces the problem of

### 1. **Sparsity** increase as the number of unique token (size of input) increases

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/EdPwBdK.png" style="height:200px;"> </div>
<br>

If you notice, in the example demonstrate above, there are a lot of zeros as the value of each corpus, **the percentage of useful information of each row will decrease as the size of input increase**. Why? Image rather than 4 corpus, we have a million sentences, just image how many zeros will there be in each row!

### 2. **Dimension** increase as the number of unique token (size of input) increases

Apart from the problem of sparsity, dimensionality is also a primary concern of BOW. Just like the example mentioned, if we have a million sentences, there will be millions of features/columns in the dataset which make our training process very inefficient.

### 3. BOW cannot handle **Out of Vocabulary (OOV)** problem

Imagine we use our example dataframe to train our model and now we want to classifer the sentence "Machine Learning is awesome". Our model simply cannot give us an accurate prediction because none of these tokens are in our feature!

# <a id="bag-of-n-grams"></a><div style="text-align:center; font-size:40px;"><span style="color:#3A506B">Bag of n-grams</span></div>


<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/QLE08hJ.png" style="height:300px;"> </div>



Just at looking at the imagine above, you may already have a good guess of what bag of n-grams do if you had go through the BOW section. Indeed, bag of n-grams simple increase the number of features/columns by altering between how to split the tokens.

### 🧑‍💻 Let's code 

Let's see bag of n-grams in action with sklearn.

In [9]:
# We will be all the dataset from the previous section, so these are just copy and paste

# Dataset from BOW section
import pandas as pd

corpus = ["Hey there, how's the weather today.",
          "Liverpool announced the signing of Szoboszlai",
          "An apple a day keeps the doctor away.",
          "Natural Language Processing is fun!"]

text_df = pd.DataFrame({"Text": corpus})

# Preprocess function from BOW section

import spacy

nlp = spacy.load("en_core_web_sm")

def preprocessing(text):
    doc = nlp(text.lower()) # lowercase
    preprocessed_text = []
    for token in doc:
        if token.is_stop or token.is_punct: # if the token is stop word/punctuation we do not add it to the list
            continue
        preprocessed_text.append(token.lemma_)
    
    return " ".join(preprocessed_text)

text_df["Text"] = text_df["Text"].apply(preprocessing)
text_df

Unnamed: 0,Text
0,hey weather today
1,liverpool announce signing szoboszlai
2,apple day keep doctor away
3,natural language processing fun


In [10]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

cv = CountVectorizer(ngram_range=(1,2)) 
# to declare bag of n_grams, we simply change the parameter
# in this example we will be using bag of one word and bag of two word
# noted that the parameter is in form of range, so (1,3) will be bag of 1,2,3 words


# Same code from previous section
X = cv.fit_transform(text_df['Text'])

feature_names = cv.get_feature_names_out() # get all the unique tokens 

df = pd.DataFrame(X.toarray(), columns=feature_names, index=text_df['Text']) # create a pandas dataframe


# ----------------------------------------------------------
# just a function that better display large pandas dataframe
from IPython.display import display, HTML

def create_scrollable_table(df, table_id, title):
    html = f'<h3>{title}</h3>'
    html += f'<div id="{table_id}" style="height:200px; overflow:auto;">'
    html += df.to_html()
    html += '</div>'
    return html
# ----------------------------------------------------------

html_df = create_scrollable_table(df, 
                            'Bag of n-grams', 
                            'Bag of n-grams')
display(HTML(html_df))

Unnamed: 0_level_0,announce,announce signing,apple,apple day,away,day,day keep,doctor,doctor away,fun,hey,hey weather,keep,keep doctor,language,language processing,liverpool,liverpool announce,natural,natural language,processing,processing fun,signing,signing szoboszlai,szoboszlai,today,weather,weather today
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
hey weather today,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1
liverpool announce signing szoboszlai,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,0,0,0
apple day keep doctor away,0,0,1,1,1,1,1,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
natural language processing fun,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0,0


The difference between Bag of n-grams and Bag of Words is that Bag of n-grams contains more information than BOW because **bag of n-grams capture more context around each word than BOW**. For example, instead of just capturing 'hey', 'weather' and 'today', bag of n-gram captures 'hey', 'hey weather', 'weather', 'today', 'weather today'.

But also at the same time, the flaws of sparsity and dimensionality as mentioned in the BOW section is also being demonstrated here. While bag of n-grams also cannot deal with Out of Vocabulary (OOV) same as bag of words.

# <a id="tf-idf"></a><div style="text-align:center; font-size:40px;"><span style="color:#3A506B">🌟 Term Frequency - Inverse Document Frequency (TF-IDF)🌟 </span></div>

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/4WcIN8I.png" style="height:300px;"> </div>


When you are navigating through the sections of [Bag of Words](#bag-of-words) and [Bag of n-grams](#bag-of-n-grams), have you notice the fact that they both assign a weight of 1 to each appearance of word in the sentence? This might not make a big difference for our tiny dataset from our examples, but for larger dataset, it may be a problem because imagine words like 'is' will have a value of 100 and important words like 'Tesla' that contains the most information will have a value of 10. When we feed this dataset to our model to train, this will in turn **give words like 'is' a relatively high weighting and reduce the accuracy of the model** as you can imagine words like 'is' will have a very high value also for other sentences.

You may suggest that we could simply remove these [stop words](#stop-word) in our preprocessing part, but sometimes we do not want to do it and in fact, in this section I will show you a better technique to tackle this problem -- TF-IDF.

❗ *In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to **reflect how important a word is to a document in a collection or corpus**. It is often **used as a weighting factor** in searches of information retrieval, text mining, and user modeling.*

When talking about TF-IDF, we have to introduce some math and statistical concept to better illustrate the concept. But don't get scared, the maths in TF-IDF is relatively easy so stay with me.

Let's break TF-IDF into two parts, term frequency (TF) and inverse document frequency (IDF).

## Term Frequency (TF)

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/RhnP5xc.png" style="height:120px;"> </div>

This look a little overwhelming, to put the mathematical equation in words,

`Term Frequency(TF) = number of times the word appeared / total no of words in a document`

### ✍ Example: 

The best way to learn is always through examples, so let's see an example with this sentence

>> **"Natural Language Processing is fun Machine Learning is fun"**

❓ Question: Find the term frequency of the word *'is'*

✅ Total number of word in the document: 9 <br>
✅ number of times *is* appeared in the text: 2 <br>
🟰 TF(is): 2/9 

Similarly, TF(Machine) is what, yes your right, 1/9.

🌟 If you think deeper, the range of possible value of TF is 0 to 1, when TF is closer to 1, it means that the word appear more frequently in the text.

Now let's move on to Inverse Document Frequency (IDF)

## Inverse Document Frequency (IDF)


<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/AvZkDVx.png" style="height:150px;"> </div>

<br>

To put it in words,

`Inverse Document Frequency(IDF) = log(Total number of documents (rows) / Number of documents (rows) containing the word)`

### ✍ Example: 

Let's see IDF in action with an example, imagine we have 2 documents (rows) 

>> **1. "Natural Language Processing is fun Machine Learning is fun"** <br>
>> **2. "This is the best thing I ever had"**

❓ Question: Find the inverse document frequency of the word *'is'*

✅ Total number of documents: 2 <br>
✅ Number of documents containing the word *is*: 2 <br>
🟰 IDF(is): log(2/2) = 0

Similarly, IDF(Machine) will be log(2/1) which is around 0.7 (note: we usually like to use the natural logrithm log base of e in machine learning and statistics)

🌟 If you think deeper, the range of value of IDF is 0 to infinity, but really, if you notice from the example above, what we are doing is to give a smaller weighting to words that appear more frequently across all the documents. Now you see why this is the perfect replacement to simply removing all the stop words, we are achieving similar outcome in a more statistical and robust approach!

<br>

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/SBwxNWb.png" style="height:300px;"> </div>

## TF-IDF


<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/7RDG6tk.png" style="height:200px;"> </div>

<br>

Now that you understand the concept of TF and IDF, TF-IDF is relatively simple idea. It is basically 

`Term frequency (TF) x Inverse document frequency (IDF)`

Let's try to understand this intuitively. 

🌟 TF-IDF = TF x IDF
> As TF increase, TF-IDF increase ➡ **If a word appear in a document more frequently, we give the word a higher weight and consider the word as 'more informative and important'** 
<br>
<br>
> As IDF decrease, TF-IDF decrease ➡ **If a word appear ACROSS documents more frequently, we give the word a lower weight and consider the word as a sort of 'stop word' and less informative to each individual document**

Now let's see TF-IDF in action with sklearn. I will be using the same corpus as the corpus we used in Bag of Words section to allow us to see the difference and effect in using TF-IDF.

### 🧑‍💻 Let's code 

In [11]:
# import the same dataset from the Bag of Word section
import pandas as pd

corpus = ["Hey there, how is the weather today.",
          "Liverpool announced the signing of Szoboszlai",
          "An apple a day keeps the doctor away.",
          "Natural Language Processing is fun!"]

text_df = pd.DataFrame({"Text": corpus})
# ----------------------------------------------------
# preprocess function copy and paste from Bag of Word section, but we DO NOT remove stop word
import spacy

nlp = spacy.load("en_core_web_sm")

def preprocessing(text):
    doc = nlp(text.lower()) # lowercase
    preprocessed_text = []
    for token in doc:
        if token.is_punct: # if the token is stop word/punctuation we do not add it to the list
            continue
        preprocessed_text.append(token.lemma_)
    
    return " ".join(preprocessed_text)

text_df["Text"] = text_df["Text"].apply(preprocessing)
# ----------------------------------------------------

# TF-IDF vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

X = tfidf_vectorizer.fit_transform(text_df["Text"])

feature_names = tfidf_vectorizer.get_feature_names_out() # get all the unique tokens 

tfidf_df = pd.DataFrame(X.toarray(), columns=feature_names, index=text_df["Text"]) # create a pandas dataframe

# ----------------------------------------------------------
# just a function that better display large pandas dataframe
from IPython.display import display, HTML

def create_scrollable_table(df, table_id, title):
    html = f'<h3>{title}</h3>'
    html += f'<div id="{table_id}" style="height:200px; overflow:auto;">'
    html += df.to_html()
    html += '</div>'
    return html
# ----------------------------------------------------------

html_df = create_scrollable_table(tfidf_df, 
                            'TF-IDF', 
                            'TF-IDF')
display(HTML(html_df))

Unnamed: 0_level_0,an,announce,apple,away,be,day,doctor,fun,hey,how,keep,language,liverpool,natural,of,processing,signing,szoboszlai,the,there,today,weather
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
hey there how be the weather today,0.0,0.0,0.0,0.0,0.321093,0.0,0.0,0.0,0.407265,0.407265,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.259952,0.407265,0.407265,0.407265
liverpool announce the signing of szoboszlai,0.0,0.430037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.430037,0.0,0.430037,0.0,0.430037,0.430037,0.274487,0.0,0.0,0.0
an apple a day keep the doctor away,0.395056,0.0,0.395056,0.395056,0.0,0.395056,0.395056,0.0,0.0,0.0,0.395056,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.252159,0.0,0.0,0.0
natural language processing be fun,0.0,0.0,0.0,0.0,0.366739,0.0,0.0,0.465162,0.0,0.0,0.0,0.465162,0.0,0.465162,0.0,0.465162,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# Copy of what we did in Bag of Word Section

import pandas as pd

corpus = ["Hey there, how is the weather today.",
          "Liverpool announced the signing of Szoboszlai",
          "An apple a day keeps the doctor away.",
          "Natural Language Processing is fun!"]

text_df = pd.DataFrame({"Text": corpus})
# ----------------------------------------------------
# preprocess function copy and paste from Bag of Word section
import spacy

nlp = spacy.load("en_core_web_sm")

def preprocessing(text):
    doc = nlp(text.lower()) # lowercase
    preprocessed_text = []
    for token in doc:
        if token.is_punct or token.is_stop: # if the token is stop word/punctuation we do not add it to the list
            continue
        preprocessed_text.append(token.lemma_)
    
    return " ".join(preprocessed_text)

text_df["Text"] = text_df["Text"].apply(preprocessing)
# ----------------------------------------------------
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X = cv.fit_transform(text_df['Text'])

feature_names = cv.get_feature_names_out() # get all the unique tokens 

df = pd.DataFrame(X.toarray(), columns=feature_names, index=text_df['Text']) # create a pandas dataframe


# ----------------------------------------------------------
# just a function that better display large pandas dataframe
from IPython.display import display, HTML

def create_scrollable_table(df, table_id, title):
    html = f'<h3>{title}</h3>'
    html += f'<div id="{table_id}" style="height:200px; overflow:auto;">'
    html += df.to_html()
    html += '</div>'
    return html
# ----------------------------------------------------------

html_df = create_scrollable_table(df, 
                            'Bag-of-Words', 
                            'Bag-of-Words')
display(HTML(html_df))

Unnamed: 0_level_0,announce,apple,away,day,doctor,fun,hey,keep,language,liverpool,natural,processing,signing,szoboszlai,today,weather
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
hey weather today,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1
liverpool announce signing szoboszlai,1,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0
apple day keep doctor away,0,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0
natural language processing fun,0,0,0,0,0,1,0,0,1,0,1,1,0,0,0,0


Now you see the purpose and power of TF-IDF, it's purpose is more clear when you work on larger dataset. But you can see that the weighting on words like 'the' is relatively lower compare to other words for the result in TF-IDF. While for bag of words, it has the same weight for all the words. (Noted that if the word appear twice in a row for bag of word, the weight will be 2, it appears that all weight is 1 right now since all word only appeared once in each df)

# <a id="word-embeddings"></a><div style="text-align:center; font-size:40px;"><span style="color:#3A506B">🌟 Word Embeddings 🌟</span></div>

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/FFdoj7s.png" style="height:400px;"> </div>

❗ *Word embedding or word vector is an approach with which we represent documents and words. It is defined as a **numeric vector input that allows words with similar meanings to have the same representation**. It can approximate meaning and represent a word in a lower dimensional space.*

Probably THE MOST IMPORTANT section throughout the notebook and in feature engineering process in NLP. Word embedding is a very important concept in NLP which is basically a way to 'vectorize' words into word vectors (if you don't know what's a vector, imagine it as a one dimensional array).

But the concept of word embedding is not simply to assign a random vector to a word, but assign the vector to each word in a way so that words of similar meaning will have a similar vector and have a closer distance in the vector space.

In fact, techniques we have mentioned just now like [TF-IDF](#TF-IDF) is in fact a form of word embedding, but this section and the following sections will focus on more advance and complex word embedding strategies and methods particularly involving deep learning. 
<br>

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/jsSEAHK.png" style="height:300px;"> </div>

<br>

But before we dive into the concepts and theory of word embeddings its related concepts and techniques like cosine similarity, I want to take another approach and show you how a word vector look like using spaCy.

### 🧑‍💻 Let's code 

In [13]:
import spacy

nlp = spacy.load("en_core_web_lg")

doc = nlp("Natural Language Processing is fun asdasdasd")

for token in doc:
    print(f"Does the word {token} have a vector in spacy? {token.has_vector}")

Does the word Natural have a vector in spacy? True
Does the word Language have a vector in spacy? True
Does the word Processing have a vector in spacy? True
Does the word is have a vector in spacy? True
Does the word fun have a vector in spacy? True
Does the word asdasdasd have a vector in spacy? False


In [14]:
# Vector of the word "Natural"
doc[0].vector

array([-1.6209e+00, -4.2642e+00,  2.4627e+00, -8.7363e-01,  7.3510e-01,
       -1.2308e+00,  1.0982e+00, -3.9754e-01, -1.7352e+00,  1.0502e+00,
        1.4811e+00,  2.1259e+00, -3.6473e+00, -1.3301e+00,  1.6452e+00,
        1.6438e+00,  2.3485e+00,  1.6963e+00, -2.4390e+00, -4.2472e+00,
       -2.7240e+00,  3.0848e+00, -2.1869e+00,  4.6633e+00,  2.4792e+00,
       -1.3560e+00,  2.1133e+00, -3.9983e+00, -8.5734e-01, -4.3072e-01,
       -6.2560e-01, -2.8925e+00, -2.0539e-01,  2.5628e+00, -5.8583e+00,
       -5.8432e+00,  1.2105e+00,  1.2293e+00, -4.8879e+00,  2.1147e-01,
        1.3969e+00, -1.9619e+00, -1.4299e+00,  8.8560e-01, -2.2521e+00,
        5.8442e+00,  4.8863e+00,  5.6117e-01,  3.0167e+00, -5.3299e+00,
        1.2969e+00,  1.2631e-01, -1.2806e+00,  3.0387e+00, -2.1331e-01,
       -7.0597e-01, -3.4147e+00,  5.3751e-01, -2.2919e+00, -1.0904e+00,
       -1.0717e+00, -3.8249e+00,  1.4455e+00,  2.0177e+00,  2.0078e+00,
       -1.0829e+00, -3.1302e+00, -3.1663e-01,  5.8296e+00, -4.21

In [15]:
doc[0].vector.shape

(300,)

In fact, it is very common for word vector to have a length of 300.

Now you might be confused, where does the vectors come from? For now, all you need to know is that the spaCy pipeline provide you with the word vectors of the tokens, which is, they have build models and algorithms to train and got the word vectors and allow users like us to use these vectors.

In [16]:
doc = nlp("good happy sad bad")

good = doc[0]

for token in doc:
    print(f"Similarity between good and {token}: {token.similarity(good)}")

Similarity between good and good: 1.0
Similarity between good and happy: 0.5769590735435486
Similarity between good and sad: 0.30881333351135254
Similarity between good and bad: 0.7391888499259949


🌟 It is important to note that when we say that two word embeddings are similar, we mean that **the two vectors representing the embeddings are close to each other in some sense, such as geometric distance or cosine similarity**. We are **not only refering to how similar is their meaning**, but depending on the training method, it can be the likelihood of two words to appear in the same context.

# <a id="cosine-similarity"></a><div style="text-align:center; font-size:40px;"><span style="color:#3A506B">🌟 Cosine Similarity 🌟</span></div>

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/GhmO2iq.jpg" style=""> </div>

We have mentioned a lot about cosine similarity in our previous section. But what exactly is cosine similarity? 

❗ *Cosine similarity **measures the similarity between two vectors of an inner product space**. It is **measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction**. It is often used to measure document similarity in text analysis.*

### The content of this section is heavily inspired by this video by 

### StatQuest: https://www.youtube.com/watch?v=e9U0QAFbfLI 

### It is highly recommended that you watch this 10 minutes video for awesome animations and a more detailed explanations. Below will be an extraction of content from the video.

One of the main feature of cosine similarity is that it is a metric that can be used regardless of how large and high dimension the dataset is, that is, we can still compute the angle between two words no matter how many words and dimensionality there are, unlike other metrics such as the Euclidean distance approach (calculate distance btween two points) which may not work well when the dataset and dimension is huge. 

Before we dive into the maths, let's see a very simple example to let us have an intuitive understanding of how cosine similarity work with the following example from the StatQuest video.

### ✍ Example (Image and example taken from the StatQuest video)

In particular, let us explore how do we find the cosine similarity between the phrase "Hello" and "Hello World". 

**1. The first step we would do is to create a countvectorizer similar to the idea of [bag of words](#bag-of-words).**

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/OhZfQkx.png" style="height: 150px;"> </div>

<br>

**2. Now we plot it on a 2D graph, with the value of axis being each unique token**

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/ijzhyc2.png" style="height: 300px;"> </div>

**3. Calculate the cosine similarity, in this case, it is simply the cosine of the angle between the lines of the two point to the origin**

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/FWGI09l.png" style="height: 300px;"> </div>

I hope you grasp the general idea behind cosine similarity with these awesome animations. In fact, this following example illustrate the aforementioned point that 'it is a metric that can be used regardless of how large and high dimension the dataset is' which tackle the flaws of metrics such as Eucliden distance approach.

Let's see this example which demonstrate the cosine similarity of the phrase "Hello Hello Hello" and "Hello World".

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/5DoUown.png" style="height: 300px;"> </div>

<br>

This demonstrates that cosine similarity *is determined entirely by the angle between the lines and not by the lengths of the line*.

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/5DoUown.png" style="height: 300px;"> </div>

### ❗ The greater the value of cosine similarity, the more similar two vectors are to each other

(in general, value of cosine similarity can be between -1 and 1, but for word embedding, it is always positive for some reason: https://vaibhavgarg1982.medium.com/why-are-cosine-similarities-of-text-embeddings-almost-always-positive-6bd31eaee4d5#:~:text=Going%20by%20the%20definition%20of,the%20smallest%20score%20was%200.4522.)

The above examples clearly explained the basic concept of cosine similarity. But it will not work when there are more than 2 unique tokens, ie. when there is a higher dimension. To deal with this problem, we need to utilize the mathematical equation of cosine similarity.

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/IxqVJiK.png" style="height: 200px;"> </div>


> cosine similarity = dot product of AxB / magnitude of A x magniture of B

Let's see the formula in action with an example with these two phrases

> **1. natural language processing is fun** <br>
> **2. machine learning is fun**

In [17]:
# Implement count vectorizer (for implementation details, view Bag of Word section)

import pandas as pd

corpus = ["natural language processing is fun",
          "machine learning is fun"]

text_df = pd.DataFrame({"Text": corpus})

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X = cv.fit_transform(text_df['Text'])

feature_names = cv.get_feature_names_out() # get all the unique tokens 

df = pd.DataFrame(X.toarray(), columns=feature_names, index=text_df['Text']) # create a pandas dataframe

df

Unnamed: 0_level_0,fun,is,language,learning,machine,natural,processing
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
natural language processing is fun,1,1,1,0,0,1,1
machine learning is fun,1,1,0,1,1,0,0


Now let's see how the maths work, let's first focus on the **numerator**

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/I6XXRjc.png" style="height: 100px;"> </div>

If you know your linear algebra, you will know that this is simply the dot product of vector A and vector B, ie. the dot product of \[1,1,1,0,0,1,1\] and \[1,1,0,1,1,0,0\], which if we expand it, it will become

> (1x1) + (1x1) + (1x0) + (0x1) + (0x1) + (1x0) + (1x0)

Now you seeing where it goes right? The first (1x1) corresponds to the word 'fun' where this one in (*1*x1) is the corresponding value of vector A and the one in (1x*1*) corresponding value of vector B for the word 'fun'. We then do summation all the unique token to get the final result.

Now let us look at the **denominator**

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/EGPoj5Z.png" style="height: 100px;"> </div>


Now that you have an idea of what the notations refers to, this should be very easy, in this example, this will simply be

> $\sqrt{(1^2 + 1^2 + 1^2 + 0^2 + 0^2 + 1^2 + 1^2)}$ x $\sqrt{(1^2 + 1^2 + 0^2 + 1^2 + 1^2 + 0^2 + 0^2)}$

Where the first part of the square root is the summation of each value in vector A and similar for vector B


So the overall equation will be 

$\frac{(1*1) + (1*1) + (1*0) + (0*1) + (0*1) + (1*0) + (1*0)}{\sqrt{(1^2 + 1^2 + 1^2 + 0^2 + 0^2 + 1^2 + 1^2)}  *  \sqrt{(1^2 + 1^2 + 0^2 + 1^2 + 1^2 + 0^2 + 0^2)}}$

And if you do the maths the answer will be roughly equal to 0.4472

In fact we can easily verify it using sklearn

### 🧑‍💻 Let's code 

Using cosine_similarity in Python is a relatively simple task, you simply need to import it from the sklearn library

In [18]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

A = np.array([1,1,1,0,0,1,1]) # natural language processing is fun
B = np.array([1,1,0,1,1,0,0]) # machine learning is fun

cosine_similarity([A], [B])

array([[0.4472136]])

# <a id="arithmetic-of-word-embeddings"></a><div style="text-align:center; font-size:40px;"><span style="color:#3A506B">Arithmetic of Word Embeddings</span></div>

<div style="width:100%;text-align: center;"> <img align=middle src="https://i.imgur.com/6Z1kIR8.png" style="height: 350px;"> </div>


Before we move on with more fancy concepts and algorithms, I want to show you some awesome thing word vectors can do, which is the arithmetic of word embeddings.

As we mentioned, word embeddings are just some vectors, so naturally we can do some kind of mathematical operations with them. The beauty of word embeddings lies exactly here, let's take an example from the image above.

### ✍ Example: King 🤴 - Man 👱‍♂️ + Woman 👩 = Queen 👸

> Vector of King: \[0,0,1\] <br>
> Vector of Man: \[0,0,0\] <br>
> Vector of Woman: \[1,0,0\] <br>
> Vector of Queen: \[1,0,1\] <br>

Now when we do the maths, we can indeed verify that King - Man + Woman = Queen. How amazing! How does this happen? (look at the emoji to have a more intuitive understanding!)

### Did this question ever hit you while you are surfing the previous sections -
### >> What do each value of a word vector mean? How do we determine how the vector should be like to achieve this amazing outcome?

The above picture explain a bit. For each word vector, each column have a meaning, which is Feminity, Youth, Royalty. And according to these **semantics** (relating to meaning in language or logic), we attempt to assign value based on these meanings for each word. 

In reality, like the vector we saw in spaCy library, we actually do not know what each column means, not that the information is not disclosed, but that we actually have no clue what each column represents. All we know is that each column has its own **meaning** , it is this **meaning** that allow the magic we saw in word vectors like similarity and clusterings of similar words. How does this happen, you may ask. Well this is highly due to the method and algorithms we use in training these word embeddings which is what will be introduced in the following section.


# <a id="summary-2"></a><div style="text-align:center; font-size:40px;"><span style="color:#3A506B">📝 Short Summary (Part 2.1: Word Embeddings) 📝</span></div>

Let's review what we have learnt in part 2: Feature Engineering. 

### 1. [Bag of Words](#bag-of-words)

A way of word embedding, in which we create a count vectorizer to count the word appearance

### 2. [Bag of n-grams](#bag-of-n-grams)

A way of word embedding similar to bag of words, but instead of just use a single word as column, we use a range of consecutive words as features

### 3. [TF-IDF](#tf-idf)

A way of word embedding in which we use some sort of formulas to assign weight to each word in each corpus, effectively replacing the need to remove stop words

### 4. [Word Embeddings](#word-embeddings)

A way to represent text in which we attempt to numerize/vectorize words in a meaningful way in which words that are more similar will be closer together in vector space

### 5. [Cosine Similarity](#cosine-similarity)

A popular metrics to quantify similarity between two vectors

# Continue on: **🟢 NLP Beginner Series Part 2.2:  Embedding Models** 
 https://www.kaggle.com/code/crxxom/nlp-beginner-series-part-2-2-embedding-models

<a id="hey-there"></a><div style="text-align:center; font-size:40px;"><span style="color:#3A506B">👋 Hey there 👋</span></div>

### 🌟 About the Author

Hey there, I am an undergraduate majoring in quantitative finance at the Univeristy of Hong Kong. I am a highly motivated person with a huge passion in mathematics and computer science related topics. 

When I said I am a beginner in the start of the notebook, I am not lying! In fact, I am a self learner in the field of machine learning and natural language processing, and as the moment I am writing this, I have only started my machine learning journey for a couple of months. 

Although I do not have a lot of experience and knowlegde in the field, I am more than happy to connect and learn/work together in some projects and stuff like that.

Do you know Kaggle got discord server? https://discord.gg/kaggle (crxxom) <br>
My Linkedin: https://www.linkedin.com/in/jadon-ng-848a48263/

### 🌟 Upvote the notebook!

If you find the notebook useful, it will be great if you can show some support by upvoting the notebook, it means a lot to me! 

Also, if you want to make suggestions/corrections to the content of this notebook, don't hesitate to comment your thoughts, I will reply you as soon as possible.

One more thing, if you want to share some of your resources/notebooks, comment down your links in the comment section and let us learn! 


### 🌟 Will there be updates?

As in for now, I will only be updating the notebook if there are some incorrect information or if I discover some useful resources to share. 