# 2. Word Vectorization

So far we have learned techniques and procedures for spplitting sentences / paragraphs to individual words, reducing them to word base and removing non-relevant words. On the other hand, these words could not be used as a model input they are still in the textual format.

**We need to convert input data from its raw textual format into vectors of real numbers**. This step is also known as **vectorization**.

In a way, vectorization can be imagined as a feature extraction step as the main idea evolves around getting distinct features out of the text in order to train our model.

There are plenty of different algorithms and pre-written models that can be used to vectorize textual data, however, in this tutorial we will only look at **Bag of Words** and **TF-IDF**.

## Bag of Words

It is one of the simplest techniques out there which involves three steps:

1. **Tokenization**. We need to convert text to list of sequences.

2. **Creating vocabulary**. To create vocabulary, we extract uniques words from the whole word list and then sort them by alphabetical order.

3. **Creating vector**. After extracting the frequency at which each word vocabulary appears in the initial word list, we construct a matrix. In this sparse matrix, each row is a sentence vector whose length is equal to the size of vocabulary.

Without any further or do, let's look at one of the possible implementations.

In [3]:
#Setting up
!pip install nltk
import nltk
nltk.download("popular")



[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import sent_tokenize

Text = "Natural language processing (NLP) refers to the branch of computer science. To be more specific, the branch of artificial intelligence. It is concerned with giving computers the ability to understand text and spoken words in much the same way human beings can."

#Sentence tokenization
sent = sent_tokenize(Text)

#Creating CountVectorizer
cv = CountVectorizer()

X = cv.fit_transform(sent)

#Converting to array for visualization purposes
X = X.toarray()

Let's print our vocabulary to better understand our matrix structure

In [12]:
sorted(cv.vocabulary_.keys())

['ability',
 'and',
 'artificial',
 'be',
 'beings',
 'branch',
 'can',
 'computer',
 'computers',
 'concerned',
 'giving',
 'human',
 'in',
 'intelligence',
 'is',
 'it',
 'language',
 'more',
 'much',
 'natural',
 'nlp',
 'of',
 'processing',
 'refers',
 'same',
 'science',
 'specific',
 'spoken',
 'text',
 'the',
 'to',
 'understand',
 'way',
 'with',
 'words']

A few things to note:
- Every row represents sentence extracted from the initial text
- The length of each column is equal to the length of vocabulary
- Each value represents word's frequency in the sentence

As you can see, Bag of Words is a very simple, yet effective algorithm that can be applied to many cases. On the other hand, it simplicity causes some limitations. Since BoW model is only concerned with the frequency of vocabulary, articles, prepositions, and conjunctions get the same importance as other words, even though, they do not actually carry the same 'weight'. This is where the **TF-IDF** algorithm comes in.

## TF-IDF

Term Frequency-Inverse Document Frequency is a numarical statistic that is intended to evaluate the importance of the word in the given textual sample.

TF-IDF algorithm has two parts:
- **Term Frequency**. It can be understood as a normalized frequency score (it's value will always be smaller than 1) that can be calculated using the following formula:
![TF](https://lh4.googleusercontent.com/qeUw5Ui1H-nvG-CjoSiJ7rA3vuiRN-YNRVNzY0rCroT36t71R36ZRckGdQAOzQdC7iXlBge8qlLcjtE2aITLRLoXrx56eEs2ucNtyEaSm6X5xXSutWB1ckUHkg6ScBxFUYJqVdTc)


- **Inverse Document Frequency**. Document Frequency (DF) itself shows the proportion of documents that contain a certain word. If we inverse DF and tage the logarithm, we get the expresion for IDF:
![IDF](https://lh5.googleusercontent.com/M5Zkpe89P6lDf4x__CzwQPIMkG7jJ7ediYWVkJYBsv1-eslTAfcCw0zbl5_2xrn3p2sRTgd-budDlwzjgj4lyny-WO5KLGIDosPRRMEj4zR_fNbl5SWEwF0Xm1m8UOK7pTOs6Zxz)

Overall, the more common the word is accross all documents, the lesser its importance is for the current document. The final TF-IDF score can be found as follows:

![TF-IDF](https://lh5.googleusercontent.com/LjP1udnGPjJHZ-qlkeO02ToZM21OykkOflxmi3cCVAljxHPpV1jf5PFHIwjkXoCKjAWMn1WB9Ln5fAcypxWppCsSlkorQYr0TZ1rGNYCfXNjHCWhM1RqwQTV8B0fRMtp_bAKjm5A)

Now, let's look how all of this can be implemented in code.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize
import pandas as pd

Text = "Natural language processing (NLP) refers to the branch of computer science. To be more specific, the branch of artificial intelligence. It is concerned with giving computers the ability to understand text and spoken words in much the same way human beings can."

# Sentence tokenization
sentences = sent_tokenize(Text)

Unnamed: 0,ability,and,artificial,be,beings,branch,can,computer,computers,concerned,...,science,specific,spoken,text,the,to,understand,way,with,words
0,0.0,0.0,0.0,0.0,0.0,0.255584,0.0,0.336062,0.0,0.0,...,0.336062,0.0,0.0,0.0,0.198483,0.198483,0.0,0.0,0.0,0.0
1,0.0,0.0,0.381956,0.381956,0.0,0.290488,0.0,0.0,0.0,0.0,...,0.0,0.381956,0.0,0.0,0.225589,0.225589,0.0,0.0,0.0,0.0
2,0.21956,0.21956,0.0,0.0,0.21956,0.0,0.21956,0.0,0.21956,0.21956,...,0.0,0.0,0.21956,0.21956,0.259351,0.129675,0.21956,0.21956,0.21956,0.21956


To see which features are the most important, let's construct a data frame with feature names and sorted TF-IDF scores as a column.

In [19]:
tfidf = TfidfVectorizer()
transformed = tfidf.fit_transform(sentences)
df = pd.DataFrame(transformed.toarray(), columns=tfidf.get_feature_names_out())
df

Unnamed: 0,ability,and,artificial,be,beings,branch,can,computer,computers,concerned,...,science,specific,spoken,text,the,to,understand,way,with,words
0,0.0,0.0,0.0,0.0,0.0,0.255584,0.0,0.336062,0.0,0.0,...,0.336062,0.0,0.0,0.0,0.198483,0.198483,0.0,0.0,0.0,0.0
1,0.0,0.0,0.381956,0.381956,0.0,0.290488,0.0,0.0,0.0,0.0,...,0.0,0.381956,0.0,0.0,0.225589,0.225589,0.0,0.0,0.0,0.0
2,0.21956,0.21956,0.0,0.0,0.21956,0.0,0.21956,0.0,0.21956,0.21956,...,0.0,0.0,0.21956,0.21956,0.259351,0.129675,0.21956,0.21956,0.21956,0.21956


Despite still being simple, TF-IDF is widely used in tasks where model has to evaluate the best response to a query, especially in a chatbot system or keyword extraction.