# _Natural Language Processing_
## _Building Vocabulary using tokenizer_

In [1]:
# import library
import numpy as np

In [3]:
sentence = "I am learning natural language processing with python in 2022."
# tokenize the sentence
tokenized_sentence = sentence.split()
tokenized_sentence

['I',
 'am',
 'learning',
 'natural',
 'language',
 'processing',
 'with',
 'python',
 'in',
 '2022.']

Python built-in function str.split() did a good job to tokenize the sentence. but if you look makes a mistakes at last word, it includes the sentence ending punctuation with the token "_2022._"

A good tokenizer will be able to tokenize the sentence correctly.Means it will not include the sentence ending punctuation with the token "_2022_"

Now lets forget the mistakes we will deal with it later in more advance phase. 

Now we will turning the words into vector representation by doing one-hot operation on the words.

In [4]:
# create a list of unique words
vocab = sorted(set(tokenized_sentence))

# Why I use set? 
# Because I want to remove the duplicate words in the sentence.

In [6]:
# join the words in the vocab list
', '.join(vocab)

'2022., I, am, in, language, learning, natural, processing, python, with'

In [7]:
num_tokens = len(tokenized_sentence)
vocab_size = len(vocab)
print(num_tokens, vocab_size)

10 10


In [8]:
onehot_encoding = np.zeros((num_tokens, vocab_size), int)

onehot_encoding is just a matrix of num_token x vocab_size.
- num_tokens = Rows
- vocab_size = Columns

In [9]:
onehot_encoding

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [12]:
# now convert tokens to one-hot encoding
for i, token in enumerate(tokenized_sentence):
    index = vocab.index(token)
    onehot_encoding[i, index] = 1

What is just happen here?
- for each words in the sentence, we will find the index of the word in the vocab_size.
- then mark the columns for that word in the vocabulary with 1.

In [15]:
print(" ".join(vocab))
print(onehot_encoding)

2022. I am in language learning natural processing python with
[[0 1 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 1 0]
 [0 0 0 1 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0]]


In [16]:
# lets view it using Pandas
import pandas as pd
pd.DataFrame(onehot_encoding,columns=vocab)

Unnamed: 0,2022.,I,am,in,language,learning,natural,processing,python,with
0,0,1,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,1,0,0,0,0,0
5,0,0,0,0,0,0,0,1,0,0
6,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,0,0,0,0,1,0
8,0,0,0,1,0,0,0,0,0,0
9,1,0,0,0,0,0,0,0,0,0


One hot vectors are super-sparse matrix, you can see there are lots of zeros 😶 in the matrix. So, working with one hot vector can be a dimensional problem.😥 

e.g if you have a sentence of length 100, or a full novel, you will have a matrix of 100 x vocab_size😨 or total_number_of_words_in_novel x vocab_size.😱

Ok, I must give you idea about the size and space it required.

**LOOK 👇**

In [20]:
# lets you have 300 books with 4000 sentence each and 12 words per line.
# then, 
rows = 300 * 4000 * 12
num_bytes = rows * 1000000

in_gb = num_bytes / (1024 * 1024 * 1024)
in_tb = in_gb / 1024

print(f"Rows: {rows}")
print(f"Bytes: {num_bytes}")
print(f"GB: {in_gb}gb")
print(f"TB: {in_tb}tb")

print("SO, you have 13tb of data from a single corpus. That's a lot of data.")

Rows: 14400000
Bytes: 14400000000000
GB: 13411.04507446289gb
TB: 13.096723705530167tb
SO, you have 13tb of data from a single corpus. That's a lot of data.


In [18]:
# will you see the actual sentence? in matrix form 😉
# ok , but don't do it with the dataframe for ML algorithms
df = pd.DataFrame(onehot_encoding,columns=vocab)
df[df==0] = " "
df

Unnamed: 0,2022.,I,am,in,language,learning,natural,processing,python,with
0,,1.0,,,,,,,,
1,,,1.0,,,,,,,
2,,,,,,1.0,,,,
3,,,,,,,1.0,,,
4,,,,,1.0,,,,,
5,,,,,,,,1.0,,
6,,,,,,,,,,1.0
7,,,,,,,,,1.0,
8,,,,1.0,,,,,,
9,1.0,,,,,,,,,
