# _Natural Language Processing_
## _Building Vocabulary using tokenizer_

In [1]:
# import library
import numpy as np

In [3]:
sentence = "I am learning natural language processing with python in 2022."
# tokenize the sentence
tokenized_sentence = sentence.split()
tokenized_sentence

['I',
 'am',
 'learning',
 'natural',
 'language',
 'processing',
 'with',
 'python',
 'in',
 '2022.']

Python built-in function str.split() did a good job to tokenize the sentence. but if you look makes a mistakes at last word, it includes the sentence ending punctuation with the token "_2022._"

A good tokenizer will be able to tokenize the sentence correctly.Means it will not include the sentence ending punctuation with the token "_2022_"

Now lets forget the mistakes we will deal with it later in more advance phase. 

Now we will turning the words into vector representation by doing one-hot operation on the words.

In [4]:
# create a list of unique words
vocab = sorted(set(tokenized_sentence))

# Why I use set? 
# Because I want to remove the duplicate words in the sentence.

In [6]:
# join the words in the vocab list
', '.join(vocab)

'2022., I, am, in, language, learning, natural, processing, python, with'

In [7]:
num_tokens = len(tokenized_sentence)
vocab_size = len(vocab)
print(num_tokens, vocab_size)

10 10


In [8]:
onehot_encoding = np.zeros((num_tokens, vocab_size), int)

onehot_encoding is just a matrix of num_token x vocab_size.
- num_tokens = Rows
- vocab_size = Columns

In [9]:
onehot_encoding

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [12]:
# now convert tokens to one-hot encoding
for i, token in enumerate(tokenized_sentence):
    index = vocab.index(token)
    onehot_encoding[i, index] = 1

What is just happen here?
- for each words in the sentence, we will find the index of the word in the vocab_size.
- then mark the columns for that word in the vocabulary with 1.

In [15]:
print(" ".join(vocab))
print(onehot_encoding)

2022. I am in language learning natural processing python with
[[0 1 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 1 0]
 [0 0 0 1 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0]]


In [16]:
# lets view it using Pandas
import pandas as pd
pd.DataFrame(onehot_encoding,columns=vocab)

Unnamed: 0,2022.,I,am,in,language,learning,natural,processing,python,with
0,0,1,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,1,0,0,0,0,0
5,0,0,0,0,0,0,0,1,0,0
6,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,0,0,0,0,1,0
8,0,0,0,1,0,0,0,0,0,0
9,1,0,0,0,0,0,0,0,0,0


One hot vectors are super-sparse matrix, you can see there are lots of zeros 😶 in the matrix. So, working with one hot vector can be a dimensional problem.😥 

e.g if you have a sentence of length 100, or a full novel, you will have a matrix of 100 x vocab_size😨 or total_number_of_words_in_novel x vocab_size.😱

Ok, I must give you idea about the size and space it required.

**LOOK 👇**

In [20]:
# lets you have 300 books with 4000 sentence each and 12 words per line.
# then, 
rows = 300 * 4000 * 12
num_bytes = rows * 1000000

in_gb = num_bytes / (1024 * 1024 * 1024)
in_tb = in_gb / 1024

print(f"Rows: {rows}")
print(f"Bytes: {num_bytes}")
print(f"GB: {in_gb}gb")
print(f"TB: {in_tb}tb")

print("SO, you have 13tb of data from a single corpus. That's a lot of data.")

Rows: 14400000
Bytes: 14400000000000
GB: 13411.04507446289gb
TB: 13.096723705530167tb
SO, you have 13tb of data from a single corpus. That's a lot of data.


In [18]:
# will you see the actual sentence? in matrix form 😉
# ok , but don't do it with the dataframe for ML algorithms
df = pd.DataFrame(onehot_encoding,columns=vocab)
df[df==0] = " "
df

Unnamed: 0,2022.,I,am,in,language,learning,natural,processing,python,with
0,,1.0,,,,,,,,
1,,,1.0,,,,,,,
2,,,,,,1.0,,,,
3,,,,,,,1.0,,,
4,,,,,1.0,,,,,
5,,,,,,,,1.0,,
6,,,,,,,,,,1.0
7,,,,,,,,,1.0,
8,,,,1.0,,,,,,
9,1.0,,,,,,,,,


### How to tackle the problem of one-hot vector?
- there are many ways to tackle the problems , one is use of _Bag-of-Words_ model.

## _Bag of Words_

In [21]:
sentence_bow = {}
for token in sentence.split():
    if token not in sentence_bow:
        sentence_bow[token] = 1
    else:
        sentence_bow[token] += 1

In [23]:
sorted(sentence_bow.items())

[('2022.', 1),
 ('I', 1),
 ('am', 1),
 ('in', 1),
 ('language', 1),
 ('learning', 1),
 ('natural', 1),
 ('processing', 1),
 ('python', 1),
 ('with', 1)]

- You noticed that sorted function put decimal number before the words.
- and put capitalized words before the lowercase words.
- why?
  - Because this is the ordering of the characters in the ASCII and Unicode character sets.

In [28]:
# STEP 1: dictionary to pandas series
pd.Series(dict([(token, 1) for token in sentence.split()]))

I             1
am            1
learning      1
natural       1
language      1
processing    1
with          1
python        1
in            1
2022.         1
dtype: int64

In [31]:
# STEP 2: pandas series to DataFrame
pd.DataFrame(pd.Series(dict([(token, 1) for token in sentence.split()])))

Unnamed: 0,0
I,1
am,1
learning,1
natural,1
language,1
processing,1
with,1
python,1
in,1
2022.,1


In [32]:
# STEP 3: Transpose the DataFrame
pd.DataFrame(pd.Series(dict([(token, 1) for token in sentence.split()]))).T

Unnamed: 0,I,am,learning,natural,language,processing,with,python,in,2022.
0,1,1,1,1,1,1,1,1,1,1


Now do it altogather.
👇

In [34]:
# now lets use pandas efficient DataFrame to store the data
df_sentence = pd.DataFrame(pd.Series(dict([(token,1) for token in sentence.split()])), columns=['token']).T

df_sentence

Unnamed: 0,I,am,learning,natural,language,processing,with,python,in,2022.
token,1,1,1,1,1,1,1,1,1,1


### Construct a dataframe of Bag of Words vectors

In [35]:
# Crime and Punishment from Project Gutenberg

novel = """On an exceptionally hot evening early in July a young man came out of
the garret in which he lodged in S. Place and walked slowly, as though
in hesitation, towards K. bridge.

He had successfully avoided meeting his landlady on the staircase. His
garret was under the roof of a high, five-storied house and was more
like a cupboard than a room. The landlady who provided him with garret,
dinners, and attendance, lived on the floor below, and every time
he went out he was obliged to pass her kitchen, the door of which
invariably stood open. And each time he passed, the young man had a
sick, frightened feeling, which made him scowl and feel ashamed. He was
hopelessly in debt to his landlady, and was afraid of meeting her.

This was not because he was cowardly and abject, quite the contrary; but
for some time past he had been in an overstrained irritable condition,
verging on hypochondria. He had become so completely absorbed in
himself, and isolated from his fellows that he dreaded meeting, not
only his landlady, but anyone at all. He was crushed by poverty, but the
anxieties of his position had of late ceased to weigh upon him. He had
given up attending to matters of practical importance; he had lost all
desire to do so. Nothing that any landlady could do had a real terror
for him. But to be stopped on the stairs, to be forced to listen to her
trivial, irrelevant gossip, to pestering demands for payment, threats
and complaints, and to rack his brains for excuses, to prevaricate, to
lie--no, rather than that, he would creep down the stairs like a cat and
slip out unseen.

This evening, however, on coming out into the street, he became acutely
aware of his fears.

“I want to attempt a thing _like that_ and am frightened by these
trifles,” he thought, with an odd smile. “Hm... yes, all is in a man’s
hands and he lets it all slip from cowardice, that’s an axiom. It would
be interesting to know what it is men are most afraid of. Taking a new
step, uttering a new word is what they fear most.... But I am talking
too much. It’s because I chatter that I do nothing. Or perhaps it is
that I chatter because I do nothing. I’ve learned to chatter this
last month, lying for days together in my den thinking... of Jack the
Giant-killer. Why am I going there now? Am I capable of _that_? Is
_that_ serious? It is not serious at all. It’s simply a fantasy to amuse
myself; a plaything! Yes, maybe it is a plaything.”

The heat in the street was terrible: and the airlessness, the bustle
and the plaster, scaffolding, bricks, and dust all about him, and that
special Petersburg stench, so familiar to all who are unable to get out
of town in summer--all worked painfully upon the young man’s already
overwrought nerves. The insufferable stench from the pot-houses, which
are particularly numerous in that part of the town, and the drunken men
whom he met continually, although it was a working day, completed
the revolting misery of the picture. An expression of the profoundest
disgust gleamed for a moment in the young man’s refined face. He was,
by the way, exceptionally handsome, above the average in height, slim,
well-built, with beautiful dark eyes and dark brown hair. Soon he sank
into deep thought, or more accurately speaking into a complete blankness
of mind; he walked along not observing what was about him and not caring
to observe it. From time to time, he would mutter something, from the
habit of talking to himself, to which he had just confessed. At these
moments he would become conscious that his ideas were sometimes in a
tangle and that he was very weak; for two days he had scarcely tasted
food.
He was so badly dressed that even a man accustomed to shabbiness would
have been ashamed to be seen in the street in such rags. In that quarter
of the town, however, scarcely any shortcoming in dress would have
created surprise. Owing to the proximity of the Hay Market, the number
of establishments of bad character, the preponderance of the trading
and working class population crowded in these streets and alleys in the
heart of Petersburg, types so various were to be seen in the streets
that no figure, however queer, would have caused surprise. But there was
such accumulated bitterness and contempt in the young man’s heart, that,
in spite of all the fastidiousness of youth, he minded his rags least
of all in the street. It was a different matter when he met with
acquaintances or with former fellow students, whom, indeed, he disliked
meeting at any time. And yet when a drunken man who, for some unknown
reason, was being taken somewhere in a huge waggon dragged by a heavy
dray horse, suddenly shouted at him as he drove past: “Hey there, German
hatter” bawling at the top of his voice and pointing at him--the young
man stopped suddenly and clutched tremulously at his hat. It was a tall
round hat from Zimmerman’s, but completely worn out, rusty with age, all
torn and bespattered, brimless and bent on one side in a most unseemly
fashion. Not shame, however, but quite another feeling akin to terror
had overtaken him.

“I knew it,” he muttered in confusion, “I thought so! That’s the worst
of all! Why, a stupid thing like this, the most trivial detail might
spoil the whole plan. Yes, my hat is too noticeable.... It looks absurd
and that makes it noticeable.... With my rags I ought to wear a cap, any
sort of old pancake, but not this grotesque thing. Nobody wears such
a hat, it would be noticed a mile off, it would be remembered.... What
matters is that people would remember it, and that would give them
a clue. For this business one should be as little conspicuous as
possible.... Trifles, trifles are what matter! Why, it’s just such
trifles that always ruin everything....”

He had not far to go; he knew indeed how many steps it was from the gate
of his lodging house: exactly seven hundred and thirty. He had counted
them once when he had been lost in dreams. At the time he had put no
faith in those dreams and was only tantalising himself by their hideous
but daring recklessness. Now, a month later, he had begun to look upon
them differently, and, in spite of the monologues in which he jeered at
his own impotence and indecision, he had involuntarily come to regard
this “hideous” dream as an exploit to be attempted, although he
still did not realise this himself. He was positively going now for a
“rehearsal” of his project, and at every step his excitement grew more
and more violent."""

In [37]:
sentence = """On an exceptionally hot evening early in July a young man came out of the garret in which he lodged in S.\n"""
sentence += """Place and walked slowly, as though in hesitation, towards K. bridge.\n"""
sentence += """He had successfully avoided meeting his landlady on the staircase.\n"""
sentence += """His garret was under the roof of a high, five-storied house and was more
like a cupboard than a room.\n"""
sentence += """The landlady who provided him with garret, dinners, and attendance, lived on the floor below, and every time
he went out he was obliged to pass her kitchen, the door of which invariably stood open.\n"""

sentence += """And each time he passed, the young man had a sick, frightened feeling, which made him scowl and feel ashamed.\n"""

sentence += """He was hopelessly in debt to his landlady, and was afraid of meeting her."""

In [38]:
corpus = {}
for i, sent in enumerate(sentence.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())

In [39]:
df_corpus = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T

In [41]:
df_corpus[df_corpus.columns[:50]]

Unnamed: 0,On,an,exceptionally,hot,evening,early,in,July,a,young,...,was,under,roof,"high,",five-storied,house,more,like,cupboard,than
sent0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
sent1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent3,0,0,0,0,0,0,0,0,1,0,...,1,1,1,1,1,1,1,0,0,0
sent4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,1,1
sent5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent6,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
sent7,0,0,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
sent8,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [42]:
df_corpus.size

801