<a href="https://colab.research.google.com/github/amanjaiswal777/Fundamentals-Machine-Learning-algorithms/blob/master/Text_Representation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**This repository is about the NATURAL LANGUAGE PROCESSING from Scratch**

##Text Representation

###In this section we will learn how to represent text in our application.





In [0]:
import string
from collections import Counter
from pprint import  pprint
import gzip
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [0]:
text = """Mary had a little lamb, little lamb,
 little lamb. Mary had a little lamb
 whose fleece was white as snow.
 And everywhere that Mary went
 Mary went, Mary went. Everywhere
 that Mary went,
 The lamb was sure to go"""

##**Tokenization**

In [3]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [0]:
def extract_words(text):
  temp = text.split()
  text_words = []

  for word in temp:

    while word[0] in string.punctuation:
      word = word[1:]

    while word[-1] in string.punctuation:
      word = word[:-1]

    text_words.append(word.lower())

  return text_words      

In [5]:
text_words = extract_words(text)
print(text_words)

['mary', 'had', 'a', 'little', 'lamb', 'little', 'lamb', 'little', 'lamb', 'mary', 'had', 'a', 'little', 'lamb', 'whose', 'fleece', 'was', 'white', 'as', 'snow', 'and', 'everywhere', 'that', 'mary', 'went', 'mary', 'went', 'mary', 'went', 'everywhere', 'that', 'mary', 'went', 'the', 'lamb', 'was', 'sure', 'to', 'go']


In [0]:
word_dict = {}
word_list = []
vocabulary_size = 0
text_tokens = []

for word in text_words:
  if word not in word_dict:
    word_dict[word] = vocabulary_size
    word_list.append(word)
    vocabulary_size += 1


  text_tokens.append(word_dict[word])

In [7]:
print("Word list:", word_list, "\n\n Word dictionary:")
pprint(word_dict)

Word list: ['mary', 'had', 'a', 'little', 'lamb', 'whose', 'fleece', 'was', 'white', 'as', 'snow', 'and', 'everywhere', 'that', 'went', 'the', 'sure', 'to', 'go'] 

 Word dictionary:
{'a': 2,
 'and': 11,
 'as': 9,
 'everywhere': 12,
 'fleece': 6,
 'go': 18,
 'had': 1,
 'lamb': 4,
 'little': 3,
 'mary': 0,
 'snow': 10,
 'sure': 16,
 'that': 13,
 'the': 15,
 'to': 17,
 'was': 7,
 'went': 14,
 'white': 8,
 'whose': 5}


In [8]:
print(text_tokens)

[0, 1, 2, 3, 4, 3, 4, 3, 4, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0, 14, 0, 14, 0, 14, 12, 13, 0, 14, 15, 4, 7, 16, 17, 18]


##**One Hot encoding**

In [0]:
def one_hot(word, word_dict):
  vector = np.zeros(len(word_dict))
  vector[word_dict[word]] = 1

  return vector

In [10]:
fleece_hot = one_hot("fleece", word_dict)
print(fleece_hot)

[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [11]:
print(word_dict["fleece"])
fleece_hot[6] == 1

6


True

In [12]:
print(fleece_hot.sum())

1.0


##**Bag of words**##

We can now use the one-hot encoded vector for each word to produce a vector representation of our original text, by simply adding up all the one-hot encoded vectors:

In [13]:
text_vector1 = np.zeros(vocabulary_size)

for word in text_words:
  hot_word = one_hot(word, word_dict)
  text_vector1 += hot_word

print(text_vector1)  

[6. 2. 2. 4. 5. 1. 1. 2. 1. 1. 1. 1. 2. 2. 4. 1. 1. 1. 1.]


In practice, we can also easily skip the encoding step at the word level by using the word_dict defined above:

In [18]:
text_vector = np.zeros(vocabulary_size)

for word in text_words:
  text_vector[word_dict[word]] += 1

print(text_vector)  

[6. 2. 2. 4. 5. 1. 1. 2. 1. 1. 1. 1. 2. 2. 4. 1. 1. 1. 1.]


Naturally, this approach is completely equivalent to the previous one and has the added advantage of being more efficient in terms of both speed and memory requirements.

This is known as the bag of words representation of the text. It should be noted that these vectors simply contains the number of times each word appears in our document, so we can easily tell that the word mary appears exactly 6 times in our little nursery rhyme.

In [19]:
text_vector[word_dict["mary"]]

6.0

A more pythonic (and efficient) way of producing the same result is to use the standard Counter module:

In [20]:
text_words

['mary',
 'had',
 'a',
 'little',
 'lamb',
 'little',
 'lamb',
 'little',
 'lamb',
 'mary',
 'had',
 'a',
 'little',
 'lamb',
 'whose',
 'fleece',
 'was',
 'white',
 'as',
 'snow',
 'and',
 'everywhere',
 'that',
 'mary',
 'went',
 'mary',
 'went',
 'mary',
 'went',
 'everywhere',
 'that',
 'mary',
 'went',
 'the',
 'lamb',
 'was',
 'sure',
 'to',
 'go']

In [21]:
word_counts = Counter(text_words)
pprint(word_counts)

Counter({'mary': 6,
         'lamb': 5,
         'little': 4,
         'went': 4,
         'had': 2,
         'a': 2,
         'was': 2,
         'everywhere': 2,
         'that': 2,
         'whose': 1,
         'fleece': 1,
         'white': 1,
         'as': 1,
         'snow': 1,
         'and': 1,
         'the': 1,
         'sure': 1,
         'to': 1,
         'go': 1})


From which we can easily generate the text_vector and word_dict data structures:

In [0]:
items = list(word_counts.items())

#Extracting word dictionary and vector representation
word_dict2 = dict([[items[i][0], i] for i in range(len(items))])
text_vector2 = [items[i][1] for i in range(len(items))]

In [23]:
word_counts['mary']

6

In [24]:
text_vector

array([6., 2., 2., 4., 5., 1., 1., 2., 1., 1., 1., 1., 2., 2., 4., 1., 1.,
       1., 1.])

In [25]:
print("Text vector:", text_vector2, "\n\nWord dictionary:")
pprint(word_dict2)

Text vector: [6, 2, 2, 4, 5, 1, 1, 2, 1, 1, 1, 1, 2, 2, 4, 1, 1, 1, 1] 

Word dictionary:
{'a': 2,
 'and': 11,
 'as': 9,
 'everywhere': 12,
 'fleece': 6,
 'go': 18,
 'had': 1,
 'lamb': 4,
 'little': 3,
 'mary': 0,
 'snow': 10,
 'sure': 16,
 'that': 13,
 'the': 15,
 'to': 17,
 'was': 7,
 'went': 14,
 'white': 8,
 'whose': 5}


The results using this approach are slightly different than the previous ones, because the words are mapped to different integer ids but the corresponding values are the same:

In [0]:
for word in word_dict.keys():
  if text_vector[word_dict[word]] != text_vector2[word_dict2[word]]:
    print("Error!")

##**Term Frequency**##
The bag of words vector representation introduced above relies simply on the frequency of occurence of each word. Following a long tradition of giving fancy names to simple ideas, this is known as Term Frequency.

Intuitively, we expect the the frequency with which a given word is mentioned should correspond to the relevance of that word for the piece of text we are considering. For example, Mary is a pretty important word in our little nursery rhyme and indeed it is the one that occurs the most often:

In [29]:
sorted(items, key=lambda x:x[1], reverse=True)

[('mary', 6),
 ('lamb', 5),
 ('little', 4),
 ('went', 4),
 ('had', 2),
 ('a', 2),
 ('was', 2),
 ('everywhere', 2),
 ('that', 2),
 ('whose', 1),
 ('fleece', 1),
 ('white', 1),
 ('as', 1),
 ('snow', 1),
 ('and', 1),
 ('the', 1),
 ('sure', 1),
 ('to', 1),
 ('go', 1)]