# Introduction to Bag of words

This notebooks helps to understand the basics of processing text data and representation. 

The first part explores the naive way of processing a dataset without use of any pre-processing tools, techniques and libraries.
`data` is a list containing three sentences (marked by commas). The words are identified by whitespace characters.

In [1]:
data = ['The boy has a green bird',
              'The bird eats a duck',
              'The duck eats a worm']

string.split() function splits the string by the whitespace characters.

In [2]:
sents = [sent.split() for sent in data]

In [3]:
sents

[['The', 'boy', 'has', 'a', 'green', 'bird'],
 ['The', 'bird', 'eats', 'a', 'duck'],
 ['The', 'duck', 'eats', 'a', 'worm']]

#### Extracting vocabularies from the sentences

Vocabulary is the list of non-repeating words that are present in the whole dataset. In the below example, `allwords` has all the words that are present in the dataset, but we need to remove the repetition. `set()` obtains the unique words in the list.

In [4]:
allwords = [word for sent in sents for word in sent]

In [5]:
vocab = list(set(allwords))

In [6]:
vocab

['has', 'green', 'bird', 'worm', 'eats', 'duck', 'The', 'boy', 'a']

In [7]:
sents

[['The', 'boy', 'has', 'a', 'green', 'bird'],
 ['The', 'bird', 'eats', 'a', 'duck'],
 ['The', 'duck', 'eats', 'a', 'worm']]

Now, lets compute bag of words. i.e, a matrix between vocabulary and the sentences in the dataset, where the counts of the vocabulary present in each sentences are filled in the matrix

In [8]:
sents_vec = [[sent.count(v) for v in vocab ] for sent in sents]

In [9]:
sents_vec

[[1, 1, 1, 0, 0, 0, 1, 1, 1],
 [0, 0, 1, 0, 1, 1, 1, 0, 1],
 [0, 0, 0, 1, 1, 1, 1, 0, 1]]

In [10]:
print(vocab)
print("S1 : ", sents_vec[0])
print("S2 : ", sents_vec[1])
print("S3 : ", sents_vec[2])

['has', 'green', 'bird', 'worm', 'eats', 'duck', 'The', 'boy', 'a']
S1 :  [1, 1, 1, 0, 0, 0, 1, 1, 1]
S2 :  [0, 0, 1, 0, 1, 1, 1, 0, 1]
S3 :  [0, 0, 0, 1, 1, 1, 1, 0, 1]


Great!. But this is an inefficient method for computing vocabularies and occurrences. So, We could make use of the `CountVectorizer()` from `sklearn`.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
cv = CountVectorizer(min_df=0)

In [13]:
matrix = cv.fit_transform(data) #This command computes vocabulary and occurrences

In [14]:
cv.get_feature_names() # vocabulary of the term-document matrix

['bird', 'boy', 'duck', 'eats', 'green', 'has', 'the', 'worm']

In [15]:
print(matrix)

  (0, 6)	1
  (0, 1)	1
  (0, 5)	1
  (0, 4)	1
  (0, 0)	1
  (1, 6)	1
  (1, 0)	1
  (1, 3)	1
  (1, 2)	1
  (2, 6)	1
  (2, 3)	1
  (2, 2)	1
  (2, 7)	1


In [16]:
for i,x in enumerate(matrix):
    print(f"sentence {i}")
    print(x)
    
# It is important to note that the letter a is removed by the CountVectorizer because 
# by default minimum 2 character words are counted as words.

sentence 0
  (0, 6)	1
  (0, 1)	1
  (0, 5)	1
  (0, 4)	1
  (0, 0)	1
sentence 1
  (0, 6)	1
  (0, 0)	1
  (0, 3)	1
  (0, 2)	1
sentence 2
  (0, 6)	1
  (0, 3)	1
  (0, 2)	1
  (0, 7)	1


In practical cases, the real world data does not contain clear separation of words and sentences. There is complexity in grammar, writing styles, etc. Thus whitespaces are not good enough to separate words. 

In [17]:
foo = "Oh God! I haven't saved any of it's responses"

In [18]:
[w for w in foo.split()]

['Oh', 'God!', 'I', "haven't", 'saved', 'any', 'of', "it's", 'responses']

### Tokenizers

Tokenization is a process of breaking down a text into smaller chunks be it word or sentence depending upon the level of tokenization.

`nltk` is short for natural language toolkit. It is one of the classical libraries used for text processing. nltk provides functions for words and sentence tokenization.

In [19]:
from nltk import word_tokenize

In [20]:
foo = "Oh God!\n I haven't saved any of it's responses"

In [21]:
print(word_tokenize(foo))

['Oh', 'God', '!', 'I', 'have', "n't", 'saved', 'any', 'of', 'it', "'s", 'responses']


Apart from general word tokenizers, there are also specific tokenizers such as for tokenizing tweets.

In [22]:
footweet = "@Tomato Wassssssssssup man, that's so cooool"

In [23]:
from nltk.tokenize import TweetTokenizer

In [24]:
word_tokenize(footweet)

['@', 'Tomato', 'Wassssssssssup', 'man', ',', 'that', "'s", 'so', 'cooool']

In [25]:
TweetTokenizer(strip_handles=True, reduce_len=True).tokenize(footweet)

['Wasssup', 'man', ',', "that's", 'so', 'coool']

Similar to tokenizing words, there are also sentence tokenizers, which helps us to slice sentences in a paragraph.

In [26]:
from nltk import sent_tokenize

In [27]:
bar = "Sent tokenize knows that time period from 10 a.m. to 1 p.m. are not sentence boundaries. neither are the names G.H.Hardy and J.J.Thompson. you can even start the sentence without Caps"

In [28]:
sent_tokenize(bar)

['Sent tokenize knows that time period from 10 a.m. to 1 p.m. are not sentence boundaries.',
 'neither are the names G.H.Hardy and J.J.Thompson.',
 'you can even start the sentence without Caps']

Text data has lots of information, but based on the use case one could simplify the complex information leading to reduced space and time complexity. For example words like bike, biking, biked provides the same meaning varied in tenses. When we don't need such information they could be stripped to its root word using two NLP techniques, Stemming and Lemmatization.

### Stemming & Lemmatization

In [29]:
from nltk import PorterStemmer, SnowballStemmer

In [30]:
from nltk import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [31]:
ps = PorterStemmer()

In [32]:
foo = "My baby and all other babies in my buildings are crying laughing and playing all the time"

In [33]:
print([ps.stem(tok) for tok in foo.split()])

['My', 'babi', 'and', 'all', 'other', 'babi', 'in', 'my', 'build', 'are', 'cri', 'laugh', 'and', 'play', 'all', 'the', 'time']


We could see that babies is shortened to `babi`, which is not really an english word, but as long as suffixes of `babi-` are grouped as a single token, it helps in easy processing. But what if we need to be still grammatically correct and trade off with the speed of calculation. In that case we have to use Lemmatization. It is grammatically accurate and it is computationally expensive, since it loads a lemmatizer model.

In [34]:
wnl = WordNetLemmatizer()

In [35]:
bar = "Those people were crying and running every day"

In [36]:
[wnl.lemmatize(tok, pos="v") for tok in bar.split()]

['Those', 'people', 'be', 'cry', 'and', 'run', 'every', 'day']

### Punctuation removal

In [37]:
import string

In [38]:
eggs = '''Phew! After lots of <br>, I finally cleared all these "English" and "Science" exams'''

In [39]:
punctuation_map = str.maketrans('', '', string.punctuation)

In [40]:
eggs.translate(punctuation_map)

'Phew After lots of br I finally cleared all these English and Science exams'