# Bag of Words model
Bag of words model is one of a series of techniques from a field of computer science known as Natural Language Processing or NLP to extract features from text. 
- This model extracts features from text in the form of frequency counts
- The way it does this is by **counting the frequency of words** in a document.

Input:
1. vocabulary
2. documents/sentences

The output of the bag of words model is a **frequency vector**.


### Motivation
Trying to implement a machine learning algorithm to classify documents. For example, classify spam or non-spam email.


1. Word proportion
2. Raw word counts
3. Binary (1 if appear)
4. TF-IDF 
5. feature vectors


Reference:

http://www.insightsbot.com/blog/R8fu5/bag-of-words-algorithm-in-python-introduction

In [16]:
import numpy as np
import re

### Step1: defining the vocabulary and create frequency vector

In [17]:
sentences = ["Machine learning is great","Natural Language Processing is a complex field","Natural Language Processing is used in machine learning"]

In [18]:
"""
1. extract the words from a document using regular expressions
2. convert all words to lower case and exclude our stop words.
"""
def extract_words(sentence):
    ignore_words = ['a']
    words = re.sub("[^\w]", " ",  sentence).split() #nltk.word_tokenize(sentence)
    words_cleaned = [w.lower() for w in words if w not in ignore_words]
    return words_cleaned 

In [19]:
"""
builds our vocabulary by looping through all our documents (sentences), 
extracting the words from each, removing duplicates using the set function 

Return
-------
a sorted list of words.
"""
def tokenize_sentences(sentences):
    words = []
    for sentence in sentences:
        w = extract_words(sentence)
        words.extend(w)
        
    words = sorted(list(set(words)))
    return words

### Step 2: Convert sentences into a frequency vector 
The result is a numerical vector which can be utilized as inputs in the various machine learning algorithms to classify documents 

In [20]:
"""
Stage 1:
implementation of the bag of words model
takes an input of a sentence and words (our vocabulary). 
It then extracts the words from the input sentence using the previously defined function. 
a vector of zeros using numpy zeros function with a length of the number of words in our vocabulary.

Stage 2: 
for each word in our sentence, we loop through our vocabulary and if the word exists we increase the count by 1. 
We return the numpy array of frequency counts.
"""

def bagofwords(sentence, words):
    #Stage 1:
    sentence_words = extract_words(sentence)
    bag = np.zeros(len(words)) # frequency word count
    # stage 2:
    for sw in sentence_words:
        for i,word in enumerate(words):
            if word == sw: 
                bag[i] += 1
                
    return np.array(bag)

In [21]:
frequency_vector = bagofwords("Machine learning is great", vocabulary)
frequency_vector

array([0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0.])

### Vectorize Sentences using SciKit Learn CountVectorizer
SciKit Learn CountVectorizer:
Python’s SciKit-Learn provides built in functions to implement the above bag of words model. 

Let’s implement all the above in simply 4 lines of code. 



In [22]:
from sklearn.feature_extraction.text import CountVectorizer


sentences = ["Machine learning is great","Natural Language Processing is a complex field","Natural Language Processing is used in machine learning"]

vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000) 
# create vocabulary
train_data_features = vectorizer.fit_transform(sentences)
# create frequency vector
vectorizer.transform(["Machine learning is great"]).toarray()


array([[0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0]])