# Bag of Words (BoW)

Bag of Words is a method for feature generation from textual documents.

In the Text Classification problem, we have a set of texts and their respective labels. But we directly can't use text for our model. We need to convert these text into some numbers or vectors of numbers.

Bag-of-words model(BoW ) is the simplest way of extracting features from the text. 
BoW converts text into the matrix of occurrence of words within a document. 
This model concerns about whether given words occurred/appeared or not in the document. We can create a matrix of document and words by counting the occurrence of words in the given document. This matrix is known as Document-Term Matrix(DTM).

This matrix is using a single word. It can be a combination of two or more words, which is called a bigram or trigram model and the general approach is called the n-gram model.

You can generate document term matrix by using scikit-learn's CountVectorizer.






### Defining Bag of Words (BoW) Vectors

Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data. 

The intuition behind this is that two similar text fields will contain similar kind of words, and will therefore have a similar bag of words. 

Further, that from the text alone we can learn something about the meaning of the document.



Let us consider two strings/documents:


1: “i am happy to see you”

2: “the happy prince looked around”


The list of unique words/vocabulary are the following:

am, around, happy, i, looked, prince,  see, you, the, to 

A binary feature vector can be represent each of the documents as follows:

 am around happy i looked prince see you the to 
        


1:      1           0         1          1         0         0        1       1       0     1




2:      0           1         1          0         1         1        0       0       1     0


Here 1 represents that the word is present in the string/document and 0 means the word is absent.




Similarly is a word is present more than once in a document, the 1 can be replaced with the actual count of the word in the document.

Example:

If we have a third string/document

3: “the happy prince looked around to see you happy”


 am around happy i looked prince see you the to 



3:      0   1   2   1   1   1   1   1   1    1





It is evident that the dimension of this feature vector will be equal to the number of unique words in the text (strings or documents).

For a very large document there are chances that many words present in one document will not be present in the other.
Therefore, the feature vectors for the documents will hold the values 0’ sin most of the places. 

To overcome this problem, normally the BoW vectors are represented as sparse vectors, reducing the dimensions of the vector representations of the documents.

We generally perform stop-word-removal before finding BoW.




### Loading Data

Now, you will learn Text Classification. We will perform Multi-Nomial Naive Bayes Classification using scikit-learn.

In the model the building part, you can use the "Sentiment Analysis of Movie, Reviews" dataset available on Kaggle. The dataset is a tab-separated file. Dataset has four columns PhraseId, SentenceId, Phrase, and Sentiment.

This data has 5 sentiment labels:

0 - negative 1 - somewhat negative 2 - neutral 3 - somewhat positive 4 - positive

Here, you can build a model to classify the type of cultivar. The dataset is available on Kaggle. You can download it from the following link: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


In [None]:

data=pd.read_csv('movie-train.tsv', sep='\t')


## Attributes/Datatypes/Number-of-Entries in the DataFrame

In [None]:
data.info()

## Look into the First Ten Observations in the DataFrame

"Phrase" column contains the actual movie reviews.

"Sentiment" column contains the reviews - labels/categories for the corresponding movie.

The sentiment labels are as follows:

0 - negative 
1 - somewhat negative 
2 - neutral 
3 - somewhat positive 
4 - positive


In [None]:
print(data[['Phrase','Sentiment']].head(10))

                                              Phrase  Sentiment
0  A series of escapades demonstrating the adage ...          1
1  A series of escapades demonstrating the adage ...          2
2                                           A series          2
3                                                  A          2
4                                             series          2
5  of escapades demonstrating the adage that what...          2
6                                                 of          2
7  escapades demonstrating the adage that what is...          2
8                                          escapades          2
9  demonstrating the adage that what is good for ...          2


## Count of Different Sentiment Categories

In [None]:
data.Sentiment.value_counts()

## Plot the Sentiment Categories in the Dataset

In [None]:
Sentiment_count=data.groupby('Sentiment').count()
plt.bar(Sentiment_count.index.values, Sentiment_count['Phrase'])
plt.xlabel('Review Sentiments')
plt.ylabel('Number of Review')
plt.show()

## Find BoW Feature Vectors

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

# Tokenizer to remove unwanted elements from out data like symbols and numbers
token = RegexpTokenizer(r'[a-zA-Z0-9]+')

cv = CountVectorizer(lowercase=True,stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
text_counts= cv.fit_transform(data['Phrase'])

In [None]:
print(text_counts)

  (0, 11671)	1
  (0, 4517)	1
  (0, 3444)	1
  (0, 294)	1
  (0, 5735)	2
  (0, 5751)	1
  (0, 5512)	1
  (0, 9065)	1
  (0, 593)	1
  (0, 584)	1
  (0, 12673)	1
  (1, 11671)	1
  (1, 4517)	1
  (1, 3444)	1
  (1, 294)	1
  (1, 5735)	1
  (1, 5751)	1
  (2, 11671)	1
  (4, 11671)	1
  (5, 4517)	1
  (5, 3444)	1
  (5, 294)	1
  (5, 5735)	1
  (5, 5751)	1
  (7, 4517)	1
  :	:
  (156050, 11305)	1
  (156050, 9054)	1
  (156051, 11305)	1
  (156051, 9054)	1
  (156052, 11305)	1
  (156053, 11281)	1
  (156053, 1281)	1
  (156053, 5252)	1
  (156053, 6156)	1
  (156053, 1006)	1
  (156053, 2271)	1
  (156054, 11281)	1
  (156054, 5252)	1
  (156054, 6156)	1
  (156054, 1006)	1
  (156054, 2271)	1
  (156055, 11281)	1
  (156055, 6156)	1
  (156056, 5252)	1
  (156056, 1006)	1
  (156056, 2271)	1
  (156057, 1006)	1
  (156057, 2271)	1
  (156058, 1006)	1
  (156059, 2271)	1


## Split Data into Training and Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(text_counts, data['Sentiment'], test_size=0.3, random_state=1)

## Classification with Naive Bayes Algorithm

In [None]:
# Import the class for MultinomialNaiveBayes from sklearn library
from sklearn.naive_bayes import MultinomialNB

# Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

### Train the Model on Training Data

In [None]:
# Model Generation Using Multinomial Naive Bayes
clf = MultinomialNB().fit(X_train, y_train)

### Evaluate the Model on Test Data

In [None]:
predicted= clf.predict(X_test)
print("MultinomialNB Accuracy:",metrics.accuracy_score(y_test, predicted))

MultinomialNB Accuracy: 0.6049169122986885
