<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

In [1]:
import collections,re

In [2]:
def bag_of_words(text):
    _bag_of_words = [collections.Counter(re.findall(r'\w+',word)) for word in text]
    bag_of_words = sum(_bag_of_words, collections.Counter())
    return bag_of_words

In [3]:
a=['In order to perform machine learning on text documents, the raw (text)',
 'data cannot be fed directly to algorithm as these algorithms expect numerical feature vectors',
 'so instead we need to turn the text content into numerical feature vectors.']

In [4]:
sample_word_tokens_bow = bag_of_words(text=a)
print(sample_word_tokens_bow)

Counter({'to': 3, 'text': 3, 'the': 2, 'numerical': 2, 'feature': 2, 'vectors': 2, 'In': 1, 'order': 1, 'perform': 1, 'machine': 1, 'learning': 1, 'on': 1, 'documents': 1, 'raw': 1, 'data': 1, 'cannot': 1, 'be': 1, 'fed': 1, 'directly': 1, 'algorithm': 1, 'as': 1, 'these': 1, 'algorithms': 1, 'expect': 1, 'so': 1, 'instead': 1, 'we': 1, 'need': 1, 'turn': 1, 'content': 1, 'into': 1})


#  Extracting features  or feature encoding from text files

In order to perform machine learning on text documents, the raw (text) data cannot be fed directly to algorithm as these algorithms expect numerical feature vectors so instead we need to turn the text content into numerical feature vectors.

From the [scikit-learn documentation](https://scikit-learn.org/stable/modules/feature_extraction.html):
<b>
We call vectorization the general process of turning a collection of text documents into numerical feature vectors.
</b>

In creating a classifier it is important to decide what features of the input are relevant, and how to encode those features. When we consider a textual data such as a sentence or a document  for instance the observable features are the counts and the order of the letters and the words within the text so we need a way to extract these  features. There are several ways of extracting features from a textual data but in this tutorial we will consider a very common feature extraction procedures for sentences and documents known as the <b>
bag-of-words approach (BOW)</b> which looks at the histogram of the words within the text ( considering each word count as a feature.) 


# Bag Of Words (BOWs) 
Is a feature extraction technique used for extracting features from textual data for modeling machine learning algorithm and is commonly used in problems such as language modeling and document classification.  A bag-of-words is a representation of textual data, describing the occurrence of words within a sentence or document, disregarding grammar and the order of words.

<p><b>How does Bag of Words Works</b></p>
In order to understand how bag of words works let assume we have two simple text documents:
```md
1. Boys like playing football and Emma is a boy so Emma likes playing football

2  Mary likes watching movies 

```

Based on these two text documents, a list of token (words) for each document is as follows

```javascript
'Boys', 'like', 'playing', 'football', 'and', 'Emma', 'is', 'a', 'boy', 'so', 'Emma', 'likes', 'playing', 'football'


'Mary', 'likes', 'watching', 'movies'
```


denoting document1 by doc1 and 2  by doc2, we will construct a dictionary (key->value pair) of
words for both doc1 and doc2 where each key is the word, and each value is the number of occurrences of that word in the given text document.

```javascript
doc1={ 'a' : 1, 'and' : 1, 'boy' : 1, 'Boys' : 1, 'Emma' : 2, 'football' : 2, 'is' : 1,  'like' : 1,  'likes' : 1, 'playing' : 2,   'so' : 1}

dco2={'likes' : 1, 'Mary' : 1,  'movies' : 1 ,'watching' : 1}
```

<b>NOTE :</b> the order of the words is not important


Putting everything together and considering **a** as a stop word, features extracted using bag of words for these documents will be

# vocabulary of words
and, boy, boys, emma,  football, is, like, likes, mary, movies, playing, so, watching



<style>
table, th, td {
    border: 1px solid black;
    border-collapse: collapse;
}
th, td {
    padding: 15px;
}
</style>
<table>
    <tr>
        <th></th><th>and</th><th>boy</th><th>boys</th><th>emma</th><th>football</th>
        <th>is</th> <th>like</th><th>likes</th><th>mary</th><th>movies</th>
        <th>playing</th><th>so</th>
        <th>watching</th>
    </tr>

<tr>
    <td>Doc1</td><td>1</td><td>1</td><td>1</td><td>2</td><td>2</td><td>1</td>
    <td>1</td><td>1</td><td>0</td><td>0</td><td>2</td><td>1</td><td>0</td>
</tr>
<tr>
    <td>Doc2</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
    <td>1</td><td>1</td><td>1</td><td>0</td><td>0</td><td>1</td>
</tr>
</table>



In [5]:
c=['Boys like playing football and Emma is a boy so Emma likes playing football',
   "Mary likes watching movies"]

In [6]:
for i in c:
    print(i.split())

['Boys', 'like', 'playing', 'football', 'and', 'Emma', 'is', 'a', 'boy', 'so', 'Emma', 'likes', 'playing', 'football']
['Mary', 'likes', 'watching', 'movies']


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [8]:
feature_extr=CountVectorizer()
model=feature_extr.fit_transform(c)

In [9]:
pd.DataFrame(model.toarray(),columns=[feature_extr.get_feature_names()],index=['doc1','doc2'])

Unnamed: 0,and,boy,boys,emma,football,is,like,likes,mary,movies,playing,so,watching
doc1,1,1,1,2,2,1,1,1,0,0,2,1,0
doc2,0,0,0,0,0,0,0,1,1,1,0,0,1


In [10]:
model.shape

(2, 13)

<p><b>Disadvantages</b></p>
Although BOWs is very simple to understand and implement, it has some disadvantages which include

- highly sparse vectors or matrix as the are  very few non-zero elements in dimensions corresponding to words that occur in the sentence.

- Bag of words representation leads to a high dimensional feature vector as the total dimension is the vocabulary size.
- Bag of words representation does not consider the semantic relation between words by assuming that the words are independent of each other.



# Buiding a Classifier with the features extracted using BOWS

In [11]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

The dataset called “Twenty Newsgroups”. which is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. <a href='http://qwone.com/~jason/20Newsgroups/'>Official description of theTwenty Newsgroups data</a> will be used as our data

we will work on a partial dataset with only 11 categories out
of the 20 available in the dataset:

In [12]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med','sci.electronics',
              'sci.space','talk.politics.guns','talk.politics.mideast','talk.politics.misc',
              'talk.religion.misc','misc.forsale']

In [13]:
twenty_news_train=fetch_20newsgroups(subset='train',categories=categories,
                                remove=('footers','headers','quotes'))

In [14]:
twenty_news_train.target_names

['alt.atheism',
 'comp.graphics',
 'misc.forsale',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [15]:
print("\n".join(twenty_news_train.data[2].split(',')))


That's a revisionist account of what happened.  Gritz was well-aware
of Duke's presence on the ticket.  Given that Gritz is not at all shy
about associating and promoting other white supremacists (such as the
Christian Identity movement or Willis Carto)
 whatever reasons Gritz
had to leave the ticket had nothing to do with Duke's presence.


I believe Chip Berlet has a Populist Party newsletter from the time with
a photo of Gritz happily shaking hands with Duke.


In [16]:
twenty_news_train.target[0:4]

array([4, 0, 7, 1], dtype=int64)

In [17]:
for i in twenty_news_train.target[0:4]:
    print("{}==>{}".format(i,twenty_news_train.target_names[i]))

4==>sci.med
0==>alt.atheism
7==>talk.politics.guns
1==>comp.graphics


In [18]:
news_clf=Pipeline([('bows',CountVectorizer()),
                   ('sgd',SGDClassifier(max_iter=10,class_weight='balanced'))])

In [19]:
news_clf.fit(twenty_news_train.data,twenty_news_train.target)



Pipeline(steps=[('bows', CountVectorizer()),
                ('sgd', SGDClassifier(class_weight='balanced', max_iter=10))])

# PREDICTING NEW INSTANCES

In [20]:
docs_new = ['God is love', 'OpenGL on the GPU is fast','I am selling my car','Nvidia']
predict=news_clf.predict(docs_new)
for doc, pred in zip(docs_new,predict):
    print('{}=>{}'.format(doc,twenty_news_train.target_names[pred]))

God is love=>talk.religion.misc
OpenGL on the GPU is fast=>comp.graphics
I am selling my car=>misc.forsale
Nvidia=>misc.forsale


In [21]:
twenty_news_test=fetch_20newsgroups(subset='test',categories=categories,
                                remove=('footers','headers','quotes'))

# accuracy of the model

In [22]:
test_prediction=news_clf.predict(twenty_news_test.data)
np.mean(twenty_news_test.target==test_prediction)

0.6128140703517588