# Bag-of-words (BOW) Example

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. 

Import libraries:

In [2]:
from keras.preprocessing.text import Tokenizer
from tokenize import generate_tokens
from StringIO import StringIO

Using TensorFlow backend.


Two examples of small code snippets:

In [3]:
# define documents
docs = ['print("Hello, World!")',
		'''def say_hello():
    print("Hello, World!")
say_hello()''',
       ]

**Tokenizer for Python source**: The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing “pretty-printers,” including colorizers for on-screen displays. To simplify token stream handling, all Operators and Delimiters tokens are returned using the generic token.OP token type:

* https://docs.python.org/2/library/tokenize.html

In [4]:
token_docs = [' '.join([t[1] for t in list(generate_tokens(StringIO(doc).readline))]) for doc in docs]

Code snippets tokenized:

In [5]:
token_docs

['print ( "Hello, World!" ) ',
 'def say_hello ( ) : \n      print ( "Hello, World!" ) \n  say_hello ( ) ']

In addition, we use Keras Tokenizer to split the documents into words and create our bag-of-words model:

In [6]:
t = Tokenizer(num_words=None,
              filters='\n', 
              lower=False, 
              split=' ', 
              char_level=False)

In [7]:
t.fit_on_texts(token_docs)

Word counts:

In [8]:
t.word_counts

OrderedDict([('print', 2),
             ('(', 4),
             ('"Hello,', 2),
             ('World!"', 2),
             (')', 4),
             ('def', 1),
             ('say_hello', 2),
             (':', 1)])

Dictionary indexes of each token:

In [9]:
t.word_index

{'"Hello,': 4,
 '(': 1,
 ')': 2,
 ':': 8,
 'World!"': 5,
 'def': 7,
 'print': 3,
 'say_hello': 6}

We can convert this data into a matrix of frequencies for each doc using the vocabulary:

In [10]:
t.texts_to_matrix(token_docs, mode='count')

array([[ 0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.],
       [ 0.,  3.,  3.,  1.,  1.,  1.,  2.,  1.,  1.]])