# Bag of Words - BoW

With acknowledgements to Rahul Vasaikar https://github.com/rahulvasaikar

This notebook builds a simple bag of words model from first principles. We will then see how you can use the SciKitLearn module to build BoW models, and visualises it along a small number of axes.



Firsrt, we will import some useful modules for handling strings and collections, and also SciKit and pandas for our second version of BoW.

In [None]:
# Import some useful modules, to implement a Bag of Words from scratch
import string
import pprint
from collections import Counter

# Modules we will use to implement another version of Bag of Words with SciKitLearn
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


Let's make some simple documents, in a list.

In [None]:
documents = ['Klonopin 0.25 mg po every evening, Fluconazole 200 mg po daily, Synthroid 125 mcg po every day',
             'She will not consider switching to clozapine',
             'lovastatin 40 mg one half tab po daily, multivitamin daily, metformin 500 mg one tab po twice a day',
             'Aspirin 81 mg po once daily, Zoloft 25 mg po once daily, Calcium with vitamin D two tablets po once daily']



Let's "normalise" our documents, to remove punctuation and case differences. We could do more here - what NLP techniques might you apply to iron out differences between similar words?

In [None]:
normalised_documents = []
for i in documents:
    no_punctuation = ''.join(c for c in i if c not in string.punctuation)
    normalised_documents.append(no_punctuation.lower())
    
for i in normalised_documents:
  print(i)

Now let's split them up in to tokens, by splitting at whitespace. We could use a tokeniser for this, e.g. from nltk. Why might this be better?

In [None]:
tokenised_documents = []
for i in normalised_documents:
    tokenised_documents.append(i.split(' '))

for i in tokenised_documents:
  print(i)

Let's  find the frequency of each unique token in our documents, i.e. the Bag of Words - BoW.

In [None]:
frequency_list = []
import pprint
from collections import Counter

for i in tokenised_documents:
    frequency_list.append(Counter(i))

pp = pprint.PrettyPrinter(width=200)
pp.pprint(frequency_list)

Now let's do the same with SciKitLearn, using the CountVectorizer class
. We define a token pattern that excludes numbers, and we also remove english stopwords.

In [None]:
count_vector = CountVectorizer(token_pattern=r'\b[^\d\W]+\b', stop_words = 'english')

Now let's run our vectorizer, to make the bag of words. We will print our token features. Note how all punctuation has been removed by default. 


In [None]:
count_vector.fit(documents)
count_vector.get_feature_names_out()

Let's transform our  documents in to count vectors and take a look:

In [None]:
doc_array = count_vector.transform(documents).toarray()
print(doc_array)

And looking at how this encodes each document against the word dimensions: 

In [None]:
frequency_matrix = pd.DataFrame(doc_array,index=documents,columns=count_vector.get_feature_names_out())
frequency_matrix

We can spot the difference between our documents. To imagine what it would look like if we plotted these in a multidimensional space, with one dimension for each word in our vocabulary, let's restrict our vocabulary to just three of the words for now.

In [None]:
count_vector = CountVectorizer(token_pattern=r'\b[^\d\W]+\b', stop_words = 'english', vocabulary=['daily','mg','po'])
count_vector.fit(documents)
doc_array = count_vector.transform(documents).toarray()
frequency_matrix = pd.DataFrame(doc_array,index=documents,columns=count_vector.get_feature_names_out())
frequency_matrix

Our plotting module needs one array for  each of our three dimensions, instead of one for each document:

In [None]:
rotated = list(zip(*doc_array[::-1]))
print(rotated)

And now the plot:

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
z,x,y = rotated
ax.scatter(list(x), list(y), list(z), zdir='z', c= 'red')