# Bag of Words

This notebook is designed to be an introduction for creating your own bag of words function with Python; or simply put, creating a fuction to count word frequinces in some string or document. The notebook will also cover some useful add-on fuctions for bag of words like remove stop words and keep the n-most frequent words from a bag. Note: Worded explanations are sparse in this notebook because the code is rather straight foward. If you want more insight, just check out the <a href="https://en.wikipedia.org/wiki/Bag-of-words_model">wiki page</a>.

## Counting Words in a String

To begin, let's split a string into a list of its components and count the frequency of each of the words in the list.

In [1]:
#the string we wish to count over
somewords="Mr. Obama acknowledged that the killings - an act not just of demented violence but of racial hatred - had exposed a fault line in American democracy. He said he understood if Americans questioned whether the racial divide would ever be bridged."

#strip puncuations and transform to lower case
import string
somewords=somewords.translate(string.maketrans("",""), string.punctuation).lower()

#split string into list of words
wordlist=somewords.split()

#get word count for every word in list
wordcount= [wordlist.count(word) for word in wordlist]
    
print somewords
print wordcount

mr obama acknowledged that the killings  an act not just of demented violence but of racial hatred  had exposed a fault line in american democracy he said he understood if americans questioned whether the racial divide would ever be bridged
[1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1]


## Wrapping into a Dictionary and Making as a Function

Now that we have counted the amount times a word appears in the string, let's put it into a dictionary (hash). By doing this, the duplicates we saw above will go away.

In [2]:
#create a dictionary with word count
def WordBag(aString):
    wordlist=aString.translate(string.maketrans("",""), string.punctuation).lower().split()
    wordcount = [wordlist.count(word) for word in wordlist]
    return dict(zip(wordlist,wordcount))

In [3]:
#test it on somewords
test=WordBag(somewords)

print test

{'divide': 1, 'just': 1, 'exposed': 1, 'an': 1, 'understood': 1, 'demented': 1, 'americans': 1, 'in': 1, 'racial': 2, 'if': 1, 'killings': 1, 'democracy': 1, 'had': 1, 'questioned': 1, 'bridged': 1, 'he': 2, 'ever': 1, 'be': 1, 'that': 1, 'mr': 1, 'but': 1, 'said': 1, 'not': 1, 'line': 1, 'the': 2, 'obama': 1, 'a': 1, 'would': 1, 'whether': 1, 'of': 2, 'violence': 1, 'acknowledged': 1, 'american': 1, 'act': 1, 'hatred': 1, 'fault': 1}


## Write a Function to add Ordering

Yes, it is that simple. Now, we have one of the most basic tools for investigating text in machine learning. Let's add some extras. 

The first extra is just a simple function to order the bag.

In [4]:
#create ordering function, note: will always order words in alphabetical order
def OrderWordBag(bag, Descending=True):
    arr=[(bag[key], key) for key in bag]
    if Descending==True:
        return sorted(arr, key=lambda x: (-x[0], x[1]))
    else:
        return sorted(arr, key=lambda x: (x[0], x[1]))

In [5]:
#order descending
print OrderWordBag(test)

[(2, 'he'), (2, 'of'), (2, 'racial'), (2, 'the'), (1, 'a'), (1, 'acknowledged'), (1, 'act'), (1, 'american'), (1, 'americans'), (1, 'an'), (1, 'be'), (1, 'bridged'), (1, 'but'), (1, 'demented'), (1, 'democracy'), (1, 'divide'), (1, 'ever'), (1, 'exposed'), (1, 'fault'), (1, 'had'), (1, 'hatred'), (1, 'if'), (1, 'in'), (1, 'just'), (1, 'killings'), (1, 'line'), (1, 'mr'), (1, 'not'), (1, 'obama'), (1, 'questioned'), (1, 'said'), (1, 'that'), (1, 'understood'), (1, 'violence'), (1, 'whether'), (1, 'would')]


In [6]:
#order ascending
print OrderWordBag(test, False)

[(1, 'a'), (1, 'acknowledged'), (1, 'act'), (1, 'american'), (1, 'americans'), (1, 'an'), (1, 'be'), (1, 'bridged'), (1, 'but'), (1, 'demented'), (1, 'democracy'), (1, 'divide'), (1, 'ever'), (1, 'exposed'), (1, 'fault'), (1, 'had'), (1, 'hatred'), (1, 'if'), (1, 'in'), (1, 'just'), (1, 'killings'), (1, 'line'), (1, 'mr'), (1, 'not'), (1, 'obama'), (1, 'questioned'), (1, 'said'), (1, 'that'), (1, 'understood'), (1, 'violence'), (1, 'whether'), (1, 'would'), (2, 'he'), (2, 'of'), (2, 'racial'), (2, 'the')]


## Create function to take n-top words

The second extra arises when we only care about the most seen words. We can write a function that takes the output from our bag of words function and just keeps the  n most seen.

In [7]:
#function to take the n most counted words, 
#note: a_to_z is the alphabetical ordering or the reverses
def ntop(n,bag,a_to_z=True):
    arr=[(bag[key], key) for key in bag]
    if a_to_z==True:
        return sorted(arr, key=lambda x: (-x[0], x[1]))[0:n]
    else:
        return sorted(arr, key=lambda x: (x[0], x[1]), reverse=True)[0:n]

In [8]:
#8 top counted, a to z ordering
print ntop(8,test)

[(2, 'he'), (2, 'of'), (2, 'racial'), (2, 'the'), (1, 'a'), (1, 'acknowledged'), (1, 'act'), (1, 'american')]


In [9]:
#8 top counted, z to a ordering
print ntop(8,test, False)

[(2, 'the'), (2, 'racial'), (2, 'of'), (2, 'he'), (1, 'would'), (1, 'whether'), (1, 'violence'), (1, 'understood')]


# Creating a Function to Remove Stop Words

The last add on is removing 'stop words'. Stop words are words that apear in all documents and the thought is, because of this, add no real value to saying something about a document. These are words like: the, a ,are, is, it, etc. Below we create a list of stop words, a function to remove them, and run the function on a complete article we read in to test it.

In [10]:
#read an article into a string variable, and remove ugly encodings
with open ("ObamaArticle.txt", "r") as myfile:
    article=myfile.read().decode('utf-8')
    article=article.replace(u'\u201d','').replace(u'\n','').replace(u'\u2019','')
    article=article.replace(u'\u201c','').replace(u'\u2019m','')
    article=article.encode('ascii', 'ignore')
    
article

'DALLAS - President Obama said on Tuesday that the nation mourned along with Dallas for five police officers gunned down by a black Army veteran, but he implored Americans not to give in to despair or the fear that the center might not hold.Im here to say that we must reject such despair, Mr. Obama said at a memorial service for the officers in Dallas. Im here to insist that we are not so divided as we seem. I say that because I know America. I know how far weve come against impossible odds. I know well make it because of what Ive experienced in my own life.Mr. Obama acknowledged that the killings - an act not just of demented violence but of racial hatred - had exposed a fault line in American democracy. He said he understood if Americans questioned whether the racial divide would ever be bridged.Im not nave, he said. Ive spoken at too many memorials during the course of this presidency.Mr. Obama acknowledged the limitations of his own words, and quoted from the Gospel of John: Let us

In [11]:
art=WordBag(article)

print art

{'all': 1, 'bush': 3, 'listening': 1, 'just': 2, 'huddled': 1, 'over': 1, 'people': 2, 'thisformer': 1, 'month': 1, 'course': 1, 'through': 1, 'saidbehind': 1, 'go': 1, 'love': 2, 'fear': 1, 'debate': 1, 'before': 2, 'fearful': 1, 'killings': 1, 'police': 5, 'quoted': 1, 'comfort': 1, 'also': 1, 'justice': 1, 'truthmr': 1, 'concerned': 1, 'remember': 1, 'had': 3, 'about': 1, 'to': 12, 'heroism': 1, 'black': 2, 'questioned': 1, 'he': 9, 'might': 1, 'into': 1, 'divided': 1, 'do': 1, 'weve': 1, 'advantage': 1, 'far': 1, 'similar': 1, 'later': 1, 'saidi': 1, 'nation': 3, 'earnest': 1, 'grew': 1, 'five': 2, 'know': 3, 'words': 3, 'not': 9, 'during': 1, 'one': 1, 'him': 2, 'apart': 1, 'worsening': 1, 'like': 1, 'did': 1, 'across': 1, 'consensus': 1, 'this': 2, 'changed': 1, 'holdim': 1, 'simply': 1, 'gospel': 1, 'fault': 1, 'out': 1, 'clap': 1, 'secretary': 1, 'because': 2, 'nave': 1, 'exposed': 1, 'house': 1, 'cynicism': 1, 'stronger': 1, 'feel': 1, 'soul': 1, 'understood': 1, 'demented': 1

In [12]:
#order descending
print OrderWordBag(art)

[(33, 'the'), (18, 'of'), (14, 'that'), (13, 'a'), (13, 'in'), (12, 'and'), (12, 'to'), (9, 'he'), (9, 'not'), (9, 'obama'), (8, 'said'), (7, 'we'), (6, 'but'), (6, 'dallas'), (6, 'i'), (6, 'officers'), (6, 'too'), (6, 'with'), (5, 'are'), (5, 'for'), (5, 'it'), (5, 'mr'), (5, 'police'), (4, 'americans'), (4, 'at'), (4, 'have'), (4, 'his'), (4, 'ive'), (4, 'many'), (4, 'or'), (4, 'president'), (4, 'racial'), (4, 'us'), (4, 'who'), (3, 'as'), (3, 'bush'), (3, 'had'), (3, 'has'), (3, 'killing'), (3, 'know'), (3, 'nation'), (3, 'only'), (3, 'so'), (3, 'words'), (2, 'acknowledged'), (2, 'added'), (2, 'after'), (2, 'an'), (2, 'ask'), (2, 'because'), (2, 'before'), (2, 'black'), (2, 'by'), (2, 'despair'), (2, 'families'), (2, 'five'), (2, 'forces'), (2, 'here'), (2, 'him'), (2, 'is'), (2, 'its'), (2, 'just'), (2, 'love'), (2, 'make'), (2, 'memorial'), (2, 'monday'), (2, 'much'), (2, 'on'), (2, 'our'), (2, 'own'), (2, 'people'), (2, 'racism'), (2, 'say'), (2, 'such'), (2, 'this'), (2, 'time')

In [13]:
#array holding the words we would like to remove
stopwords = ['a', 'about', 'above', 'across', 'after', 'afterwards']
stopwords += ['again', 'against', 'all', 'almost', 'alone', 'along']
stopwords += ['already', 'also', 'although', 'always', 'am', 'among']
stopwords += ['amongst', 'amoungst', 'amount', 'an', 'and', 'another']
stopwords += ['any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere']
stopwords += ['are', 'around', 'as', 'at', 'back', 'be', 'became']
stopwords += ['because', 'become', 'becomes', 'becoming', 'been']
stopwords += ['before', 'beforehand', 'behind', 'being', 'below']
stopwords += ['beside', 'besides', 'between', 'beyond', 'bill', 'both']
stopwords += ['bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant']
stopwords += ['co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de']
stopwords += ['describe', 'detail', 'did', 'do', 'done', 'down', 'due']
stopwords += ['during', 'each', 'eg', 'eight', 'either', 'eleven', 'else']
stopwords += ['elsewhere', 'empty', 'enough', 'etc', 'even', 'ever']
stopwords += ['every', 'everyone', 'everything', 'everywhere', 'except']
stopwords += ['few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first']
stopwords += ['five', 'for', 'former', 'formerly', 'forty', 'found']
stopwords += ['four', 'from', 'front', 'full', 'further', 'get', 'give']
stopwords += ['go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her']
stopwords += ['here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers']
stopwords += ['herself', 'him', 'himself', 'his', 'how', 'however']
stopwords += ['hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed']
stopwords += ['interest', 'into', 'is', 'it', 'its', 'itself', 'keep']
stopwords += ['last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made']
stopwords += ['many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine']
stopwords += ['more', 'moreover', 'most', 'mostly', 'move', 'much']
stopwords += ['must', 'my', 'myself', 'name', 'namely', 'neither', 'never']
stopwords += ['nevertheless', 'next', 'nine', 'no', 'nobody', 'none']
stopwords += ['noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of']
stopwords += ['off', 'often', 'on','once', 'one', 'only', 'onto', 'or']
stopwords += ['other', 'others', 'otherwise', 'our', 'ours', 'ourselves']
stopwords += ['out', 'over', 'own', 'part', 'per', 'perhaps', 'please']
stopwords += ['put', 'rather', 're', 's', 'same', 'see', 'seem', 'seemed']
stopwords += ['seeming', 'seems', 'serious', 'several', 'she', 'should']
stopwords += ['show', 'side', 'since', 'sincere', 'six', 'sixty', 'so']
stopwords += ['some', 'somehow', 'someone', 'something', 'sometime']
stopwords += ['sometimes', 'somewhere', 'still', 'such', 'system', 'take']
stopwords += ['ten', 'than', 'that', 'the', 'their', 'them', 'themselves']
stopwords += ['then', 'thence', 'there', 'thereafter', 'thereby']
stopwords += ['therefore', 'therein', 'thereupon', 'these', 'they']
stopwords += ['thick', 'thin', 'third', 'this', 'those', 'though', 'three']
stopwords += ['three', 'through', 'throughout', 'thru', 'thus', 'to']
stopwords += ['together', 'too', 'top', 'toward', 'towards', 'twelve']
stopwords += ['twenty', 'two', 'un', 'under', 'until', 'up', 'upon']
stopwords += ['us', 'very', 'via', 'was', 'we', 'well', 'were', 'what']
stopwords += ['whatever', 'when', 'whence', 'whenever', 'where']
stopwords += ['whereafter', 'whereas', 'whereby', 'wherein', 'whereupon']
stopwords += ['wherever', 'whether', 'which', 'while', 'whither', 'who']
stopwords += ['whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with']
stopwords += ['within', 'without', 'would', 'yet', 'you', 'your']
stopwords += ['yours', 'yourself', 'yourselves']

In [14]:
# Given a dict of words, remove any that are
# in a list of stop words.

def removeStopwords(bag, stopwords):
    for s in stopwords:
        if s in bag.keys():
            del bag[s]      
    return bag

In [15]:
print removeStopwords(art, stopwords)

{'bush': 3, 'listening': 1, 'just': 2, 'huddled': 1, 'people': 2, 'thisformer': 1, 'month': 1, 'course': 1, 'saidbehind': 1, 'love': 2, 'fear': 1, 'debate': 1, 'fearful': 1, 'killings': 1, 'police': 5, 'quoted': 1, 'comfort': 1, 'justice': 1, 'truthmr': 1, 'concerned': 1, 'remember': 1, 'heroism': 1, 'black': 2, 'questioned': 1, 'divided': 1, 'weve': 1, 'advantage': 1, 'far': 1, 'similar': 1, 'later': 1, 'saidi': 1, 'nation': 3, 'earnest': 1, 'grew': 1, 'know': 3, 'words': 3, 'apart': 1, 'worsening': 1, 'like': 1, 'consensus': 1, 'changed': 1, 'holdim': 1, 'simply': 1, 'gospel': 1, 'fault': 1, 'clap': 1, 'secretary': 1, 'nave': 1, 'exposed': 1, 'house': 1, 'cynicism': 1, 'stronger': 1, 'feel': 1, 'soul': 1, 'understood': 1, 'demented': 1, 'added': 2, 'forging': 1, 'home': 1, 'racial': 4, 'said': 8, 'acknowledgment': 1, 'griefstricken': 1, 'tensions': 1, 'confess': 1, 'effort': 1, 'experience': 1, 'row': 1, 'grieves': 1, 'limitations': 1, 'approached': 1, 'doubt': 1, 'bridgedim': 1, 'me

In [16]:
#print ordered without stopwords
print OrderWordBag(art)

[(9, 'obama'), (8, 'said'), (6, 'dallas'), (6, 'officers'), (5, 'mr'), (5, 'police'), (4, 'americans'), (4, 'ive'), (4, 'president'), (4, 'racial'), (3, 'bush'), (3, 'killing'), (3, 'know'), (3, 'nation'), (3, 'words'), (2, 'acknowledged'), (2, 'added'), (2, 'ask'), (2, 'black'), (2, 'despair'), (2, 'families'), (2, 'forces'), (2, 'just'), (2, 'love'), (2, 'make'), (2, 'memorial'), (2, 'monday'), (2, 'people'), (2, 'racism'), (2, 'say'), (2, 'time'), (2, 'violence'), (1, '11th'), (1, 'acknowledgment'), (1, 'act'), (1, 'action'), (1, 'advantage'), (1, 'america'), (1, 'american'), (1, 'apart'), (1, 'appealed'), (1, 'applaudedthe'), (1, 'approached'), (1, 'army'), (1, 'balanced'), (1, 'bias'), (1, 'biasthe'), (1, 'binding'), (1, 'blunt'), (1, 'bridgedim'), (1, 'center'), (1, 'change'), (1, 'changed'), (1, 'city'), (1, 'clap'), (1, 'come'), (1, 'comfort'), (1, 'concerned'), (1, 'confess'), (1, 'consensus'), (1, 'console'), (1, 'correctness'), (1, 'country'), (1, 'course'), (1, 'criminal'),

# Conclusion

Well that's it. Now that you have bag or words in your tool box, go forth and bag yourself some documents to play with. Also, I recommend you look into <a href="http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html">Tf-idf</a>, and <a href="https://turi.com/learn/userguide/feature-engineering/bm25.html"> BM25 </a>. As always, happy exploring.

