# Bag of Words Model in Pandas

A bag of words is a matrix representation of a document. It consists of several columns which are unique words. And every row is a new document. The cell values of every column indicate whether the word is present in the document or not. A dataframe representation is shown below. 

![Image](./data/bagofwords.PNG)

Each column (except doc_id) is a word. Each row is a new document. The first column is the name of the document. The first row is telling us that doc_id 1987_1 does not have the word abalone, abbeel or zhou. Hence each value is 0. If the word is contained in the document then that corresponding value in the column is 1.     

We have to build this bag of words model with 5 documents.  The documents are named as doc1.txt, doc2.txt, doc3.txt, doc4.txt and doc5.txt.  


**There should be 5 rows in the dataframe. The columns should be unique words in all documents. The columns should have words with length greater than 4. The words should not have any punctuation marks with it.**

In [1]:
#Load file pointers with file locations
fp1 = open('data/Bag of Words Docs/doc1.txt','r')
fp2 = open('data/Bag of Words Docs/doc2.txt','r')
fp3 = open('data/Bag of Words Docs/doc3.txt','r')
fp4 = open('data/Bag of Words Docs/doc4.txt','r')
fp5 = open('data/Bag of Words Docs/doc5.txt','r')
#characters to be stripped
strips=',.-"()'
#List containing words from each doc
doc=[]
doc.append(fp1.read().split(' '))
doc.append(fp2.read().split(' '))
doc.append(fp3.read().split(' '))
doc.append(fp4.read().replace('\n',' ').split(' '))
doc.append(fp5.read().split(' '))
doc

[['In',
  'this',
  'tutorial',
  'competition,',
  'we',
  'dig',
  'a',
  'little',
  '"deeper"',
  'into',
  'sentiment',
  'analysis.',
  "Google's",
  'Word2Vec',
  'is',
  'a',
  'deep-learning',
  'inspired',
  'method',
  'that',
  'focuses',
  'on',
  'the',
  'meaning',
  'of',
  'words.',
  'Word2Vec',
  'attempts',
  'to',
  'understand',
  'meaning',
  'and',
  'semantic',
  'relationships',
  'among',
  'words.',
  'It',
  'works',
  'in',
  'a',
  'way',
  'that',
  'is',
  'similar',
  'to',
  'deep',
  'approaches,',
  'such',
  'as',
  'recurrent',
  'neural',
  'nets',
  'or',
  'deep',
  'neural',
  'nets,',
  'but',
  'is',
  'computationally',
  'more',
  'efficient.',
  'This',
  'tutorial',
  'focuses',
  'on',
  'Word2Vec',
  'for',
  'sentiment',
  'analysis.'],
 ['Sentiment',
  'analysis',
  'is',
  'a',
  'challenging',
  'subject',
  'in',
  'machine',
  'learning.',
  'People',
  'express',
  'their',
  'emotions',
  'in',
  'language',
  'that',
  'is',
 

In [2]:
#List of sets having unique words in a doc
newdoc=[]
#newdoc contains non-duplicated words and stripped down symbolic characters
for d in doc:
    newdoc.append(set(map(lambda x:x.strip(strips),d)))
l=0
#Print number of unique words in each doc
print(len(newdoc[0]))
for d in newdoc:
    l+=len(d)
    print(l)
print(newdoc[0])

51
51
105
168
233
290
{'the', 'on', 'more', 'In', 'in', 'but', 'competition', 'tutorial', 'words', 'dig', 'little', 'neural', 'to', 'It', 'works', 'deep', 'recurrent', 'of', 'is', 'a', 'deeper', 'approaches', 'we', 'analysis', 'semantic', 'understand', 'efficient', 'This', 'method', 'for', 'focuses', 'meaning', 'similar', 'attempts', 'deep-learning', 'that', 'such', 'Word2Vec', 'this', 'sentiment', 'into', "Google's", 'as', 'relationships', 'or', 'computationally', 'among', 'inspired', 'nets', 'way', 'and'}


In [3]:
#Print set of unique words with length>4 in all docs
docsset={*()}
for d in newdoc:
    for w in d:
        if len(w)>4:
            docsset.add(w)
len(docsset)

133

In [11]:
#Make rows for each doc
Rows=[]
for  i in range(4):
    Rows.append([1 if col in newdoc[i] else 0 for col in docsset])
Rows

[[0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  1,
  1,
  1,
  0,
  1,
  0,
  0,
  0,
  0,
  0],
 [1,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  1,
  0,
  1,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  1,
  0,
  1,
  0,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  1,
  0,
  0,

In [14]:
#Create dataframe with unique words from all docs as columns and boolean values indicating presence of words in doc
df=pd.DataFrame(Rows,columns=docsset)
df.insert(loc=0, column='doc_id',value=['doc1.txt','doc2.txt','doc3.txt','doc4.txt','doc5.txt'])
df

Unnamed: 0,doc_id,learning,subject,exists,prescriptive,papers,making,Words,computers,world,...,method,focuses,deep-learning,analogies,Word2Vec,speech,published,reproduced,these,output
0,doc1.txt,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,0,0,0,0,0
1,doc2.txt,1,1,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
2,doc3.txt,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,doc4.txt,1,0,1,1,1,0,0,0,0,...,0,0,0,0,1,0,1,0,1,1
4,doc5.txt,0,0,0,0,0,0,1,0,1,...,0,0,0,1,1,0,0,1,0,0


In [22]:
#Print word appearing in 4 of the documents
for j in df:
    if df[j].sum()==4:
        print(df[j])

0    1
1    1
2    0
3    1
4    1
Name: Word2Vec, dtype: int64
