# Bag of Words

A bag-of-words is an approach to transform text to numeric form. 

In [1]:
# Import the required function
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


In [2]:
annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

In [3]:
# Build the vectorizer and fit it
anna_vect = CountVectorizer()
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)


In [4]:
# Print the bag-of-words result 
print(anna_bow.toarray())

[[1 1 1 0 1 0 1 0 0 0 0 0 0]
 [0 0 0 1 0 1 0 1 1 1 1 2 1]]


We have transformed the first sentence of Anna Karenina to an array counting the frequencies of each word. However, the output is not very readable, is it? We are still missing the names of the features

In [7]:
imdb = "E:/Education/NLP/IMDB Dataset.csv"
reviews = pd.read_csv(imdb)

In [8]:
# Build the vectorizer, specify max features 
vect = CountVectorizer(max_features=100)

# Fit the vectorizer
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)


In [9]:
# Create the bow representation
X_df=pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   about  after  all  also  an  and  any  are  as  at  ...  well  were  what  \
0      1      1    1     0   1    6    0    2   4   0  ...     1     0     2   
1      1      0    2     0   0    7    0    2   0   0  ...     3     0     0   
2      0      0    0     0   0    4    0    1   0   1  ...     1     0     0   
3      0      0    3     0   0    4    0    2   2   0  ...     1     0     0   
4      2      0    2     0   0    5    0    1   1   0  ...     0     0     1   

   when  which  who  will  with  would  you  
0     0      1    2     0     5      1    3  
1     0      1    0     0     3      0    1  
2     1      0    0     0     2      0    0  
3     1      1    0     0     3      0    2  
4     0      1    0     0     1      0    0  

[5 rows x 100 columns]


In the above examples,bagofwords does not maintain the order of words and hence the context gets lost. To maintain context we use ngrams

In [18]:
#from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify token sequence and fit
vect = CountVectorizer(max_features=300,ngram_range=(1,2),max_df=200)
vect.fit(reviews.review)


CountVectorizer(max_df=200, max_features=300, ngram_range=(1, 2))

In [19]:
# Transform the review column
X_review = vect.transform(reviews.review)

In [20]:
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())

In [21]:
print(X_df.head())

   10 for  1930  2007  2nd  abc  agents  airplane  airport  alexander  altman  \
0       0     0     0    0    0       0         0        0          0       0   
1       0     0     0    0    0       0         0        0          0       0   
2       0     0     0    0    0       0         0        0          0       0   
3       0     0     0    0    0       0         0        0          0       0   
4       0     0     0    0    0       0         0        0          0       0   

   ...  warrior  waters  wax  welles  werewolf  willis  woody  wrestling  ya  \
0  ...        0       0    0       0         0       0      0          0   0   
1  ...        0       0    0       0         0       0      0          0   0   
2  ...        0       0    0       0         0       0      2          0   0   
3  ...        0       0    0       0         0       0      0          0   0   
4  ...        0       0    0       0         0       0      0          0   0   

   yellow  
0       0  
1       