# Bag of Words

A bag-of-words is an approach to transform text to numeric form. 

In [1]:
# Import the required function
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


In [2]:
annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

In [3]:
# Build the vectorizer and fit it
anna_vect = CountVectorizer()
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)


In [4]:
# Print the bag-of-words result 
print(anna_bow.toarray())

[[1 1 1 0 1 0 1 0 0 0 0 0 0]
 [0 0 0 1 0 1 0 1 1 1 1 2 1]]


We have transformed the first sentence of Anna Karenina to an array counting the frequencies of each word. However, the output is not very readable, is it? We are still missing the names of the features

In [5]:
imdb = "E:/Education/NLP/IMDB Dataset.csv"
reviews = pd.read_csv(imdb)

In [6]:
# Build the vectorizer, specify max features 
vect = CountVectorizer(max_features=100)

# Fit the vectorizer
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)


In [7]:
# Create the bow representation
X_df=pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   about  after  all  also  an  and  any  are  as  at  ...  well  were  what  \
0      1      1    1     0   1    6    0    2   4   0  ...     1     0     2   
1      1      0    2     0   0    7    0    2   0   0  ...     3     0     0   
2      0      0    0     0   0    4    0    1   0   1  ...     1     0     0   
3      0      0    3     0   0    4    0    2   2   0  ...     1     0     0   
4      2      0    2     0   0    5    0    1   1   0  ...     0     0     1   

   when  which  who  will  with  would  you  
0     0      1    2     0     5      1    3  
1     0      1    0     0     3      0    1  
2     1      0    0     0     2      0    0  
3     1      1    0     0     3      0    2  
4     0      1    0     0     1      0    0  

[5 rows x 100 columns]


In the above examples,bagofwords does not maintain the order of words and hence the context gets lost. To maintain context we use ngrams

In [8]:
#from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify token sequence and fit
vect = CountVectorizer(max_features=300,ngram_range=(1,2),max_df=200)
vect.fit(reviews.review)


CountVectorizer(max_df=200, max_features=300, ngram_range=(1, 2))

In [9]:
# Transform the review column
X_review = vect.transform(reviews.review)

In [10]:
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())

In [11]:
print(X_df.head())

   10 for  1930  2007  2nd  abc  agents  airplane  airport  alexander  altman  \
0       0     0     0    0    0       0         0        0          0       0   
1       0     0     0    0    0       0         0        0          0       0   
2       0     0     0    0    0       0         0        0          0       0   
3       0     0     0    0    0       0         0        0          0       0   
4       0     0     0    0    0       0         0        0          0       0   

   ...  warrior  waters  wax  welles  werewolf  willis  woody  wrestling  ya  \
0  ...        0       0    0       0         0       0      0          0   0   
1  ...        0       0    0       0         0       0      0          0   0   
2  ...        0       0    0       0         0       0      2          0   0   
3  ...        0       0    0       0         0       0      0          0   0   
4  ...        0       0    0       0         0       0      0          0   0   

   yellow  
0       0  
1       

## Tokenize the string

Create a new feature for the length of a review

In [12]:
# Import the needed packages
from nltk import word_tokenize

# Tokenize each item in the review column 
word_tokens = [word_tokenize(review) for review in reviews.review]

# Print out the first item of the word_tokens list
print(word_tokens[0])

['One', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', 'just', '1', 'Oz', 'episode', 'you', "'ll", 'be', 'hooked', '.', 'They', 'are', 'right', ',', 'as', 'this', 'is', 'exactly', 'what', 'happened', 'with', 'me.', '<', 'br', '/', '>', '<', 'br', '/', '>', 'The', 'first', 'thing', 'that', 'struck', 'me', 'about', 'Oz', 'was', 'its', 'brutality', 'and', 'unflinching', 'scenes', 'of', 'violence', ',', 'which', 'set', 'in', 'right', 'from', 'the', 'word', 'GO', '.', 'Trust', 'me', ',', 'this', 'is', 'not', 'a', 'show', 'for', 'the', 'faint', 'hearted', 'or', 'timid', '.', 'This', 'show', 'pulls', 'no', 'punches', 'with', 'regards', 'to', 'drugs', ',', 'sex', 'or', 'violence', '.', 'Its', 'is', 'hardcore', ',', 'in', 'the', 'classic', 'use', 'of', 'the', 'word.', '<', 'br', '/', '>', '<', 'br', '/', '>', 'It', 'is', 'called', 'OZ', 'as', 'that', 'is', 'the', 'nickname', 'given', 'to', 'the', 'Oswald', 'Maximum', 'Security', 'State', 'Penitentary', '.', 

In [13]:
# Create an empty list to store the length of reviews
len_tokens = []

# Iterate over the word_tokens list and determine the length of each item
for i in range(len(word_tokens)):
     len_tokens.append(len(word_tokens[i]))

# Create a new feature for the lengh of each review
reviews['n_words'] = len_tokens

In [14]:
print(reviews)

                                                  review sentiment  n_words
0      One of the other reviewers has mentioned that ...  positive      380
1      A wonderful little production. <br /><br />The...  positive      201
2      I thought this was a wonderful way to spend ti...  positive      205
3      Basically there's a family where a little boy ...  negative      175
4      Petter Mattei's "Love in the Time of Money" is...  positive      283
...                                                  ...       ...      ...
49995  I thought this movie did a down right good job...  positive      241
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative      138
49997  I am a Catholic taught in parochial elementary...  negative      271
49998  I'm going to have to disagree with the previou...  negative      240
49999  No one expects the Star Trek movies to be high...  negative      150

[50000 rows x 3 columns]


Building a feature for the language - to detect the language

In [None]:
from langdetect import detect_langs
languages = [] 

# Loop over the rows of the dataset and append  
for row in range(len(reviews)):
    languages.append(detect_langs(reviews.iloc[row, 1]))

# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]

# Assign the list to a new feature 
reviews['language'] = languages

print(reviews.language.unique())