## BAG OF WORDS

In [1]:
# Import the required function
from sklearn.feature_extraction.text import CountVectorizer

annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

# Build the vectorizer and fit it
anna_vect = CountVectorizer()
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)

# Print the bag-of-words result 
print(anna_bow.toarray())

[[1 1 1 0 1 0 1 0 0 0 0 0 0]
 [0 0 0 1 0 1 0 1 1 1 1 2 1]]


## BOW using product reviews

Now we will apply BOW to the amazon review of products

In [3]:
import pandas as pd
reviews = pd.read_csv('amazon_reviews_sample.csv')

In [4]:
reviews.head()

Unnamed: 0.1,Unnamed: 0,score,review
0,0,1,Stuning even for the non-gamer: This sound tr...
1,1,1,The best soundtrack ever to anything.: I'm re...
2,2,1,Amazing!: This soundtrack is my favorite musi...
3,3,1,Excellent Soundtrack: I truly like this sound...
4,4,1,"Remember, Pull Your Jaw Off The Floor After H..."


In [11]:
reviews.shape

(10000, 3)

In [14]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify max features 
vect = CountVectorizer(max_features=100)
# Fit the vectorizer
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df=pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
X_df.head()

Unnamed: 0,about,after,all,also,am,an,and,any,are,as,...,what,when,which,who,will,with,work,would,you,your
0,0,0,1,0,0,0,2,0,0,0,...,0,0,0,2,0,1,0,2,0,1
1,0,0,0,0,0,0,3,1,1,0,...,0,0,0,0,0,0,0,1,1,0
2,0,0,3,0,0,1,4,0,1,1,...,0,0,1,1,0,0,1,1,2,0
3,0,0,0,0,0,0,9,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,3,0,1,0,...,0,0,0,0,0,0,0,0,3,1


We have successfully built  BOW generated vocabulary and transformed it to numeric features of the dataset!

## Specifying token sequence length with BOW
We can specify different length of tokens - what we called n-grams - we can better capture the context, which can be very important.

For this we will just take the 100 sample of reviews of amazon product since all of the data can occupy very large amount of memory

In [16]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify token sequence and fit
vect = CountVectorizer(ngram_range=(1,2))
vect.fit(reviews[:100].review)

# Transform the review column
X_review = vect.transform(reviews[:100].review)

# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   10  10 95  10 cups  100  100 years  110  110 years  114622  \
0   0      0        0    0          0    0          0       0   
1   0      0        0    0          0    0          0       0   
2   0      0        0    0          0    0          0       0   
3   0      0        0    0          0    0          0       0   
4   0      0        0    0          0    0          0       0   

   114622 excellent  12  ...  youtube video  yr  yr old  yucky  yucky thick  \
0                 0   0  ...              0   0       0      0            0   
1                 0   0  ...              0   0       0      0            0   
2                 0   0  ...              0   0       0      0            0   
3                 0   0  ...              0   0       0      0            0   
4                 0   0  ...              0   0       0      0            0   

   zelbessdisk  zelbessdisk three  zen  zen baseball  zen motorcycle  
0            0                  0    0             0           

We have built a numeric representation of the review column using uni- and bigrams!

## BOW with n-grams and vocabulary size
We will build a bag-of-words once more, using the reviews dataset of Amazon product reviews. We will be to limit the size of the vocabulary and specify the length of the token sequence.

In [18]:
#Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify max features and fit
vect = CountVectorizer(max_features=1000, ngram_range=(2, 2), max_df=500)
vect.fit(reviews.review)

# Transform the review
X_review = vect.transform(reviews.review)

# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   able to  about how  about it  about the  about this  after reading  \
0        0          0         0          0           0              0   
1        0          0         0          0           0              0   
2        0          0         0          0           0              0   
3        0          0         0          0           0              0   
4        0          0         0          0           0              0   

   after the  again and  ago and  agree with  ...  you think  you to  you ve  \
0          0          0        0           0  ...          0       0       0   
1          0          0        0           0  ...          0       0       0   
2          0          0        0           0  ...          0       0       2   
3          0          0        0           0  ...          0       0       0   
4          0          0        0           0  ...          0       0       1   

   you want  you will  you won  you would  your money  your own  your time  
0  

## Language detection of product reviews

In [29]:
from langdetect import detect_langs
languages = [] 

# Loop over the rows of the dataset and append  
for row in range(len(reviews.review)):
    languages.append(detect_langs(reviews.iloc[row, 2]))

# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]

NameError: name 'non_english_reviews' is not defined

In [32]:
# Assign the list to a new feature 
reviews['language'] = languages

reviews[reviews['language'] != 'en']

Unnamed: 0.1,Unnamed: 0,score,review,language
169,169,1,Awesume! BEST BLOCKS EVER!: THIS TOY WAS OUR ...,de
1249,1249,1,Il grande ritorno!: E' dai tempi del tour di ...,it
1259,1259,1,La reencarnación vista por un científico: El ...,es
1261,1261,1,Magnifico libro: Brian Weiss ha dejado una ma...,es
1639,1639,1,El libro mas completo que existe para nosotra...,es
1745,1745,1,Excelente!: Una excelente guía para todos aqu...,es
2316,2316,1,Nightwish is unique and rocks for eva: Moi to...,fr
2486,2486,1,Palabras de aliento para tu caminar con Dios:...,es
2760,2760,0,Completement nul: Fait sur commande et ennuya...,fr
2903,2903,1,fabuloso: mil gracias por el producto fabulos...,es
