### **In this study, I will apply feature extraction methods on Amazon Reviews dataset**



---



---



In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
df = pd.read_csv('output/Amazon_reviews_categorised.csv')
df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,review,polarities,sentiment_score,review_category
0,US,53096575,R2I0T26SV0ELPP,316219266,665813273,The Everything Store: Jeff Bezos and the Age o...,Books,1,4054,4756,N,Y,I wanted to like this book,"In the first chapter, the book sets the stage ...",2013-11-04,I wanted to like this book In the first chapte...,"{'neg': 0.055, 'neu': 0.78, 'pos': 0.164, 'com...",positive,product
1,US,12637794,RZGFXZ2HYHHRA,1780671067,509449366,Secret Garden: An Inky Treasure Hunt and Color...,Books,5,3416,3449,N,Y,CHOOSE THE RIGHT COLORING PENCILS,This is the list of items I use for coloring a...,2015-05-02,CHOOSE THE RIGHT COLORING PENCILS This is th...,"{'neg': 0.036, 'neu': 0.882, 'pos': 0.082, 'co...",positive,product
2,US,30381644,R25ITJRIMQW92F,805096663,647864157,Killing Kennedy: The End of Camelot,Books,1,2893,3589,N,Y,SORRY BILL,"I was working in downtown Washington, D.C. on ...",2012-10-15,SORRY BILL I was working in downtown Washingto...,"{'neg': 0.17, 'neu': 0.779, 'pos': 0.052, 'com...",negetive,product
3,US,26445230,R2I37K23W0YCC9,1623363586,465642569,Thug Kitchen: The Official Cookbook: Eat Like ...,Books,5,2746,2841,N,Y,"Great Taste, A Little Complicated","I'll start by saying, the food in this book is...",2014-11-02,"Great Taste, A Little Complicated I'll start b...","{'neg': 0.0, 'neu': 0.841, 'pos': 0.159, 'comp...",positive,product
4,US,52830381,R2ATIJCX4DJWBB,1400069289,136914857,The Power of Habit: Why We Do What We Do in Li...,Books,1,2551,3021,N,Y,"One or two chapters of interest, the rest filler",The first two chapters weren't bad. They made ...,2012-05-04,"One or two chapters of interest, the rest fill...","{'neg': 0.048, 'neu': 0.827, 'pos': 0.126, 'co...",positive,product


**Let's convert the 'review_date' column to a meaningful time format**  

In [3]:
df['review_date'] = pd.to_datetime(df['review_date'])

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143230 entries, 0 to 1143229
Data columns (total 19 columns):
 #   Column             Non-Null Count    Dtype         
---  ------             --------------    -----         
 0   marketplace        1143230 non-null  object        
 1   customer_id        1143230 non-null  int64         
 2   review_id          1143230 non-null  object        
 3   product_id         1143230 non-null  object        
 4   product_parent     1143230 non-null  int64         
 5   product_title      1143230 non-null  object        
 6   product_category   1143230 non-null  object        
 7   star_rating        1143230 non-null  int64         
 8   helpful_votes      1143230 non-null  int64         
 9   total_votes        1143230 non-null  int64         
 10  vine               1143230 non-null  object        
 11  verified_purchase  1143230 non-null  object        
 12  review_headline    1143230 non-null  object        
 13  review_body        1143230 

In [5]:
df.describe().round(1)

Unnamed: 0,customer_id,product_parent,star_rating,helpful_votes,total_votes
count,1143230.0,1143230.0,1143230.0,1143230.0,1143230.0
mean,29796857.7,500246947.5,3.0,5.1,7.0
std,15315084.8,288002921.2,1.4,19.2,22.0
min,10135.0,5710.0,1.0,0.0,0.0
25%,15775476.5,251905084.0,2.0,1.0,1.0
50%,29348325.0,501126142.0,3.0,2.0,3.0
75%,44172192.8,748573501.0,4.0,5.0,7.0
max,53096584.0,999997596.0,5.0,4054.0,4756.0


# Basic Feature Extraction - 1

Normally, I tried to make data cleaning first. Then, I realized that while making data cleaning, I am losing some of characters that can help data cleaning. Therefore, there will be two part of feature extraction. Here, I will extract features that can't be exracted after data cleaning.

### 1) Number of stopwords

In [6]:
!pip install -q wordcloud
import wordcloud
from nltk.corpus import stopwords
import nltk
import string
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yogesh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Yogesh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Yogesh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Yogesh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [7]:
df['stopwords'] = df['review'].apply(lambda x: len([x for x in x.split() if x in stop]))

In [8]:
df[['review','stopwords']].head()

Unnamed: 0,review,stopwords
0,I wanted to like this book In the first chapte...,505
1,CHOOSE THE RIGHT COLORING PENCILS This is th...,95
2,SORRY BILL I was working in downtown Washingto...,680
3,"Great Taste, A Little Complicated I'll start b...",100
4,"One or two chapters of interest, the rest fill...",118


In [9]:
df.stopwords.loc[df.stopwords != 0].count()

1118818

### 2) Number of Punctuation

In [10]:
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return count

df['punctuation'] = df['review'].apply(lambda x: count_punct(x))

In [11]:
df[['review','punctuation']].head()

Unnamed: 0,review,punctuation
0,I wanted to like this book In the first chapte...,151
1,CHOOSE THE RIGHT COLORING PENCILS This is th...,33
2,SORRY BILL I was working in downtown Washingto...,421
3,"Great Taste, A Little Complicated I'll start b...",67
4,"One or two chapters of interest, the rest fill...",87


In [12]:
df.punctuation.loc[df.punctuation != 0].count()

1112075

### 3) Number of hashtag characters

One more interesting feature which we can extract from a review is calculating the number of hashtags or mentions present in it. This also helps in extracting extra information from our text data.

In [13]:
df['hastags'] = df['review'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
df[['review','hastags']].head()

Unnamed: 0,review,hastags
0,I wanted to like this book In the first chapte...,0
1,CHOOSE THE RIGHT COLORING PENCILS This is th...,1
2,SORRY BILL I was working in downtown Washingto...,0
3,"Great Taste, A Little Complicated I'll start b...",0
4,"One or two chapters of interest, the rest fill...",0


In [14]:
df.hastags.loc[df.hastags != 0].count()

4314

### 4) Number of numerics
Calculate the number of numerics which are present in the tweets can be useful. At least, it doesn't hurt to have such a data!

In [15]:
df['numerics'] = df['review'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df[['review','numerics']].head()

Unnamed: 0,review,numerics
0,I wanted to like this book In the first chapte...,3
1,CHOOSE THE RIGHT COLORING PENCILS This is th...,6
2,SORRY BILL I was working in downtown Washingto...,13
3,"Great Taste, A Little Complicated I'll start b...",0
4,"One or two chapters of interest, the rest fill...",0


In [16]:
df.numerics.loc[df.numerics != 0].count()

210534

### 5) Number of Uppercase words
Anger or rage is quite often expressed by writing in UPPERCASE words which makes this a necessary operation to identify those words.

In [17]:
df['upper'] = df['review'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
df[['review','upper']].head()

Unnamed: 0,review,upper
0,I wanted to like this book In the first chapte...,22
1,CHOOSE THE RIGHT COLORING PENCILS This is th...,15
2,SORRY BILL I was working in downtown Washingto...,37
3,"Great Taste, A Little Complicated I'll start b...",7
4,"One or two chapters of interest, the rest fill...",10


In [18]:
df.upper.loc[df.upper != 0].count()

859296

### 6) Number of Emojis
Emojis can be indictor of some emotions that can be related to being customer satisfaction.

In [19]:
!pip install emot

import emot 
emot_obj = emot.core.emot() 

df['emoji'] = df['review'].apply(lambda x: len(emot_obj.emoji(x)["value"]))
df[['review','emoji']].head()



Unnamed: 0,review,emoji
0,I wanted to like this book In the first chapte...,0
1,CHOOSE THE RIGHT COLORING PENCILS This is th...,0
2,SORRY BILL I was working in downtown Washingto...,0
3,"Great Taste, A Little Complicated I'll start b...",0
4,"One or two chapters of interest, the rest fill...",0


In [20]:
df.emoji.loc[df.emoji != 0].count()

827

### 7) Number of Emoticons

***What is the difference between emoji and emoticons?***

*   :-) is an emoticon
*   😜 → emoji.

In [21]:
import emot 
emot_obj = emot.core.emot() 

df['emoticon'] = df['review'].apply(lambda x: len(emot_obj.emoticons(x)["value"]))
df[['review','emoticon']].head()

Unnamed: 0,review,emoticon
0,I wanted to like this book In the first chapte...,0
1,CHOOSE THE RIGHT COLORING PENCILS This is th...,0
2,SORRY BILL I was working in downtown Washingto...,2
3,"Great Taste, A Little Complicated I'll start b...",1
4,"One or two chapters of interest, the rest fill...",0


In [22]:
df.emoticon.loc[df.emoticon != 0].count()

48216



---



---



---



**Now, let's save this extracted data as CSV file** 

In [23]:
df.to_csv('output/Amazon_reviews_processed_1.csv', index=False)