## Data cleaning goals
* Exploration of amazon review data
* Data cleaning and pre-processing for text analytics & sentiment analysis
* credit: https://github.com/EnesGokceDS/Amazon_Reviews_NLP_Capstone_Project/blob/master/1_Data_cleaning_and_feature_extraction.ipynb

## Part 1: Load data
* Load libraries
* Load data
* Quick exploration of data

In [11]:
# Import libraries
import pandas as pd
import nltk as n
from textblob import TextBlob

n.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\domen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [18]:
# Get the amazon reviews data, store as pandas df
df = pd.read_csv("dog-cameras-raw.csv")
df = pd.DataFrame(df)
df.head(3) # show first row

Unnamed: 0,product,date,title,rating,body
0,"Furbo Dog Camera: Treat Tossing, Full HD WiFi ...","Reviewed in Canada on December 14, 2018",Glorified Webcam,2.0,I bought the Furbo as a birthday gift for my b...
1,"Furbo Dog Camera: Treat Tossing, Full HD WiFi ...","Reviewed in Canada on August 15, 2018",Recieved Used Item!,1.0,Extremely disappointed. I recieved a Furbo tha...
2,"Furbo Dog Camera: Treat Tossing, Full HD WiFi ...","Reviewed in Canada on May 26, 2018",Furbo made miracle happen for me,5.0,I’ve been using furbo for 2.5 weeks now. It ha...


In [19]:
# Rename body (review text) to text, ensure it is of type string
df = df.rename(columns = {'body': 'text'}).astype(str)

# Get info from df
df.info()

# Describe the df
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 490 entries, 0 to 489
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   product  490 non-null    object
 1   date     490 non-null    object
 2   title    490 non-null    object
 3   rating   490 non-null    object
 4   text     490 non-null    object
dtypes: object(5)
memory usage: 19.3+ KB


Unnamed: 0,product,date,title,rating,text
count,490,490,490,490.0,490
unique,1,355,442,5.0,490
top,"Furbo Dog Camera: Treat Tossing, Full HD WiFi ...","Reviewed in Canada on January 9, 2020",Five Stars,5.0,The camera gives an incredibly clear and wide ...
freq,490,7,15,347.0,1


In [13]:
# Exploring missing values
null_values = df.isna().sum()
null_values = pd.DataFrame(null_values,columns=['null'])
sum_tot = len(df)
null_values['percent'] = null_values['null']/sum_tot*100
round(null_values,3).sort_values('percent',ascending=False)

Unnamed: 0,null,percent
product,0,0.0
date,0,0.0
title,0,0.0
rating,0,0.0
body,0,0.0


## Part 2: Feature Extraction (before text cleaning)
* Count of Stopwords
* Count of Punctuation
* Count of Hashtag characters
* Count of Numeric characters
* Count of Emojis & Emoticons

In [14]:
# Load libraries
!pip install -q wordcloud
import wordcloud
from nltk.corpus import stopwords
import nltk
import string
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\domen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\domen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\domen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\domen\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [17]:
# Create stopword count feature
df['stopword_ct'] = df['text'].apply(lambda x: len([x for x in x.split() if x in stop]))

# See 3 rows
df[['text','stopword_ct']].head(3)

Unnamed: 0,text,stopword_ct
0,I bought the Furbo as a birthday gift for my b...,121
1,Extremely disappointed. I recieved a Furbo tha...,25
2,I’ve been using furbo for 2.5 weeks now. It ha...,358


In [22]:
# Create punctuation count feature
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return count

df['punctuation_ct'] = df['text'].apply(lambda x: count_punct(x))

# See 3 rows
df[['text', 'punctuation_ct']].head(3)

Unnamed: 0,text,punctuation_ct
0,I bought the Furbo as a birthday gift for my b...,52
1,Extremely disappointed. I recieved a Furbo tha...,6
2,I’ve been using furbo for 2.5 weeks now. It ha...,62


In [28]:
# Create hashtag count feature
df['hastag_ct'] = df['text'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))

# See 3 rows
df[['text','hastag_ct']].head(3)

# How many times where hashtag is not 0?
# df.hastag_ct.loc[df.hastag_ct != 0].count()

Unnamed: 0,text,hastag_ct
0,I bought the Furbo as a birthday gift for my b...,0
1,Extremely disappointed. I recieved a Furbo tha...,0
2,I’ve been using furbo for 2.5 weeks now. It ha...,0


In [29]:
# Create numeric count feature
df['numeric_ct'] = df['text'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))

# See 3 rows
df[['text','numeric_ct']].head(3)

Unnamed: 0,text,numeric_ct
0,I bought the Furbo as a birthday gift for my b...,1
1,Extremely disappointed. I recieved a Furbo tha...,0
2,I’ve been using furbo for 2.5 weeks now. It ha...,1


In [43]:
# Load libraries for emoji & regex
import emoji
import regex

# Write a function to identify all the emojis, call it emoji_ct

# See 3 rows
# df[['text','emoji_ct']].head(3)



In [None]:
# Write a function to identify all the emoticons, call it emoticon_ct

# See 3 rows
# df[['text','emoticon_ct']].head(3)



## Part 3: Data & Text Cleaning
* Change to lower case
* Remove punctuation, stopwords, URLs, html tags, emojis, emoticons
* Spell correction
* Explore & remove custom stopwords

## Part 4: Feature Extraction (after text cleaning)
* Word count
* Character count
* Avg/median word length
* Create date/time variable
* Create review country


## Part 5: Save Data
* Save cleaned data to CSV

In [None]:
# Function to find the polarity of each review
def polarity(x):
    pol = TextBlob(x).sentiment.polarity
    df['polarity'] = x['text'].apply(pol) # depending on the size of your data, this step may take some time.
    return df

polarity()