# Getting insight from customer reviews using Amazon Comprehend

## Introduction
<a id="Introduction"></a>



We will use a NLP AI Service from Amazon Web Services - [Amazon Comprehend](https://aws.amazon.com/comprehend/) to solve the business problem. Amazon Comprehend is a natural language processing (NLP) service that uses machine learning (ML) to find insights and relationships in texts. Amazon Comprehend has ability for you to train it to recognize custom entities and perform custom classification. 

*Notes*: `boto3`, the Python SDK for AWS, is used in the different examples of this notebook. It is already installed if you are executing this Notebook from a Sagemaker Notebook environment. 

## Problem Statetment
<a id="ProblemStatetement"></a>

Consumers are increasingly engaging with businesses through digital surfaces and multiple touch-points. Statistics shows that the majority of shoppers use reviews to determine what products to buy and which services to purchase. Reviews have the power to influence consumer decisions and strengthen brand value. Customer review is a great tool to estimate product quality, identify improvement opportunities, launch promotional campaigns and make great product recommendations. We will use Amazon Comprehend to extract meaningful information from product reviews, analyze it to understand how users of different demographies are reacting to products, and also analyze aggregated information on user affinity towards a product.

## Use AWS NLP Service Amazon Comprehend as a Solution
<a id="Rescue"></a>

We will use Natural Language Processing to solve the problem by following the below mentioned approach - 

#### 1. Data Processing and Transformation Notebook
Exploratory Data Analysis to understand the dataset
#### 2. Comprehend Topic Modelling Job Notebook
Use Topic Modeling to generate topics
#### 3. Topic Mapping and Sentiment Generation Notebook
Use topics to understand segments and sentiment associated with each item




### Data Loading

#### Initialize Input & Output Paths
<a id="InitialiazeS3Data"></a>

In [None]:
# Library imports
import pandas as pd
import os

### Input-paths

In [None]:
# Bucket containing the data
BUCKET = 'clothing-shoe-jewel-tm-blog'

# Item ratings and metadata
S3_DATA_FILE = 'Clothing_Shoes_and_Jewelry.json.gz' # Zip
S3_META_FILE = 'meta_Clothing_Shoes_and_Jewelry.json.gz' # Zip

S3_DATA = 's3://' + BUCKET + '/' + S3_DATA_FILE
S3_META = 's3://' + BUCKET + '/' + S3_META_FILE

### Output-paths

In [None]:
# Transformed review, input for Comprehend
LOCAL_TRANSFORMED_REVIEW = os.path.join('data', 'TransformedReviews.txt')
S3_OUT = 's3://' + BUCKET + '/out/' + 'TransformedReviews.txt'

# Final dataframe where topics and sentiments are going to be joined
S3_FEEDBACK_TOPICS = 's3://' + BUCKET + '/out/' + 'FinalDataframe.csv'

#### Load Review and Meta Data into Dataframe

In [None]:
def convert_json_to_df(path):
    """Reads a subset of a json file in a given path in chunks, combines, and returns
    """
    # Creating chunks from 500k data points each of chunk size 10k
    chunks = pd.read_json(path, orient='records', 
                                lines=True, 
                                nrows=500000, 
                                chunksize=10000, 
                                compression='gzip')
    # Creating a single dataframe from all the chunks
    load_df = pd.DataFrame()
    for chunk in chunks:
        load_df = pd.concat([load_df, chunk], axis=0)
    return load_df

In [None]:
# Review data
original_df = convert_json_to_df(S3_DATA)

In [None]:
# Metadata
original_meta = convert_json_to_df(S3_META)

### Exploratory Data Analysis

In [None]:
# Shape of reviews and metadata
print('Shape of review data: ', original_df.shape)
print('Shape of metadata: ', original_meta.shape)

In [None]:
# We are interested in verified reviews only
# Also checking the amount of missing values in the review data
print('Frequency of verified/non verified review data: ', original_df['verified'].value_counts())
print('Frequency of missing values in review data: ', original_df.isna().sum())

In [None]:
# Sneak peek for review data
original_df.head()

In [None]:
# Sneak peek for metadata
original_meta.head()

In [None]:
# Count of each categories for EDA.
print('Frequncy of different item categories in metadata: ', original_meta['category'].value_counts())

In [None]:
# Checking null values for metadata
print('Frequency of missing values in metadata: ', original_meta.isna().sum())

In [None]:
# Checking if there are duplicated data. There are indeed duplicated data in the dataframe.
print('Duplicate items in metadata: ', original_meta[original_meta['asin'].duplicated()])

### Preprocessing

In [None]:
def clean_text(df):
    """Preprocessing review text.
    The text becomes Comprehend compatible as a result.
    This is the most important preprocessing step.
    """
    # Encode and decode reviews
    df['reviewText'] = df['reviewText'].str.encode("utf-8", "ignore")
    df['reviewText'] = df['reviewText'].str.decode('ascii')

    # Replacing characters with whitespace
    df['reviewText'] = df['reviewText'].replace(r'\r+|\n+|\t+|\u2028',' ', regex=True)

    # Replacing punctuations
    df['reviewText'] = df['reviewText'].str.replace('[^\w\s]','', regex=True)

    # Lowercasing reviews
    df['reviewText'] = df['reviewText'].str.lower()
    return df

In [None]:
def prepare_input_data(df):
    """Encoding and getting reviews in byte size.
    Review gets encoded to utf-8 format and getting the size of the reviews in bytes. 
    Comprehend requires each review input to be no more than 5000 Bytes
    """
    df['review_size'] = df['reviewText'].apply(lambda x:len(x.encode('utf-8')))
    df = df[(df['review_size'] > 0) & (df['review_size'] < 5000)]
    df = df.drop(columns=['review_size'])
    return df

In [None]:
# Only data points with a verified review will be selected and the review must not be missing
filter = (original_df['verified'] == True) & (~original_df['reviewText'].isna())
filtered_df = original_df[filter]

In [None]:
# Only a subset of fields are selected in this experiment. 
filtered_df = filtered_df[['asin', 'reviewText', 'summary', 'unixReviewTime', 'overall', 'reviewerID']]

In [None]:
# Just in case, once again, dropping data points with missing review text
filtered_df = filtered_df.dropna(subset=['reviewText'])
print('Shape of review data: ', filtered_df.shape)

In [None]:
# Dropping duplicate items from metadata
original_meta = original_meta.drop_duplicates(subset=['asin'])

In [None]:
# Only a subset of fields are selected in this experiment. 
original_meta = original_meta[['asin', 'category', 'title', 'description', 'brand', 'main_cat']]

In [None]:
# Clean reviews using text cleaning pipeline
df = clean_text(filtered_df)

In [None]:
# Reset index as we are merging metadata with reviews shortly
df = df.reset_index().drop(columns=['index'])

In [None]:
# Merge metadata with review data
df = df.merge(original_meta, how='left', on='asin')

In [None]:
# Dataframe where Comprehend outputs (topics and sentiments) will be added
df = prepare_input_data(df)

### Save Data in S3

In [None]:
# Saving dataframe on S3
df.to_csv(S3_FEEDBACK_TOPICS, index=False)

In [None]:
# Reviews are transformed per Comprehend guideline- one review per line
# The txt file will be used as input for Comprehend
# We first save the input file locally
with open(LOCAL_TRANSFORMED_REVIEW, "w") as outfile:
    outfile.write("\n".join(df['reviewText'].tolist()))

In [None]:
# Transferring the transformed review (input to Comprehend) to S3
!aws s3 mv {LOCAL_TRANSFORMED_REVIEW} {S3_OUT}