#### <a id="1"></a>
# <p style="background-color:#8DB600;font-family:newtimeroman;color:#FFF9ED;font-size:120%;text-align:center;border-radius:10px 10px;">Amazon Product Quality Report</p>

I scrape and clean Amazon review and rating data for multiple products, with the aim of creating a concise, customer-friendly report on the product. 
* I identify the probable presence of unreliable product quality and/or fraudulent reviews, using a chi-squared test. 
* I extract keywords used by happy versus unhappy uncustomers, with an emphasis on keywords that express concrete features of the product, while excluding uninformative sentimental keywords. 
* I create a concise, balanced report on product pros and cons.

#### <a id="1"></a>
# <p style="background-color:#8DB600;font-family:newtimeroman;color:#FFF9ED;font-size:100%;text-align:center;border-radius:10px 10px;">Importing Libraries</p>

In [722]:
# Importing libraries 

# standard libraries
import numpy as np 
import pandas as pd 

# for file management
import os 

# for data scraping
import requests 
from bs4 import BeautifulSoup
from user_agent import generate_user_agent

# for formatting and cleaning
import re
from datetime import datetime

# language 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# plotting
from matplotlib import pyplot as plt

# statistics
from scipy.stats import chisquare

#### <a id="1"></a>
# <p style="background-color:#8DB600;font-family:newtimeroman;color:#FFF9ED;font-size:100%;text-align:center;border-radius:10px 10px;">Scraper for Amazon Product Reviews</p>

I scraped data from Amazon reviews for four different Amazon products. 

The code for the scraping is below. However, at least for me, it no longer works. Amazon now detects me as a bot and blocks all access. I temporarily resolved this problem by attaching a randomly generated user agent to each of my requests, but this doesn't work anymore. I've given up, for now, on getting past their bot detection. Therefore I only have reviews from the four products I looked at before my scraper stopped working.  

The code below defines the Amazon review scraper. 
* The function 'get_soup' takes as input the URL of a webpage, and retrieves the HTML. 
* The function 'convert_Amazon_product_to_reviews' takes as input the URL of any product page on Amazon, and converts it to the URL containing the first page of all product reviews. This URL ends in 'pageNumber=', so that moving through the pages of reviews simply involves adding a number, 1+, to the end of the string. 
* The function 'extract_Amazon_reviews' takes as input the main product page URL and returns a dataframe including: the review title, name of reviewer, star rating, where and when the review was made, the specific product style purchased, whether the review arises from a 'verified' purchase, the review text, and the number of helpful votes. 


In [160]:
def get_soup(URL: str):
    s = requests.Session()
    
    # Random user agent.
    user_agent = generate_user_agent()
    headers = {'User-Agent': user_agent, 'Accept-Language': 'en-US, en;q=0.5'}        

    # Get soup 
    webpage = s.get(URL, headers = headers)
    soup = BeautifulSoup(webpage.text, 'lxml')
    
    # Return error message
    if ("Sorry, we just need to make sure you\'re not a robot" in str(soup)) or ("To discuss automated access" in str(soup)): # Part of error messages about bots
        print('Error. Amazon blocked access.') 
        return None

    # Return soup
    return soup

In [161]:
URL = 'https://www.amazon.com/Software-Design-Flexibility-Programming-Yourself-ebook/dp/B089423GC6/?pf_rd_r=V3Y43HK5TFK9VWA30XD6&pf_rd_p=935389f8-611a-4123-b867-b2d567ba3a96&pd_rd_r=7385a253-360e-4f17-83a6-03adefa6787f&pd_rd_w=Qe2rk&pd_rd_wg=P0a9Y&ref_=pd_gw_bmx_gp_1g4a5hlo'
get_soup(URL)

Error. Amazon blocked access.


In [96]:
def convert_Amazon_product_to_reviews(URL: str):
    return URL.replace('/dp', '/product-reviews').split('ref=')[0] + 'ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber='

In [85]:
def extract_Amazon_reviews(URL: str, page_no: int):
    # Get soup
    URL = convert_Amazon_product_to_reviews(URL)
    soup = get_soup(URL + str(page_no))
    if not soup:
        return 
    
    # Extract sub-soup with all review information 
    reviewSection = soup.find("div", attrs = {"id": "cm_cr-review_list", "class" : "a-section a-spacing-none review-views celwidget"})
    if not reviewSection: # if no review section: stop!
        return pd.DataFrame()
    allReviews = reviewSection.find_all("div", attrs = {"data-hook" : "review"})
    if not allReviews: # if no reviews: stop!
        return pd.DataFrame()

    # Review Title - can appear under two possible tags 
    reviewTitle = []
    for review in allReviews:
        x = review.find("a", attrs = {"data-hook" : "review-title"})
        if x:
            reviewTitle.append(x.span.string)
        else:
            reviewTitle.append(review.find("span", attrs = {"data-hook" : "review-title"}).span.string)

    # Reviewer Name
    reviewerNames = [review.find("span", attrs = {"class": "a-profile-name"}).string for review in allReviews]
    
    # Star Rating
    ratings = [int(float(review.find("span", attrs = {"class": "a-icon-alt"}).string.split()[0])) for review in allReviews]
    
    # Where and When Review was Made
    reviewPlaceDate = [review.find("span", attrs = {"data-hook": "review-date"}).string.strip() for review in allReviews]
    
    # Product Style (if it exists)
    productStyle = [review.find("a", attrs = {"data-hook": "format-strip"}) for review in allReviews]
    productStyle = [(x.string if x else None) for x in productStyle]
    
    # Verified Purchase
    reviewType = [review.find("span", attrs = {"data-hook": "avp-badge"}) for review in allReviews]
    reviewType = [(x.string if x is not None else x) for x in reviewType]

    # Review Text
    reviewText = [review.find("span", attrs = {"data-hook": "review-body"}).span for review in allReviews]
    reviewText = [(review.contents if review else "") for review in reviewText] # handles possibility of empty review text
    reviewText = [(' '.join(list(filter(None, [x.string for x in review]))).replace('\n', ' ').strip() if review != "" else "") for review in reviewText]

    # Helpful Votes 
    helpfulVotes = [review.find("span", attrs = {"data-hook" : "helpful-vote-statement"}) for review in allReviews]
    helpfulVotes = [(x.string.split()[0] if x is not None else x) for x in helpfulVotes]

    # Make Dataframe 
    columns = ["Name", "Rating", "ReviewTitle", "PlaceDate", "ProductStyle", "IsVerified", "ReviewText", "HelpfulVotes"]
    data = pd.DataFrame(list(zip(reviewerNames, ratings, reviewTitle, reviewPlaceDate, productStyle, reviewType, reviewText, helpfulVotes)), 
                columns = columns)
    return data

#### <a id="1"></a>
# <p style="background-color:#8DB600;font-family:newtimeroman;color:#FFF9ED;font-size:100%;text-align:center;border-radius:10px 10px;">Data scraping and cleaning</p>

In this section, I scrape, clean, and save data for four miscellaneous, popular products.
* Pampers -- Baby Wipes: https://www.amazon.com/Choose-your-count-Sensitive-Hypoallergenic/dp/B079V67BFW/ref=cm_cr_arp_d_product_top?ie=UTF8
* FangTian -- N95 Masks: https://www.amazon.com/FANGTIAN-Particulate-Respirators-Protective-TC-84A-7861/dp/B087Z7N4XF/ref=cm_cr_arp_d_product_top?ie=UTF8
* Nature's Nutrition -- Turmeric Supplements: https://www.amazon.com/Curcuminoids-Absorption-Anti-Inflammatory-Natures-Nutrition/dp/B06X9T1Y3F/ref=cm_cr_arp_d_product_top?ie=UTF8
* Nike -- Men's Sneakers: https://www.amazon.com/Nike-Mens-Monarch-Cross-Trainer/dp/B07JQKM2SP/ref=cm_cr_arp_d_product_top?ie=UTF8

In [356]:
# Sample Product URLS 
URL_PampersWipes = "https://www.amazon.com/Choose-your-count-Sensitive-Hypoallergenic/product-reviews/B079V67BFW/ \
    ref=cm_cr_getr_d_paging_btm_next_3?ie=UTF8&reviewerType=all_reviews&pageNumber="
URL_N95 = "https://www.amazon.com/FANGTIAN-Particulate-Respirators-Protective-TC-84A-7861/product-reviews/B087Z7N4XF/   \
    ref=cm_cr_arp_d_paging_?btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber="
URL_TurmericSupplement = "https://www.amazon.com/Curcuminoids-Absorption-Anti-Inflammatory-Natures-Nutrition/product-reviews/B06X9T1Y3F/  \
    ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber="
URL_Nike = "https://www.amazon.com/Nike-Mens-Monarch-Cross-Trainer/product-reviews/B07JQKM2SP/ \
    ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber="

Sample_URLS = {"PampersWipes": URL_PampersWipes, "N95": URL_N95, "TurmericSupplement": URL_TurmericSupplement, "Nike": URL_Nike}

In [162]:
# Choose product and maximum number of pages you want to check 
product = "TurmericSupplement"
page_lim = 1000

# Product Name 
URL_base = Sample_URLS[product]
productName = get_soup(URL_base, 1).find("a", attrs = {"data-hook" : "product-link"}).string

# Initialize dataframe
columns = ["Name", "Rating", "ReviewTitle", "PlaceDate", "ProductStyle", "IsVerified", "ReviewText", "HelpfulVotes"]
data = pd.DataFrame(columns = columns)

# Fill out dataframe
for page_no in range(1, page_lim):
    # Extract reviews 
    page_data = extract_review_dataframe(page_no)
    
    # If no more reviews found: break. Otherwise, concatenate to our growing dataframe. 
    if page_data.empty: 
        break 
    data = pd.concat([data, page_data])
    
    # Status update 
    if page_no % 20 == 0: print("Currently at page number {}".format(page_no))

# Data cleaning

# Reset index
data.reset_index(drop = True, inplace = True)

# Replace 'PlaceDate' column with 'Place' and 'Date' columns 
for index in data.index:
    placedate = data.loc[index, "PlaceDate"]
    a, b = re.search('^Reviewed in ', placedate).span()
    c, d = re.search(' on ', placedate).span()
    place, date = placedate[b:c], placedate[d:]
    data.loc[index, "Place"] = place
    data.loc[index, "Date"] = date
data.drop("PlaceDate", axis = 1, inplace = True)

# Label encode 'IsVerified' column
data["IsVerified"].replace({"Verified Purchase": 1, None: 0}, inplace = True)

# Make 'HelpfulVotes' column an integer type 
data["HelpfulVotes"].replace({'One': "1", None: "0"}, inplace = True)
data["HelpfulVotes"] = data.HelpfulVotes.str.replace(',', '').astype(int)

# Make 'Date' column a datetime type 
data["Date"] = pd.to_datetime(data["Date"])

# Add 'Year' and 'Month' columns 
data["Year"] = data["Date"].apply(lambda x: (x.year))
data["Month"] = data["Date"].apply(lambda x: (x.month))

In [363]:
# # Save (or update) dataframe
# filename = "Data_" + str(product) + ".csv"
# if os.path.exists(filename):
#     os.remove(filename)
# data.to_csv(filename)

See below an example of what our data looks like, for the Turmeric Supplements.

In [189]:
# Load data for desired product 
product = "TurmericSupplement" #N95, PampersWipes, TurmericSupplement, Nike
filename = "Data_" + str(product) + ".csv"
data = pd.read_csv(filename, index_col = 0)
data.head()

Unnamed: 0,Name,Rating,ReviewTitle,ProductStyle,IsVerified,ReviewText,HelpfulVotes,Place,Date,Year,Month
0,vagma,1,NOT THE SAME,Size: 120 Count (Pack of 1),1,I've been ordering these turmeric pills for a ...,2595,the United States,2018-09-19,2018,9
1,Amazon Customer,1,White Capsules Mixed in with Turmeric??,Size: 180 Count (Pack of 1),1,Was shocked to find white capsules mixed in wi...,1774,the United States,2019-05-16,2019,5
2,Amazon Customer,5,Love it.,Size: 180 Count (Pack of 1),1,Let me begin with that I have a degree in biol...,1156,the United States,2018-08-18,2018,8
3,Kathy Sneed,1,Beware,Size: 60 Count (Pack of 1),1,It worked get for arthritis but my husband had...,820,the United States,2018-10-05,2018,10
4,Johnny,1,Nausea and diarrhea,Size: 60 Count (Pack of 1),1,I took only 2 capsules instead of the 3 reccom...,614,the United States,2019-05-01,2019,5


These are the number of reviews we collected for each product. For products with over 5000 reviews, I only collected the first 5000, most recent reviews. For all products, I collected every review made within the year of 2021. 

In [485]:
for product in ["PampersWipes", "Nike", "TurmericSupplement", "N95"]:
    filename = "Data_" + str(product) + ".csv"
    data = pd.read_csv(filename, index_col = 0)
    print('We have {} reviews for the product {}'.format(data.shape[0], product))

We have 1319 reviews for the product PampersWipes
We have 5000 reviews for the product Nike
We have 5000 reviews for the product TurmericSupplement
We have 1213 reviews for the product N95


#### <a id="1"></a>
# <p style="background-color:#8DB600;font-family:newtimeroman;color:#FFF9ED;font-size:100%;text-align:center;border-radius:10px 10px;">Fluctuating quality and/or Fraudulent reviews Detection</p>

We want to know what we're buying. The quality of a product should not rise or drop rapidly with
changes in manufacturing or storage practices, or sellers. This is especially relevant on Amazon, where third parties are able to jump in to legitimate product pages, sometimes selling their fake versions of a product. The shoe company, Nike, for example, chose to completely cease sales on Amazon in 2019 due to unauthorized 3rd party sales and counterfeit products. 

We also want to be able to trust the reviews we read. Fraudulent reviews can occur, either to increase sales with five-star reviews, or to decrease sales of competitors with one-star reviews. 

Both fluctuating quality and fraudulent reviews would be reflected in the typical star-ratings over time. Decreasing quality would increase the rate of 1-star reviews. Fraudulent reviews would likely cause irregular spikes in the numbers of either positive or negative reviews. Many news sources claim that fraudulent, paid reviews are extremely common on Amazon, though I did not personally check whether there is strong evidence towards this claim. 

Let's define our null hypothesis as the hypothesis that a product has a consistent satisfaction rate. The percentage of negative ratings, let's say 1-, 2-, and 3-stars, should stay a constant fraction of the number of reviews, over time, though allowing for random fluctuations. We can detect statistically significant fluctuations using a chi-squared test, and compare the p-value with a statistical significance threshold to decide whether to reject the null hypothesis. Rejecting the null hypothesis would mean accepting the alternative hypothesis, that the percentage of apparently unhappy customers is changing statistically significantly over time. This can occur, for example, due to fluctuating quality or fraudulent reviews. 

In theory, the p-value should allow us to sometimes reject the null hypothesis, thereby labelling some products as having volatile ratings. This may not always be a valuable distinction, as other time-dependent factors like advertising might cause a spike in positive or negative ratings. As another product, products with fewer reviews will also tend to have higher p-values, because it is harder to reject the null hypothesis with fewer data points. 

In practice, then, we can only use the results of the chi-squared test if we've tested it on a large number of products, especially from the same category, and find that the results are sensible. This is not possible for me because my scraper was blocked by bot detection after I collected data for only four products. I will do the chi-squared test anyways on my four products, but a strong interpretation of the results would require much more data. 

In the code block below, I do a chi-squared test on the 'observed' number of negative ratings (1, 2, and 3 stars) versus the 'expected' number, defined as the mean percentage of negative ratings, times the total number of ratings within a given time period. I use only the data from the year 2021, the only full year when all of the products had a high enough number of sales and ratings for a chi-squared test to be valid. 

In [471]:
def convert_monthly_series_to_list(series): 
        lst = []
        for month in range(1, 13): 
            if month not in series.index:
                lst.append(0)
            else:
                lst.append(series.loc[month])
        return lst
    
for product in ["PampersWipes", "Nike", "TurmericSupplement", "N95"]:
    
    # Load data
    filename = "Data_" + str(product) + ".csv"
    data = pd.read_csv(filename)
    
    # Specify year and star-reviews 
    years = 2021
    star_ratings = [1,2,3] # All of these are bad ratings. Even with 3 star ratings, 1-3 account for at MOST 15% of all ratings in the products I'm looking at.

    # Total number of reviews, per month 
    totalByMonth = data.groupby(["Year", "Month"]).size().loc[year] 
    
    # Percent with specified star ratings 
    percent = data[(data["Year"] == year) & (data["Rating"].isin(star_ratings))].shape[0] / data[(data["Year"] == year)].shape[0]

    # Expected versus Observed Bad Ratings
    expected_series = totalByMonth * percent
    observed_series = data[data["Rating"].isin(star_ratings)].groupby(["Year", "Month"]).size().loc[year]
    expected, observed = convert_monthly_series_to_list(expected_series), convert_monthly_series_to_list(observed_series)
    
    # Divide up data in different ways: 12 data points (every month), 6 (every two months), 4 (every three months), 3 (every four months).
    # These can give different results in p-values, so it's worth checking the pvalue for all of them, and reporting the result. 
    min_pvalue = 1.0
    for k in range(1, 4 + 1):
        observed_k = [sum(observed[j] for j in range(i, i + k)) for i in range(0, 12, k)]
        expected_k = [sum(expected[j] for j in range(i, i + k)) for i in range(0, 12, k)]
        if any(x < 5 for x in (observed_k + expected_k)):
            continue
        min_pvalue = min(min_pvalue, chisquare(observed_k, expected_k).pvalue)
    
    print(product, ' has a chi-squared p-value of {0:1.2}'.format(min_pvalue))

PampersWipes  has a chi-squared p-value of 0.11
Nike  has a chi-squared p-value of 0.0036
TurmericSupplement  has a chi-squared p-value of 0.00096
N95  has a chi-squared p-value of 2.4e-20


With any conventional statistical significance threshold, the Nike shoes, Tumeric Supplement, and N95 masks fail the null hypothesis. They have statistically significant volatility in their star-ratings, indicating the possibility of volatile quality and/or fraudulent reviews.

As with all statistical significance tests, we cannot say that Pampers Wipes 'succeeded' the null hypothesis - only that it does not fail. 

That said, we can take the higher p-value of Pampers Wipes as a vote of confidence in reliable, steady customer opinion. The average customer rating of 4.65 is something we can trust, at least relative to the average customer ratings of the other products. 

#### <a id="1"></a>
# <p style="background-color:#8DB600;font-family:newtimeroman;color:#FFF9ED;font-size:100%;text-align:center;border-radius:10px 10px;">Automatic Reporting of Product Pros and Cons</p>

Amazon reviews can be highly informative, giving specific information on what each customer liked or disliked about the product. However, based on my own experience buying items on Amazon, I personally believe there are a few weaknesses to Amazon reviews:
* It takes **a lot of time** to read enough reviews to get a balanced perspective on customer opinion. 
* **Information is sparse.** Many sentences are fluff, spent describing the product as amazing or horrible without making concrete statements about the products' features. These sentiment-focused sentences are not useful, because they do not give more information that the star-ratings provide. 
* The provided **key words** tend to be **positive or neutral**. This is true, at least, for the products we have data for, which are well-rated 4-5 star products. This makes it much harder to get a sense of the complaints that unhappy customers have with the product. 

In this section, I extract **positive and negative keywords** from customer reviews. I define positive (negative) keywords based on these four criteria: (1) occur much more often in positive (negative) reviews than in negative (positive) reviews, (2) occur at some minimum frequency in reviews, and (3) are not in a list of standard English stopwords, and (4) are not in a handmade list of sentimental, low-content words like 'good' or 'bad'.

I will also print **brief snippets of reviews** that contain the keywords found. This is useful because the keywords are often, but not always, informative on their own. I believe the resulting collection of review snippets is both informative and concise, but this is up to personal opinion.

The code blocks below
* define functions to extract positive and negative keywords
* define functions to format printing output 

In [748]:
# Function to load data for any given product. 
def load_data(product: str):
    filename = "Data_" + product + ".csv"
    return pd.read_csv(filename, index_col = 0)


# Review stopwords: words that indicate quality, but do not provide concrete information about the product
review_stopwords = ['good', 'great', 'excellent', 'fantastic', 'well', 'better', 'best', 'perfect', # positive adjectives
                    'bad', 'terrible', 'horrible', 'worse', 'worst', # negative adjectives
                    'love', 'luv', 'like', # positive verbs
                    'hate', 'dislike', # negative verbs
                    'dont', # common misspellings of stop words
                    'really', 'seems', # other
                   ]

# Get key bad words and key good words
def get_key_words(data, ratio_threshold = 3.0, min_frequency = 0.002, limit = 5):
    # Join all bad vs good reviews, in a single string 
    bad_reviews = ' '.join(list(data[(data.Rating == 1) | (data.Rating == 2)].ReviewText.dropna())).lower()
    good_reviews = ' '.join(list(data[data.Rating == 5].ReviewText.dropna())).lower()
    
    # Get rid of punctuation
    punctuations = ['.', '!', '?', ';', ',', '(', ')', '/']
    for char in punctuations:
        bad_reviews = bad_reviews.replace(char, ' ')
        good_reviews = good_reviews.replace(char, ' ')
        
    # Lemmatize ==> get good or bad words (not unique) and all words together (unique)
    lemmatizer = WordNetLemmatizer()
    all_bad = [lemmatizer.lemmatize(word) for word in bad_reviews.split()]
    all_good = [lemmatizer.lemmatize(word) for word in good_reviews.split()]
    words = set(all_bad + all_good)
    
    # Fill out dictionaries:
    # frequency_bad_to_good: ratio of frequencies at which word appears, in bad reviews versus good reviews
    # tf_bad (tf_good): term frequency of word in bad (good) reviews
    frequency_bad_to_good, tf_bad, tf_good = {}, {}, {}
    total_bad, total_good = len(all_bad), len(all_good)
    for word in words:
        tf_bad[word] = all_bad.count(word) / total_bad
        tf_good[word] = all_good.count(word) / total_good
        frequency_bad_to_good[word] = tf_bad[word] / tf_good[word] if tf_good[word] else np.inf

    # Get key bad words, key good words
    key_bad_words = [word for word in words\
         if (frequency_bad_to_good[word] > ratio_frequency_threshold) \
         & (tf_bad[word] > min_frequency) \
         & (word not in stopwords_lst + review_stopwords)]
    key_good_words = [word for word in words\
         if (frequency_bad_to_good[word] < 1 / ratio_frequency_threshold) \
         & (tf_good[word] > min_frequency) \
         & (word not in stopwords_lst + review_stopwords)]
    
    # Keep only limit-# of key words, for each group 
    # Keep those with highest term frequency
    key_bad_words = sorted(key_bad_words, key = lambda word: tf_bad[word], reverse = True)[:limit]
    key_good_words = sorted(key_good_words, key = lambda word: tf_good[word], reverse = True)[:limit]
    
    return key_good_words, key_bad_words

In [751]:
# Print colors (for formatting)
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'
    
# Product names (For formatting)
product_names = {'TurmericSupplement': "Nature's Nutrition Turmeric Suppplements",
                 'N95': 'Fangtian N95 Masks',
                 'PampersWipes' : 'Pampers Baby Wipes',
                 'Nike': "Nike Men's Sneakers"}


# Print key words
def print_key_words(product: str, key_good_words: list, key_bad_words: list):
    print(color.BOLD + product_names[product] + color.END + '\n')

    print(color.BOLD + color.GREEN + 'Key Words from Happy Customers:' + color.END)
    print(color.BOLD + color.GREEN + ',  '.join(key_good_words) + color.END + '\n')

    print(color.BOLD + color.RED + 'Key Words from Unhappy Customers: ' + color.END)
    print(color.BOLD + color.RED + ',  '.join(key_bad_words) + color.END + '\n')
    return


# Print review snippets
def print_review_snippets(data, product, key_good_words, key_bad_words):
    print(color.BOLD + product_names[product] + color.END + '\n')
    # Good and Bad reviews
    good = ' '.join(list(data[data.Rating == 5].ReviewText.dropna()))
    bad = ' '.join(list(data[data.Rating.isin([1,2])].ReviewText.dropna()))

    for ctr in range(2):
        # Define variables if GOOD (ctr = 0) or BAD (ctr = 1)
        if ctr == 0:
            reviews = good
            key_words = key_good_words
            mood_color = color.GREEN
            print(color.BOLD + 'Snippets of Reviews from HAPPY Customers' + color.END)
        elif ctr == 1:
            reviews = bad
            key_words = key_bad_words
            mood_color = color.RED
            print(color.BOLD + 'Snippets of Reviews from UNHAPPY Customers' + color.END)

        # Get all sentences from reviews
        sentences = [x.split('!') for x in reviews.split('.')]
        sentences = [item for sublist in sentences for item in sublist] 

        for word in key_words:
            # Get sentences that contain the key word
            highlighted_sentences = [sentence for sentence in sentences if word in sentence.split(' ')]

            # Print which key word you're looking at
            print('\n' + color.BOLD + mood_color + word.capitalize() + color.END + '\n')

            # Format sentences and print
            for sentence in highlighted_sentences[:3]:
                sentence_pieces = sentence.split(word)
                formatted_sentence = ''
                for piece in sentence_pieces[:-1]:
                    formatted_sentence += piece
                    formatted_sentence += color.BOLD + mood_color + word + color.END
                formatted_sentence += sentence_pieces[-1]
                formatted_sentence = formatted_sentence.strip().strip(')').strip() + '.'
                print(formatted_sentence)

        print('\n')
    return

See below for the positive keywords and negative keywords for each product. 

In [747]:
products = ['PampersWipes', 'TurmericSupplement', 'Nike', 'N95']
for product in products:
    data = load_data(product)
    key_good_words, key_bad_words = get_key_words(data)
    print_key_words(product, key_good_words, key_bad_words)

[1mPampers Baby Wipes[0m

[1m[92mKey Words from Happy Customers:[0m
[1m[92msoft,  price,  work,  always,  gift[0m

[1m[91mKey Words from Unhappy Customers: [0m
[1m[91mrash,  dry,  bottom,  know,  old[0m

[1mNature's Nutrition Turmeric Suppplements[0m

[1m[92mKey Words from Happy Customers:[0m
[1m[92mpain,  help,  joint,  inflammation,  arthritis[0m

[1m[91mKey Words from Unhappy Customers: [0m
[1m[91mbottle,  smell,  taste,  label,  per[0m

[1mNike Men's Sneakers[0m

[1m[92mKey Words from Happy Customers:[0m
[1m[92mfit,  comfortable,  price,  husband,  support[0m

[1m[91mKey Words from Unhappy Customers: [0m
[1m[91mmonth,  squeak,  toe,  week,  sole[0m

[1mFangtian N95 Masks[0m

[1m[92mKey Words from Happy Customers:[0m
[1m[92mcomfortable,  n95,  feel,  seal,  glass[0m

[1m[91mKey Words from Unhappy Customers: [0m
[1m[91msmall,  bought,  uncomfortable,  size,  smell[0m



Some comments on the keywords above:
* Some key words are informative only when we account for the context - whether the key word came from positive or negative reviews. For example, the keyword 'price' under Pampers Baby Wipes is immediately informative because it comes from happy customers; people believe the price is good. The keyword 'month' under Nike Men's Sneakers comes from unhappy customers, so we know that something goes wrong over the course of a month-long timescale. 
* A couple key words, especially 'know' (Pampers Baby Wipes) and 'per' (Turmeric Supplements) and 'bought' (N95 Masks), don't tell us anything on their own. Fortunately these are uncommon. 
* The N95 mask keyword 'glass' should actually be 'glasses'. The keyword became 'glass' because of the lemmatizer I used. I don't know how I would avoid this problem, except manually or with a lemmatizer that can take context into account. 

Next, I print review snippets for each product. I believe these overviews are concise and possibly even more useful than raw Amazon reviews, because they remove distracting and time-consuming-to-read text and group together reviews by topic. However, this is up to personal opinion.

One flaw is that each snippet is only a single sentence containing the key word, and so some sentences are not informative because they are taken out of the context of their review. The context could be improved with more sentences in each snippet, but this would come at the cost of conciseness.

Scroll down to see the review snippets. Note that this is otherwise the end of our Jupyter notebook.

In [752]:
products = ['PampersWipes', 'TurmericSupplement', 'Nike', 'N95']
for product in products:
    data = load_data(product)
    key_good_words, key_bad_words = get_key_words(data)
    print_review_snippets(data, product, key_good_words, key_bad_words)

[1mPampers Baby Wipes[0m

[1mSnippets of Reviews from HAPPY Customers[0m

[1m[92mSoft[0m

I would highly recommend if you’re baby had a sensitive bottom, I love how [1m[92msoft[0m they are too.
:) Absolutely the best for babies and young kids, [1m[92msoft[0m for tender skin including their face.
Very [1m[92msoft[0m and much larger then all the other brands I've tried.

[1m[92mPrice[0m

These are definitely my favorite wipes for the [1m[92mprice[0m.
and u get only one pack for the same [1m[92mprice[0m.
You get more for [1m[92mprice[0m of 2 wipes.

[1m[92mWork[0m

After food shopping and touching the wagons , after lunch especially if you have pizza or finger foods , at the laundromat if your detergent spills on your hands , and on a hot day coming home from [1m[92mwork[0m it’s a great way to start makeup removal , at the gym to freshen up as well.
We need to buy laptop for [1m[92mwork[0m so I can make more money.
I use these at [1m[92mwork[0m and 