Websites such as Amazon contain large numbers of product reviews.  This presents a rich source of information that we can use to understand more about these products, as well as how people communicate positive and negative sentiments more generally.

The dataset comprising reviews of more than 1000 products.

The dataset is a CSV file containing a number of fields:

ID: Indicates a unique ID number for each product in the dataset
product_name: The name of the product, as displayed on the Amazon website
category: Each product is assigned to a single category indicating the type of product
noRatings: This represents the number of positive or negative ratings (not reviews) of the product
cost: How much the product sells for.  Note that many products have a price range rather than a single price, typically meaning they can be customized when purchased.
REVIEWLIST: A list of reviews for the product, expressed as a JSON string
product_url: A link to the product's page on Amazon

Note that there are some data integrity issues present in the file.  For example, some products are missing reviews and others are missing prices.

Using this dataset, you will have an opportunity to learn how to identify positive and negative sentiments by analysing the linguistic patterns in the reviews.

Task 1: Loading data

Implement the function task1() in task1.py that outputs a json file called task1.json in the following format:
{"Number of Products:": X, "Number of Categories:": Y}
where X and Y  are the number of products and number of categories respectively. 

In [1]:
import numpy as np
import pandas as pd
import json

def task1():

    dataset = pd.read_csv(r'/course/data/dataset.csv')
    task1_df = pd.DataFrame(dataset, columns = ['ID', 'category'])

    df = task1_df.groupby('category')['ID'].nunique()
    num_category = int(df.count()) 
    num_product = int(task1_df.count()[1])

    json_dict = {"Number of Products:": num_product, "Number of Categories:": num_category}
    task1_json = json.dumps(json_dict)
    json.dump(json.loads(task1_json), open("task1.json", "w"))

    return 0


Task 2: Data aggregation

Each review contains a review_star field allowing you to determine how many stars (out of a possible 5) the review rated the product.  For example, a 5 star review contains the following:

"review_star": "a-icon a-icon-star a-star-5 review-rating"

Implement the function task2() in task2.py which determines the average review score for each product.  Any review with a missing or invalid review_star field should not be included in the calculation. 

If a product contains no valid reviews, you should assign it an average score of 0 or None.

Your function should save its output to a csv file called task2.csv, which contains the following headings: ID,  category, average_score. Each row in the file should contain the details of one product, with

ID and category containing the original values in the data file.

average_score being the average review score for each product

The rows in task2.csv should be in ascending order of ID.

In [2]:
import numpy as np
import pandas as pd
import json
import csv

def findscore(text):
    score = 0
    for char in text:
        if char.isdigit():
            score += int(char)
    return score

def task2():
   
    dataset = pd.read_csv(r'/course/data/dataset.csv')

    df = pd.DataFrame(dataset, columns = ['ID', 'category','REVIEWLIST'])
    ID = df['ID']
    category = df['category']
    review = df['REVIEWLIST']

    # access the review and convert it into python dict
    result_lst = []
    for i in range(len(ID)):
        score = 0
        rating = review[i]
        rating_dict = json.loads(rating)

        # access the review star and add it to the score variable    
        for j in range(len(rating_dict)):
            star_review = rating_dict[j]['review_star']      
            score += findscore(star_review)

        if len(rating_dict) == 0:
            result_lst.append([i, category[i], 0])
        else:
            result_lst.append([i,category[i],round(score/len(rating_dict), 2)])

    # generate csv
    header = ['ID', 'category', 'average_score' ]
    with open('task2.csv','w') as f:
        writer = csv.writer(f)
        writer.writerow(header)
        for item in result_lst:
            writer.writerow(item)
            
    return 0


Task 3: Calculating the average product price 

Each product comes with a cost field, which specifies the sale price of the item. The format of this field is not consistent - some products have a single cost whereas others have a price range.

Implement the function task3() in task3.py that calculates the average cost for each product:

If a product contains only a single price, that price should be the average cost.
If a product contains a price range (e.g. $X - $Y), then the average cost should be (X+Y)/2
If a product contains an invalid or missing price, then average cost should be zero.

All average costs should be listed as a single numeric figure, rounded to two decimal places, with no dollar signs present.

Your function should save its output to a csv file called task3.csv, which contains the following headings: ID, category, average_cost. Each row in the file should contain the details of one product, with

ID and category containing the original values in the data file.

average_cost being the average cost as determined above.

The rows in task3.csv should be in ascending order of ID.

In [3]:
import pandas as pd
import csv
from re import sub

def task3():
    dataset = pd.read_csv(r'/course/data/dataset.csv')

    df = pd.DataFrame(dataset)
    ID = df['ID']
    category = df['category']
    cost = df['cost']
    
    average_cost = []
    for i in range(len(ID)):
        sub(',', '', cost[i])
        if '$' in cost[i]:
            if '-' in cost[i]:
                price = cost[i].split('-')
                price1 = float(sub(r'[^\d.]', '', price[0]))
                price2 = float(sub(r'[^\d.]', '', price[1]))
                ave = str(round(((price1 + price2) / 2), 2))
                average_cost.append([i, category[i], '$'+ ave])
            else:
                average_cost.append([i, category[i], '$' + str(float(sub(r'[^\d.]', '', cost[i])))])
        else:
            average_cost.append([i, category[i], '$0.00'])
    

    header = ['ID', 'category', 'average_cost' ]
    with open('task3.csv','w') as f:
        writer = csv.writer(f)
        writer.writerow(header)
        for item in average_cost:
            writer.writerow(item)
            


Task 4: Plotting the average review score

For this task, consider only the 'Pet Supplies' category of products.

Implement the function task4() in task4.py to generate a plot allowing you to compare the average price with the average review score for each product in 'Pet Supplies'.  Save your plot plot as task4.png.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv  

def remove_char(list, char):
    for i in range(0,len(list)):
        list[i] = list[i].replace(char,'')
    return list

def task4():
    task2_data = pd.read_csv(r'task2.csv')
    task3_data = pd.read_csv(r'task3.csv')

    petsupplies_avgprice = []
    petsupplies_avgscore = []
    for i in range(len(task3_data)):
        task3 = task3_data.iloc[[i]].values[0]
        task2 = task2_data.iloc[[i]].values[0]
        if task3[1] == 'Pet Supplies':
            petsupplies_avgprice.append(float(task3[2][1:]))
        if task2[1] == 'Pet Supplies':
            petsupplies_avgscore.append(task2[2])

    plt.clf()    
    task4_plot = plt.scatter(petsupplies_avgscore, petsupplies_avgprice)
    plt.xlabel('Pet Supplies Average Score')
    plt.ylabel('Pet Supplies Average Price')
    plt.title('Average Price vs Average Review Score')
    task4 = plt.savefig('task4.png')

    return 0

Task 5: Comparing the review scores between categories

Implement the function task5() in task5.py which outputs a file called task5.png comparing the means of the average review scores of products in each category.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv 

def task5():
    task2_data = pd.read_csv(r'task2.csv')
    category = np.array(list(task2_data['category']))
    category_data = np.unique(category)

    avg_cat_score = list(round((task2_data.groupby('category')['average_score']).mean(), 2))
    
    # Plot the data
    dt = sorted([[avg_cat_score[i], category_data[i]] for i in range(len(category_data))])
    plt.clf()    
    fig, task5_plot = plt.subplots(figsize =(25, 10))
    task5_plot = plt.barh([i[1] for i in dt], [i[0] for i in dt])  

    c = -1
    for i in task5_plot.patches:
        c += 1
        plt.text(i.get_width()+0.02, i.get_y()+0.2, dt[c][0],
             fontsize = 12, color ='red')

    plt.title('Average Review Scores of Products in each Category', fontsize = 40)
    plt.xlabel('Mean Average Score', fontsize = 20)
    plt.ylabel('Categories',fontsize = 30)
    
    task5 = plt.savefig('task5.png')
    
    return 0

Task 6: Text processing

We would now like to develop a way of understanding whether a review is favourable or unfavourable towards a product, based on the text of the review.  To do this, we would like to consider each sequential pair of words in the product.  For example, intuitively if the sequence 'great product' appears in a review, we might conclude that the review is favourable to the product.  Building this type of system requires us to pre-process the text of the reviews.

The text content of a review is in the JSON formatted REVIEWLIST field's review_body value.

Implement the function task6() in task6.py that performs the following pre-processing steps on the content of the reviews:

Convert all non-alphabetic characters (for example, numbers, apostrophes and punctuation), except for spacing characters (for example, whitespaces, tabs and newlines) to single-space characters. For example, ‘&’ should be converted to ‘ ’. You should consider non-English alphabetic characters as non-alphabetic for the purposes of this conversion.

Convert all spacing characters such as tabs and newlines into single-space characters, and ensure that only one whitespace character exists between each token.

Change all uppercase characters to lowercase.

Remove all stop words in nltk’s list of English stop words from the review.

Remove all remaining words that are only one or two characters long from the review.

Generate each sequential pair of words that occur in the review (i.e. word bigrams without padding).  For example, the review 'great product great price' should generate the following list: ['great product', 'product great', 'great price']

Once steps 1 -- 6 are done, build a JSON file representing each review in the dataset.  The JSON file should contain a list of objects.  Each object should represent one review and contain the following key/value pairs:

score: containing the score for that review

bigrams: containing the list of word bigrams appearing in the review as described above

Any reviews that don't contain a valid score or contain no bigrams after pre-processing should be ignored.

Your file should be saved as task6.json. 

The creation of vocabulary should be implemented reasonably efficiently.  The run time of task 6 should be no more than 45 seconds. Excessively long execution time for this task will result in a deduction of up to 2 marks.


In [None]:
import numpy as np
import pandas as pd
import json
import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk import ngrams

def bigram(text):
    return [f"{word1} {word2}" for word1, word2 in ngrams(text.split(), 2)]

def task6():
    task2_data = pd.read_csv(r'task2.csv')
    dataset = pd.read_csv(r'/course/data/dataset.csv')

    df = pd.DataFrame(dataset, columns = ['ID', 'category','REVIEWLIST'])
    ID = df['ID']
    category = df['category']
    review = df['REVIEWLIST']
    score = task2_data['average_score']

    stop_pattern = r'.(?!\w)'
    special_stop_pattern = r'\.(?=\w)'
    no_punct_pattern = r'[^a-z\s]'

    task6_lst = []
    for i in range(len(ID)):
        review_dict = json.loads(review[i])
        for j in range(len(review_dict)):
             review_body = review_dict[j]['review_body']

             # convert non-alphabetic character and spacing characters into single-space
             review_body = re.sub('[^0-9a-zA-Z]+', ' ', review_body) 
             review_body = re.sub('\d+', ' ', review_body)

             # convert all uppercase characters to lowercase
             review_body = review_body.lower()

             # remove all stop words
             stop_words = set(stopwords.words('english'))
             review_body = ' '.join([i for i in review_body.split() if i not in stop_words])

             # remove all remaining words that are only one or two characters long
             one_two_char = re.compile(r'\W*\b\w{1,3}\b')
             review_body = one_two_char.sub('', review_body)

             # generate word bigrams
             review_body_bigrams = bigram(review_body)

        task6_lst.append((score[i], review_body_bigrams))
    
    task6_json = json.dumps(task6_lst)
    json.dump(json.loads(task6_json), open("task6.json", "w"))

    return 0






Task 7: Detecting the most indicative bigrams of positive reviews

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk import ngrams
import math
import json
import csv
import re

def findscore(text):
    num = re.findall(r'\d+', text)
    for i in num:
        return i

def bigram(text):
    return [f"{word1} {word2}" for word1, word2 in ngrams(text.split(), 2)]

def process(text):
    # convert non-alphabetic character and spacing characters into single-space
    text = re.sub('[^0-9a-zA-Z]+', ' ', text) 
    text = re.sub('\d+', ' ', text)

    # convert all uppercase characters to lowercase
    text = text.lower()

    # remove all stop words
    stop_words = set(stopwords.words('english'))
    text = ' '.join([i for i in text.split() if i not in stop_words])

    # remove all remaining words that are only one or two characters long
    one_two_char = re.compile(r'\W*\b\w{1,3}\b')
    text = one_two_char.sub('', text)

    # generate word bigrams
    text = bigram(text)

    return text

def task7():
    dataset = pd.read_csv(r'/course/data/dataset.csv')

    df = pd.DataFrame(dataset, columns = ['ID', 'category','REVIEWLIST'])
    
    ID = df['ID']
    category = df['category']
    review = df['REVIEWLIST']

    positive_review = {}
    negative_review = {}
    for i in range(len(ID)):
        rating = review[i]
        rating_dict = json.loads(rating)

        if len(rating_dict)!= 0:
            for j in range(len(rating_dict)):
                star_review = rating_dict[j]['review_star']      
                num = findscore(star_review)  
                review_body = process(rating_dict[j]['review_body'])
                if (type(num) == str):
                    if len(review_body) == 0:
                        continue
                    if int(num) == 1:
                        for bigram in review_body:
                            if bigram in negative_review:
                                negative_review[bigram] += 1
                            else:
                                negative_review[bigram] = 1
                    elif int(num) == 5:
                        for bigram in review_body:
                            if bigram in positive_review:
                                positive_review[bigram] += 1
                            else:
                                positive_review[bigram] = 1

    # loop through each bigram and see what's the count is, then divide by the sum
    total_pos = sum(positive_review.values()) 
    total_neg = sum(negative_review.values())

    # put comment
    # already exclude the odds of 0 or inf by separate it into two dictionary
    for bigram in positive_review:
        positive_review[bigram] = [positive_review[bigram]/total_pos, (positive_review[bigram]/total_pos)/(1 - positive_review[bigram]/total_pos)]
    for bigram in negative_review:
        negative_review[bigram] = [negative_review[bigram]/total_neg, (negative_review[bigram]/total_neg)/(1 - negative_review[bigram]/total_neg)]

    odd_ratio_dict = {}
    for bigram in negative_review:
        if bigram in positive_review:
            if positive_review[bigram] != 0:
                odd_neg_ratio = positive_review[bigram][1]/negative_review[bigram][1]
                odd_ratio_dict[bigram] = round(math.log10(odd_neg_ratio), 4)
    for bigram in positive_review:
        if bigram in negative_review:
            if negative_review[bigram] != 0:
                odd_pos_ratio = positive_review[bigram][1]/negative_review[bigram][1]
                odd_ratio_dict[bigram] = round(math.log10(odd_pos_ratio), 4)
    
    # TASK 7a
    task7a_data = dict(sorted(odd_ratio_dict.items(), key=lambda item: item[1]))
    
    header = ['bigram', 'log_odds_ratio' ]
    with open('task7a.csv','w') as f:
        writer = csv.writer(f)
        writer.writerow(header)
        for bigram in task7a_data.keys():
            f.write("%s,%s\n"%(bigram, task7a_data[bigram]))

    # TASK 7b
    log_odds_ratio_data = odd_ratio_dict.values()
    plt.clf()    
    task7b_plot = plt.hist(log_odds_ratio_data, bins = 20)
    plt.xlabel('Log odds ratio', fontsize = 30)
    plt.ylabel('Frequency', fontsize = 30)
    plt.title("Bigrams' log odds ratio and its Frequency", fontsize = 40)
    task7b = plt.savefig('task7b.png')

    # TASK 7c
    sorted_task7c_data = sorted(((value, key) for (key,value) in task7a_data.items()), reverse = True)
    task7c_data = {k: v for v, k in sorted_task7c_data}
    top10_bigram = list(task7c_data.keys())[0:10]
    top10_bigram_logoddratio = list(task7c_data.values())[0:10]
    last10_bigram = list(task7c_data.keys())[-11:-1]
    last10_bigram_logoddratio = list(task7c_data.values())[-11:-1]

    plt.clf()    
    fig, (top10, last10) = plt.subplots(2, figsize=(15, 25))
    top10.bar(top10_bigram, top10_bigram_logoddratio)
    top10.set_xticklabels(top10_bigram, rotation=45, fontsize=15)
    last10.bar(last10_bigram, last10_bigram_logoddratio)
    last10.set_xticklabels(last10_bigram ,rotation=45, fontsize=15)

    fig.suptitle('Top 10 bigrams and Last 10 bigrams log odds ratio', fontsize=40)
    last10.set_xlabel('Bigrams', fontsize=30)
    top10.set_ylabel('Log odds ratio', fontsize=20)
    last10.set_ylabel('Log odds ratio', fontsize=20)
    plt.savefig('task7c.png')

    return 0



