Dataset Documentation

A. General Information:

- Dataset Name:  We name this dataset as 'Amazon Sales'. It contains the data of over a thousand of Amazon's product's ratings and reviews, as listed on it's official website.
- Source: The dataset is obtained from Kaggle, a subsidiary of Google containing published datasets. It is scraped through BeautifulSoup and Webdriver using Python, from the official website of Amazon. 
- Date of Collection: This dataset was collected in 2022. 
- Owner/Provider: The information was obtained from Amazon's official website, but collected by Karkavelraja J., a contributor on the Kaggle Platform. 
- License: This dataset is under the CC BY-NC-SA 4.0 license - the dataset is available for sharing and adaptation. However, appropriate credit must be given with a link provided to the link, and changes to the dataset must be indicated. The material may not be used for commercial purposes. 

In [19]:
# Load the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Download the Amazon Sales Data set from this site and save onto your device
# https://www.kaggle.com/datasets/karkavelrajaj/amazon-sales-dataset/data

# Change your working directory if to access the file in its location
import os
os.getcwd()
# os.chdir("...")

In [None]:
# 1. Data Inspection and removal of unnecessary columns
amazon_sales = pd.read_csv("amazon.csv")

# View the first few rows of the data set
amazon_sales.head()

In [None]:
# View the data type of each feature
amazon_sales.dtypes

B. Dataset Description

As seen from above, the first few rows of the data set are illustrated. The data type for each feature is a pandas object. There are 1465 entries with 16 columns.

The Amazon sales dataset is a table of information. Its features include:
1. `product_id`: Product ID of the item purchased
2. `product_name`: Name of the product
3. `category`: Category the product belongs to
4. `discounted_price`: The discounted price of the product
5. `actual_price`: The actual price of the product
6. `discount_percentage`: Percentage of the discount of its original price
7. `rating`: Rating of the product on Amazon's website, a scale from 0 to 5
8. `rating_count`: Number of people who voted for the Amazon rating for that product
9. `about_product`: A description of the product
10. `user_id`: User ID of the consumer that purchased the product
11. `user_name`: The username of the consumer
12. `review_id`: The ID of the review that consumer who purchased the product gave
13. `review_title`: Title of the review given
14. `review_content`: Content of the review given by the consumer
15. `img_link`: A link containing an image of the product
16. `product_link`: The link of the product on the Amazon website

In [None]:
amazon_sales.isnull().sum()

The output shows that the `rating_count` feature has 2 missing entries, which is 0.0014% of all observations.
There is also a missing entry for the `rating` feature, which is 0.00068% of all observations.

In [None]:
amazon_sales.duplicated().value_counts()

C. Data Quality:
- There are missing values associated with two features in the data set.
- There are no duplicates associated with the data,as seen above, thus we do not have to employ any methods to remove duplicates.

D. Data Structure
- Schema Diagram (if applicable, include an entity-relationship diagram of how the data tables relate to each other)
- Relationships: Describe any relationships between different tables using key pairing

Preprocessing Documentation

Feature Selection:

Firstly, we shall remove several features, namely `about_product`, `user_id`, `user_name`,`rating_count`, `review_content`, `review_id`, `review_title`, `img_link`, `product_link`, with the sole reason being that they are not necessary in future data analysis that use relational databases including this dataset. (TO BE DETERMINED)

In [39]:
# Remove unnecessary columns
amazon_sales = amazon_sales.drop(columns = ['about_product', 'user_id', 'user_name','rating_count', 'review_content', 'review_title', 'img_link', 'product_link'])

The code chunk below is to transform the features `actual_price`, `discounted_price`, `rating` and `discount_percentage` into numeric data types.
Further rename the columns to make the column names look prettier.

Comment: `actual_price` and `discounted_price`, which are in rupees, are converted to USD as the standardised currency across all relational databases is the USD.

In [None]:
# 2. Data type Conversions (Transforming numeric columns to make them look nicer)

# Here, we convert four columns into numerical data types
amazon_sales['actual_price'] = amazon_sales['actual_price'].astype(str).str.replace('₹', '')
amazon_sales['discounted_price'] = amazon_sales['discounted_price'].astype(str).str.replace('₹', '')
amazon_sales['discount_percentage'] = amazon_sales['discount_percentage'].astype(str).str.replace('%', '')

amazon_sales['actual_price'] = amazon_sales['actual_price'].str.replace(',', '')
amazon_sales['discounted_price'] = amazon_sales['discounted_price'].str.replace(',', '')

amazon_sales['actual_price'] = pd.to_numeric(amazon_sales['actual_price'])
amazon_sales['discount_percentage'] = pd.to_numeric(amazon_sales['discount_percentage'])
amazon_sales['discounted_price'] = pd.to_numeric(amazon_sales['discounted_price'])

amazon_sales['rating'] = amazon_sales['rating'].str.replace('|', '')
amazon_sales['rating'] = pd.to_numeric(amazon_sales['rating'])

amazon_sales['actual_price'] = round(amazon_sales['actual_price']*0.012, 2)
amazon_sales['discounted_price'] = round(amazon_sales['discounted_price']*0.012, 2)

amazon_sales = amazon_sales.rename(columns = {'actual_price': 'actual_price (USD)', 'discounted_price': 'discounted_price (USD)', 'discount_percentage': 'discount (%)'})
amazon_sales.head()

Encoding the feature `category` as a categorical variable. This was after we trimmed the entries this feature has into a shorter and cleaner phrase for a categorical variable. This feature classifies the products based on their main category 'Electronics' or 'Computers & Accessories'. Since we have decided on 'Home appliances' as one of our main categories, we convert 'Computers & Accessories' to 'Home Appliances'.


In [None]:
amazon_sales['category'] = amazon_sales['category'].str.split('|').str[0]
amazon_sales['category'].unique()

There are 9 distinct categories of products in this dataset. We want to focus our project analysis on just three main categories: Electronics, Home appliances and Clothing.
Therefore, we remove these categories from our dataset: "MusicalInstruments", "OfficeProducts", "Toys&Games", "Car&Motorbike", "Health&PersonalCare" (thereby removing all observations that fulfill this condition)

In [None]:
remove_set = {"MusicalInstruments", "OfficeProducts", "Toys&Games", "Car&Motorbike", "Health&PersonalCare"}
amazon_sales = amazon_sales[~amazon_sales['category'].isin(remove_set)]
amazon_sales['category'].unique()

In [None]:
amazon_sales['category'] = amazon_sales['category'].replace(["Home&Kitchen"], "Home Appliances")
amazon_sales['category'] = amazon_sales['category'].replace(["Computers&Accessories"], "Electronics")
amazon_sales['category'] = amazon_sales['category'].replace(["HomeImprovement"], "Home Appliances")

amazon_sales['category'] = amazon_sales.category.astype('category')
amazon_sales['category'].unique()

We have decided on removal as the number of observations is very small (fewer than 5) in the original data set, and assumed that removal of these rows will not impact the data analysis significantly.

In [None]:
amazon_sales.dropna(subset = ['rating'], inplace = True)
amazon_sales.isnull().sum()

There are no duplicate records found as per our inspection on the data set as conducted earlier on.

As there are no date and time records, date and time handling for this data set is not required.

Below, we will handle outliers using Z-score normalization, removing observations that have z-scores above 3.

In [32]:
# 4. Handling outliers
# Use visualization plots to identify potential outliers in numerical columns
# sns.boxplot(x = amazon_sales["actual_price (rupees)"])

# Use Z-score normalization to identify and remove outliers
from scipy import stats

# Obtain a subset of numerical columns for the dataset, except for rating and discount
# because the values are already within a small margin/within a range of values
amazon_numeric = amazon_sales.iloc[:, 3:5]


#  Calculate the z-scores for the subset of the dataset
z_scores = stats.zscore(amazon_numeric)

# Convert the z-scores to absolute values and filter out the outliers
z_scores = np.abs(z_scores)
filters = (z_scores < 3).all(axis=1)
amazon_sales = amazon_sales[filters]

In [None]:
filters.value_counts()

From Z-score normalization conducted above, using -3 and 3 as the threshold to remove observations with outliers in numerical columns, we have found possible outliers, many of them, in the `discounted_price (USD)` feature.

As for now, we will not add any new features to the data set

Data storage

- We will store the cleaned and preprocessed data in a database
- The file format to store the cleaned data is CSV, which has always been the default format for the data. If there is a need to store the cleaned data in a SQL database, we can conform to that.