# Sampling large Datasets
In data processing, a great deal of computing involves analysing large amounts of text mixed with numerical data.  This is what Spark is particularly suited for. Sampling is an essential pre-processing for machine leanring for proof of concept

## Recbole dataset
Recbole is a powerful recommendation system traning and evaluation platform. It has many built-in datasets(https://recbole.io/dataset_list.html), some of which is too large to process on a single computer. I will use spark to preprocess it to shrink its size. 

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- \ done
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l- \ | done
[?25h  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488493 sha256=5c414f85c4eb88f571626769d4fee3db60653989828d045e0930b53ef5cce719
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:

!rm url.yaml
!wget https://raw.githubusercontent.com/RUCAIBox/RecBole/master/recbole/properties/dataset/url.yaml
!pip install pyyaml

import yaml

# Specify the path to the YAML file
file_path = "url.yaml"

# Open the file and load the YAML contents
with open(file_path, "r") as file:
    dataset_urls = yaml.safe_load(file)
   
# only print the first 5 lines
for key in list(dataset_urls.keys())[:5]:
    print(key, ":", dataset_urls[key])

rm: cannot remove 'url.yaml': No such file or directory
--2024-05-03 23:40:36--  https://raw.githubusercontent.com/RUCAIBox/RecBole/master/recbole/properties/dataset/url.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16548 (16K) [text/plain]
Saving to: 'url.yaml'


2024-05-03 23:40:36 (27.2 MB/s) - 'url.yaml' saved [16548/16548]

adult : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Adult/adult.zip
alibaba-ifashion : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Alibaba-iFashion/Alibaba-iFashion.zip
aliec : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/AliEC/AliEC.zip
amazon-apps-for-android : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Amazon_ratings/Amazon_Apps_for_Android.z

Set the datasets to donwload and process

In [3]:
datasets_to_download = ['amazon-digital-music', 'amazon-video-games']
# datasets_to_download = ['amazon-cds-vinyl']
import os
# Path to the folder where the zip file will be extracted
input_folder_path = "input"

# Create input folder if it doesn't exist
if not os.path.exists(input_folder_path):
    os.makedirs(input_folder_path)
    
# Path to the folder where processed file will be saved
output_folder_path = "output"

# Create out folder if it doesn't exist
if not os.path.exists(output_folder_path):
    os.makedirs(output_folder_path)

In [4]:
!pip install requests
import requests
import zipfile
import io

def download_upzip(url, dataset_name):
    # Download the zip file
    response = requests.get(url)
    zip_file = zipfile.ZipFile(io.BytesIO(response.content))

    # Extract the zip file to the specified folder of dataset_name
    folder_path = os.path.join(input_folder_path, dataset_name)
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
    zip_file.extractall(folder_path)

    #TODO: if extracted file is a directory, move all files to the parent directory
    # for root, dirs, files in os.walk(folder_path):
    #     for file in files:
    #         os.rename(os.path.join(root, file), os.path.join(folder_path, file))
    #     for dir in dirs:
    #         os.rmdir(os.path.join(root, dir))

    # Close the zip file
    zip_file.close()

#  download all dataset from datasets_to_download
for dataset in datasets_to_download:
    download_upzip(dataset_urls[dataset], dataset)



In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Amazon Sampling").getOrCreate()
spark.catalog.clearCache()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/03 23:41:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [6]:
from pyspark.sql.functions import col, when, count

# read from file into dataframe
dfs = {}
for dataset in datasets_to_download:
    dataset_path = os.path.join(input_folder_path, dataset)
    dfs[dataset] = {}
    for file in os.listdir(dataset_path):
        file_path = os.path.join(dataset_path, file)
        df = spark.read.option("delimiter",'\t').option("header", True).csv(file_path)
        dfs[dataset][file] = df
        print(f"Dataset: {dataset}, File: {file}")
        df.show(5)
        
        print(f'num of {file}:',df.count())

        # check the uniqueness of key, we assume key name is ending with _id bofore :token i.e. item_id:token
        # find the header ending with _id:token
        key_columns = [col for col in df.columns if col.endswith('_id:token')]
        for key_column in key_columns:
            print(f"Number of disintict {key_column}:", df.select(key_column).distinct().count())
            

        # check the completeness of each column
        print("Number of non-null values in each column:")
        df.select([count(when(col(c).isNotNull() , c)).alias(c) for c in df.columns]).show()

Dataset: amazon-digital-music, File: Amazon_Digital_Music.item
+-------------+--------------------+-----------+----------------+----------------+--------------------+-----------+
|item_id:token|         title:token|price:float|sales_type:token|sales_rank:float|categories:token_seq|brand:token|
+-------------+--------------------+-----------+----------------+----------------+--------------------+-----------+
|   5555991584|     Memory of Trees|       9.49|           Music|        939190.0|'CDs & Vinyl', 'P...|       NULL|
|   6308051551|Don't Drink His B...|       8.91|            NULL|            NULL|'Digital Music', ...|       NULL|
|   7901622466|             On Fire|      11.33|           Music|         58799.0|'CDs & Vinyl', 'C...|       NULL|
|   B0000000ZW|      Changing Faces|      23.64|           Music|         68784.0|'CDs & Vinyl', 'P...|       NULL|
|   B00000016W|          Pet Sounds|       9.49|           Music|         77205.0|'CDs & Vinyl', 'C...|       NULL|
+--------

                                                                                

Number of disintict item_id:token: 279899
Number of non-null values in each column:


                                                                                

+-------------+-----------+-----------+----------------+----------------+--------------------+-----------+
|item_id:token|title:token|price:float|sales_type:token|sales_rank:float|categories:token_seq|brand:token|
+-------------+-----------+-----------+----------------+----------------+--------------------+-----------+
|       279899|       7321|      86174|            9531|            9531|              279899|        554|
+-------------+-----------+-----------+----------------+----------------+--------------------+-----------+

Dataset: amazon-digital-music, File: Amazon_Digital_Music.inter
+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A2EFCYXHNK06IS|   5555991584|         5.0|      978480000|
|A1WR23ER5HMAA9|   5555991584|         5.0|      953424000|
|A2IR4Q0GPAFJKW|   5555991584|         4.0|     1393545600|
|A2V0KUVAB9HSYO|   5555991584|         4

                                                                                

Number of disintict user_id:token: 478235


                                                                                

Number of disintict item_id:token: 266414
Number of non-null values in each column:


                                                                                

+-------------+-------------+------------+---------------+
|user_id:token|item_id:token|rating:float|timestamp:float|
+-------------+-------------+------------+---------------+
|       836006|       836006|      836006|         836006|
+-------------+-------------+------------+---------------+

Dataset: amazon-video-games, File: Amazon_Video_Games.inter
+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
| AB9S9279OZ3QO|   0078764343|         5.0|     1373155200|
|A24SSUT5CSW8BH|   0078764343|         5.0|     1377302400|
| AK3V0HEBJMQ7J|   0078764343|         4.0|     1372896000|
|A10BECPH7W8HM7|   043933702X|         5.0|     1404950400|
|A2PRV9OULX1TWP|   043933702X|         5.0|     1386115200|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

num of Amazon_Video_Games.inter: 1324753


                                                                                

Number of disintict user_id:token: 826767


                                                                                

Number of disintict item_id:token: 50210
Number of non-null values in each column:


                                                                                

+-------------+-------------+------------+---------------+
|user_id:token|item_id:token|rating:float|timestamp:float|
+-------------+-------------+------------+---------------+
|      1324753|      1324753|     1324753|        1324753|
+-------------+-------------+------------+---------------+

Dataset: amazon-video-games, File: Amazon_Video_Games.item
+-------------+-----------+----------------+----------------+--------------------+-----------+-----------+
|item_id:token|price:float|sales_type:token|sales_rank:float|categories:token_seq|title:token|brand:token|
+-------------+-----------+----------------+----------------+--------------------+-----------+-----------+
|   0078764343|      37.98|     Video Games|         28655.0|'Games', 'Video G...|       NULL|       NULL|
|   043933702X|       23.5|     Video Games|         44080.0|'Games', 'Video G...|       NULL|       NULL|
|   0439339987|       8.95|     Video Games|         49836.0|'Games', 'Video G...|       NULL|       NULL|
|  


## Data Processing

In [7]:
inter_map = {}
# analyze the sparse of the dataset
for dataset in datasets_to_download:
    dataset_path = os.path.join(input_folder_path, dataset)
    for file in os.listdir(dataset_path):
        if file.endswith('.inter'):
            inter_map[dataset] = file

### filter out inactive user/items

In [8]:
user_inter_threshold = 10
item_inter_threshold = 10

# filter out the user and item with less than threshold interactions
for dataset in datasets_to_download:
    print('-----------------------------------')
    print(f"Dataset: {dataset}")
    inter_df = dfs[dataset][inter_map[dataset]]
    
    print(f'num of iteractions:',inter_df.count())

    # print(f'num of {inter_map[dataset]}:',inter_df.count())
    print(f'num of user_id:',inter_df.select('user_id:token').distinct().count())
    print(f'num of item_id:',inter_df.select('item_id:token').distinct().count())
    # count the number of interactions for each user and item and rename the count column
    user_count_df = inter_df.groupBy('user_id:token').count().withColumnRenamed('count','count_user')
    item_count_df = inter_df.groupBy('item_id:token').count().withColumnRenamed('count','count_item')

    # append the count of user and item to the original df
    inter_df = inter_df.join(user_count_df, on='user_id:token', how='inner')
    inter_df = inter_df.join(item_count_df, on='item_id:token', how='inner')
    inter_df.show(5)
    
    # filter out the user and item with less than threshold interactions
    inter_df = inter_df.filter((col('count_user') >= user_inter_threshold) & (col('count_item') >= item_inter_threshold))
    
    print(f'filtered num of iteractions:',inter_df.count())
    
    # release the memory of dfs[dataset][inter_map[dataset]]
    dfs[dataset][inter_map[dataset]] = inter_df.drop('count_user','count_item')
    

-----------------------------------
Dataset: amazon-digital-music
num of iteractions: 836006


                                                                                

num of user_id: 478235


                                                                                

num of item_id: 266414


                                                                                

+-------------+--------------+------------+---------------+----------+----------+
|item_id:token| user_id:token|rating:float|timestamp:float|count_user|count_item|
+-------------+--------------+------------+---------------+----------+----------+
|   B0000001Q8|A3U3UXV7VQ6GGD|         5.0|     1222387200|         6|         3|
|   B0000001Q8|A247RM73M8B176|         4.0|     1108166400|         1|         3|
|   B0000001Q8|A1X67QWGL8QVX9|         5.0|     1168300800|        15|         3|
|   B0000001SH|A1LIOQNKOYCWBN|         5.0|     1370995200|         1|         5|
|   B0000001SH|A29PW3RBYPME61|         5.0|     1363564800|         1|         5|
+-------------+--------------+------------+---------------+----------+----------+
only showing top 5 rows



                                                                                

filtered num of iteractions: 65344
-----------------------------------
Dataset: amazon-video-games
num of iteractions: 1324753


                                                                                

num of user_id: 826767


                                                                                

num of item_id: 50210


                                                                                

+-------------+--------------+------------+---------------+----------+----------+
|item_id:token| user_id:token|rating:float|timestamp:float|count_user|count_item|
+-------------+--------------+------------+---------------+----------+----------+
|   B00014WNE6|A100UZ3LRLU135|         4.0|     1120780800|         1|        36|
|   B002I0HBZW|A100VTYZQ17B4I|         5.0|     1305590400|         1|       413|
|   B000TKB28K|A101SY98T3JQMU|         5.0|     1231632000|         1|        47|
|   B00BMFIXT2|A10584T58O3B5Y|         5.0|     1394841600|        21|       810|
|   B000GCGB3M|A105XFXLYACZZZ|         5.0|     1355184000|         1|       139|
+-------------+--------------+------------+---------------+----------+----------+
only showing top 5 rows



                                                                                

filtered num of iteractions: 141608


### Output overlaped users between datasets 

In [9]:
# folder list of output folders
output_folder_list = []

In [10]:
base_dataset = datasets_to_download[0]
# find the common users between base_dataset and other datasets
for j in range(1,len(datasets_to_download)):
        dataset1 = base_dataset
        dataset2 = datasets_to_download[j]
        inter1 = dfs[dataset1][inter_map[dataset1]]
        inter2 = dfs[dataset2][inter_map[dataset2]]
        inter1.createOrReplaceTempView("inter1")
        inter2.createOrReplaceTempView("inter2")

        print(f"Common users between {dataset1} and {dataset2}")    
        # get the distinct users and then intersect
        inter1_dist = inter1.select('user_id:token').distinct()
        # inter1_dist.show(5)
        common_users = inter1_dist.join(inter2, inter1_dist['user_id:token'] == inter2['user_id:token'],'leftsemi')
        common_users.show(5)

        print(f'num of common_users:',common_users.count())
        # print the items count of each inter of common users
        inter1_com_user = inter1.join(common_users, 'user_id:token')
        inter2_com_user = inter2.join(common_users, 'user_id:token')
        # statictics of inter 1
        inter1_com_user_count = inter1_com_user.count()
        inter1_com_item_count = inter1_com_user.select('item_id:token').distinct().count()
        print(f'num of interactino of common users in {dataset1}:',inter1_com_user_count)
        print(f'num of related items in the interaction:',inter1_com_item_count)
        print(f'density of {dataset1} inetraction :',inter1.count()/inter1_com_user_count/inter1_com_item_count)

        # save filtered datasets to file
        inter1_out_path = os.path.join(output_folder_path, f"{dataset1}_{dataset2}")
        inter1_com_user.show(5)
        inter1_com_user.repartition(1).write.option("header", "true").csv(inter1_out_path, mode='overwrite', sep='\t')
        output_folder_list.append(inter1_out_path)
        # output_folder_map[dataset1] = inter1_out_path
        
        # statictics of inter 2
        inter2_com_user_count = inter2_com_user.count()
        inter2_com_item_count = inter2_com_user.select('item_id:token').distinct().count()
        print(f'num of interactino of common users in {dataset2}:',inter2_com_user_count)
        print(f'num of related items in the interaction:',inter2_com_item_count)
        print(f'density of {dataset2} inetraction :',inter2.count()/inter2_com_user_count/inter2_com_item_count)

        # save filtered datasets to file
        inter2_out_path = os.path.join(output_folder_path, f"{dataset2}_{dataset1}")
        inter2_com_user.show(5) 
        inter2_com_user.repartition(1).write.option("header", "true").csv(inter2_out_path, mode='overwrite', sep='\t')
        output_folder_list.append(inter2_out_path)
        # output_folder_map[dataset2] = inter2_out_path

Common users between amazon-digital-music and amazon-video-games


                                                                                

+--------------+
| user_id:token|
+--------------+
|A2P6QCZWW3H1X6|
|A37Z81LW79DUZ8|
|A1MCQLJGZ2ODCK|
|A2Z4H7PQHDPUWF|
|A15OGDJS69EUCP|
+--------------+
only showing top 5 rows



                                                                                

num of common_users: 265


                                                                                

num of interactino of common users in amazon-digital-music: 5366
num of related items in the interaction: 3002


                                                                                

density of amazon-digital-music inetraction : 0.004056433492096088


                                                                                

+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A105S56ODHGJEK|   B00000613H|         5.0|      990230400|
|A105S56ODHGJEK|   B00001QGPS|         4.0|      944611200|
|A105S56ODHGJEK|   B000002LJF|         5.0|      987120000|
|A105S56ODHGJEK|   B000003C3V|         5.0|      944870400|
|A105S56ODHGJEK|   B0000013GT|         5.0|      953769600|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

num of interactino of common users in amazon-video-games: 9055
num of related items in the interaction: 5264


                                                                                

density of amazon-video-games inetraction : 0.002970868669847722


                                                                                

+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A2P6QCZWW3H1X6|   B003QOWPWS|         4.0|     1379116800|
|A2P6QCZWW3H1X6|   B0016BVYA2|         4.0|     1372636800|
|A2P6QCZWW3H1X6|   B00FM5IY4W|         2.0|     1389744000|
|A2P6QCZWW3H1X6|   B002BSA298|         3.0|     1379030400|
|A2P6QCZWW3H1X6|   B00D7NQP9M|         4.0|     1383868800|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

In [11]:
# find the common users among all downloaded datasets
for dataset in datasets_to_download:
    inter = dfs[dataset][inter_map[dataset]]
    inter.createOrReplaceTempView("inter")

    print(f"Common users among all datasets")    
    # get the distinct users and then intersect
    inter_dist = inter.select('user_id:token').distinct()
    inter_dist.show(3)
    if dataset == datasets_to_download[0]:
        common_users = inter_dist
    else:
        common_users = common_users.join(inter_dist, 'user_id:token','inner')
    print(f'num of common_users after merge with {dataset}:',common_users.count())

common_users.show(3)

# export inter of common users to file
for dataset in datasets_to_download:
    inter = dfs[dataset][inter_map[dataset]]
    inter.createOrReplaceTempView("inter")
    inter_com_user = inter.join(common_users, 'user_id:token')
    inter_com_user_count = inter_com_user.count()
    inter_com_item_count = inter_com_user.select('item_id:token').distinct().count()
    print(f'num of interactino of common users in {dataset}:',inter_com_user_count)
    print(f'num of {dataset} :',inter_com_item_count)
    print(f'density of {dataset} inetraction :',inter.count()/inter_com_user_count/inter_com_item_count)

    # save filtered datasets to file
    inter_out_path = os.path.join(output_folder_path, f"{dataset}_common")
    inter_com_user.show(5)
    inter_com_user.repartition(1).write.option("header", "true").csv(inter_out_path, mode='overwrite', sep='\t')
    output_folder_list.append(inter_out_path)
    # output_folder_map[dataset] = inter_out_path

Common users among all datasets


                                                                                

+--------------+
| user_id:token|
+--------------+
|A1KISBM4ST9O3U|
|A2XQT9ZMXJ8NF3|
|A3B01BU84T197B|
+--------------+
only showing top 3 rows



                                                                                

num of common_users after merge with amazon-digital-music: 5729
Common users among all datasets


                                                                                

+--------------+
| user_id:token|
+--------------+
| A12LH2100CKQO|
|A1NTA4K5DS2V80|
|A2P6QCZWW3H1X6|
+--------------+
only showing top 3 rows



                                                                                

num of common_users after merge with amazon-video-games: 265


                                                                                

+--------------+
| user_id:token|
+--------------+
|A2P6QCZWW3H1X6|
|A37Z81LW79DUZ8|
|A2Z4H7PQHDPUWF|
+--------------+
only showing top 3 rows



                                                                                

num of interactino of common users in amazon-digital-music: 5366
num of amazon-digital-music : 3002


                                                                                

density of amazon-digital-music inetraction : 0.004056433492096088


                                                                                

+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A105S56ODHGJEK|   B00000613H|         5.0|      990230400|
|A105S56ODHGJEK|   B00001QGPS|         4.0|      944611200|
|A105S56ODHGJEK|   B000002LJF|         5.0|      987120000|
|A105S56ODHGJEK|   B000003C3V|         5.0|      944870400|
|A105S56ODHGJEK|   B0000013GT|         5.0|      953769600|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

num of interactino of common users in amazon-video-games: 9055
num of amazon-video-games : 5264


                                                                                

density of amazon-video-games inetraction : 0.002970868669847722


                                                                                

+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A105S56ODHGJEK|   B0057FANEQ|         3.0|     1323820800|
|A105S56ODHGJEK|   B000MUW98O|         3.0|     1205971200|
|A105S56ODHGJEK|   B001NX6GBK|         4.0|     1270252800|
|A105S56ODHGJEK|   B004R9OVEG|         5.0|     1307664000|
|A105S56ODHGJEK|   B0058SHNF4|         4.0|     1326672000|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

In [12]:
dataset_itemfile_map = {}
def get_itemfile_path(dataset):
    dataset_path = os.path.join(input_folder_path, dataset)
    for file in os.listdir(dataset_path):
        if file.endswith('.item'):
            return os.path.join(dataset_path, file)
    return None

for ouptput_folder in output_folder_list:
    # strip the dataset from the first part of folder
    dataset = os.path.basename(ouptput_folder).split('_')[0]
    # copy .item file from correonding input folder to output folder
    itemfile_path = get_itemfile_path(dataset)
    if itemfile_path:
        print(f"copy from {itemfile_path} to {ouptput_folder} for {dataset} ")
        out_path = os.path.join(ouptput_folder, f"{dataset}.item")
        !cp $itemfile_path $out_path
    else:
        print(f"item file not found for {dataset}")

    for file in os.listdir(ouptput_folder):
        # rename exported cvs as .inter
        if file.endswith('.csv'):
            # rename file to {folder}.inter
            file_path = os.path.join(ouptput_folder, file)
            out_path = os.path.join(ouptput_folder, f"{dataset}.inter")
            !mv $file_path $out_path

copy from input/amazon-digital-music/Amazon_Digital_Music.item to output/amazon-digital-music_amazon-video-games for amazon-digital-music 
copy from input/amazon-video-games/Amazon_Video_Games.item to output/amazon-video-games_amazon-digital-music for amazon-video-games 
copy from input/amazon-digital-music/Amazon_Digital_Music.item to output/amazon-digital-music_common for amazon-digital-music 
copy from input/amazon-video-games/Amazon_Video_Games.item to output/amazon-video-games_common for amazon-video-games 


## Analyze Chronicle Characteristics
TBD

## Sampling
Stratified sampling based on hotness(interaction rate) of items

## release all the resources 

In [13]:
# unpersist the dfs
for dataset in datasets_to_download:
    for key in dfs[dataset]:
        dfs[dataset][key].unpersist()
        
# Stop the Spark session
spark.stop()

## Sammary
Spark is a powerful and efficient tool to handle sample on large scale of data. 
* flexible and powerful functionality
* runs super fast even on my laptop
* easy to apply to similar datasets(Amazon have dataset of different categories), I only focused on one categoy this time. 