## Sampling large Datasets
In data processing, a great deal of computing involves analysing large amounts of text mixed with numerical data.  This is what Spark is particularly suited for. Sampling is an essential pre-processing for machine leanring for proof of concept

### Amazon dataset
The file Amazon_xx.inter comprising the ratings of users over the items.
Each record/line in the file has the following fields: user_id, item_id, rating and timestamp.

* user_id: the id of the users and its type is token. 
* item_id: the id of the items and its type is token.
* rating: the rating of the users over the item, and its type is float.
* timestamp: the UNIX time of the interaction, and its type is float.

The file Amazon_xx.item comprising the attributes of the items.
Each record/line in the file has the following fields: item_id, title, price, sales_type, sales_rank, brand, categories
 
* item_id: the id of the item and its type is token.
* title: the title of the item, and its type is token.
* price: the price of the item, and its type is float.
* sales_type: the type sales rank in, and its type is token. 
* sales_rank: sales rank, and its type is float.
* brand: the brand name of the item, and its type is token.
* categories: the categories of the item, and its type is token_seq.

In [1]:

!rm url.yaml
!wget https://raw.githubusercontent.com/RUCAIBox/RecBole/master/recbole/properties/dataset/url.yaml
!pip install pyyaml

import yaml

# Specify the path to the YAML file
file_path = "url.yaml"

# Open the file and load the YAML contents
with open(file_path, "r") as file:
    dataset_urls = yaml.safe_load(file)
   
# only print the first 5 lines
for key in list(dataset_urls.keys())[:5]:
    print(key, ":", dataset_urls[key])

--2024-04-13 22:25:12--  https://raw.githubusercontent.com/RUCAIBox/RecBole/master/recbole/properties/dataset/url.yaml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8001::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16548 (16K) [text/plain]
Saving to: ‘url.yaml’


2024-04-13 22:25:12 (470 KB/s) - ‘url.yaml’ saved [16548/16548]

adult : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Adult/adult.zip
alibaba-ifashion : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Alibaba-iFashion/Alibaba-iFashion.zip
aliec : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/AliEC/AliEC.zip
amazon-apps-for-android : https://recbole.s3-accelerate.amazonaws.com/ProcessedDatasets/Amazon_ratings/Amazon_Apps_for_Android.zip
amazon-automotive : https://recbole.s3-accelerate

Set the datasets to donwload and 

In [18]:
datasets_to_download = ['amazon-movies-tv','amazon-books']
# datasets_to_download = ['amazon-cds-vinyl']
import os
# Path to the folder where the zip file will be extracted
input_folder_path = "input"

# Create input folder if it doesn't exist
if not os.path.exists(input_folder_path):
    os.makedirs(input_folder_path)
    
# Path to the folder where processed file will be saved
output_folder_path = "output"

# Create out folder if it doesn't exist
if not os.path.exists(output_folder_path):
    os.makedirs(output_folder_path)

In [44]:
!pip install requests
import requests
import zipfile
import io

def download_upzip(url, dataset_name):
    # Download the zip file
    response = requests.get(url)
    zip_file = zipfile.ZipFile(io.BytesIO(response.content))

    # Extract the zip file to the specified folder of dataset_name
    folder_path = os.path.join(input_folder_path, dataset_name)
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
    zip_file.extractall(folder_path)

    #TODO: if extracted file is a directory, move all files to the parent directory
    # for root, dirs, files in os.walk(folder_path):
    #     for file in files:
    #         os.rename(os.path.join(root, file), os.path.join(folder_path, file))
    #     for dir in dirs:
    #         os.rmdir(os.path.join(root, dir))

    # Close the zip file
    zip_file.close()

#  download all dataset from datasets_to_download
for dataset in datasets_to_download:
    download_upzip(dataset_urls[dataset], dataset)



In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Amazon Sampling").getOrCreate()

24/04/13 22:25:27 WARN Utils: Your hostname, qingdeMacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 10.0.0.13 instead (on interface en0)
24/04/13 22:25:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/13 22:25:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/04/13 22:25:32 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [11]:
from pyspark.sql.functions import col, when, count

# read from file into dataframe
dfs = {}
for dataset in datasets_to_download:
    dataset_path = os.path.join(input_folder_path, dataset)
    dfs[dataset] = {}
    for file in os.listdir(dataset_path):
        file_path = os.path.join(dataset_path, file)
        df = spark.read.option("delimiter",'\t').option("header", True).csv(file_path)
        dfs[dataset][file] = df
        print(f"Dataset: {dataset}, File: {file}")
        df.show(5)
        
        print(f'num of {file}:',df.count())

        # check the uniqueness of key, we assume key name is ending with _id bofore :token i.e. item_id:token
        # find the header ending with _id:token
        key_columns = [col for col in df.columns if col.endswith('_id:token')]
        for key_column in key_columns:
            print(f"Number of disintict {key_column}:", df.select(key_column).distinct().count())
            
        print()
        # check the completeness of each column
        df.select([count(when(col(c).isNotNull() , c)).alias(c) for c in df.columns]).show()

                                                                                

Dataset: amazon-movies-tv, File: Amazon_Movies_and_TV.item
+-------------+--------------------+--------------------+-----------+----------------+----------------+-----------+
|item_id:token|categories:token_seq|         title:token|price:float|sales_type:token|sales_rank:float|brand:token|
+-------------+--------------------+--------------------+-----------+----------------+----------------+-----------+
|   0000143561|'Movies', 'Movies...|Everyday Italian ...|      12.99|     Movies & TV|        376041.0|       null|
|   0000589012|'Movies', 'Movies...|Why Don't They Ju...|      15.95|     Movies & TV|       1084845.0|       null|
|   0000695009|'Movies', 'Movies...|Understanding Sei...|       null|     Movies & TV|       1022732.0|       null|
|   000107461X|'Movies', 'Movies...|Live in Houston [...|       null|     Movies & TV|        954116.0|       null|
|   0000143529|'Movies', 'Movies...|My Fair Pastry (G...|      19.99|     Movies & TV|        463562.0|       null|
+------------

                                                                                

num of Amazon_Movies_and_TV.item: 208328


                                                                                

Number of disintict item_id:token: 208326



                                                                                

+-------------+--------------------+-----------+-----------+----------------+----------------+-----------+
|item_id:token|categories:token_seq|title:token|price:float|sales_type:token|sales_rank:float|brand:token|
+-------------+--------------------+-----------+-----------+----------------+----------------+-----------+
|       208328|              208325|     107676|     155624|          204904|          204902|      12314|
+-------------+--------------------+-----------+-----------+----------------+----------------+-----------+

Dataset: amazon-movies-tv, File: Amazon_Movies_and_TV.inter
+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A3R5OBKS7OM2IR|   0000143502|         5.0|     1358380800|
|A3R5OBKS7OM2IR|   0000143529|         5.0|     1380672000|
| AH3QC2PC1VTGP|   0000143561|         2.0|     1216252800|
|A3LKP6WPMP9UKX|   0000143588|         5.0| 

                                                                                

num of Amazon_Movies_and_TV.inter: 4607047


                                                                                

Number of disintict user_id:token: 2088620


                                                                                

Number of disintict item_id:token: 200941



                                                                                

+-------------+-------------+------------+---------------+
|user_id:token|item_id:token|rating:float|timestamp:float|
+-------------+-------------+------------+---------------+
|      4607047|      4607047|     4607047|        4607047|
+-------------+-------------+------------+---------------+

Dataset: amazon-books, File: Amazon_Books.item
+-------------+----------------+----------------+--------------------+--------------------+-----------+-----------+
|item_id:token|sales_type:token|sales_rank:float|categories:token_seq|         title:token|price:float|brand:token|
+-------------+----------------+----------------+--------------------+--------------------+-----------+-----------+
|   0001048791|           Books|       6334800.0|             'Books'|The Crucible: Per...|       null|       null|
|   0001048775|           Books|      13243226.0|             'Books'|Measure for Measu...|       null|       null|
|   0001048236|           Books|       8973864.0|             'Books'|The She

                                                                                

Number of disintict item_id:token: 2370604



                                                                                

+-------------+----------------+----------------+--------------------+-----------+-----------+-----------+
|item_id:token|sales_type:token|sales_rank:float|categories:token_seq|title:token|price:float|brand:token|
+-------------+----------------+----------------+--------------------+-----------+-----------+-----------+
|      2370604|         1891174|         1891163|             2370585|    1938767|    1679399|        106|
+-------------+----------------+----------------+--------------------+-----------+-----------+-----------+

Dataset: amazon-books, File: Amazon_Books.inter
+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
| AH2L9G3DQHHAJ|   0000000116|         4.0|     1019865600|
|A2IIIDRK3PRRZY|   0000000116|         1.0|     1395619200|
|A1TADCM7YWPQ8M|   0000000868|         4.0|     1031702400|
| AWGH7V0BDOJKB|   0000013714|         4.0|     13831776

                                                                                

num of Amazon_Books.inter: 22507155


                                                                                

Number of disintict user_id:token: 8026324


                                                                                

Number of disintict item_id:token: 2330066





+-------------+-------------+------------+---------------+
|user_id:token|item_id:token|rating:float|timestamp:float|
+-------------+-------------+------------+---------------+
|     22507155|     22507155|    22507155|       22507155|
+-------------+-------------+------------+---------------+



                                                                                


## Data Processing

In [None]:
inter_map = {}
# analyze the sparse of the dataset
for dataset in datasets_to_download:
    dataset_path = os.path.join(input_folder_path, dataset)
    for file in os.listdir(dataset_path):
        if file.endswith('.inter'):
            inter_map[dataset] = file

### filter out inactive user/items

In [21]:
user_inter_threshold = 5
item_inter_threshold = 5

# filter out the user and item with less than threshold interactions
for dataset in datasets_to_download:
    print('-----------------------------------')
    print(f"Dataset: {dataset}")
    inter_df = dfs[dataset][inter_map[dataset]]
    inter_df.show(5)
    # print(f'num of {inter_map[dataset]}:',inter_df.count())
    print(f'num of user_id:',inter_df.select('user_id:token').distinct().count())
    print(f'num of item_id:',inter_df.select('item_id:token').distinct().count())
    user_count_df = inter_df.groupBy('user_id:token').count()
    filtered_df = user_count_df.filter(user_count_df['count'] >= user_inter_threshold)
    print(f'num of user_id: with interaction bigger than {user_inter_threshold}:',filtered_df.select('user_id:token').distinct().count())
    inter_df = inter_df.join(filtered_df, on='user_id:token', how='inner')
    
    item_count_df = inter_df.groupBy('item_id:token').count()
    filtered_df = item_count_df.filter(item_count_df['count'] >= item_inter_threshold)
    print(f'num of item_id: with interaction bigger than {item_inter_threshold}:',filtered_df.select('item_id:token').distinct().count())
    inter_df = inter_df.join(filtered_df, on='item_id:token', how='inner')
    
    dfs[dataset][inter_map[dataset]] = inter_df
    

Dataset: amazon-movies-tv
+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
|A3R5OBKS7OM2IR|   0000143502|         5.0|     1358380800|
|A3R5OBKS7OM2IR|   0000143529|         5.0|     1380672000|
| AH3QC2PC1VTGP|   0000143561|         2.0|     1216252800|
|A3LKP6WPMP9UKX|   0000143588|         5.0|     1236902400|
| AVIY68KEPQ5ZD|   0000143588|         5.0|     1232236800|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

num of user_id: 2088620


                                                                                

num of item_id: 200941


24/04/13 23:00:54 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:00:54 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:00:54 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:00:54 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:00:54 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:00:54 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:00:54 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:00:54 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
                                                                

num of user_id: with interaction bigger than 5 139138


24/04/13 23:01:01 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:01 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:01 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:02 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:02 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:02 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:02 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:02 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:03 WARN RowBasedKeyValueBatch: Calling spill() on

num of item_id: with interaction bigger than 5 52474
Dataset: amazon-books
+--------------+-------------+------------+---------------+
| user_id:token|item_id:token|rating:float|timestamp:float|
+--------------+-------------+------------+---------------+
| AH2L9G3DQHHAJ|   0000000116|         4.0|     1019865600|
|A2IIIDRK3PRRZY|   0000000116|         1.0|     1395619200|
|A1TADCM7YWPQ8M|   0000000868|         4.0|     1031702400|
| AWGH7V0BDOJKB|   0000013714|         4.0|     1383177600|
|A3UTQPQPM4TQO0|   0000013714|         5.0|     1374883200|
+--------------+-------------+------------+---------------+
only showing top 5 rows



                                                                                

num of user_id: 8026324


                                                                                

num of item_id: 2330066


24/04/13 23:01:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:25 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:25 WARN RowBasedKeyValueBatch: Calling spill() on

num of user_id: with interaction bigger than 5 813534


24/04/13 23:01:45 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:45 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:45 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:45 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:45 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:45 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:45 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:01:46 WARN RowBasedKeyValueBatch: Calling spill() on

num of item_id: with interaction bigger than 5 413158


                                                                                

### Overlaped users between datasets 

In [22]:
for i in range(len(datasets_to_download)):
    for j in range(i+1, len(datasets_to_download)):
        dataset1 = datasets_to_download[i]
        dataset2 = datasets_to_download[j]
        inter1 = dfs[dataset1][inter_map[dataset1]]
        inter2 = dfs[dataset2][inter_map[dataset2]]
        inter1.createOrReplaceTempView("inter1")
        inter2.createOrReplaceTempView("inter2")
        
        print(f"Common users between {dataset1} and {dataset2}")    
        # get the distinct users and then intersect
        inter1_dist = inter1.select('user_id:token').distinct()
        inter1_dist.show(5)
        common_users = inter1_dist.join(inter2, inter1_dist['user_id:token'] == inter2['user_id:token'],'leftsemi')
        common_users.show(5)
    
        print(f'num of common_users:',common_users.count())
        # print the items count of each inter of common users
        inter1_com_user = inter1.join(common_users, 'user_id:token')
        inter2_com_user = inter2.join(common_users, 'user_id:token')
        # statictics of inter 1
        inter1_com_user_count = inter1_com_user.count()
        inter1_com_item_count = inter1_com_user.select('item_id:token').distinct().count()
        print(f'num of {dataset1} common_users:',inter1_com_user_count)
        print(f'num of {dataset1} interaction:',inter1_com_item_count)
        print(f'density of {dataset1} inetraction :',inter1.count()/inter1_com_user_count/inter1_com_item_count)
        
        # save filtered datasets to file
        inter1_out_path = os.path.join(output_folder_path, f"{dataset1}_common_user")
        inter1_com_user.write.option("header", "true").csv(inter1_out_path)
        
        # statictics of inter 2
        inter2_com_user_count = inter2_com_user.count()
        inter2_com_item_count = inter2_com_user.select('item_id:token').distinct().count()
        print(f'num of {dataset2} common_users:',inter2_com_user_count)
        print(f'num of {dataset2} interaction:',inter2_com_item_count)
        print(f'density of {dataset2} inetraction :',inter2.count()/inter2_com_user_count/inter2_com_item_count)
        
        # save filtered datasets to file
        outer2_out_path = os.path.join(output_folder_path, f"{dataset2}_common_user")
        inter2_com_user.write.option("header", "true").csv(outer2_out_path)

Common users between amazon-movies-tv and amazon-books


24/04/13 23:02:29 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:29 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:29 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:29 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:29 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:29 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:29 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:29 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:34 WARN RowBasedKeyValueBatch: Calling spill() on

+--------------+
| user_id:token|
+--------------+
| AB6TIXAB6ZI7V|
|A29R4FCO6RFX4K|
|A3H6ILRU4OBLD5|
| AH7UDG89WBTZU|
|A10NSL34GNSTTG|
+--------------+
only showing top 5 rows



24/04/13 23:02:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:02:57 WARN RowBasedKeyValueBatch: Calling spill() on

+--------------+
| user_id:token|
+--------------+
|A1006V961PBMKA|
|A100NGGXRQF0AQ|
|A100UD67AHFODS|
|A101OMG474Q26I|
|A102B8D74H64TO|
+--------------+
only showing top 5 rows



24/04/13 23:03:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:03:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:03:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:03:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:03:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:03:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:03:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:03:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:03:53 WARN RowBasedKeyValueBatch: Calling spill() on

num of common_users: 50885


24/04/13 23:04:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:04:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:04:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:04:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:04:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:04:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:04:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:04:40 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:04:50 WARN RowBasedKeyValueBatch: Calling spill() on

num of amazon-movies-tv common_users: 919625
num of amazon-movies-tv interaction: 52032


24/04/13 23:06:38 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:38 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:38 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:38 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:38 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:38 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:38 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:43 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:43 WARN RowBasedKeyValueBatch: Calling spill() on

density of amazon-movies-tv inetraction : 3.672020154345896e-05


24/04/13 23:06:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:06:58 WARN RowBasedKeyValueBatch: Calling spill() on

num of amazon-books common_users: 1393839
num of amazon-books interaction: 303405


24/04/13 23:08:56 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:08:56 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:08:56 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:08:56 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:08:56 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:08:56 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:08:56 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:08:56 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:08:57 WARN RowBasedKeyValueBatch: Calling spill() on

density of amazon-books inetraction : 2.2969738306390776e-05


24/04/13 23:09:30 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:09:30 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:09:30 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:09:30 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:09:30 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:09:30 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:09:30 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:09:30 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/04/13 23:09:42 WARN RowBasedKeyValueBatch: Calling spill() on

AnalysisException: [COLUMN_ALREADY_EXISTS] The column `count` already exists. Consider to choose another name or rename the existing column.

### Filter out the datasets
* filter datasets interaction is low trival(less than thresthold)

## Sampling
Stratified sampling based on hotness(interaction rate) of items

In [8]:
# bining based on the histogram
from pyspark.ml.feature import Bucketizer

#specify bin ranges and column to bin
bucketizer = Bucketizer(splits=[0, 5, 10, 15, 20, float('Inf')],
                        inputCol='points',
                        outputCol='bins')

#perform binning based on values in 'points' column
df_bins = bucketizer.setHandleInvalid('keep').transform(df)

#view new DataFrame
df_bins.show()

# sampling by bsed on  interaction rate of items


NameError: name 'df' is not defined

## Sammary
Spark is a powerful and efficient tool to handle sample on large scale of data. 
* flexible and powerful functionality
* runs super fast even on my laptop
* easy to apply to similar datasets(Amazon have dataset of different categories), I only focused on one categoy this time. 