<h1>Moby Pick - Reformat Interactions CSV</h1>

This notebook adapts the interactions CSV generated by the `PERSONALIZE_clean_books_data.csv` notebook into the format that personalize requires.
Per this page in the documentation: https://docs.aws.amazon.com/personalize/latest/dg/VIDEO-ON-DEMAND-interactions-dataset.html#VIDEO-ON-DEMAND-interactions-schema

We use faker to generate fake dates for Personalize.

In [1]:
!pip install faker

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [2]:
import sys
import os
import itertools
import datetime
import random
from operator import add
from csv import reader
from itertools import chain
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.sql.functions import col, expr, udf, collect_list, struct, array, lit, rand, unix_timestamp
from pyspark.sql.types import FloatType, StringType, ArrayType, IntegerType, StructType, StructField, LongType

In [3]:
from faker import Faker
fake = Faker()

In [4]:
cf = SparkConf()
cf.set("spark.submit.deployMode","client")
sc = SparkContext.getOrCreate(cf)
from pyspark.sql import SparkSession
spark = SparkSession \
	    .builder \
	    .appName("Python Spark SQL basic example") \
	    .config("spark.some.config.option", "some-value") \
	    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/13 10:02:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


<h4> Read the interactions data generated by the previous notebook: </h4>

In [5]:
path = "./interactions_data.csv/interactions_data.csv"
intsDF = spark.read.csv(path, header=True)

In [6]:
readEvents = intsDF.filter(intsDF.is_read == True).select(intsDF.user_id, intsDF.book_id)

In [7]:
readEvents.show(10)
#readEvents.count() # 1381758

+-------+-------+
|user_id|book_id|
+-------+-------+
|      0|    929|
|      0|    890|
|      0|    870|
|      0|    865|
|      0|    830|
|      0|    827|
|      0|    816|
|      0|    706|
|      0|    667|
|      0|    662|
+-------+-------+
only showing top 10 rows



In [8]:
reviewEvents = intsDF.filter(intsDF.is_reviewed == True).select(intsDF.user_id, intsDF.book_id)

In [9]:
#path = "../goodreads_book_genres_initial.json"
path = "./full_book_data.csv/full_book_data.csv"
booksDF = spark.read.csv(path, header=True)

In [10]:
booksDF.printSchema()

root
 |-- best_book_id: string (nullable = true)
 |-- original_publication_day: string (nullable = true)
 |-- original_publication_month: string (nullable = true)
 |-- original_publication_year: string (nullable = true)
 |-- original_title: string (nullable = true)
 |-- rating_dist: string (nullable = true)
 |-- ratings_count: string (nullable = true)
 |-- ratings_sum: string (nullable = true)
 |-- reviews_count: string (nullable = true)
 |-- text_reviews_count: string (nullable = true)
 |-- work_id: string (nullable = true)
 |-- avg_rating: string (nullable = true)
 |-- inferred_language_id: string (nullable = true)
 |-- book_id: string (nullable = true)
 |-- genres: string (nullable = true)
 |-- det_book_id: string (nullable = true)
 |-- det_work_id: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- title: string (nullable = true)
 |-- format: string (nullable = true)
 |-- url: string (nullable = true)
 |-- image_url: string (null

In [11]:
booksDF.filter(booksDF.best_book_id == 3).select(booksDF.best_book_id,\
                                                     booksDF.original_title,\
                                                     booksDF.original_publication_year,\
                                                     booksDF.ratings_count).show(10, truncate=False)

+------------+----------------------------------------+-------------------------+-------------+
|best_book_id|original_title                          |original_publication_year|ratings_count|
+------------+----------------------------------------+-------------------------+-------------+
|3           |Harry Potter and the Philosopher's Stone|1997                     |4972886      |
+------------+----------------------------------------+-------------------------+-------------+



In [12]:
readEvents.filter(readEvents.book_id == 19557).show()

+-------+-------+
|user_id|book_id|
+-------+-------+
+-------+-------+



The most popular books:

In [13]:
readEvents.createOrReplaceTempView("f_events")
users_per_book = "SELECT book_id, count(distinct user_id) as num_users FROM f_events GROUP BY book_id ORDER BY num_users DESC"
spark.sql(users_per_book).show(15)

[Stage 9:>                                                        (0 + 10) / 10]

+-------+---------+
|book_id|num_users|
+-------+---------+
|    968|   176099|
|    706|    71805|
|   7510|    43152|
|   1067|    33783|
|   1202|    31081|
|   6969|    28197|
|   1371|    20198|
|  12948|    20011|
|   8282|    17998|
|   5413|    17096|
|   6854|    16993|
|   1241|    15577|
|   6294|    15456|
|   1554|    15011|
|   6853|    14995|
+-------+---------+
only showing top 15 rows



                                                                                

<h3>Dates</h3>

Personalize's interactions dataset requires dates in unix epoch time format. 

In [14]:
start_date = datetime.date(year=2017, month=1, day=1)
mock_dates = []
for i in range(0, 50):
    mock_date = fake.date_time_between(start_date=start_date)
    #mock_dates.append(Row(mock_date.timestamp()))
    mock_dates.append(Row(mock_date.strftime("%Y-%m-%d")))

In [15]:
def get_random_date():
    md = mock_dates[random.randint(0, 49)][0]
    return md

get_random_date_udf = udf(get_random_date, StringType())

In [16]:
#readEvents = readEvents.withColumn('timestamp', lit(fake.date_between(start_date=start_date)))
#readEvents = readEvents.withColumn('timestamp', timeDF.sample(False, 0.1, seed=0).limit(1))
#readEvents = readEvents.withColumn('timestamp', timeDF.sample(False, 0.1, seed=0).limit(1).__getitem__('rand_time'))
readEvents = readEvents.withColumn('timestamp_dt', get_random_date_udf()).withColumnRenamed('book_id', 'ITEM_ID').withColumnRenamed('user_id', 'USER_ID')
readEvents = readEvents.withColumn('TIMESTAMP', unix_timestamp('timestamp_dt', 'yyyy-MM-dd')).drop('timestamp_dt')
readEvents = readEvents.withColumn('EVENT_TYPE', lit("read"))

In [17]:
readEvents.filter(readEvents.ITEM_ID == 3).show(10)

+-------+-------+----------+----------+
|USER_ID|ITEM_ID| TIMESTAMP|EVENT_TYPE|
+-------+-------+----------+----------+
|    274|      3|1631937600|      read|
|   1159|      3|1670994000|      read|
|   1810|      3|1503979200|      read|
|   2492|      3|1663905600|      read|
|   2496|      3|1670994000|      read|
|   3392|      3|1506830400|      read|
|   3749|      3|1681876800|      read|
|   5067|      3|1593144000|      read|
|   5368|      3|1681876800|      read|
|   6227|      3|1613797200|      read|
+-------+-------+----------+----------+
only showing top 10 rows



In [18]:
readEvents = readEvents.repartition(1)
#readEvents.write.csv("personalize_read_events.csv", header=True, mode="overwrite")