<a href="https://colab.research.google.com/github/amien1410/colab-notebooks/blob/main/Colab_Pyspark_H%26M_EDA_Recommendation_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title Install Kaggle modules and download the dataset

from google.colab import drive
drive.mount('/content/drive')

!pip install kaggle
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
!kaggle datasets download -d odins0n/hm256x256
!unzip -q "/content/hm256x256.zip"

Mounted at /content/drive
Dataset URL: https://www.kaggle.com/datasets/odins0n/hm256x256
License(s): other
Downloading hm256x256.zip to /content
 96% 2.05G/2.13G [00:12<00:02, 36.4MB/s]
100% 2.13G/2.13G [00:12<00:00, 182MB/s] 


✨ **Getting Started** ✨

Before we dive into building our recommendation system, we need to set up our environment. This involves installing the necessary libraries and downloading the dataset we'll be working with.

In [2]:
!pip install pyspark



⚙️ **Setting up Spark** ⚙️

Before we can use PySpark, we need to start a Spark session. We'll also import some useful libraries that will help us work with data and build our recommendation model.

In [3]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.types import ArrayType, DoubleType, BooleanType
from pyspark.sql.functions import col,array_contains
from pyspark.sql import SQLContext
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import udf,col,when
from pyspark.sql.functions import to_timestamp,date_format
import numpy as np
import pandas as pd
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import *

sc = SparkSession.builder.appName("Recommendations").config("spark.sql.files.maxPartitionBytes", 5000000).getOrCreate()
spark = SparkSession(sc)

💾 **Loading the Transaction Data** 💾

We'll load the transaction data into a Spark DataFrame. This dataset contains all the purchase information, which is super important for our recommendation system.

In [4]:
transaction = spark.read.option("header",True) \
              .csv("/content/transactions_train.csv")
transaction.printSchema()

root
 |-- t_dat: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- article_id: string (nullable = true)
 |-- price: string (nullable = true)
 |-- sales_channel_id: string (nullable = true)



📅 **Checking the Date Range** 📅

It's always good to know the time period our data covers. Let's find the earliest and latest transaction dates in the dataset.

In [5]:
from pyspark.sql.functions import min, max
from pyspark.sql.functions import unix_timestamp, lit
min_date, max_date = transaction.select(min("t_dat"), max("t_dat")).first()
min_date, max_date

('2018-09-20', '2020-09-22')

In [6]:
from pyspark.sql.functions import min as min_, max as max_

# Cache the DataFrame if reused multiple times
transaction.cache()

# Aggregate min and max in one pass
date_range = transaction.agg(
    min_("t_dat").alias("min_date"),
    max_("t_dat").alias("max_date")
).collect()[0]

min_date = date_range['min_date']
max_date = date_range['max_date']
print(min_date, max_date)

2018-09-20 2020-09-22


🧹 **Preparing the Data** 🧹

We'll clean and transform the transaction data to get it ready for our recommendation system. This includes filtering by date and counting how many times each customer bought a specific article.

In [7]:
hm =  transaction.withColumn('t_dat', transaction['t_dat'].cast('string'))
hm = hm.withColumn('date', from_unixtime(unix_timestamp('t_dat', 'yyyy-MM-dd')))
hm = hm.withColumn('year', year(col('date')))
hm = hm.withColumn('month', month(col('date')))
hm = hm.withColumn('day', date_format(col('date'), "d"))

hm = hm[hm['year'] == 2020]
hm = hm[hm['month'] == 9]
hm = hm[hm['day'] == 22]
transaction.unpersist()

# Prepare the dataset
hm = hm.groupby('customer_id', 'article_id').count()
hm.show(5)

+--------------------+----------+-----+
|         customer_id|article_id|count|
+--------------------+----------+-----+
|00f7bc5c0df4c615b...|0780418013|    1|
|02094817e46f3b692...|0791587001|    1|
|0333e5dda0257e9f4...|0839332002|    2|
|07c7a1172caf8fb97...|0573085043|    1|
|081373184e601470c...|0714790020|    1|
+--------------------+----------+-----+
only showing top 5 rows



In [8]:
print((hm.count(), len(hm.columns)))

(29486, 3)


🔍 **Checking Data Sparsity** 🔍

Sparsity is a fancy word that tells us how "empty" our data is. In our case, it's about how many possible customer-article purchases didn't actually happen. Knowing this helps us understand our data better.

In [9]:
# Count the total number of article in the dataset
numerator = hm.select("count").count()

# Count the number of distinct customerid and distinct articleid
num_users = hm.select("customer_id").distinct().count()
num_articles = hm.select("article_id").distinct().count()

# Set the denominator equal to the number of customer multiplied by the number of articles
denominator = num_users * num_articles

# Divide the numerator by the denominator
sparsity = (1.0 - (numerator *1.0)/denominator)*100
print("Sparsity: ", "%.2f" % sparsity + "%.")

Sparsity:  99.96%.


📊 **Analyzing Customer Activity** 📊

It's interesting to see how often our customers make purchases. Let's count the number of transactions for each customer and see which customers are the most active.

In [10]:
userId_count = hm.groupBy("customer_id").count().orderBy('count', ascending=False)
userId_count.show()

+--------------------+-----+
|         customer_id|count|
+--------------------+-----+
|30b6056bacc5f5c9d...|   28|
|5e8fb4d457fdffc61...|   28|
|dc1b173e541f8d3c1...|   27|
|6335d496ef463bc40...|   25|
|1796e87fd2e88932b...|   25|
|f50287d9cf052d4b4...|   24|
|54e8ebd39543b5a4d...|   23|
|fd5ce8716faf00f6a...|   23|
|850ec77661a417d27...|   22|
|32f3a6a7ce63d302c...|   21|
|fc783381f1ea2174c...|   21|
|ad3663a848dccbdda...|   21|
|b606fe5786c00151a...|   21|
|298523b6637340717...|   21|
|a08e284bb18add2d7...|   21|
|383e1b07e2c1fe169...|   21|
|b49647f84a99ced53...|   21|
|3ca77aab50ae4532b...|   20|
|2a721767cd9864ed5...|   20|
|af5166e0f89b0d433...|   19|
+--------------------+-----+
only showing top 20 rows



👚 **Analyzing Article Popularity** 👖

Let's see which articles are flying off the shelves! We'll count how many times each article has been purchased to find the most popular items.

In [11]:
articleId_count = hm.groupBy("article_id").count().orderBy('count', ascending=False)
articleId_count.show()

+----------+-----+
|article_id|count|
+----------+-----+
|0924243002|   91|
|0918522001|   88|
|0866731001|   78|
|0751471001|   75|
|0448509014|   73|
|0714790020|   72|
|0762846027|   68|
|0928206001|   67|
|0893432002|   66|
|0918292001|   65|
|0915529005|   64|
|0788575004|   63|
|0915529003|   63|
|0863583001|   60|
|0930380001|   59|
|0573085028|   59|
|0919273002|   58|
|0850917001|   57|
|0573085042|   56|
|0874110016|   53|
+----------+-----+
only showing top 20 rows



📇 **Indexing Customer and Article IDs** 📇

Before we can train our recommendation model, we need to convert the customer and article IDs into numerical indexes. This helps the model work efficiently. We'll use a `StringIndexer` for this.

In [12]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in list(set(hm.columns)-set(['count'])) ]
pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(hm).transform(hm)
transformed.show()

+--------------------+----------+-----+-----------------+----------------+
|         customer_id|article_id|count|customer_id_index|article_id_index|
+--------------------+----------+-----+-----------------+----------------+
|00f7bc5c0df4c615b...|0780418013|    1|            783.0|          2237.0|
|02094817e46f3b692...|0791587001|    1|            785.0|            35.0|
|0333e5dda0257e9f4...|0839332002|    2|           4098.0|           732.0|
|07c7a1172caf8fb97...|0573085043|    1|           1702.0|            44.0|
|081373184e601470c...|0714790020|    1|           4146.0|             5.0|
|09bec2a61046ccbea...|0860336002|    1|           6792.0|          2368.0|
|0be4f1ecce204ee32...|0573085028|    1|            799.0|            14.0|
|0c4b30343292b5101...|0918522001|    1|           6825.0|             1.0|
|0e10e02358875468b...|0579541001|    1|           2689.0|            53.0|
|0fc371e67e61a31d7...|0907170001|    1|           1737.0|          1978.0|
|10817b19177f6a53e...|071