<a href="https://colab.research.google.com/github/amien1410/colab-notebooks/blob/main/Colab_Pyspark_H%26M_EDA_Recommendation_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title Install Kaggle modules and download the dataset

from google.colab import drive
drive.mount('/content/drive')

!pip install kaggle
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
!kaggle datasets download -d odins0n/hm256x256
!unzip -q "/content/hm256x256.zip"

Mounted at /content/drive
Dataset URL: https://www.kaggle.com/datasets/odins0n/hm256x256
License(s): other
Downloading hm256x256.zip to /content
 96% 2.05G/2.13G [00:12<00:02, 36.4MB/s]
100% 2.13G/2.13G [00:12<00:00, 182MB/s] 


✨ **Getting Started** ✨

Before we dive into building our recommendation system, we need to set up our environment. This involves installing the necessary libraries and downloading the dataset we'll be working with.

In [2]:
!pip install pyspark



⚙️ **Setting up Spark** ⚙️

Before we can use PySpark, we need to start a Spark session. We'll also import some useful libraries that will help us work with data and build our recommendation model.

In [3]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.types import ArrayType, DoubleType, BooleanType
from pyspark.sql.functions import col,array_contains
from pyspark.sql import SQLContext
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import udf,col,when
from pyspark.sql.functions import to_timestamp,date_format
import numpy as np
import pandas as pd
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import *

sc = SparkSession.builder.appName("Recommendations").config("spark.sql.files.maxPartitionBytes", 5000000).getOrCreate()
spark = SparkSession(sc)

💾 **Loading the Transaction Data** 💾

We'll load the transaction data into a Spark DataFrame. This dataset contains all the purchase information, which is super important for our recommendation system.

In [4]:
transaction = spark.read.option("header",True) \
              .csv("/content/transactions_train.csv")
transaction.printSchema()

root
 |-- t_dat: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- article_id: string (nullable = true)
 |-- price: string (nullable = true)
 |-- sales_channel_id: string (nullable = true)



📅 **Checking the Date Range** 📅

It's always good to know the time period our data covers. Let's find the earliest and latest transaction dates in the dataset.

In [5]:
from pyspark.sql.functions import min, max
from pyspark.sql.functions import unix_timestamp, lit
min_date, max_date = transaction.select(min("t_dat"), max("t_dat")).first()
min_date, max_date

('2018-09-20', '2020-09-22')

In [6]:
from pyspark.sql.functions import min as min_, max as max_

# Cache the DataFrame if reused multiple times
transaction.cache()

# Aggregate min and max in one pass
date_range = transaction.agg(
    min_("t_dat").alias("min_date"),
    max_("t_dat").alias("max_date")
).collect()[0]

min_date = date_range['min_date']
max_date = date_range['max_date']
print(min_date, max_date)

2018-09-20 2020-09-22


🧹 **Preparing the Data** 🧹

We'll clean and transform the transaction data to get it ready for our recommendation system. This includes filtering by date and counting how many times each customer bought a specific article.

In [None]:
hm =  transaction.withColumn('t_dat', transaction['t_dat'].cast('string'))
hm = hm.withColumn('date', from_unixtime(unix_timestamp('t_dat', 'yyyy-MM-dd')))
hm = hm.withColumn('year', year(col('date')))
hm = hm.withColumn('month', month(col('date')))
hm = hm.withColumn('day', date_format(col('date'), "d"))

hm = hm[hm['year'] == 2020]
hm = hm[hm['month'] == 9]
hm = hm[hm['day'] == 22]
transaction.unpersist()

# Prepare the dataset
hm = hm.groupby('customer_id', 'article_id').count()
hm.show(5)