<a href="https://colab.research.google.com/github/amien1410/colab-notebooks/blob/main/Colab_Pyspark_H%26M_EDA_Recommendation_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@title Install Kaggle modules and download the dataset

from google.colab import drive
drive.mount('/content/drive')

!pip install kaggle
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
!kaggle datasets download -d odins0n/hm256x256
!unzip -q "/content/hm256x256.zip"

Mounted at /content/drive
Dataset URL: https://www.kaggle.com/datasets/odins0n/hm256x256
License(s): other
Downloading hm256x256.zip to /content
 96% 2.05G/2.13G [00:12<00:02, 36.4MB/s]
100% 2.13G/2.13G [00:12<00:00, 182MB/s] 


✨ **Getting Started** ✨

Before we dive into building our recommendation system, we need to set up our environment. This involves installing the necessary libraries and downloading the dataset we'll be working with.

In [2]:
!pip install pyspark



⚙️ **Setting up Spark** ⚙️

Before we can use PySpark, we need to start a Spark session. We'll also import some useful libraries that will help us work with data and build our recommendation model.

In [3]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.types import ArrayType, DoubleType, BooleanType
from pyspark.sql.functions import col,array_contains
from pyspark.sql import SQLContext
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import udf,col,when
from pyspark.sql.functions import to_timestamp,date_format
import numpy as np
import pandas as pd
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import *

sc = SparkSession.builder.appName("Recommendations").config("spark.sql.files.maxPartitionBytes", 5000000).getOrCreate()
spark = SparkSession(sc)

💾 **Loading the Transaction Data** 💾

We'll load the transaction data into a Spark DataFrame. This dataset contains all the purchase information, which is super important for our recommendation system.

In [4]:
transaction = spark.read.option("header",True) \
              .csv("/content/transactions_train.csv")
transaction.printSchema()

root
 |-- t_dat: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- article_id: string (nullable = true)
 |-- price: string (nullable = true)
 |-- sales_channel_id: string (nullable = true)



📅 **Checking the Date Range** 📅

It's always good to know the time period our data covers. Let's find the earliest and latest transaction dates in the dataset.

In [5]:
from pyspark.sql.functions import min, max
from pyspark.sql.functions import unix_timestamp, lit
min_date, max_date = transaction.select(min("t_dat"), max("t_dat")).first()
min_date, max_date

('2018-09-20', '2020-09-22')

In [6]:
from pyspark.sql.functions import min as min_, max as max_

# Cache the DataFrame if reused multiple times
transaction.cache()

# Aggregate min and max in one pass
date_range = transaction.agg(
    min_("t_dat").alias("min_date"),
    max_("t_dat").alias("max_date")
).collect()[0]

min_date = date_range['min_date']
max_date = date_range['max_date']
print(min_date, max_date)

2018-09-20 2020-09-22


🧹 **Preparing the Data** 🧹

We'll clean and transform the transaction data to get it ready for our recommendation system. This includes filtering by date and counting how many times each customer bought a specific article.

In [7]:
hm =  transaction.withColumn('t_dat', transaction['t_dat'].cast('string'))
hm = hm.withColumn('date', from_unixtime(unix_timestamp('t_dat', 'yyyy-MM-dd')))
hm = hm.withColumn('year', year(col('date')))
hm = hm.withColumn('month', month(col('date')))
hm = hm.withColumn('day', date_format(col('date'), "d"))

hm = hm[hm['year'] == 2020]
hm = hm[hm['month'] == 9]
hm = hm[hm['day'] == 22]
transaction.unpersist()

# Prepare the dataset
hm = hm.groupby('customer_id', 'article_id').count()
hm.show(5)

+--------------------+----------+-----+
|         customer_id|article_id|count|
+--------------------+----------+-----+
|00f7bc5c0df4c615b...|0780418013|    1|
|02094817e46f3b692...|0791587001|    1|
|0333e5dda0257e9f4...|0839332002|    2|
|07c7a1172caf8fb97...|0573085043|    1|
|081373184e601470c...|0714790020|    1|
+--------------------+----------+-----+
only showing top 5 rows



In [8]:
print((hm.count(), len(hm.columns)))

(29486, 3)


🔍 **Checking Data Sparsity** 🔍

Sparsity is a fancy word that tells us how "empty" our data is. In our case, it's about how many possible customer-article purchases didn't actually happen. Knowing this helps us understand our data better.

In [9]:
# Count the total number of article in the dataset
numerator = hm.select("count").count()

# Count the number of distinct customerid and distinct articleid
num_users = hm.select("customer_id").distinct().count()
num_articles = hm.select("article_id").distinct().count()

# Set the denominator equal to the number of customer multiplied by the number of articles
denominator = num_users * num_articles

# Divide the numerator by the denominator
sparsity = (1.0 - (numerator *1.0)/denominator)*100
print("Sparsity: ", "%.2f" % sparsity + "%.")

Sparsity:  99.96%.


📊 **Analyzing Customer Activity** 📊

It's interesting to see how often our customers make purchases. Let's count the number of transactions for each customer and see which customers are the most active.

In [10]:
userId_count = hm.groupBy("customer_id").count().orderBy('count', ascending=False)
userId_count.show()

+--------------------+-----+
|         customer_id|count|
+--------------------+-----+
|30b6056bacc5f5c9d...|   28|
|5e8fb4d457fdffc61...|   28|
|dc1b173e541f8d3c1...|   27|
|6335d496ef463bc40...|   25|
|1796e87fd2e88932b...|   25|
|f50287d9cf052d4b4...|   24|
|54e8ebd39543b5a4d...|   23|
|fd5ce8716faf00f6a...|   23|
|850ec77661a417d27...|   22|
|32f3a6a7ce63d302c...|   21|
|fc783381f1ea2174c...|   21|
|ad3663a848dccbdda...|   21|
|b606fe5786c00151a...|   21|
|298523b6637340717...|   21|
|a08e284bb18add2d7...|   21|
|383e1b07e2c1fe169...|   21|
|b49647f84a99ced53...|   21|
|3ca77aab50ae4532b...|   20|
|2a721767cd9864ed5...|   20|
|af5166e0f89b0d433...|   19|
+--------------------+-----+
only showing top 20 rows



👚 **Analyzing Article Popularity** 👖

Let's see which articles are flying off the shelves! We'll count how many times each article has been purchased to find the most popular items.

In [11]:
articleId_count = hm.groupBy("article_id").count().orderBy('count', ascending=False)
articleId_count.show()

+----------+-----+
|article_id|count|
+----------+-----+
|0924243002|   91|
|0918522001|   88|
|0866731001|   78|
|0751471001|   75|
|0448509014|   73|
|0714790020|   72|
|0762846027|   68|
|0928206001|   67|
|0893432002|   66|
|0918292001|   65|
|0915529005|   64|
|0788575004|   63|
|0915529003|   63|
|0863583001|   60|
|0930380001|   59|
|0573085028|   59|
|0919273002|   58|
|0850917001|   57|
|0573085042|   56|
|0874110016|   53|
+----------+-----+
only showing top 20 rows



📇 **Indexing Customer and Article IDs** 📇

Before we can train our recommendation model, we need to convert the customer and article IDs into numerical indexes. This helps the model work efficiently. We'll use a `StringIndexer` for this.

In [12]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in list(set(hm.columns)-set(['count'])) ]
pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(hm).transform(hm)
transformed.show()

+--------------------+----------+-----+-----------------+----------------+
|         customer_id|article_id|count|customer_id_index|article_id_index|
+--------------------+----------+-----+-----------------+----------------+
|00f7bc5c0df4c615b...|0780418013|    1|            783.0|          2237.0|
|02094817e46f3b692...|0791587001|    1|            785.0|            35.0|
|0333e5dda0257e9f4...|0839332002|    2|           4098.0|           732.0|
|07c7a1172caf8fb97...|0573085043|    1|           1702.0|            44.0|
|081373184e601470c...|0714790020|    1|           4146.0|             5.0|
|09bec2a61046ccbea...|0860336002|    1|           6792.0|          2368.0|
|0be4f1ecce204ee32...|0573085028|    1|            799.0|            14.0|
|0c4b30343292b5101...|0918522001|    1|           6825.0|             1.0|
|0e10e02358875468b...|0579541001|    1|           2689.0|            53.0|
|0fc371e67e61a31d7...|0907170001|    1|           1737.0|          1978.0|
|10817b19177f6a53e...|071

✂️ **Splitting the Data** ✂️

To evaluate how well our recommendation model works, we'll split our indexed data into two sets: a training set to teach the model and a testing set to see how accurately it makes predictions on data it hasn't seen before.

In [13]:
(training,test)=transformed.randomSplit([0.8, 0.2])

🏋️‍♀️ **Training the Recommendation Model** 🏋️‍♀️

Now for the core of our recommendation system! We'll train an ALS model using our training data. We'll also use cross-validation to find the best settings for our model so it can make accurate recommendations.

In [16]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

# ✅ Step 1: Cache your training data to prevent recomputation
training.cache()

# ✅ Step 2: Create ALS model with cold start handling
als = ALS(
    userCol="customer_id_index",
    itemCol="article_id_index",
    ratingCol="count",
    coldStartStrategy="drop",  # drops NaN predictions
    nonnegative=True,
    implicitPrefs=False  # if explicit feedback; set True for implicit feedback
)

# ✅ Step 3: Build a smaller, faster hyperparameter grid
param_grid = ParamGridBuilder()\
    .addGrid(als.rank, [10, 20])\
    .addGrid(als.maxIter, [10])\
    .addGrid(als.regParam, [0.1, 0.15])\
    .build()

# ✅ Step 4: Use RMSE for evaluation
evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="count",
    predictionCol="prediction"
)

# ✅ Step 5: Use TrainValidationSplit for faster tuning
tvs = TrainValidationSplit(
    estimator=als,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    trainRatio=0.8,  # 80% train / 20% validation
    parallelism=4    # run up to 4 models in parallel (tune based on your cluster)
)

# ✅ Step 6: Train the model (this will take much less time now)
model = tvs.fit(training)

# ✅ Step 7: Best model retrieval and evaluation
best_model = model.bestModel
print("Best rank:", best_model.rank)
print("Best regParam:", best_model._java_obj.parent().getRegParam())
print("Best maxIter:", best_model._java_obj.parent().getMaxIter())

Best rank: 20
Best regParam: 0.1
Best maxIter: 10


🔬 **Evaluating the Model** 🔬

Now that we have our best model, let's see how well it predicts! We'll use the test data to generate recommendations and then measure the accuracy using RMSE (Root Mean Squared Error). A lower RMSE means better predictions!

In [17]:
#Generate predictions and evaluate using RMSE
predictions = best_model.transform(test)
rmse = evaluator.evaluate(predictions)

In [18]:
print("RMSE =" + str(rmse))

RMSE =0.20660370130920125


👍 **Generating Recommendations** 👍

With our trained model, we can now generate recommendations for our customers. We can ask the model to predict which articles a customer is most likely to be interested in.

In [19]:
user_recs=best_model.recommendForAllItems(10).show(10)

+----------------+--------------------+
|article_id_index|     recommendations|
+----------------+--------------------+
|               1|[{9001, 5.1390114...|
|               3|[{9001, 6.4745083...|
|               5|[{4907, 5.561627}...|
|               6|[{4907, 5.64958},...|
|               9|[{9001, 4.6373963...|
|              12|[{4907, 4.751717}...|
|              13|[{9001, 5.167022}...|
|              15|[{4907, 6.066818}...|
|              16|[{9001, 5.2180023...|
|              17|[{4907, 5.821047}...|
+----------------+--------------------+
only showing top 10 rows



👥 **Recommendations for All Users** 👥

Instead of just looking at item recommendations for a few users, we can generate recommendations for everyone! This gives us a comprehensive list of what each customer might like.

In [20]:
df_recom = best_model.recommendForAllUsers(10)
df_recom.show(10)

+-----------------+--------------------+
|customer_id_index|     recommendations|
+-----------------+--------------------+
|                1|[{1661, 2.5185127...|
|                3|[{1661, 2.822737}...|
|                5|[{1661, 3.1315286...|
|                6|[{1661, 5.926103}...|
|                9|[{1661, 2.3304753...|
|               12|[{1661, 2.6639178...|
|               13|[{1661, 3.0723724...|
|               15|[{1661, 2.4710696...|
|               16|[{1661, 2.6907763...|
|               17|[{1661, 2.535876}...|
+-----------------+--------------------+
only showing top 10 rows



🔄 **Transforming Recommendations** 🔄

We'll select the customer and article indexes from the recommendations and then convert the Spark DataFrame to a pandas DataFrame. This can make it easier to work with if you prefer pandas for certain tasks.

In [21]:
df_recom = df_recom.select("customer_id_index","recommendations.article_id_index")
df_recom.show(10)
df_recom = df_recom.toPandas()

+-----------------+--------------------+
|customer_id_index|    article_id_index|
+-----------------+--------------------+
|                1|[1661, 5111, 1891...|
|                3|[1661, 5111, 4249...|
|                5|[1661, 1891, 5111...|
|                6|[1661, 1910, 6383...|
|                9|[1661, 5111, 1221...|
|               12|[1661, 1891, 5111...|
|               13|[1661, 5111, 1891...|
|               15|[1661, 5111, 1891...|
|               16|[1661, 5111, 1891...|
|               17|[1661, 1891, 5111...|
+-----------------+--------------------+
only showing top 10 rows



In [22]:
df_recom.sort_values('customer_id_index')

Unnamed: 0,customer_id_index,article_id_index
4821,0,"[1661, 5111, 1891, 1221, 4249, 3870, 3950, 707..."
0,1,"[1661, 5111, 1891, 1221, 1035, 4405, 3950, 424..."
4822,2,"[1661, 1891, 4405, 3870, 5111, 1035, 1221, 251..."
1,3,"[1661, 5111, 4249, 4039, 7073, 4405, 2511, 395..."
4823,4,"[1661, 3870, 1221, 5111, 1035, 1891, 5891, 638..."
...,...,...
9648,10521,"[1661, 4405, 5111, 1035, 5891, 7074, 3018, 327..."
9649,10522,"[1661, 5111, 6383, 5040, 4405, 1221, 4249, 395..."
4818,10524,"[1661, 1891, 1221, 3870, 1035, 5111, 1765, 589..."
4819,10526,"[3870, 4405, 1661, 5702, 4894, 1035, 4808, 386..."


🗺️ **Mapping IDs** 🗺️

To make our recommendations more understandable, we need a way to convert the indexed customer and article IDs back to their original values. We'll create a mapping DataFrame for this.

In [23]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
import pandas as pd
md=transformed.select(transformed['article_id'],transformed['article_id_index'],transformed['customer_id'],transformed['customer_id_index'])
md=md.toPandas()
md

Unnamed: 0,article_id,article_id_index,customer_id,customer_id_index
0,0780418013,2237.0,00f7bc5c0df4c615b2502a2c2e9ef9eff988c81dec2e5e...,783.0
1,0791587001,35.0,02094817e46f3b692149b06cf9577e42848c2294e78598...,785.0
2,0839332002,732.0,0333e5dda0257e9f498be52f1e569bfae576caed0cbdcd...,4098.0
3,0573085043,44.0,07c7a1172caf8fb9784b28e51b25b985ab6a1ec7ce923e...,1702.0
4,0714790020,5.0,081373184e601470cc9911f33d3eeebc6f33ed79222573...,4146.0
...,...,...,...,...
29481,0889574009,7169.0,e38b916888d34fad15caa3597a4d52e696f0d4074402d0...,10095.0
29482,0895610004,766.0,662e8285f18b4ee6e871a3ca2f95938a36e9746a2c43c2...,5107.0
29483,0674606068,524.0,813f8e4991473fd81da8f683362998ccac07e50ec96c02...,234.0
29484,0688728023,611.0,ea9fa86df5414dd80e1230b4efebde1e5b89ca80694534...,379.0


🔗 **Mapping Indexed IDs Back to Original** 🔗

To make our recommendations readable, we'll use the mapping DataFrame we created to replace the numerical indexes with the original customer and article IDs.

In [24]:
dict1 =dict(zip(md['article_id_index'],md['article_id']))
dict2=dict(zip(md['customer_id_index'],md['customer_id']))
df_recom['article_id'] = df_recom['article_id_index'].map(lambda x: [dict1[y] for y in x if y in dict1])
df_recom['customer_id']=df_recom['customer_id_index'].map(dict2)
df_recom

Unnamed: 0,customer_id_index,article_id_index,article_id,customer_id
0,1,"[1661, 5111, 1891, 1221, 1035, 4405, 3950, 424...","[0297078008, 0757971006, 0871638002, 087290100...",5e8fb4d457fdffc61e235328ba7e43a4139c94c5f9d52a...
1,3,"[1661, 5111, 4249, 4039, 7073, 4405, 2511, 395...","[0297078008, 0757971006, 0502869002, 092643500...",1796e87fd2e88932b50966a07cc18b490cd5e1474dbee3...
2,5,"[1661, 1891, 5111, 6874, 4249, 5894, 4212, 122...","[0297078008, 0871638002, 0757971006, 087726100...",f50287d9cf052d4b423fc3d4d7a0c306de8c752583544b...
3,6,"[1661, 1910, 6383, 5040, 42, 5111, 1035, 3950,...","[0297078008, 0880479001, 0857347002, 075048101...",54e8ebd39543b5a4d69c3e7d79977558d2a606e6540ba0...
4,9,"[1661, 5111, 1221, 1891, 1035, 3950, 7073, 440...","[0297078008, 0757971006, 0872901005, 087163800...",298523b6637340717e19df4e2a46a7ce7d80434c985a84...
...,...,...,...,...
9645,10515,"[1661, 5891, 1553, 3018, 3031, 2752, 7398, 424...","[0297078008, 0825109005, 0895730002, 074256100...",fefb56faca51b2e9de0082a3da3379e1fd41709509f6a4...
9646,10516,"[1661, 5111, 6383, 5040, 1910, 4405, 1035, 122...","[0297078008, 0757971006, 0857347002, 075048101...",ff09354db173e36e7148bd2da4da7890eaa95b00556014...
9647,10519,"[1661, 5111, 1891, 1221, 3870, 6383, 5040, 424...","[0297078008, 0757971006, 0871638002, 087290100...",ff240ee1590922141103063f7b4212c3832f0f5b0e0eb2...
9648,10521,"[1661, 4405, 5111, 1035, 5891, 7074, 3018, 327...","[0297078008, 0571048002, 0757971006, 088380800...",ff6d8d22b25287dfb2b0bbec08d4425aa67fbf02911fe0...


Great! To present our final recommendations, let's remove the indexed ID columns and keep only the original customer and article IDs. This gives us a clean list of recommendations. 👇

In [25]:
recom_final = df_recom.drop(['customer_id_index','article_id_index'], axis = 1)
finalpre=recom_final[['customer_id','article_id']]
finalpre

Unnamed: 0,customer_id,article_id
0,5e8fb4d457fdffc61e235328ba7e43a4139c94c5f9d52a...,"[0297078008, 0757971006, 0871638002, 087290100..."
1,1796e87fd2e88932b50966a07cc18b490cd5e1474dbee3...,"[0297078008, 0757971006, 0502869002, 092643500..."
2,f50287d9cf052d4b423fc3d4d7a0c306de8c752583544b...,"[0297078008, 0871638002, 0757971006, 087726100..."
3,54e8ebd39543b5a4d69c3e7d79977558d2a606e6540ba0...,"[0297078008, 0880479001, 0857347002, 075048101..."
4,298523b6637340717e19df4e2a46a7ce7d80434c985a84...,"[0297078008, 0757971006, 0872901005, 087163800..."
...,...,...
9645,fefb56faca51b2e9de0082a3da3379e1fd41709509f6a4...,"[0297078008, 0825109005, 0895730002, 074256100..."
9646,ff09354db173e36e7148bd2da4da7890eaa95b00556014...,"[0297078008, 0757971006, 0857347002, 075048101..."
9647,ff240ee1590922141103063f7b4212c3832f0f5b0e0eb2...,"[0297078008, 0757971006, 0871638002, 087290100..."
9648,ff6d8d22b25287dfb2b0bbec08d4425aa67fbf02911fe0...,"[0297078008, 0571048002, 0757971006, 088380800..."


Now that we have our final recommendations in a clean format, let's save them to a CSV file. This way, you can easily access and use the recommendations. 👇

In [27]:
my_pred = finalpre
my_pred.to_csv('my_pred.csv',index=False)

💾 **Saving the Trained Model** 💾

Let's save the trained `best_model` so you can easily load and use it again without retraining.

In [28]:
# Define the path to save the model
model_path = "/content/hm_recommendation_model"

# Save the best model
best_model.save(model_path)

print(f"Model saved successfully to {model_path}")

Model saved successfully to /content/hm_recommendation_model


🎉 **Tutorial Complete: Insights and Next Steps** 🎉

Congratulations! You've successfully built a basic H&M fashion recommendation system using PySpark. Throughout this tutorial, we've covered several key steps:

1.  **Data Loading and Exploration:** We started by loading the transaction data and exploring its characteristics, including the date range and data sparsity.
2.  **Data Preparation:** We cleaned and transformed the data, filtering for a specific date and aggregating transactions by customer and article.
3.  **Feature Engineering:** We indexed the customer and article IDs to prepare the data for the ALS model.
4.  **Model Training and Evaluation:** We trained an ALS collaborative filtering model, tuned its hyperparameters using cross-validation, and evaluated its performance using RMSE.
5.  **Recommendation Generation:** We generated recommendations for all users and mapped the indexed IDs back to the original values for interpretability.
6.  **Saving Results:** We saved the final recommendations to a CSV file and the trained model for future use.

**Key Insights:**

*   We observed the sparsity of the transaction data, which is common in recommendation systems and highlights the need for techniques like collaborative filtering to make predictions for unseen interactions.
*   Analyzing customer and article activity gave us a better understanding of the data distribution and popular items/customers.
*   The RMSE value provides a measure of how well our model's predictions align with actual purchase counts.

**Next Steps and Further Exploration:**

*   **Model Improvement:** You could explore different ALS hyperparameters, try alternative collaborative filtering algorithms, or incorporate additional features (e.g., article metadata, customer demographics) to potentially improve recommendation quality.
*   **Evaluation Metrics:** Besides RMSE, consider using metrics more relevant to recommendation systems, such as Precision@k, Recall@k, or Mean Average Precision (MAP).
*   **Deployment:** Think about how you would deploy this model to generate real-time recommendations for users in a production environment.
*   **Cold Start Problem:** Address the cold start problem for new users or items that have little or no interaction data.
*   **Different Recommendation Approaches:** Explore content-based filtering or hybrid recommendation systems.

This tutorial provides a solid foundation. Feel free to experiment further and adapt the code to your specific needs! Happy coding! 😊

📂 **Bonus** 📂

You can easily load the saved model back into your Spark environment and use it to generate recommendations for new data.

In [None]:
from pyspark.ml.recommendation import ALSModel

# Define the path where the model was saved
loaded_model_path = "/content/hm_recommendation_model"

# Load the saved model
loaded_model = ALSModel.load(loaded_model_path)

# Now you can use the loaded_model to generate recommendations for new data.
# For example, to recommend items for all users in a new DataFrame:
# new_user_data = spark.createDataFrame([...]) # Replace with your new data
# new_recommendations = loaded_model.recommendForUserSubset(new_user_data, 10)
# new_recommendations.show()

# Or to recommend for all items for a subset of users:
# new_item_data = spark.createDataFrame([...]) # Replace with your new data
# new_recommendations = loaded_model.recommendForAllItems(10)
# new_recommendations.show()

# To generate recommendations for all users (similar to what we did before):
all_user_recs = loaded_model.recommendForAllUsers(10)
all_user_recs.show(5)

# You can then apply the same mapping logic as before to get original article and customer IDs.

📂 **Moving the Model to Google Drive** 📂

To keep your project files together, let's move the saved model directory from the Colab environment to your Kaggle folder in Google Drive.

In [29]:
!mv /content/hm_recommendation_model /content/drive/MyDrive/kaggle/