# **Welcome to the Notebook**

### Task 1 - Set up the project

Installing the needed modules.

In [1]:
!pip install openai python-dotenv pyspark

# !pip install --upgrade httpx openai python-dotenv pyspark



Imporint the modules

In [2]:
from dotenv import load_dotenv
import os
from openai import OpenAI
import pandas as pd
import numpy as np

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, FloatType

from pyspark.ml.feature import VectorAssembler, PCA
from pyspark.ml.clustering import KMeans
import plotly.express as px

Setup the OpenAI API

In [3]:
# loading API key from dotenv file
load_dotenv(dotenv_path='apikey.env.txt')

# Retrieve API key from ENV variable
APIKEY = os.getenv("APIKEY")

# Creatinh an Instance of the OpenAI client with provided API key
client = OpenAI(api_key=APIKEY)

client

<openai.OpenAI at 0x7f00a5c289d0>

In [4]:
print(f"API Key loaded: {APIKEY}")

API Key loaded: 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


Create a Spark session :
entry point to use Spark Functionality, connect own app to Spark cluster and enable you load, process & Analyse large dataset.

In [5]:
spark = SparkSession.builder.appName("OpenAI_Embedded_Recommender_System").getOrCreate()
spark

Loading the dataset

In [6]:
file_path = "products_dataset.csv"
data = spark.read.csv(file_path, header=True, inferSchema=True, samplingRatio=1)

data.show()

+----------+--------------------+--------------------+
|product_id|               title|         description|
+----------+--------------------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|
|        P1|Turmode 30 ft. RP...|If you need more ...|
|        P2|Large Tapestry Bo...|Polyester cover r...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|
|        P4|Men's Crazy Horse...|This 9 in. black ...|
|        P5|Mariana 6 ft. Mul...|With robust struc...|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|
|        P7|7/8 in. x 4-1/2 i...|DEWALT High Perfo...|
|        P8|  Ring Gold Bar Cart|This Ring Bar Car...|
|        P9|Traditional Silve...|This transitional...|
|       P10|15 in. x 59 in. O...|Its easy to add a...|
|       P11|1 qt. #350F-7 Wil...|BEHR PREMIUM PLUS...|
|       P12|Anthracite Cordle...|BlindsAvenue ligh...|
|       P13|SlimGrip 78-Inch ...|Luverne SlimGrip ...|
|       P14|6 in. x 28 in. x ...|Our Rustic Collec...|
|       P1

List of 8 products recently viewed by the user.

In [7]:
recently_viewed_products = [
    'P316',
    'P333',
    'P1115',
    'P1691',
    'P1082',
    'P397',
    'P1441',
    'P1054',
]

### Task 2 - Prepare the dataset

Combine `title` and `description` Columns

In [8]:
data = data.withColumn("combined_text", concat_ws(" ", "title", "description"))
# concat_ws - concatenate with seperation " "
data.show()

+----------+--------------------+--------------------+--------------------+
|product_id|               title|         description|       combined_text|
+----------+--------------------+--------------------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|Men's 3X Large Ca...|
|        P1|Turmode 30 ft. RP...|If you need more ...|Turmode 30 ft. RP...|
|        P2|Large Tapestry Bo...|Polyester cover r...|Large Tapestry Bo...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|16-Gauge-Sinks Ve...|
|        P4|Men's Crazy Horse...|This 9 in. black ...|Men's Crazy Horse...|
|        P5|Mariana 6 ft. Mul...|With robust struc...|Mariana 6 ft. Mul...|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|5 gal. #650C-2 Po...|
|        P7|7/8 in. x 4-1/2 i...|DEWALT High Perfo...|7/8 in. x 4-1/2 i...|
|        P8|  Ring Gold Bar Cart|This Ring Bar Car...|Ring Gold Bar Car...|
|        P9|Traditional Silve...|This transitional...|Traditional Silve...|
|       P10|

get the combined_text column and convert it into a list

In [9]:
list_combined_text = data.select("combined_text").rdd.flatMap(lambda x: x).collect()
print(list_combined_text[:3])

["Men's 3X Large Carbon Heather Cotton/Polyester Rain Defender Paxton Heavyweight Hooded Zip-Front Sweatshirt This heavyweight, water-repellent hooded sweatshirt has a zip front for fast layering. ORIGINAL FIT. 13 oz., 75% cotton/25% polyester blend with Rain Defender durable water repellent. Attached, jersey-lined three-piece hood with drawcord closure. Antique-finish brass front zipper. Two front hand-warmer pockets have a hidden security pocket inside. Stretchable, spandex-reinforced rib-knit cuffs and waistband. Locker loop facilitates hanging.", "Turmode 30 ft. RP TNC Female to RP TNC Male Adapter Cable If you need more length between your existing wireless device and Hi-Gain Antenna, this is the product for you. It's compatible with most Wi-Fi Antennas, so it is easy for you to extend your wireless network. Just replace your existing cable that runs between your wireless device and Antenna and you're ready to use your network with extended range.", 'Large Tapestry Bolster Bed Pol

Note : But keep in mind that this data is stored in Spark, distributed across the cluster, and isn't directly accessible in memory. So if you want to transform or manipulate this data, we cannot directly do that.

To work with this data, we need something called an RDD, or Resilient Distributed Dataset. An RDD is a distributed collection of data across the Spark cluster, allowing us to perform transformations on the data in a parallel and fault-tolerant way. To get the RDD, all I need to do is to say .rdd. In this way, I can have access to this RDD. If I run this, I get the RDD object. Now in here, I have access to different methods that I can use for transforming this data. In our case, we are going to use a flat map. A flat map applies a lambda function, or a function to each of the elements in this column. In our case, it is going to be lambda x x. For each of the elements, return the element. This is still inside of the Spark cluster, and I need to collect it and bring it to memory. To do that, I need to call collect. Collect is one of the triggers that is used in Spark in order to bring data from the Spark cluster into your memory. And in here, I'm gonna have a list.

Use OpenAI text embedding model to create the vector embeddings.

In [10]:
# # Using OpenAI API for text embedding - chargable
# response = client.embeddings.create(
#     input=list_combined_text,
#     model="text-embedding-3-small",
#     dimensions=512,
# )

# embedding_vectors = [d.embedding for d in response.data]
# embedding_vectors[:3]


In [11]:
# Using HuggingFace Model for text embedding - Free
from sentence_transformers import SentenceTransformer
#model = SentenceTransformer('paraphrase-MiniLM-L6-v2')  # Use any other model

model = SentenceTransformer('paraphrase-TinyBERT-L6-v2')

embedding_vectors = model.encode(list_combined_text)

print(f"Dimensionality of embedding: {len(embedding_vectors[0])}")
embedding_vectors[:3]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dimensionality of embedding: 768


array([[ 0.11626028, -0.15627941, -0.2368016 , ..., -0.08178298,
        -0.20101297,  0.10670185],
       [ 0.01522942, -0.13704216, -0.24675126, ...,  0.30454403,
        -0.2014109 , -0.08399742],
       [-0.12096605, -0.13888785, -0.25137565, ...,  0.15209316,
        -0.1651402 ,  0.0063173 ]], dtype=float32)

Let't put the embedding vectors into our original dataframe

Convert embedding vectors list into a Pyspark DataFrame

In [12]:
features_column_names = [f"embedding_{i}" for i in range(len(embedding_vectors[0]))]

# convert dataframe into PySpark dataframe
embedding_data =  spark.createDataFrame(embedding_vectors, schema=features_column_names)
embedding_data.show()


+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------

Add unique `row_id` to each row in the pysaprk dataframe

In [13]:
embedding_data = embedding_data.repartition(1).withColumn("row_id", F.monotonically_increasing_id())
embedding_data.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------

## Imp Note :
working with Spark, data is often split across multiple partitions or RDDs to enable parallel processing to make things faster. While this setup boosts performance, it can make it challenging to align rows between two data frames as each partition handles its data independently. Now, to ensure that all of these row IDs that we are generating are aligned across both data frames, we repartition our data into a single partition before generating the row IDs. This way, the row IDs are sequential and consistent throughout the entire data frame, helping us to accurately join or match rows between data frames.

Add unique `row_id` to each row in our main pyspark dataframe `df`

In [14]:
data = data.repartition(1).withColumn("row_id", F.monotonically_increasing_id())
data.show()

+----------+--------------------+--------------------+--------------------+------+
|product_id|               title|         description|       combined_text|row_id|
+----------+--------------------+--------------------+--------------------+------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|Men's 3X Large Ca...|     0|
|        P1|Turmode 30 ft. RP...|If you need more ...|Turmode 30 ft. RP...|     1|
|        P2|Large Tapestry Bo...|Polyester cover r...|Large Tapestry Bo...|     2|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|16-Gauge-Sinks Ve...|     3|
|        P4|Men's Crazy Horse...|This 9 in. black ...|Men's Crazy Horse...|     4|
|        P5|Mariana 6 ft. Mul...|With robust struc...|Mariana 6 ft. Mul...|     5|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|5 gal. #650C-2 Po...|     6|
|        P7|7/8 in. x 4-1/2 i...|DEWALT High Perfo...|7/8 in. x 4-1/2 i...|     7|
|        P8|  Ring Gold Bar Cart|This Ring Bar Car...|Ring Gold Bar Car...|     8|
|   

Let's join the two dataframes

In [15]:
data = data.join(embedding_data, on="row_id", how="inner").drop("row_id")
data.show()

+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-

### Task 3 - Cluster products using K-means

Assemble the 512 Embedding Columns into a Single 'features' Column

In [16]:
# vector assembler - Feature Transformer,
# combine multiple num data cols into single vector col in pyspark

assembler = VectorAssembler(inputCols=features_column_names, outputCol="features")
data = assembler.transform(data)

#remove 512 individual vector fe cols
data = data.select("product_id", "title","description", "features")

data.show()

+----------+--------------------+--------------------+--------------------+
|product_id|               title|         description|            features|
+----------+--------------------+--------------------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|[0.11626028269529...|
|        P1|Turmode 30 ft. RP...|If you need more ...|[0.01522942166775...|
|        P2|Large Tapestry Bo...|Polyester cover r...|[-0.1209660470485...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|[-0.0138693470507...|
|        P4|Men's Crazy Horse...|This 9 in. black ...|[0.13365463912487...|
|        P5|Mariana 6 ft. Mul...|With robust struc...|[0.12027258425951...|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|[-0.1115422695875...|
|        P7|7/8 in. x 4-1/2 i...|DEWALT High Perfo...|[-0.0019701037090...|
|        P8|  Ring Gold Bar Cart|This Ring Bar Car...|[-0.2596883475780...|
|        P9|Traditional Silve...|This transitional...|[-0.2690607309341...|
|       P10|

In [17]:
# check 20th row features col value & size
first_row_features = data.select('features').collect()[19]
print(first_row_features)
# Get the length of the DenseVector
vector_length = len(first_row_features['features'])
print(vector_length)

Row(features=DenseVector([-0.0356, -0.2896, -0.2711, 0.2623, -0.0849, -0.1187, -0.0722, 0.0536, 0.0999, 0.031, 0.0463, 0.1741, 0.4183, 0.0105, -0.0697, -0.0561, -0.0203, -0.1501, -0.3006, 0.1975, -0.1251, 0.0763, 0.064, 0.0578, 0.1605, 0.1923, 0.202, 0.0521, 0.1808, 0.3557, -0.1614, -0.0269, 0.1534, -0.1119, 0.0011, 0.0752, 0.03, -0.0924, 0.0484, -0.0173, 0.3248, 0.0327, 0.1905, -0.3359, 0.215, 0.2038, 0.3394, 0.0094, 0.2596, 0.0311, 0.1486, -0.2727, 0.1117, -0.0515, -0.0298, -0.0034, 0.1553, 0.0369, 0.0619, 0.0799, -0.0896, -0.3364, 0.257, 0.0287, -0.0788, 0.2046, 0.0445, -0.0569, -0.0327, -0.0159, -0.3711, -0.058, -0.0018, 0.0515, 0.189, 0.1604, -0.0328, -0.1881, -0.0699, -0.0798, -0.4047, 0.066, -0.2359, -0.117, -0.1422, -0.2098, -0.4698, 0.1062, -0.3156, 0.0207, -0.0032, 0.1666, 0.2584, 0.0052, -0.1225, 0.1396, -0.4644, -0.0135, -0.0532, -0.338, 0.0295, -0.0652, -0.1284, 0.1193, -0.0533, -0.2279, 0.3069, -0.0293, 0.0884, 0.2084, -0.0762, -0.0395, 0.1102, 0.1271, -0.1259, -0.004, 0.

Apply K-Means Clustering with 5 Clusters on the `features` Column

In [18]:
kmeans = KMeans(k=5, featuresCol='features', predictionCol="cluster")
model = kmeans.fit(data)
clustered_data = model.transform(data)
clustered_data.show()

+----------+--------------------+--------------------+--------------------+-------+
|product_id|               title|         description|            features|cluster|
+----------+--------------------+--------------------+--------------------+-------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|[0.11626028269529...|      3|
|        P1|Turmode 30 ft. RP...|If you need more ...|[0.01522942166775...|      3|
|        P2|Large Tapestry Bo...|Polyester cover r...|[-0.1209660470485...|      2|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|[-0.0138693470507...|      1|
|        P4|Men's Crazy Horse...|This 9 in. black ...|[0.13365463912487...|      3|
|        P5|Mariana 6 ft. Mul...|With robust struc...|[0.12027258425951...|      1|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|[-0.1115422695875...|      0|
|        P7|7/8 in. x 4-1/2 i...|DEWALT High Perfo...|[-0.0019701037090...|      3|
|        P8|  Ring Gold Bar Cart|This Ring Bar Car...|[-0.2596883475780...| 

### Task 4 - Visualize the clusters

Let's reduce the dimensionality of our features for visualization purpose

`512 dimensions => 2 dimensions`

In [19]:
pca = PCA(k=2, inputCol="features", outputCol="pca_features")
pca_model = pca.fit(clustered_data)

pca_results = pca_model.transform(clustered_data)
pca_results.show()

+----------+--------------------+--------------------+--------------------+-------+--------------------+
|product_id|               title|         description|            features|cluster|        pca_features|
+----------+--------------------+--------------------+--------------------+-------+--------------------+
|        P0|Men's 3X Large Ca...|This heavyweight,...|[0.11626028269529...|      3|[-0.1491433843630...|
|        P1|Turmode 30 ft. RP...|If you need more ...|[0.01522942166775...|      3|[0.62450573790398...|
|        P2|Large Tapestry Bo...|Polyester cover r...|[-0.1209660470485...|      2|[0.13092184088527...|
|        P3|16-Gauge-Sinks Ve...|It features a rec...|[-0.0138693470507...|      1|[-0.3537227685660...|
|        P4|Men's Crazy Horse...|This 9 in. black ...|[0.13365463912487...|      3|[0.80800804841524...|
|        P5|Mariana 6 ft. Mul...|With robust struc...|[0.12027258425951...|      1|[-0.3008007430855...|
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|[

In [20]:
pca_df = pca_results.select(['product_id','pca_features','cluster']).toPandas()
pca_df['x'] = pca_df['pca_features'].apply(lambda x: x[0])
pca_df['y'] = pca_df['pca_features'].apply(lambda x: x[1])
pca_df.head()

Unnamed: 0,product_id,pca_features,cluster,x,y
0,P0,"[-0.1491433843630932, 0.021793771492380404]",3,-0.149143,0.021794
1,P1,"[0.6245057379039884, 0.3556715923902566]",3,0.624506,0.355672
2,P2,"[0.13092184088527842, -0.9346148036886412]",2,0.130922,-0.934615
3,P3,"[-0.35372276856609774, 0.8137301318819877]",1,-0.353723,0.81373
4,P4,"[0.8080080484152448, 0.464817419232602]",3,0.808008,0.464817


Let's plot the Clusters

In [21]:
def plot_clusters(pca_df, num_clusters=5):
    """
    Plots a 2D visualization of clusters using Plotly Express.

    Parameters:
    - pca_df (DataFrame): A Pandas DataFrame containing columns 'x', 'y', and 'cluster'.
      'x' and 'y' are the 2D PCA components, and 'cluster' indicates the cluster label.
    - num_clusters (int): The number of unique clusters to display.
    - recently_viewed_df (DataFrame, optional): DataFrame with 'x' and 'y' coordinates for recently viewed products.

    This function creates an interactive scatter plot where each point is colored according to its cluster.
    Recently viewed products are marked as black crosses if provided.

    Returns:
    - fig (Figure): The Plotly figure object for the plot.
    """

    # Create the base cluster plot
    fig = px.scatter(
        pca_df,
        x='x',
        y='y',
        opacity=0.6,
        size_max=4,
        color= pca_df.cluster.astype(str),
        title='2D Visualization of Clusters with Recently Viewed Products',
        labels={'x': 'PCA Component 1', 'y': 'PCA Component 2'},
        category_orders={'cluster': list(range(num_clusters))},
        # show the product id in the tooltip
        hover_data={'product_id': True}

    )

    # Update layout to add legend title and adjust plot settings
    fig.update_layout(legend_title_text='Clusters', legend=dict(x=1, y=1), width=600, height=500)

    return fig

fig = plot_clusters(pca_df)
fig.show()

### Task 5 - Highlight recently viewed products

In [22]:
print("The user has recently viewed the following products: ", recently_viewed_products)

The user has recently viewed the following products:  ['P316', 'P333', 'P1115', 'P1691', 'P1082', 'P397', 'P1441', 'P1054']


Let's have a look at the records in our `clustered_data` dataframe related to the recently viewed products.

In [23]:
filter_data = clustered_data.filter(clustered_data.product_id.isin(recently_viewed_products))
#filter_data.show()

unique_cluster = filter_data.select('cluster').distinct().rdd.flatMap(lambda x: x).collect()
unique_cluster

[4, 0]

### Note :  
identify the distinct clusters within this data set. By finding these unique clusters, we can base our recommendations on these specific groupings, ensuring that the products we suggest align with the contextual meaning of the items the user has already shown interest in. Now, to do this, I'm going to create a variable called unique clusters. And I can say filter data that select, I'm going to select the cluster column. And then I can say distinct. Now, because I want the unique clusters in a list, I can again say that rdd dot flat map, I define a lambda that gets an x and return the x. And then at the end, I can say collect, let's have a look at the unique clusters.

### Task 6 - Recommend products based on recently viewed products

Let's have a look at the recently viewed products titles

In [24]:
filter_data.select('title').rdd.flatMap(lambda x: x).collect()

["Mystic Fitz Roy Beige 9' 0 x 12' 0 Area Rug",
 'Florida Shag Beige/Multi 3 ft. x 5 ft. Floral Area Rug',
 '1 gal. #M250-3 Apple Turnover Extra Durable Flat Interior Paint & Primer',
 '1 gal. #HDPG60 Misty Emerald Lake Flat Interior Paint and Primer',
 '1 qt. #S220-7 Molasses Extra Durable Flat Interior Paint & Primer',
 'Modern Gray/Multi 9 ft. x 12 ft. Vibrant Abstract Polyester Area Rug',
 '1 qt. #PPU6-06 Honey Locust Eggshell Enamel Low Odor Interior Paint & Primer',
 'Genet Rust/Red-Brown 8 ft. x 11 ft. Abstract Wool Area Rug']

Let's see the distinct clusters of the recenetly viewed products.

In [25]:
print(unique_cluster)

[4, 0]


Let's find the possible products for the recommendation.

In [26]:
# also remove already viewed product by user
possible_recommendations = clustered_data.filter(clustered_data['cluster'].isin(unique_cluster)).filter(~clustered_data['product_id'].isin(recently_viewed_products))
possible_recommendations.show()

+----------+--------------------+--------------------+--------------------+-------+
|product_id|               title|         description|            features|cluster|
+----------+--------------------+--------------------+--------------------+-------+
|        P6|5 gal. #650C-2 Po...|BEHR PRO i300 Sem...|[-0.1115422695875...|      0|
|       P11|1 qt. #350F-7 Wil...|BEHR PREMIUM PLUS...|[0.12200022488832...|      0|
|       P16|5 gal. #BL-W10 Ma...|BEHR PREMIUM PLUS...|[0.04143549501895...|      0|
|       P18|1 qt. #M400-5 Bab...|BEHR PREMIUM PLUS...|[0.17083416879177...|      0|
|       P21|Whimsicle Blue Mu...|In true bohemian ...|[-0.1538960188627...|      4|
|       P24|5-gal. #HDGO64U C...|The Glidden 5-gal...|[-0.0141344917938...|      0|
|       P26|5 gal. #W-B-320 W...|BEHR ULTRA SCUFF ...|[0.12073547393083...|      0|
|       P30|1 qt. Bermuda San...|This Glidden Exte...|[0.18607740104198...|      0|
|       P33|1 gal. #N210-6 Sw...|Love your space l...|[0.09998241066932...| 

Let's perform a groupby and generate a list of product IDs that can be recommended for each of the clusters.

In [27]:
recommendations = possible_recommendations.groupby('cluster').agg(F.collect_list('product_id').alias('recommendations'))
#recommendations.show()

recommendations_df = recommendations.toPandas()
recommendations_df['random_recommendations'] = recommendations_df['recommendations'].apply(lambda x: np.random.choice(x, 5, replace=False).tolist())
recommendations_df.head()

Unnamed: 0,cluster,recommendations,random_recommendations
0,0,"[P6, P11, P16, P18, P24, P26, P30, P33, P40, P...","[P391, P33, P1042, P1718, P24]"
1,4,"[P21, P52, P60, P71, P87, P101, P108, P119, P1...","[P1826, P405, P145, P1437, P534]"


In [28]:
# write a python function to display the recommendations
def display_recommendations(row):
  # find the title of the product in df
  product_ids = row['random_recommendations']
  cluster = row.cluster

  titles = data. \
          filter(data["product_id"]. \
          isin(product_ids)).select("title").collect()

  print("\n")
  print("Recommendations for Cluster:", cluster)
  for title in titles:
    print(title[0])

recommendations_df.apply(display_recommendations, axis=1)



Recommendations for Cluster: 0
5-gal. #HDGO64U Classic Ivory Semi-Gloss Latex Exterior Paint
1 gal. #N210-6 Swiss Brown Eggshell Enamel Interior Paint & Primer
1 gal. #M450-6 Bubble Turquoise Semi-Gloss Interior Paint
1 gal. #200B-7 Fireglow Matte Interior Stain-Blocking Paint & Primer
1 qt. #PPL-68 Summer Moon Satin Enamel Low Odor Interior Paint & Primer


Recommendations for Cluster: 4
Eternity Barcelona Blue 7 ft. 10 in. x 7 ft. 10 in. Round Area Rug
Amherst Dark Gray/Beige 2 ft. x 7 ft. Geometric Interlace Runner Rug
Antiquity Beige/Multi 8 ft. x 11 ft. Border Area Rug
8 in. x 8 in. Pattern Carpet Sample - Sequin Sash -Color Woodland
Odell Distressed Persian Silver 3 ft. x 8 ft. Runner


Unnamed: 0,0
0,
1,
