# **Recommendation Systems**

In this part, we are going to implement a Recommendation System for properties for sale in Argentina. Based on a property id, the system is going to recommend 10 other properties that are similar, based on cosine similarity, to the specified property.

In [1]:
! pip install pyspark seaborn



In [4]:
# libraries

import sklearn
import pandas as pd
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from scipy.spatial.distance import cosine
from pyspark.ml.feature import Normalizer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.sql.window import Window

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')
google_drive_path = "/content/gdrive/MyDrive/Colab Notebooks/Project/"

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [5]:
# we've configured the Spark session to handle large datasets with high memory and parallelism settings
spark = SparkSession.builder.appName("Project511_RS").master("local[*]") \
          .config("spark.driver.memory", "10g") \
          .config("spark.executor.memory", "10g") \
          .config("spark.driver.maxResultSize", "2g") \
          .config("spark.sql.shuffle.partitions", "200") \
          .config("spark.default.parallelism", "100") \
          .getOrCreate()

# Read Datasets

We are going to load two versions of your dataset: one with numeric representations (for similarity calculations) and another with human-readable property characteristics (for presenting recommendations).

Read the latest version of our sale df with all the preprocessing or other manipulation we did to the data.

In [6]:
sale_df = spark.read.format("parquet").load(google_drive_path + "sale_ml.parquet")

In [7]:
sale_df.printSchema()

root
 |-- final_features: vector (nullable = true)
 |-- price_bucket: double (nullable = true)
 |-- start_year: integer (nullable = true)
 |-- end_year: integer (nullable = true)
 |-- is_active: integer (nullable = true)
 |-- l3_one_hot: vector (nullable = true)
 |-- l2_one_hot: vector (nullable = true)
 |-- property_one_hot: vector (nullable = true)
 |-- encoded_start_month: vector (nullable = true)
 |-- encoded_end_month: vector (nullable = true)
 |-- encoded_start_day_of_week: vector (nullable = true)
 |-- encoded_end_day_of_week: vector (nullable = true)
 |-- geo_cluster: integer (nullable = true)
 |-- rooms_scaled: double (nullable = true)
 |-- bedrooms_scaled: double (nullable = true)
 |-- bathrooms_scaled: double (nullable = true)
 |-- surface_total_scaled: double (nullable = true)
 |-- surface_covered_scaled: double (nullable = true)



This df (sale_df) does not have the description of the properties in textual representation, but only in numeric representation, which we converted in the Text Analytics file, so we are going to read the processed df without any encoded or scaled feature, but only the actual - readable characteristics of the properties. Both datasets have the same rows and data in the same order, but in order to give the description of the property when we recommending other properties, we are going to merge the two datasets on the id column that we will create afterwards.

In [23]:
df = spark.read.format("parquet").load(google_drive_path + "ar_properties_processed.parquet")

This is the dataset with the descriptions and all the info not encoded.

In [24]:
df.show()

+----------+----------+--------------------+-------+-----+--------+---------+-------------+---------------+--------------------+--------------------+-------------+--------------+--------------------+-----------------+
|start_date|  end_date|                  l2|     l3|rooms|bedrooms|bathrooms|surface_total|surface_covered|               title|         description|property_type|operation_type|         coordinates|   amount_in_euro|
+----------+----------+--------------------+-------+-----+--------+---------+-------------+---------------+--------------------+--------------------+-------------+--------------+--------------------+-----------------+
|2020-11-02|2020-11-16|Bs.As. G.B.A. Zon...|  Tigre|    2|       2|        1|        550.0|          267.0|Venta lote Las Ti...|Lote interno en z...|          Lot|          Sale|[-34.406894683837...|         154700.0|
|2020-11-02|9999-12-31|            Santa Fe| Alvear|    2|       2|        1|       438.17|         347.07|Vendo lote de 268...|

We are going to do some preprocessing in order to have the consistent version of it (we did some changes during the machine learning task).

In [25]:
df = df.filter(col("operation_type") == "Sale")

In [26]:
df.count()

210341

In [27]:
df = df.filter((col("amount_in_euro") <= 9000000000) & (col("amount_in_euro") > 11))

In [28]:
df.count()

210339

We are going to add the unique property id to both datasets using pyspark window function, in order to merge them later.

# Index

## Readable Dataset

In [29]:
# add column called 'id' that contains row numbers from 1 to n
w = Window().orderBy(lit('A'))
df = df.withColumn('id', row_number().over(w))

In [30]:
df.show()

+----------+----------+--------------------+-------+-----+--------+---------+-------------+---------------+--------------------+--------------------+-------------+--------------+--------------------+-----------------+---+
|start_date|  end_date|                  l2|     l3|rooms|bedrooms|bathrooms|surface_total|surface_covered|               title|         description|property_type|operation_type|         coordinates|   amount_in_euro| id|
+----------+----------+--------------------+-------+-----+--------+---------+-------------+---------------+--------------------+--------------------+-------------+--------------+--------------------+-----------------+---+
|2020-11-02|2020-11-16|Bs.As. G.B.A. Zon...|  Tigre|    2|       2|        1|        550.0|          267.0|Venta lote Las Ti...|Lote interno en z...|          Lot|          Sale|[-34.406894683837...|         154700.0|  1|
|2020-11-02|9999-12-31|            Santa Fe| Alvear|    2|       2|        1|       438.17|         347.07|Vendo

## Sale Dataset - Encoded

In [31]:
sale_df = sale_df.withColumn('id', row_number().over(w))

In [32]:
sale_df.show()

+--------------------+------------+----------+--------+---------+-------------+--------------+----------------+-------------------+-----------------+-------------------------+-----------------------+-----------+-------------------+--------------------+-------------------+--------------------+----------------------+---+
|      final_features|price_bucket|start_year|end_year|is_active|   l3_one_hot|    l2_one_hot|property_one_hot|encoded_start_month|encoded_end_month|encoded_start_day_of_week|encoded_end_day_of_week|geo_cluster|       rooms_scaled|     bedrooms_scaled|   bathrooms_scaled|surface_total_scaled|surface_covered_scaled| id|
+--------------------+------------+----------+--------+---------+-------------+--------------+----------------+-------------------+-----------------+-------------------------+-----------------------+-----------+-------------------+--------------------+-------------------+--------------------+----------------------+---+
|(17000,[0,1,2,3,6...|         4.0|  

In [33]:
# reorder the features in order to have id first
sale_df = sale_df.select("id", "final_features", "start_year", "end_year", "is_active", "l3_one_hot", "l2_one_hot", "property_one_hot", \
               "encoded_start_month", "encoded_end_month", "encoded_start_day_of_week",  "encoded_end_day_of_week", "geo_cluster", "rooms_scaled",
               "bedrooms_scaled", "bathrooms_scaled", "surface_total_scaled", "surface_covered_scaled", "price_bucket")

In [34]:
sale_df.show()

+---+--------------------+----------+--------+---------+-------------+--------------+----------------+-------------------+-----------------+-------------------------+-----------------------+-----------+-------------------+--------------------+-------------------+--------------------+----------------------+------------+
| id|      final_features|start_year|end_year|is_active|   l3_one_hot|    l2_one_hot|property_one_hot|encoded_start_month|encoded_end_month|encoded_start_day_of_week|encoded_end_day_of_week|geo_cluster|       rooms_scaled|     bedrooms_scaled|   bathrooms_scaled|surface_total_scaled|surface_covered_scaled|price_bucket|
+---+--------------------+----------+--------+---------+-------------+--------------+----------------+-------------------+-----------------+-------------------------+-----------------------+-----------+-------------------+--------------------+-------------------+--------------------+----------------------+------------+
|  1|(17000,[0,1,2,3,6...|      2020|

In [35]:
sale_df.count()

210339

# Scale final_features

Final_features are the numeric representation of each property's description so we have to scale them. This is necessary as different features can have varying ranges and scaling them helps in normalizing these differences. 

This step prepares the dataset for more effective similarity calculations by ensuring that all features contribute equally, improving the quality of recommendations in the recommendation system.

In [36]:
# Scaling 'final_features' to ensure they have a mean of 0 and a standard deviation of 1.
scaler = StandardScaler(inputCol="final_features", outputCol="scaled_final_features", withStd=True, withMean=True)

scalerModel = scaler.fit(sale_df) # fit the scaler model on the sale df to compute the mean and standard deviation of each feature

sale_df = scalerModel.transform(sale_df) # transform the sale df


In [37]:
# drop the original 'final_features' column as it is no longer needed after scaling
sale_df = sale_df.drop("final_features")

# Feature Vector Construction

The goal of constructing a feature vector is to combine all relevant features into a single vector that represents each property. This vector is then utilized for calculating similarities between properties in the recommendation system.

This process is vital for ensuring that the properties are represented in a way that enhances the effectiveness and accuracy of the recommendation system, leveraging both numerical and categorical data to identify the most similar properties based on a variety of characteristics.

In [38]:
# list with both scaled numeric features and categorical features that have been one-hot encoded
feature_columns = [
    'scaled_final_features', 'price_bucket', 'l3_one_hot', 'l2_one_hot', 'property_one_hot',
    'encoded_start_month', 'encoded_end_month', 'encoded_start_day_of_week', 'encoded_end_day_of_week',
    'rooms_scaled', 'bedrooms_scaled', 'bathrooms_scaled', 'surface_total_scaled', 'surface_covered_scaled'
]

# VectorAssembler is a transformer that combines a given list of columns into a single vector column
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
sale_df = assembler.transform(sale_df) # trasform df

# Sample

Unfortunately, due to constraints in computational resources and to ensure the system remains performant, we decided to proceed with only 10% of the full dataset. 

:(

Fortunately, as we will see later on, the results weren't that disappointing.

:)

In [39]:
sale_df = sale_df.sample(fraction=0.1, seed=1234)

In [40]:
sale_df.select("id").show()

+---+
| id|
+---+
| 31|
| 36|
| 38|
| 47|
| 51|
| 55|
| 61|
| 63|
| 88|
| 89|
| 93|
| 98|
|121|
|132|
|134|
|142|
|143|
|165|
|171|
|174|
+---+
only showing top 20 rows



**Example:** 

For demonstration purposes, we select an example property from the sampled dataset. Recommendations for this property will showcase how the system can identify and suggest similar properties based on the learned features.

In [41]:
property_id = sale_df.first()["id"]

In [42]:
property_id

31

# Normalization

We have to ensure that our feature vectors are normalized. This simplifies the cosine similarity to just a dot product since cosine similarity between two normalized vectors A and B is just their dot product. This simplifies and speeds up the computation significantly.

In [43]:
# initialize the Normalizer which will be used to normalize the feature vectors
normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
sale_df = normalizer.transform(sale_df) # transform sale df


In [44]:
#  caching the df to optimize performance as it will be accessed multiple times during the recommendation process
sale_df.cache()

DataFrame[id: int, start_year: int, end_year: int, is_active: int, l3_one_hot: vector, l2_one_hot: vector, property_one_hot: vector, encoded_start_month: vector, encoded_end_month: vector, encoded_start_day_of_week: vector, encoded_end_day_of_week: vector, geo_cluster: int, rooms_scaled: double, bedrooms_scaled: double, bathrooms_scaled: double, surface_total_scaled: double, surface_covered_scaled: double, price_bucket: double, scaled_final_features: vector, features: vector, normFeatures: vector]

In [45]:
# repartitioning the df by 'id' to optimize the layout of data across the cluster, improving the efficiency of the query operations that follow
sale_df = sale_df.repartition(50, "id")

# Similarity-Based Recommender

Since we are not dealing with user interactions but rather with item similarity, traditional collaborative filtering techniques like ALS (Alternating Least Squares) may not be suitable. Instead, we can focus on content-based filtering using the feature vectors we've constructed. By focusing on content-based filtering, the system can make precise recommendations based on detailed property features.


In [46]:
# UDF for calculating cosine similarity between two feature vectors
def cosine_similarity(x, y):
    return float(1 - cosine(np.array(x), np.array(y))) # convert vectors to numpy arrays and compute the cosine similarity


In [47]:
# udf
cosine_similarity_udf = udf(cosine_similarity, FloatType())

Function to recommend properties based on similarity of feature vectors

In [48]:
def recommend_properties(property_id, num_recommendations=10):
    try:
        # retrieve the feature vector for the specified property
        target_feature_vector = sale_df.filter(col("id") == property_id).select("features").first()
        if target_feature_vector is None:
            return f"No property found with ID {property_id}"

        # convert the sparse vector to a numpy array for similarity computation
        target_feature_vector = target_feature_vector["features"].toArray()

        # broadcast the target feature vector to all nodes to minimize data transfer during similarity calculations
        broadcast_vector = spark.sparkContext.broadcast(target_feature_vector)

        # calculate the similarity of the broadcasted vector with all properties' feature vectors
        similarities = sale_df.withColumn("similarity", cosine_similarity_udf(col("features"), lit(broadcast_vector.value)))

        # get the top properties ordered by their similarity scores, excluding the property itself
        recommendations = similarities.orderBy("similarity", ascending=False).limit(num_recommendations + 1)
        recommendations = recommendations.filter(col("id") != property_id)

        return recommendations.select("id", "similarity")
    
    # handle exceptions
    except Exception as e:
        return f"An error occurred: {str(e)}"


In [49]:
recommended_properties = recommend_properties(property_id).cache()

In [50]:
recommended_properties.show()

+------+----------+
|    id|similarity|
+------+----------+
| 23616|0.74673873|
| 19007| 0.7411356|
| 15503|   0.72375|
| 87674| 0.7216825|
|120358| 0.6843912|
|103114| 0.6728982|
|  4956| 0.6450466|
| 19967| 0.6422261|
|107779|0.63657355|
| 38311| 0.6215914|
+------+----------+



As we can observe, the scores of the recommended properties are quite high, meaning that the recommendation system, can efficiently suggest similar properties, compared to the specified one. In a real case scenario, such similarity metrics can help users discover properties that match their preferences or needs closely. 

# Merge Datasets

We merge the two datasets, in order to get a detailed description in a readable representation for each recommended property.

In [51]:
merged_df = recommended_properties.join(df, "id")

In [52]:
merged_df.show()

+------+----------+----------+----------+--------------------+------------------+-----+--------+---------+-------------+---------------+--------------------+--------------------+-------------+--------------+--------------------+--------------+
|    id|similarity|start_date|  end_date|                  l2|                l3|rooms|bedrooms|bathrooms|surface_total|surface_covered|               title|         description|property_type|operation_type|         coordinates|amount_in_euro|
+------+----------+----------+----------+--------------------+------------------+-----+--------+---------+-------------+---------------+--------------------+--------------------+-------------+--------------+--------------------+--------------+
|  4956| 0.6450466|2020-09-12|2020-11-30|     Capital Federal|          Saavedra|    2|       1|        1|       438.17|         347.07|Excelente 2 amb c...|Departamento de 2...|    Apartment|          Sale|[-34.879692077636...|       97370.0|
| 15503|   0.72375|2020-

Recommendation Score: It confirms that the system can identify properties similar to a given reference, validated by similarity scores.

Information Richness: The extensive details accompanying each recommendation enhance transparency and trust, enabling users to understand why these properties were suggested.


Additionally, we tried to implement a recommendation system, which first clustered the properties with KMeans algorith, providing this way "ground truth". In this way, we could make our calculations both on train and test set and evaluate our recommendations. However we had several complications when evalutating on test set, so we decided to leave out from our submissions, this particular jupyter file.