# Smart Adaptive Recommendations (SAR) Model

In [32]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load original catalogue data
clean_df = pd.read_csv('https://raw.githubusercontent.com/Workspace-Recommendation-Engine/workspace_data/main/workspaces_clean.csv', index_col=0)
clean_df

Unnamed: 0,Workspace_Id,Name,Rating,Review_count,Price_range,Category,Address,Latitude,Longitude,Next_status
0,0,Eugenio Trias Municipal Public Library,3.8,800,0,Public library,"P.º de Fernán Núñez, 24",40.416705,-3.679161,Opens 8:30 AM Mon
1,1,Iván de Vargas Library,4.3,313,0,Public library,"C. de San Justo, 5",40.413991,-3.709750,Opens 8:30 AM Mon
2,2,Biblioteca Mario Vargas Llosa,3.8,178,0,Public library,"C. de Barceló, 4",40.426713,-3.699394,Opens 8:30 AM Mon
3,3,Pedro Salinas Library,4.0,337,0,Public library,"Gta. de la Prta de Toledo, 1",40.407074,-3.710894,Opens 9 AM Mon
4,4,Acuna Public Library,2.9,118,0,Public library,"C. de Quintana, 9",40.427932,-3.716937,Opens 9 AM Mon
...,...,...,...,...,...,...,...,...,...,...
259,259,Harina,3.9,434,2,Coffee shop,"C. de Velázquez, 61",40.429262,-3.684050,Closes 9 PM
260,260,The Coffee Corner,4.3,314,1,Coffee shop,"Av. de Valladolid, 41",40.428630,-3.729667,Closes 9 PM
261,261,The Bear and the Madroño,4.4,590,1,Espresso bar,"C. del Doce de Octubre, 16",40.415687,-3.675956,Closes 10:30 PM
262,262,Cafés Pozo,4.6,52,0,Coffee store,"C. de Miguel Arredondo, 4",40.394994,-3.695993,Closes 2 PMReopens 5 PM


- The code above imports the necessary libraries and reads the cleaned workspace data file 'workspaces_clean.csv' into a Pandas DataFrame called clean_df.

- The index_col=0 parameter specifies that the first column of the CSV file should be used as the index of the DataFrame.

- The clean_df DataFrame contains pre-processed and cleaned data for the workspace recommendation engine. This includes relevant attributes for each workspace, such as location, opening time, price, and ratings, as well as any additional user and workspace information needed for the SAR model.


In [33]:
# Drop constantly changing Next_status column
clean_df.drop("Next_status", axis=1, inplace=True)

- The code above drops Next_status column which is irrelavant as it constantly changes based on the current time.

In [34]:
# Get 100 row indices labels following pattern:
# User_1, User_2, User_3 ... User_100
user_row_indices = []
for i in range(1, 101):
    user_row_indices.append(f"User_{i}")

# Dictionary to store weighted location ratings to add to dataframe
data = {
    "Workspace_Id": clean_df["Workspace_Id"],
    "Workspace": clean_df["Name"], # workspace locations
    "Category": clean_df["Category"]
}

# number of ratings to generate for each user
num_rows = len(clean_df)

# initialise random_seed to fixed value to always produce same results
random_seed = 2023
for row in user_row_indices:
     
    # set random seed
    np.random.seed(random_seed)
    
    # for each user, generate weights for each workspace location and add to data dictionary
    data[row] = np.random.uniform(1, 5, num_rows).round(1)
    
    # increment random seed at each iteration so that each category has different randomly generated values
    random_seed += 1

# create dataframe with weighted average for each location based on each user
weighted_clean_df = pd.DataFrame(data = data)
print("\nWeighted Average User Rating for each Workspace")
weighted_clean_df


Weighted Average User Rating for each Workspace


Unnamed: 0,Workspace_Id,Workspace,Category,User_1,User_2,User_3,User_4,User_5,User_6,User_7,...,User_91,User_92,User_93,User_94,User_95,User_96,User_97,User_98,User_99,User_100
0,0,Eugenio Trias Municipal Public Library,Public library,2.3,3.4,1.5,1.9,4.2,1.3,1.6,...,2.8,4.4,2.9,2.3,1.9,3.8,4.1,3.6,2.0,4.9
1,1,Iván de Vargas Library,Public library,4.6,3.8,4.6,2.7,4.6,1.7,2.1,...,1.3,2.9,4.2,2.4,1.5,4.0,4.6,1.8,4.9,3.9
2,2,Biblioteca Mario Vargas Llosa,Public library,3.4,1.8,4.7,4.9,4.1,3.7,4.5,...,1.4,2.9,1.5,3.5,3.5,4.3,1.9,2.2,4.3,3.2
3,3,Pedro Salinas Library,Public library,1.5,1.2,2.8,1.4,4.5,2.1,3.5,...,4.4,2.5,1.9,3.9,2.4,4.9,4.4,4.8,4.6,1.7
4,4,Acuna Public Library,Public library,1.6,1.8,2.6,2.9,3.1,3.4,2.8,...,1.8,3.1,2.2,3.3,1.6,2.7,4.8,2.3,3.5,3.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,259,Harina,Coffee shop,4.1,3.7,3.1,4.3,1.6,4.6,2.9,...,4.2,4.4,2.2,4.0,4.5,1.2,2.2,3.7,4.3,3.6
260,260,The Coffee Corner,Coffee shop,1.4,2.0,2.3,3.4,1.7,2.6,2.2,...,4.9,2.5,4.1,4.0,4.9,2.5,1.9,3.5,1.0,4.8
261,261,The Bear and the Madroño,Espresso bar,4.3,4.7,4.7,5.0,2.9,4.5,1.6,...,4.0,2.6,3.2,4.0,1.4,1.2,4.0,1.5,1.3,1.5
262,262,Cafés Pozo,Coffee store,1.4,2.0,4.0,1.6,2.2,2.1,4.3,...,2.0,1.3,3.0,1.6,1.1,2.3,4.2,1.9,3.1,4.4


-Here, we generate a synthetic dataset for the SAR model, which consists of 100 users and their ratings of the workspaces in the clean_df DataFrame. The users are identified by labels in the format of "User_1" up to "User_100".

-The data dictionary is initialized to store the workspace information from the clean_df DataFrame, including the workspace ID, name, and category.

-The **num_rows** variable is set to the number of rows in the clean_df DataFrame.

-For each user, the code generates random weights for each workspace location using **np.random.uniform(1, 5, num_rows)**, which generates random floating-point numbers between 1 and 5. These weights are rounded to 1 decimal place using the **.round(1)** method. The weights are added to the data dictionary under the key of the user's label.

-The **weighted_clean_df** DataFrame is then created by passing the data dictionary to the **pd.DataFrame()** constructor. This DataFrame shows the weighted average user rating for each workspace, based on each user's randomly generated ratings.

In [35]:
category_averages_df = weighted_clean_df.groupby("Category").mean(numeric_only=True).round(1).T
print("\nUser Average Rating for Workspace Categories")
category_averages_df


User Average Rating for Workspace Categories


Category,Bakery,Brunch,Business center,Cafe,Cafeteria,Coffee roasters,Coffee shop,Coffee store,Coworking space,Dog cafe,Donuts,Espresso bar,Library,Public library,Records storage facility,Restaurant,Tea store,University library
Workspace_Id,244.0,235.5,170.0,197.1,206.5,172.0,195.6,247.1,138.0,172.0,237.0,242.8,63.2,35.9,80.0,254.0,250.0,73.6
User_1,3.6,2.3,2.7,3.4,3.1,3.2,3.1,2.6,2.9,2.0,3.7,2.7,2.9,2.9,4.3,3.0,2.0,3.2
User_2,3.4,3.3,2.4,3.5,3.4,4.5,3.0,3.3,3.1,4.1,3.0,3.5,3.0,3.0,3.0,2.9,1.1,3.3
User_3,2.2,3.1,2.5,3.1,3.4,2.2,3.2,3.3,2.8,2.8,2.5,4.0,3.0,3.1,4.4,2.3,1.9,2.3
User_4,3.4,2.2,2.4,2.6,3.0,3.0,3.1,3.3,3.2,4.9,2.9,2.6,3.0,2.8,2.1,4.4,2.2,4.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
User_96,1.5,4.3,3.9,3.2,3.4,3.1,2.7,2.8,3.1,2.3,1.9,2.3,3.3,3.1,1.3,3.8,4.4,2.9
User_97,3.3,1.3,3.4,2.7,2.8,1.2,3.0,2.7,2.9,1.9,1.2,3.2,3.0,2.9,1.9,1.0,2.1,3.0
User_98,1.1,2.9,1.2,3.0,3.2,2.6,3.1,2.8,3.0,3.0,4.3,3.2,2.6,3.0,3.2,1.1,2.6,3.1
User_99,4.7,1.8,4.5,3.3,3.3,3.7,3.0,2.5,2.7,4.2,2.7,3.1,3.0,3.0,4.9,2.1,1.9,2.9


-In this code, we calculate the average ratings for each workspace category across all users, using the **groupby()** method on the weighted_clean_df DataFrame. The groupby() method groups the rows in the DataFrame by the "Category" column, and the **.mean()** method calculates the mean of the ratings for each category. The resulting DataFrame has the category names as the index and the mean ratings as the columns.

-Then, the **.T** method is called to transpose the DataFrame, so that the categories are now the columns and the mean ratings are the rows.

-Finally, we print the **category_averages_df** , which shows the user average rating for each workspace category based on the synthetic dataset generated in the previous code.

In [36]:
# Drop Workspace_Id row
category_averages_df.drop("Workspace_Id", axis=0, inplace=True)
category_averages_df

Category,Bakery,Brunch,Business center,Cafe,Cafeteria,Coffee roasters,Coffee shop,Coffee store,Coworking space,Dog cafe,Donuts,Espresso bar,Library,Public library,Records storage facility,Restaurant,Tea store,University library
User_1,3.6,2.3,2.7,3.4,3.1,3.2,3.1,2.6,2.9,2.0,3.7,2.7,2.9,2.9,4.3,3.0,2.0,3.2
User_2,3.4,3.3,2.4,3.5,3.4,4.5,3.0,3.3,3.1,4.1,3.0,3.5,3.0,3.0,3.0,2.9,1.1,3.3
User_3,2.2,3.1,2.5,3.1,3.4,2.2,3.2,3.3,2.8,2.8,2.5,4.0,3.0,3.1,4.4,2.3,1.9,2.3
User_4,3.4,2.2,2.4,2.6,3.0,3.0,3.1,3.3,3.2,4.9,2.9,2.6,3.0,2.8,2.1,4.4,2.2,4.2
User_5,1.5,3.1,3.4,2.8,2.8,3.6,3.2,2.4,3.2,4.4,1.0,2.8,3.2,3.0,1.7,4.1,3.8,2.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
User_96,1.5,4.3,3.9,3.2,3.4,3.1,2.7,2.8,3.1,2.3,1.9,2.3,3.3,3.1,1.3,3.8,4.4,2.9
User_97,3.3,1.3,3.4,2.7,2.8,1.2,3.0,2.7,2.9,1.9,1.2,3.2,3.0,2.9,1.9,1.0,2.1,3.0
User_98,1.1,2.9,1.2,3.0,3.2,2.6,3.1,2.8,3.0,3.0,4.3,3.2,2.6,3.0,3.2,1.1,2.6,3.1
User_99,4.7,1.8,4.5,3.3,3.3,3.7,3.0,2.5,2.7,4.2,2.7,3.1,3.0,3.0,4.9,2.1,1.9,2.9


- Here, we drop the "Workspace_Id" row from the category_averages_df DataFrame using the **drop()** method, which removes the specified row or column from the DataFrame. The **axis=0** parameter specifies that the row should be dropped, and the **inplace=True** parameter ensures that the DataFrame is modified in place.

- The resulting DataFrame shows the user average rating for each workspace category based on the synthetic dataset generated in the previous code block, with the "Workspace_Id" row removed.

In [37]:
# Copy clean_df in a dataframe called relevant_train_df
# relevant_train_df will be store only useful features to compare workspaces
relevant_train_df = clean_df.copy()

# Label encode categories with rank
# This allows us to create numeric coding in alphabetical order without having to sort
# the dataframe which would rearrange the sorting of the workspace Ids
# It is important to keep the workspace Ids sorted because we will later use
relevant_train_df['Category'] = relevant_train_df['Category'].rank(method='dense').astype(int) - 1

# Reordering workspaceIds since we decided to sort the worspace_df by category to ensure alphabetic label encoding
workspace_ids = [i for i in weighted_clean_df["Workspace_Id"]]

relevant_train_df

Unnamed: 0,Workspace_Id,Name,Rating,Review_count,Price_range,Category,Address,Latitude,Longitude
0,0,Eugenio Trias Municipal Public Library,3.8,800,0,13,"P.º de Fernán Núñez, 24",40.416705,-3.679161
1,1,Iván de Vargas Library,4.3,313,0,13,"C. de San Justo, 5",40.413991,-3.709750
2,2,Biblioteca Mario Vargas Llosa,3.8,178,0,13,"C. de Barceló, 4",40.426713,-3.699394
3,3,Pedro Salinas Library,4.0,337,0,13,"Gta. de la Prta de Toledo, 1",40.407074,-3.710894
4,4,Acuna Public Library,2.9,118,0,13,"C. de Quintana, 9",40.427932,-3.716937
...,...,...,...,...,...,...,...,...,...
259,259,Harina,3.9,434,2,6,"C. de Velázquez, 61",40.429262,-3.684050
260,260,The Coffee Corner,4.3,314,1,6,"Av. de Valladolid, 41",40.428630,-3.729667
261,261,The Bear and the Madroño,4.4,590,1,11,"C. del Doce de Octubre, 16",40.415687,-3.675956
262,262,Cafés Pozo,4.6,52,0,7,"C. de Miguel Arredondo, 4",40.394994,-3.695993


In [38]:
# Store irrelevant column names (those not needed for workspace to workspace comparison) in a list
irrelevant_cols = ["Name", "Review_count", "Address"]

# Drop irrelevant columns from relevant_train_df so that we are left with only
# Workspace Id, rating, price range, category, latitude and longitude
# Workspace Id will not be used in the similatity comparison but jus to identify workspaces
relevant_train_df.drop(irrelevant_cols, axis=1, inplace=True)
relevant_train_df

Unnamed: 0,Workspace_Id,Rating,Price_range,Category,Latitude,Longitude
0,0,3.8,0,13,40.416705,-3.679161
1,1,4.3,0,13,40.413991,-3.709750
2,2,3.8,0,13,40.426713,-3.699394
3,3,4.0,0,13,40.407074,-3.710894
4,4,2.9,0,13,40.427932,-3.716937
...,...,...,...,...,...,...
259,259,3.9,2,6,40.429262,-3.684050
260,260,4.3,1,6,40.428630,-3.729667
261,261,4.4,1,11,40.415687,-3.675956
262,262,4.6,0,7,40.394994,-3.695993


- Here, we just create a dataframe called relevant_train_df which contains the workspace Id along with the rating, price range, category, latitude and longitude as we believe only these features are relevant for comparison between workspaces.

- <b>It is important to note that the workspace Id will not be taken into account for the actual comparisons but just to identify the workspaces being compared </b>.

In [39]:
# Get indices and columns for workspaces which correspond to the workspace Ids
workspace_ids = [i for i in weighted_clean_df["Workspace_Id"]]

# Get data which is cosine similarity between each workspace based on rating, price_range, category, latitude and longitude
data = cosine_similarity(relevant_train_df.iloc[:, 1:], relevant_train_df.iloc[:, 1:])

# Create and display workspace to workspace similarity dataframe
workspace_workspace_df = pd.DataFrame(data=data, index=workspace_ids, columns=workspace_ids)
workspace_workspace_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,254,255,256,257,258,259,260,261,262,263
0,1.000000,0.999932,1.000000,0.999989,0.999779,1.000000,0.999866,0.999956,0.999744,0.999990,...,0.997947,0.998780,0.977434,0.990230,0.986287,0.985638,0.986415,0.998587,0.990187,0.990169
1,0.999932,1.000000,0.999932,0.999976,0.999467,0.999932,0.999608,0.999997,0.999683,0.999974,...,0.997785,0.998947,0.977747,0.990406,0.986497,0.985652,0.986542,0.998698,0.990390,0.990371
2,1.000000,0.999932,1.000000,0.999989,0.999780,1.000000,0.999866,0.999956,0.999746,0.999990,...,0.997944,0.998784,0.977451,0.990242,0.986299,0.985651,0.986428,0.998590,0.990198,0.990180
3,0.999989,0.999976,0.999989,1.000000,0.999670,0.999989,0.999779,0.999989,0.999734,1.000000,...,0.997900,0.998861,0.977566,0.990311,0.986379,0.985651,0.986475,0.998645,0.990278,0.990260
4,0.999779,0.999467,0.999780,0.999670,1.000000,0.999779,0.999989,0.999539,0.999513,0.999674,...,0.997893,0.998143,0.976556,0.989590,0.985584,0.985290,0.985865,0.998047,0.989496,0.989479
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,0.985638,0.985652,0.985651,0.985651,0.985290,0.985649,0.985428,0.985644,0.989045,0.985743,...,0.978711,0.991865,0.997356,0.998436,0.999563,1.000000,0.999659,0.992776,0.998400,0.998405
260,0.986415,0.986542,0.986428,0.986475,0.985865,0.986426,0.986048,0.986511,0.989828,0.986563,...,0.978293,0.992835,0.998430,0.999408,0.999971,0.999659,1.000000,0.993128,0.999394,0.999398
261,0.998587,0.998698,0.998590,0.998645,0.998047,0.998590,0.998211,0.998683,0.999358,0.998669,...,0.995528,0.999709,0.985953,0.995359,0.993106,0.992776,0.993128,1.000000,0.995351,0.995338
262,0.990187,0.990390,0.990198,0.990278,0.989496,0.990197,0.989706,0.990347,0.993064,0.990353,...,0.981932,0.995661,0.997361,0.999996,0.999417,0.998400,0.999394,0.995351,1.000000,1.000000


-Here, we create **a workspace to workspace affinity matrix** based on the synthetic dataset generated in the previous code block:

1. First, a list of workspace IDs is created by selecting the "Workspace_Id" column from the weighted_clean_df DataFrame.

2. We store the cosine similarity between two instances of the relevant train data (excluding the workspace Id) and store this as variable called data.

3. We create a dataframe called workspace_workspace_df with the data we created in the previous step and the indices and columns corresponding to the workspace Ids. This dataframe acts as a workspace to workspace affinitiy matrix which measures the cosine similarity between each workspace based on rating, price range, category, latitude and longitude.

In [40]:
# Recommendation scores are obtained by multiplying the workspace-to-workspace affinity matrix
# by the User_1 affinity vector
rec_scores = workspace_workspace_df.values.dot(weighted_clean_df["User_1"].values)

# Get data_frame for User_1 Workspace recommendation scores (descending order)
# Index equates to the workspace Id for each workspace
data = {"User_1_Recommendations": rec_scores}
user_1_rec = pd.DataFrame(data=data, index=workspace_workspace_df.index)
user_1_rec.sort_values("User_1_Recommendations", ascending=False, inplace=True)
user_1_rec

Unnamed: 0,User_1_Recommendations
172,786.468505
145,786.409809
138,786.409095
163,786.408672
128,786.408240
...,...
67,772.978181
56,772.841733
29,772.485903
235,772.405017


Here, we calculate the recommendation scores for each workspace for the hypothetical user "User_1". We use the workspace to workspace affinity matrix (workspace_workspace_df) and the affinity vector for "User_1" **(weighted_clean_df["User_1"])** generated in the code blocks prior:

1. The first line of the code calculates the recommendation scores by performing a **dot product between the workspace-to-workspace affinity matrix** (workspace_workspace_df.values) and the User_1 affinity vector (weighted_clean_df["User_1"].values).

2. The resulting recommendation scores are then added to a new DataFrame called **user_1_rec** with the column name **"User_1_Recommendations"**. The index of the DataFrame corresponds to the workspace IDs, and the cells represent the recommendation scores for each workspace.

3. Finally, the user_1_rec DataFrame is sorted **in descending order** based on the recommendation scores. 

The resulting DataFrame shows the recommended workspaces for User_1, with the top recommended workspace at the top of the DataFrame.

Print Recommendations

In [41]:
# Get sub-dataframe with top 5 scored workspaces
top_5_workspaces = user_1_rec.head(5)
print("\nTop 5 Workspaces for User 1")
top_5_workspaces


Top 5 Workspaces for User 1


Unnamed: 0,User_1_Recommendations
172,786.468505
145,786.409809
138,786.409095
163,786.408672
128,786.40824


Here, we extract the top 5 recommended workspaces for "User_1" from the user_1_rec DataFrame generated in the previous code block. We create a new DataFrame called **top_5_workspaces** that contains the top 5 recommended workspaces with their corresponding recommendation scores.

The **head(5)** method is used to extract the first 5 rows (i.e., the top 5 recommended workspaces) of the user_1_rec DataFrame. 

The resulting top_5_workspaces DataFrame is printed.

In [42]:
def print_workspace(workspace_id):
    # Get workspace row based on id
    workspace = clean_df[clean_df["Workspace_Id"] == workspace_id]
    
    # Get price range string based on category codes
    price_range_cat = workspace["Price_range"].values[0]
    if(price_range_cat == 0):
        price_range = None
    elif(price_range_cat == 1):
        price_range = "€"
    elif(price_range_cat == 2):
        price_range = "€€"
    elif(price_range_cat == 3):
        price_range = "€€€"
    
    # Print workspace details
    print(workspace["Name"].values[0])
    print(workspace["Address"].values[0])
    print(workspace["Category"].values[0])
    if price_range is not None:
      print(f"Price range: {price_range}")
    print(f"Overall Rating: {workspace['Rating'].values[0]}")  

Here, we define a function called **print_workspace** that takes a **workspace_id** as an input parameter.

The function first retrieves the row in the clean_df dataframe that corresponds to the given workspace id. It then extracts the details about the workspace such as **its name, address, category, price range and overall rating**, and prints them out to the console in a structured format.

The function is meant to be used to display details of a workspace to a user, given a workspace id.

In [43]:
# Print top 5 recommended workspaces for User_1

print("\nWorkspace Recommendations for User 1\n")

# Initialise to make top 5 count
top = 0

# Get index/workspace Id of each top 5 workspace
for i in top_5_workspaces.index:
    top += 1
    # Print details for each top choice
    print(f"Top {top} Choice\n")
    print_workspace(i)
    print("\n\n")


Workspace Recommendations for User 1

Top 1 Choice

1000 Cups Specialty Coffee & Food
Cmo. de Ganapanes, 1
Dog cafe
Overall Rating: 4.2



Top 2 Choice

WeWork - Espacio de oficinas y coworking
P.º de la Castellana, 43
Coworking space
Overall Rating: 4.4



Top 3 Choice

WeWork - Espacio de oficinas y coworking
P.º de la Castellana, 77
Coworking space
Overall Rating: 4.4



Top 4 Choice

Talent Garden Madrid
C. de Juan de Mariana, 15
Coworking space
Overall Rating: 4.5



Top 5 Choice

la raum de chamberi coworking
Calle de Modesto Lafuente, 7
Coworking space
Overall Rating: 4.3



