# Task 2: Recommendation Engine - Skeleton Notebook

This notebook provides a very basic example for the notebook you are expected to submit for Task 2 of the Final Project. The main purpose is that we can try different examples to get a better sense of your approach. Compared to Task 1 (Kaggle Competition), we don't have any objective means to evaluate the recommendations. 

Some general comments:
* You can import any data you need. This particularly includes your cleaned version of the properties dataset (incl. the auxiliary data or any other data you might have collected); there's no need to show the data cleaning / preprocessing steps in this notebook.
* You can also import your code in form of external Python (.py) script. You're actually encouraged to do so to keep this notebook light and uncluttered.
* **Important:** Please consider this notebook as an example and not to set specific requirements. Your notebook is likely to look very different. As long there is a section where we can easily test your solution, it should be fine.

## Setting up the Notebook

In [138]:
import numpy as np
import pandas as pd
from tqdm import tqdm

from dataloader import read_csv
from preprocessing import DataPreprocessor

from sklearn.metrics.pairwise import cosine_similarity, cosine_distances
from sklearn.neighbors import NearestNeighbors

In [139]:
# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load the Data

For this example, we use a simplified version of the dataset with only 2k+ data samples, each with only a subset of features.

In [248]:
trainX, trainY = read_csv('data/train.csv', ylabel='price')

In [249]:
x_index = set(trainX.index)
y_index = set(trainY.to_frame().index)
len(x_index.symmetric_difference(y_index))

0

In [212]:
auxSubzone, _ = read_csv('data/auxiliary-data/sg-subzones.csv')

auxInfraDict = {}
Infralist = ['sg-commerical-centres', 'sg-mrt-stations', 'sg-primary-schools', 'sg-secondary-schools', 'sg-shopping-malls']
for ele in Infralist:
    auxInfra, _ = read_csv('data/auxiliary-data/' + ele + '.csv')
    auxInfraDict[ele] = auxInfra

In [255]:
data_preprocessor = DataPreprocessor(auxSubzone, auxInfraDict)
trainX, trainY = data_preprocessor.fit_transform(trainX, trainY)

In [256]:
df_data = pd.concat([trainX, trainY], axis=1)

In [257]:
df_data.head()

Unnamed: 0,address,property_name,property_type,built_year,num_beds,num_baths,size_sqft,total_num_units,lat,lng,...,furnishing_unfurnished,furnishing_fully,floor_level_nan,floor_level_low,floor_level_ground,floor_level_high,floor_level_penthouse,floor_level_mid,floor_level_top,price
0,0.0,0.0,672050.6,1988.0,3.0,2.0,1115.0,116.0,1.414399,103.837196,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,514500.0
1,1.0,1.0,672050.6,1992.0,4.0,2.0,1575.0,407.0,1.372597,103.875625,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,995400.0
2,2.0,2.0,2965595.0,2022.0,4.0,6.0,3070.0,56.0,1.298773,103.895798,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,8485000.0
3,3.0,3.0,2965595.0,2023.0,3.0,2.0,958.0,638.0,1.312364,103.803271,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2626000.0
4,4.0,4.0,2965595.0,2026.0,2.0,1.0,732.0,351.0,1.273959,103.843635,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1764000.0


In [258]:
df_data.shape

(17685, 37)

In [259]:
df_data.drop_duplicates().shape

(17630, 37)

## Computing the Top Recommendations

The method `get_top_recommendations()` shows an example of how to get the top recommendations for a given data sample (data sample = row in the dataframe of the dataset). The input is a row from the dataset and a list of optional input parameters which will depend on your approach; `k` is the number of returned recommendations seems useful, though.

The output should be a `pd.DataFrame` containing the recommendations. The output dataframe should have the same columns as the row + any additional columns you deem important (e.g., any score or tags that you might want to add to your recommendations).

In principle, the method `get_top_recommendations()` may be imported from a external Python (.py) script as well.

### The input

In [275]:
row = df_data.iloc[[0]]
row

Unnamed: 0,address,property_name,property_type,built_year,num_beds,num_baths,size_sqft,total_num_units,lat,lng,...,furnishing_unfurnished,furnishing_fully,floor_level_nan,floor_level_low,floor_level_ground,floor_level_high,floor_level_penthouse,floor_level_mid,floor_level_top,price
0,0.0,0.0,672050.61116,1988.0,3.0,2.0,1115.0,116.0,1.414399,103.837196,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,514500.0


In [265]:
data_preprocessor.inverse_transform(row)

Unnamed: 0,address,property_name,property_type,built_year,num_beds,num_baths,size_sqft,total_num_units,lat,lng,...,nearest_sg-primary-schools,density_sg-primary-schools,nearest_sg-secondary-schools,density_sg-secondary-schools,nearest_sg-shopping-malls,density_sg-shopping-malls,subzone_area_size,price,furnishing,floor_level
0,sembawang / yishun (d27),866 yishun street 81,hdb,1988.0,3.0,2.0,1115.0,116.0,1.414399,103.837196,...,0.002479,3.53113,0.001648,3.694676,0.005589,2.501239,1.3402,514500.0,unspecified,


## Cosine similarity
Using cosine distance which is 1 - cosine_similarity

In [272]:
def get_top_consine_distance(row, X, k=10):
    distances = cosine_distances(row, X).flatten()
    return distances.argsort()[-k:][::-1]

In [273]:
index_list = get_top_consine_distance(row, df_data)

In [274]:
data_preprocessor.inverse_transform(df_data.iloc[index_list])

Unnamed: 0,address,property_name,property_type,built_year,num_beds,num_baths,size_sqft,total_num_units,lat,lng,...,nearest_sg-primary-schools,density_sg-primary-schools,nearest_sg-secondary-schools,density_sg-secondary-schools,nearest_sg-shopping-malls,density_sg-shopping-malls,subzone_area_size,price,furnishing,floor_level
746,1 keppel bay view,reflections at keppel bay,condo,2011.0,5.0,4.0,13293.0,1129.0,1.266718,103.811493,...,0.010052,2.504167,0.015009,1.80506,0.008824,2.937104,2.6632,71400000.0,unfurnished,
16748,queen astrid park,queen astrid park,bungalow,1989.0,10.0,10.0,16000.0,264.0,1.317659,103.790578,...,0.00639,2.692514,0.01553,3.053193,0.008347,2.991965,2.0385,114450000.0,unspecified,
5723,bukit timah road,maplewoods,bungalow,1997.0,5.0,5.0,7000.0,697.0,1.33436,103.784785,...,0.009347,2.413163,0.017979,1.872844,0.007514,3.206461,3.3872,115500000.0,unspecified,
5976,tanglin / holland (d10),oei tiong ham park,bungalow,1997.0,3.0,2.0,5000.0,39.0,1.31558,103.792829,...,0.008738,2.71055,0.01509,3.099769,0.005407,3.082039,2.0385,105000000.0,unspecified,
16654,dalvey road,the glencaird residences,bungalow,2000.0,4.0,4.0,3000.0,12.0,1.315477,103.819725,...,0.012693,2.326284,0.010716,3.296112,0.011434,3.713243,2.0961,99750000.0,unspecified,
5446,dalvey road,the glencaird residences,bungalow,2000.0,5.0,4.0,3000.0,12.0,1.315477,103.819725,...,0.012693,2.326284,0.010716,3.296112,0.011434,3.713243,2.0961,99750000.0,unspecified,
3275,bukit batok / bukit panjang / choa chu kang (d23),789 choa chu kang north 6,hdb,1996.0,4.0,2.0,1453.0,87.0,1.396419,103.751106,...,0.000543,3.816078,0.002907,3.796787,0.004503,3.670209,1.0886,8400000.0,unspecified,
4590,14 nassim road,les maisons nassim,condo,2023.0,4.0,5.0,6090.0,14.0,1.309019,103.824156,...,0.014628,2.291475,0.013368,3.082906,0.004209,4.669659,2.0961,46200000.0,unspecified,
17494,14 nassim road,les maisons nassim,condo,2023.0,4.0,4.0,6090.0,14.0,1.309019,103.824156,...,0.014628,2.291475,0.013368,3.082906,0.004209,4.669659,2.0961,44100000.0,unspecified,
4220,14a nassim road,les maisons nassim,condo,2023.0,4.0,5.0,6092.0,14.0,1.308666,103.823856,...,0.015088,2.292408,0.013806,3.079245,0.003846,4.658043,2.0961,44100000.0,unspecified,


## Nearest Neighbors

In [269]:
nearest_neighbors = NearestNeighbors(n_neighbors=10).fit(df_data)

In [270]:
distances, index_list = nearest_neighbors.kneighbors(row)

In [271]:
data_preprocessor.inverse_transform(df_data.iloc[index_list.flatten()])

Unnamed: 0,address,property_name,property_type,built_year,num_beds,num_baths,size_sqft,total_num_units,lat,lng,...,nearest_sg-primary-schools,density_sg-primary-schools,nearest_sg-secondary-schools,density_sg-secondary-schools,nearest_sg-shopping-malls,density_sg-shopping-malls,subzone_area_size,price,furnishing,floor_level
0,sembawang / yishun (d27),866 yishun street 81,hdb,1988.0,3.0,2.0,1115.0,116.0,1.414399,103.837196,...,0.002479,3.53113,0.001648,3.694676,0.005589,2.501239,1.3402,514500.0,unspecified,
4291,sembawang / yishun (d27),871 yishun street 81,hdb,1988.0,3.0,2.0,1163.0,70.0,1.413099,103.837412,...,0.003588,3.423291,0.002763,3.606851,0.006419,2.405091,1.3402,515400.0,unspecified,
11856,sembawang / yishun (d27),641 yishun street 61,hdb,1992.0,3.0,2.0,979.0,70.0,1.420758,103.838035,...,0.002588,3.921689,0.001196,3.97934,0.004084,2.842468,1.3402,514500.0,unspecified,
2416,sembawang / yishun (d27),873 yishun street 81,hdb,1988.0,3.0,2.0,1119.0,116.0,1.414486,103.836871,...,0.00259,3.537153,0.001783,3.698565,0.005772,2.499728,1.3402,509200.0,unspecified,
234,sembawang / yishun (d27),nee soon central meadows,hdb,1986.0,3.0,2.0,1119.0,69.0,1.420939,103.836109,...,0.004482,3.922934,0.002085,3.975242,0.005807,2.828076,1.3402,522900.0,unspecified,
5946,sembawang / yishun (d27),645 yishun street 61,hdb,1992.0,3.0,2.0,979.0,78.0,1.41807,103.838797,...,0.001576,3.778817,0.001879,3.882468,0.002462,2.739844,1.3402,522900.0,unspecified,
3200,sembawang / yishun (d27),817 yishun street 81,hdb,1987.0,3.0,2.0,1119.0,24.0,1.413534,103.836625,...,0.003514,3.458675,0.002684,3.633707,0.006602,2.419902,1.3402,524000.0,unspecified,
261,sembawang / yishun (d27),858 yishun avenue 4,hdb,1988.0,3.0,2.0,1119.0,83.0,1.418044,103.84056,...,0.002588,3.751281,0.003247,3.85246,0.000721,2.728155,1.3402,525000.0,unspecified,
1336,sembawang / yishun (d27),863 yishun avenue 4,hdb,1988.0,3.0,2.0,1119.0,88.0,1.418044,103.84056,...,0.002588,3.751281,0.003247,3.85246,0.000721,2.728155,1.3402,525000.0,unspecified,
7263,sembawang / yishun (d27),641 yishun street 61,hdb,1992.0,3.0,2.0,1001.0,70.0,1.420758,103.838035,...,0.002588,3.921689,0.001196,3.97934,0.004084,2.842468,1.3402,525000.0,unspecified,


In [None]:
def get_top_recommendations(row, **kwargs) -> pd.DataFrame:
    
    #####################################################
    ## Initialize the required parameters
    
    # The number of recommendations seem recommended
    # Additional input parameters are up to you
    k = 10
    
    # Extract all **kwargs input parameters
    # and set the used paramaters (here: k)
    for key, value in kwargs.items():
        if key == 'k':
            k = value
    
    
    

        
    # Return the dataset with the k recommendations
    return df_result


## Testing the Recommendation Engine

This will be the main part of your notebook to allow for testing your solutions. Most basically, for a given listing (defined by the row id in your input dataframe), we would like to see the recommendations you make. So however you set up your notebook, it should have at least a comparable section that will allow us to run your solution for different inputs.

### Pick a Sample Listing as Input

In [None]:
# Pick a row id of choice
row_id = 10
#row_id = 20
#row_id = 30
#row_id = 40
#row_id = 50

# Get the row from the dataframe (an valid row ids will throw an error)
row = df_sample.iloc[row_id]

# Just for printing it nicely, we create a new dataframe from this single row
pd.DataFrame([row])

## Compute and Display the recommendations

Since the method `get_top_recommendations()` returns a `pd.DataFrame`, it's easy to display the result.

In [None]:
k = 3

df_recommendations = get_top_recommendations(row, k=k)

df_recommendations.head(k)