# Traveloka Take-home Assessment
This is a notebook for "Traveloka Take-home Assessment"

## Problem
Traveloka is now launching a service in food delivery. Given customers' locations, orders history, and restaurant information, build a recommendation system that can recommend nearest restaurants.

## Approach
There are two classic approaches when it comes to recommendation systems: user and item-based collaborative filtering. While user collaborative emphasizes on qualitative measurements like ratings and scores given by each user, item-based focuses more on the information collected from various past transactions having similar tastes or preferences. Thus, in my opinion, item-based collaborative filtering is more appropriate to be implemented in this case. Let's see the implementation details below

In [1]:
import pandas as pd
import numpy as np

from haversine import haversine
from sklearn.neighbors import NearestNeighbors
import ml_metrics as metrics

import warnings
warnings.filterwarnings(action='ignore')

### Dataset
Here we have several data that we can use to build our recommendation model. `vendors.csv` contains the information about all restaurants (lat-long, category, tags, etc.) and we will use this data to build our "vendor matrix". Along with `train_full.csv` and `test_full.csv`, we can build "user profiles" based on what they ordered/bought previously (more on that later). Since both `train_full.csv` and `test_full.csv` have vast amount of data (~3GB and ~880MB respectively), we could reduce it by sampling over a small amount of percentage and minimize computational and time it takes to process.

For customer details, we have `train_customers.csv`, `train_locations.csv`, `test_customers.csv`, and `test_locations.csv`. However, we will not use them in this notebook and I will explain it later on later parts of this notebook. On the other hand, `orders.csv` seems to be unnecessary for us to include it since most of the data tell us about the transaction details made by customers while their preferences can be derived from both our training and test data.

In [2]:
vendors = pd.read_csv('./Data/vendors.csv')
orders = pd.read_csv('./Data/orders.csv')
# train data
train_full = pd.read_csv('./Data/train_full.csv')
train_locations = pd.read_csv('./Data/train_locations.csv')
train_customers = pd.read_csv('./Data/train_customers.csv')
# test data
test_locations = pd.read_csv('./Data/test_locations.csv')
test_full = pd.read_csv('./Data/test_full.csv')
test_customers = pd.read_csv('./Data/test_customers.csv')

In [3]:
def sample_train_and_test_data(df_train, df_test, train_frac=0.05, test_frac=0.1, random_state=42):
    sampled_train_data = df_train.sample(frac=train_frac, random_state=random_state)
    sampled_test_data = df_test.sample(frac=test_frac, random_state=random_state)

    return sampled_train_data, sampled_test_data

In [4]:
train_full_5_percent, test_full_1_percent = sample_train_and_test_data(train_full, test_full)
len(train_full_5_percent), len(test_full_1_percent)

(290120, 167200)

### Vector Space
Now, we should ask this question to ourselves: **If someone or your friend asks you for a good restaurant around your living area, what will you recommend to him/her?** Well, you probably have many restaurant in your mind, because **you don't know his/her food preference(s). Does he/she like Western? Perhaps Japanese? Or maybe Chinese?**. By asking this beforehand, you can narrow down your recommendations you don't have to worry about your friend hating the food because it doesn't match his/her preference.

Similar to our model, we need to use the information provided on each restaurant so that we know "Restaurant A is more suitable for people who like pastas, Restaurant B is more suitable for people who like sushis, etc." If we look at the column `vendor_tag_name` in `vendors.csv`, each row (restaurant) has multiple tags separated by commas. Great, we can further process this data into what we call by "Vector Space". Vector Space is an n-dimensional space with each item stored as a vector of its attributes. If we simplify this with just 3-dimensional vectors, we would end up with something like this:

<img src='./img/vector_space.png' alt="3-D Vector Space">

Restaurant A might produce good pizza and pasta but decent burger while Restaurant B is well-known for its burger and pasta except pizza. By doing this, although you have multiple tags on different restaurants, we can still find the "common ground" of those restaurants and recommend them to someone who has the same taste and preference.


In [5]:
def create_vector_space(df, features='vendor_tag_name'):
    df[features] = df[features].fillna('None')
    df[features] = df[features].apply(lambda x: x.split(','))
    vector_space = pd.get_dummies(df[features].apply(pd.Series).stack()).sum(level=0)

    return vector_space

In [6]:
vendor_profiles = create_vector_space(vendors)
vendor_profiles.index = vendors['id']
vendor_profiles.head()

Unnamed: 0_level_0,American,Arabic,Asian,Bagels,Biryani,Breakfast,Burgers,Cafe,Cakes,Chinese,...,Smoothies,Soups,Spanish Latte,Steaks,Sushi,Sweets,Thai,Thali,Vegetarian,Waffles
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,0,1,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13,0,0,0,0,0,1,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
20,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
28,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Since we could not visualize a space with more than three dimensions, we can only identify the tags by a number ("1" indicates the restaurant prepares those kind of foods while "0" doesn't).

Now in vector space, we can determine the similarity between vectors by calculating the angles between them. This creates "neighbors" where similar restaurants are closely placed with one another.

In [7]:
def train_model(df_profile, n_recommendations=20, metric='cosine', algorithm='brute'):
    n_recommendations = 20
    knn = NearestNeighbors(metric=metric, algorithm=algorithm, n_neighbors=n_recommendations)
    knn.fit(df_profile.values)
    return knn

In [11]:
knn = train_model(vendor_profiles)

The same concept can be implemented with customers, except there are possibilities where same customers (can be identified by `customer_id`) might present in the `train_full.csv` data. To capture the previous and "historical" transactions among the same customers, we can just sum them and we will end up with values on each category exceeding 1.

In [8]:
# create user profile for training data
train_user_profiles = create_vector_space(train_full_5_percent)
train_user_profiles.index = train_full_5_percent['customer_id']
train_user_profiles = train_user_profiles.groupby(train_user_profiles.index).sum()
train_user_profiles.head()

Unnamed: 0_level_0,American,Arabic,Asian,Bagels,Biryani,Breakfast,Burgers,Cafe,Cakes,Chinese,...,Smoothies,Soups,Spanish Latte,Steaks,Sushi,Sweets,Thai,Thali,Vegetarian,Waffles
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000THBA,3,1,0,1,0,1,3,1,0,0,...,1,1,0,1,0,0,0,0,0,0
001XN9X,1,1,0,0,1,1,1,0,0,1,...,1,2,0,0,0,0,1,0,0,0
001ZNTK,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
002510Y,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
005ECL6,2,0,1,0,0,1,4,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [30]:
# create user profile for test data
test_user_profiles = create_vector_space(test_full_1_percent)
test_user_profiles.index = test_full_1_percent['customer_id']
test_user_profiles = test_user_profiles.groupby(test_user_profiles.index).sum()

### Predicting
Onto the exciting part, prediction time! We will now take one `customer_id` to input it into our recommendation model. One more thing, if you have ever ordered foods from other applications like Grab or Gojek, then you should notice that they will recommend the nearest restaurant from your current position. Worry not, we have it in our data!

To calculate distance using lat-long data pairs, we utilize the `haversine` package to calculate the Haversine distance. It is basically a distance calculation based on the theory which states that our earth is sphere.

In [9]:
train_full_5_percent.customer_id.unique()

array(['HTIV4W4', 'E40FE9A', '3PC1YSR', ..., 'ESBKBFL', 'S2IZOY6',
       'EOZA199'], dtype=object)

If you remember on earlier part that we will not use the customer details data, this is because when using them, I found an unusual thing while calculating the Haversine distance.

In [12]:
user_id = 'LG46F74'
train_user_locations = pd.merge(train_customers, train_locations, left_on=['akeed_customer_id'], right_on=['customer_id'], how='left') # get lat-long from train_locations.csv and train_customers.csv
selected_user = train_user_locations[train_user_locations['akeed_customer_id']==user_id]
if len(selected_user) > 1:
    selected_user = selected_user.iloc[0]

_, vendor_idx = knn.kneighbors(np.array(train_user_profiles.loc[user_id]).reshape(1,-1))
recommended_vendor_ids = vendor_profiles.iloc[vendor_idx[0]].index
recommended_vendors = vendors[vendors['id'].isin(recommended_vendor_ids)]
distance_from_user = []
for _, row in recommended_vendors.iterrows():
    dist = haversine([selected_user['latitude'], selected_user['longitude']], [row['latitude'], row['longitude']])
    distance_from_user.append(dist)
recommended_vendors['distance_from_user'] = distance_from_user
recommended_vendors = recommended_vendors.sort_values(by=['distance_from_user'])

If we look at the `distance_from_user` column, the closest distance from customer `LG46F74` is 26 km and nobody wants their food to be delivered from 26 km away!

In [13]:
recommended_vendors

Unnamed: 0,id,authentication_id,latitude,longitude,vendor_category_en,vendor_category_id,delivery_charge,serving_distance,is_open,OpeningTime,...,vendor_tag,vendor_tag_name,one_click_vendor,country_id,city_id,created_at,updated_at,device_type,display_orders,distance_from_user
95,849,130455.0,-1.58806,-0.066441,Restaurants,2.0,0.0,10.0,1.0,,...,145689130434824,"[American, Breakfast, Burgers, Cafe, Desserts,...",Y,1.0,1.0,2019-12-21 12:47:39,2020-04-07 20:01:33,3,1,26.172458
71,304,118906.0,-1.267463,0.028361,Restaurants,2.0,0.7,10.0,1.0,10:00AM-11:15PM,...,1453016,"[American, Breakfast, Burgers, Fries, Sandwiches]",Y,1.0,1.0,2019-07-03 10:38:21,2020-04-02 12:35:56,3,1,62.232279
92,843,130447.0,-1.269317,0.082343,Restaurants,2.0,0.0,5.0,1.0,,...,145689130434824,"[American, Breakfast, Burgers, Cafe, Desserts,...",Y,1.0,1.0,2019-12-21 12:31:16,2020-04-07 19:01:17,3,1,63.090618
56,237,118838.0,-0.943419,0.081702,Restaurants,2.0,0.7,15.0,0.0,08:30PM-11:59PM,...,1585730272416,"[American, Burgers, Desserts, Donuts, Fries, P...",Y,1.0,1.0,2019-04-30 16:15:30,2020-04-07 18:45:33,3,1,98.736645
7,44,118640.0,-0.936556,0.081933,Restaurants,2.0,0.7,15.0,1.0,11:00AM-11:45PM,...,153016,"[American, Burgers, Fries, Sandwiches]",Y,1.0,1.0,2018-06-20 13:11:17,2020-04-07 20:09:27,3,1,99.495782
36,160,118758.0,-0.933981,0.081365,Restaurants,2.0,0.7,15.0,1.0,10:00AM-11:45PM,...,154816,"[American, Burgers, Kids meal, Sandwiches]",Y,1.0,1.0,2019-01-28 20:37:49,2020-04-03 22:36:25,3,1,99.77039
33,154,118752.0,-0.872019,0.099434,Restaurants,2.0,0.7,15.0,1.0,11:00AM-11:45PM,...,15472416,"[American, Burgers, Mishkak, Salads, Sandwiches]",Y,1.0,1.0,2019-01-17 11:43:15,2020-04-02 12:02:15,3,1,106.87854
73,356,118958.0,-0.845096,0.067013,Restaurants,2.0,0.0,15.0,1.0,11:00AM-111:00PM,...,148271524,"[American, Kids meal, Pasta, Pizzas, Salads]",Y,1.0,1.0,2019-07-25 13:07:10,2020-04-07 21:53:47,3,1,109.373286
94,846,130451.0,-0.441823,0.099479,Restaurants,2.0,0.0,6.0,1.0,,...,145689130434824,"[American, Breakfast, Burgers, Cafe, Desserts,...",Y,1.0,1.0,2019-12-21 12:40:36,2020-04-06 16:19:44,3,1,154.342916
75,391,118994.0,-0.624702,0.72512,Restaurants,2.0,0.7,5.0,1.0,9:00AM-11:00PM,...,14583424162835,"[American, Breakfast, Burgers, Desserts, Itali...",Y,1.0,1.0,2019-08-08 14:16:34,2020-04-02 00:51:51,3,1,158.281286


Instead, we will retrieve the customer's position from our training data. However, there are two lat-long pairs `latitude_x`, `longitude_x` and `latitude_y`, `longitude_y`. `latitude_x`, `longitude_x` pairs yield similar results with the previous one while `latitude_y`, `longitude_y` pairs result in a more plausible distance.

In [18]:
def recommend_a_user(user_id, df_train, df_vendor, user_profile, vendor_profile):
    selected_user = df_train[df_train['customer_id']==user_id]
    if len(selected_user) > 1:
        selected_user = selected_user.iloc[0]

    _, vendor_idx = knn.kneighbors(np.array(user_profile.loc[user_id]).reshape(1,-1))
    recommended_vendor_ids = vendor_profile.iloc[vendor_idx[0]].index
    recommended_vendors = df_vendor[df_vendor['id'].isin(recommended_vendor_ids)]
    distance_from_user = []
    for _, row in recommended_vendors.iterrows():
        dist = haversine([selected_user['latitude_y'], selected_user['longitude_y']], [row['latitude'], row['longitude']])
        distance_from_user.append(dist)
    recommended_vendors['distance_from_user'] = distance_from_user
    recommended_vendors = recommended_vendors.sort_values(by=['distance_from_user'])

    return recommended_vendors

In [19]:
recommend_a_user('LG46F74', train_full_5_percent, vendors, train_user_profiles, vendor_profiles)

Unnamed: 0,id,authentication_id,latitude,longitude,vendor_category_en,vendor_category_id,delivery_charge,serving_distance,is_open,OpeningTime,...,vendor_tag,vendor_tag_name,one_click_vendor,country_id,city_id,created_at,updated_at,device_type,display_orders,distance_from_user
1,13,118608.0,-0.471654,0.74447,Restaurants,2.0,0.7,5.0,1.0,08:30AM-10:30PM,...,44151342715241628,"[Breakfast, Cakes, Crepes, Italian, Pasta, Piz...",Y,1.0,1.0,2018-05-03 12:32:06,2020-04-05 20:46:03,3,1,0.015291
91,841,130436.0,-0.496138,0.740214,Restaurants,2.0,0.0,6.0,1.0,,...,145689130434824,"[American, Breakfast, Burgers, Cafe, Desserts,...",Y,1.0,1.0,2019-12-21 00:16:09,2020-04-07 15:09:12,3,1,2.760849
58,250,118851.0,-0.511584,0.758308,Restaurants,2.0,0.7,5.0,1.0,09:00AM-11:45PM,...,14452416,"[American, Breakfast, Rolls, Salads, Sandwiches]",Y,1.0,1.0,2019-05-12 17:17:13,2020-04-05 18:12:28,3,1,4.68954
3,23,118619.0,-0.585385,0.753811,Restaurants,2.0,0.0,5.0,1.0,10:59AM-10:30PM,...,583024,"[Burgers, Desserts, Fries, Salads]",Y,1.0,1.0,2018-05-06 19:20:48,2020-04-02 00:56:17,3,1,12.682594
0,4,118597.0,-0.588596,0.754434,Restaurants,2.0,0.0,6.0,1.0,11:00AM-11:30PM,...,2458912212241623,"[Arabic, Breakfast, Burgers, Desserts, Free De...",Y,1.0,1.0,2018-01-30 14:42:04,2020-04-07 15:12:43,3,1,13.044189
75,391,118994.0,-0.624702,0.72512,Restaurants,2.0,0.7,5.0,1.0,9:00AM-11:00PM,...,14583424162835,"[American, Breakfast, Burgers, Desserts, Itali...",Y,1.0,1.0,2019-08-08 14:16:34,2020-04-02 00:51:51,3,1,17.150385
6,43,118639.0,-0.11501,0.545973,Restaurants,2.0,0.7,15.0,1.0,11:00AM-11:45PM,...,153016,"[American, Burgers, Fries, Sandwiches]",Y,1.0,1.0,2018-06-20 12:28:00,2020-04-07 16:56:57,3,1,45.396934
98,858,130468.0,0.019817,0.587087,Restaurants,2.0,0.0,3.0,1.0,,...,145689130434824,"[American, Breakfast, Burgers, Cafe, Desserts,...",Y,1.0,1.0,2019-12-21 13:12:09,2020-04-07 14:26:08,3,1,57.392058
94,846,130451.0,-0.441823,0.099479,Restaurants,2.0,0.0,6.0,1.0,,...,145689130434824,"[American, Breakfast, Burgers, Cafe, Desserts,...",Y,1.0,1.0,2019-12-21 12:40:36,2020-04-06 16:19:44,3,1,71.808855
93,845,130450.0,-0.116904,0.181583,Restaurants,2.0,0.0,5.0,1.0,,...,145689130434824,"[American, Breakfast, Burgers, Cafe, Desserts,...",Y,1.0,1.0,2019-12-21 12:37:34,2020-04-08 03:43:30,3,1,73.997704


### Model Assessment
MAP (Mean Average Precision) or MAR (Mean Average Recall) is a common metric to determine whether our recommendation model is good or bad, specifically MAP@K and MAR@K. We use MAP@K (precision) to tell how relevant our restaurant recommendations are to the customers whereas MAR@K (recall) tells the ability of our model on how well it recalls the restaurants the customer has ordered food from. Generally, the higher those values are, the better it is for our recommendation model.

In [26]:
# assess model on training data
def evaluate_model(df_train_or_test, df_vendor, vendor_profile, user_profile):
    user_locations = df_train_or_test[['id', 'latitude_y', 'longitude_y', 'customer_id']]
    user_locations = user_locations[user_locations['customer_id'].notnull()] # filter data with null customer_id

    total_apk = 0
    for idx, user_row in user_locations.iterrows():
        _, vendor_idx = knn.kneighbors(np.array(user_profile.loc[user_row['customer_id']]).reshape(1,-1))
        recommended_vendor_ids = vendor_profile.iloc[vendor_idx[0]].index
        recommended_vendors = df_vendor[df_vendor['id'].isin(recommended_vendor_ids)]
        distance_from_user = []
        for _, row in recommended_vendors.iterrows():
            dist = haversine([user_row['latitude_y'], user_row['longitude_y']], [row['latitude'], row['longitude']])
            distance_from_user.append(dist)
        recommended_vendors['distance_from_user'] = distance_from_user
        recommended_vendors = recommended_vendors.sort_values(by=['distance_from_user'])
        total_apk += metrics.apk([user_row.id], recommended_vendors.id.tolist(), len(recommended_vendors))

    return total_apk/len(user_locations)

In [28]:
train_mapk = evaluate_model(train_full_5_percent, vendors, vendor_profiles, train_user_profiles)
print("Training MAP@K:", train_mapk)

Training MAP@K: 0.4755514959327175


In [31]:
test_mapk = evaluate_model(test_full_1_percent, vendors, vendor_profiles, test_user_profiles)
print("Test MAP@K:", test_mapk)

Test MAP@K: 0.3297412604130271
