In this notebook we process the data to be used for training the BPR algorithm using the Matrix Factorization as the underlying model. It tries to learn the relation between users and items and make predictions based on similarities between them. In our problem, we have a clear issue: all the items, i.e. the offers, are in principle unique and therefore it is not possible to learn these user-item relations. In order to solve this problem, we propose to identify two offers as equal if the different transport modes cover the same length of the trip.  To simplify the problem, we only consider five different transportation modes: walk, bike,car,  public transport and long public transport.   Hence, a trip composed by car-walk-walk would be the same as another trip composed by walk-walk-car

First of all, we need to store the transportation mode for each one of the legs of all the trips alternatives (we will need to generate them from the leg alternatives). 

In [1]:
import numpy as np
import pandas as pd
import json
import random

In [2]:
def random_wit_n_digits(n):
    range_start = 10**(n-1)
    range_end = (10**n)-1
    return random.randint(range_start, range_end)

In [4]:
def combine_alternatives(legs):
    """This function returns all possible combinations of leg alternatives to form
        trip alternatives. The input is a list containing the leg alternatives from
        the routeRank dataset"""

    trip_dict = dict()
    trip_dict.setdefault('from', legs[0]['from'])
    trip_dict.setdefault('to', legs[-1]['to'])
    trip_dict.setdefault('date', legs[0]['date'])
    trip_dict.setdefault('tripId', legs[0]['tripId'])
    trip_dict.setdefault('places', {})
    number_alternatives_trip = []
    for trip in legs:
        trip_dict['places'].update(trip['places'])
        number_alternatives_trip.append(len(trip['alternatives']))
    number_alternatives_trip.append(1)
    number_alternatives = np.prod(number_alternatives_trip)
    l = [[] for i in range(number_alternatives)]
    trip_dict.setdefault('alternatives', l)
    for j in range(len(legs)):
        k = np.prod(number_alternatives_trip[j + 1:])
        v = 0
        for alternative in legs[j]['alternatives']:
            for i in range(number_alternatives):
                if i % np.prod(number_alternatives_trip[j:]) == 0:
                    for m in range(k):
                        trip_dict['alternatives'][i + m + v].append(alternative)
            v += k
    return trip_dict


In [7]:
# load the data
routerank_alternatives_1 = json.load(open('data/final1.json'))
routerank_alternatives_2 = json.load(open('data/final2.json'))
leg_alternatives = routerank_alternatives_1 + routerank_alternatives_2

In [6]:
new_trips = []
unique_requests_id = open('data/final_unique_resquests_ids.txt', 'r').read().split('\n')

In [8]:
# generate the trip alternatives
k = 1
for unique_id in unique_requests_id[:-1]:
    legs = []
    for trip in leg_alternatives:
        if trip['tripId'] == unique_id:
            legs.append(trip)
    if len(legs) < 8:
        new_trips.append(combine_alternatives(legs))
    else:
        print(k)
    if k % 100 == 0:
        print('{k} trips combined'.format(k=k))
    k += 1

100 trips combined
200 trips combined
300 trips combined
400 trips combined
500 trips combined
600 trips combined
700 trips combined
800 trips combined
900 trips combined
1000 trips combined
1100 trips combined
1200 trips combined
1300 trips combined
1400 trips combined
1500 trips combined
1600 trips combined
1700 trips combined
1800 trips combined
1900 trips combined
2000 trips combined
2100 trips combined
2200 trips combined
2300 trips combined
2400 trips combined
2500 trips combined
2600 trips combined
2700 trips combined
2800 trips combined
2900 trips combined
3000 trips combined
3100 trips combined
3200 trips combined
3300 trips combined
3400 trips combined
3500 trips combined
3600 trips combined
3700 trips combined
3800 trips combined
3900 trips combined
4000 trips combined
4100 trips combined
4200 trips combined
4300 trips combined
4400 trips combined
4500 trips combined
4600 trips combined
4700 trips combined
4800 trips combined
4900 trips combined
5000 trips combined
5100 trip

Now that the trip alternatives have been generated, we iterate over the legs of these offers to store the modes of transport (mapped into 5 different means)

In [9]:
# map transport modes
dict_transport_modes = {
    'train': 'pubtrans',
    'taxi': 'pubtrans',
    'change': 'pubtrans',
    'bus': 'pubtrans',
    'subway': 'pubtrans',
    'tram': 'pubtrans',
    'bikesharing': 'bike',
    'carsharing': 'car',
    'genericpubtrans': 'pubtrans',
    'boat': 'longpubtrans',
    'funicular': 'pubtrans'
}

def map_transport_modes(mode):
    mode_mapped = dict_transport_modes.get(mode, None)
    if mode_mapped is not None:
        return mode_mapped
    else:
        return mode

In [14]:
np.random.seed(0)
random.seed(0)

modes_offer = dict()
modes_offer_list = list()

for trips in new_trips: #this loop goes through the trips
    trip = trips['alternatives']
    for alternative in trip: #this loop goes through the alternatives (offers)
        modes = list()
        offer_id = random_wit_n_digits(16)
        # modes_offer.setdefault(offer_id,list())
        for segments in alternative:
            for segment in segments['segments']: #this loop goes through the segments
                for leg in segment['legs']: #this loop goes through the legs
                    modes.append(map_transport_modes(leg['transport']))
        modes_offer.setdefault(offer_id, modes)
        modes_offer_list.append(modes)

In [16]:
print('Number of different offers:', len(modes_offer))

Number of different offers: 385865


In [22]:
# example
print(modes_offer[4469980719646669])

['car', 'car', 'car', 'car', 'walking']


In the example above, 80% of the offer is performed by car, while the other 20% on foot. All the offers with this combination will be considered as equals. For that, we convert the previous list into a vector containing the fraction of the trip covered by each transport mode.

In [17]:
vector_modes = {
    'walking':0,
    'car':1,
    'bike':2,
    'pubtrans':3,
    'longpubtrans':4
}

# create the vectors containing the fraction of the trip covered by each transport mode
modes_offer_vector = dict()
for key, value in modes_offer.items():
    modes_offer_vector.setdefault(key,np.empty(5))
    temp_dict_modes = np.zeros(5)
    k = 0
    for mode in value:
        temp_dict_modes[vector_modes[mode]] += 1
        k += 1
    modes_offer_vector[key] = list(temp_dict_modes/k)

In [20]:
# example
print(modes_offer_vector[4469980719646669])

[0.2, 0.8, 0.0, 0.0, 0.0]


Now, we need to extract the unique vectors and assign an identification to each one of them

In [23]:
unique_data = [list(x) for x in set(tuple(x) for x in list(modes_offer_vector.values()))]
dict_unique_data = dict(zip(np.arange(1,len(unique_data)+1), unique_data))

In [25]:
print('Number of different offers (based on fraction of the trip covered by each transport mode):', len(dict_unique_data))

Number of different offers (based on fraction of the trip covered by each transport mode): 1030


The next step is to map the original offer identification (unique) to the new identification 

In [26]:
def find_key_from_value(val):
    for key, value in dict_unique_data.items():
        if val == value:
            return key

In [27]:
offerid_2_new_id = dict()
for key, value in modes_offer_vector.items():
    new_id = find_key_from_value(value)
    offerid_2_new_id.setdefault(key, new_id)

In [28]:
offerid_2_new_id_df = pd.DataFrame(offerid_2_new_id, index=[0]).transpose().reset_index()
offerid_2_new_id_df = offerid_2_new_id_df.rename(columns={'index':'offer_id',0:'id'})

Now, we need to load the training set and change the offers identification

In [32]:
df = pd.read_csv('../categorized_offers/trips_combined_final.csv').drop(columns=['Unnamed: 0'])

In [36]:
df_matrix_fact = pd.merge(df, offerid_2_new_id_df, on='offer_id')[['request_id','user_id','id','Response']]

In [39]:
# save the dataframe
df_matrix_fact.to_csv('data/df_matrix_factorization.csv')