Once we have computed the different category scores for the routeRANK dataset, we need to clean up a little bit the results. In particular, firstly we need to remove the mobility requests for which we did not find the actual trip in the routeRANK alternatives.

In [1]:
import numpy as np
import pandas as pd

The trip alternatives is a really heavy file and process it all at once is computationally hard. This is why we decided to do it in three different runs generating three different files that we need to merge now.

In [2]:
trips_combined_1 = pd.read_csv('categorized_offers/df_combined_5000.csv')
trips_combined_2 = pd.read_csv('categorized_offers/df_combined_5000_12100.csv')
trips_combined_3 = pd.read_csv('categorized_offers/df_combined_12100_end.csv')

In [3]:
trip_id_no_alt_1 = open('categorized_offers/request_id_no_solution_5000.txt','r').read().split('\n')
trip_id_no_alt_2 = open('categorized_offers/request_id_no_solution_5000_12100.txt','r').read().split('\n')
trip_id_no_alt_3 = open('categorized_offers/request_id_no_solution_12100_end.txt','r').read().split('\n')

In [4]:
trips_combined_total = pd.concat([trips_combined_1, trips_combined_2, trips_combined_3])
trip_id_no_alt = trip_id_no_alt_1[:-1] + trip_id_no_alt_2[:-1] + trip_id_no_alt_3[:-1]

In [5]:
# remove the offers with no actual trip in motiv
cleaned_df = pd.DataFrame()
for tripid in trips_combined_total.request_id.unique():
    if tripid not in trip_id_no_alt:
        trip_df = trips_combined_total[trips_combined_total['request_id']==tripid]
        cleaned_df = pd.concat([cleaned_df, trip_df])

In [6]:
cleaned_df.to_csv('categorized_offers/df_combined.csv')

Now, we need to delete those mobility requests which only have one offer/alternative. The algorithm we want to implement (BPR) trains in a pair-wise fashion, and therefore works with pairs of data.

In [6]:
cleaned_df = pd.read_csv('categorized_offers/df_combined.csv')

In [7]:
print('Unique mobility requests: ', cleaned_df.request_id.nunique())

Unique mobility requests:  15967


In [13]:
# check the requests ids for which we only have 1 offer (we need to delete)
counts = cleaned_df.groupby('request_id').count().sort_values('offer_id').reset_index()[['request_id','offer_id']]
trips_ids_to_save = counts[counts['offer_id']>1]
trips_combined_gt_1_offer = pd.merge(cleaned_df, trips_ids_to_save['request_id'], on='request_id')

In [15]:
print('Unique mobility requests: ', trips_combined_gt_1_offer.request_id.nunique())

Unique mobility requests:  15855


Let us save the mobility requests identifications which will be actually used for the ranking algorithm.

In [17]:
unique_requests_ids = trips_combined_gt_1_offer.request_id.unique()
np.savetxt('categorized_offers/final_unique_resquests_ids.txt', unique_requests_ids, fmt='%s')

The next thing we need to do is to assign a unique identification to each offer.

In [19]:
import random
from random import randint
random.seed(0)
def random_with_n_digits(N,n):
    offer_ids = []
    for _ in range(N):
        range_start = 10 ** (n - 1)
        range_end = (10 ** n) - 1
        offer_ids.append(randint(range_start, range_end))
    return offer_ids

In [21]:
unique_offer_ids = random_with_n_digits(len(trips_combined_gt_1_offer), 16)

In [23]:
# clean the columns
trips_combined_gt_1_offer.drop(columns=['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1','offer_id'],
                               axis=1,inplace=True)

In [25]:
# add the new offer identifiers
trips_combined_gt_1_offer['offer_id'] = unique_offer_ids

Moreover, because of the way we have combined the leg alternatives to form the trip alternatives, there requests with too many offers (even more than 1000). For now, we will delete those requests because we will not use them

In [31]:
# delete the requests for which we have too many offers
counts = trips_combined_gt_1_offer.groupby('request_id').count().sort_values('offer_id').reset_index()[['request_id','offer_id']]
trips_ids_to_save = counts[counts['offer_id']<1000]
df = pd.merge(trips_combined_gt_1_offer, trips_ids_to_save['request_id'], on='request_id')

In [38]:
print('Unique mobility requests: ', df.request_id.nunique())

Unique mobility requests:  15820


Finally, let us change the user identification to a numeric values (right now it is a combination of characters) for convenience. 

In [35]:
def user_ids_mapping(df):
    # dataframe mapping the original ids to the new ones
    convert_userid = list(zip(np.arange(1,len(df.user_id.unique())+1)
                         ,df.user_id.unique()))
    convert_userid_df = pd.DataFrame(data=convert_userid).rename(columns={0: 'new_user_id', 1: 'user_id'})
    # merge to do perform the mapping
    df2 = pd.merge(df, convert_userid_df, 
               on='user_id').drop(columns=['user_id'], axis=1).rename(columns={'new_user_id': 'user_id'})
    return df2

In [39]:
df2 = user_ids_mapping(df)

In [40]:
df2.to_csv('categorized_offers/trips_combined_final.csv')