### Dream Team:  Xingce Bao, Sohyeong Kim, Guilio Masinelli, Silvio Zanoli


# PART III - Build a graph for the travel

In this part, we are going to build an isochrone map by doing a breadth-first-search from Zurich HB station to find all the other stations that can be reach in the selected time interval. 

When building the graph, a probability threshold is set by the user. So, when computing the transfer, only the transfers which satisfy the probability constraint are considered. 

In the last section of this part, we create an adjacent matrix which stores the information of the direct connection between two stations. This matrix is then stored and used for routing algorithm in the next part.

In [1]:
import getpass
import pyspark
from pyspark.sql import SparkSession
import os
import numpy as np
import pandas as pd
from datetime import timedelta
import pickle

First, we start by importing processed data created and stored from the previous sections. 

In [2]:
# Import the lists of stations within 10km from Zürich HB
station_data = pickle.load(open("./data/train_station_id.p","rb"))

# Import the processed actual data containing mean and variance
data_mean_variance_list = []
data_mean_variance_list.append(pd.read_csv('./data/monday_processed.csv'))
data_mean_variance_list.append(pd.read_csv('./data/tuesday_processed.csv'))
data_mean_variance_list.append(pd.read_csv('./data/wednesday_processed.csv'))
data_mean_variance_list.append(pd.read_csv('./data/thursday_processed.csv'))
data_mean_variance_list.append(pd.read_csv('./data/friday_processed.csv'))
data_mean_variance_list.append(pd.read_csv('./data/saturday_processed.csv'))
data_mean_variance_list.append(pd.read_csv('./data/sunday_processed.csv'))

# Import the schedule of the transportation for each day
data_schedule_list = []
data_schedule_list.append(pd.read_csv('./data/monday_schedule.csv'))
data_schedule_list.append(pd.read_csv('./data/tuesday_schedule.csv'))
data_schedule_list.append(pd.read_csv('./data/wednesday_schedule.csv'))
data_schedule_list.append(pd.read_csv('./data/thursday_schedule.csv'))
data_schedule_list.append(pd.read_csv('./data/friday_schedule.csv'))
data_schedule_list.append(pd.read_csv('./data/saturday_schedule.csv'))
data_schedule_list.append(pd.read_csv('./data/sunday_schedule.csv'))

## CASE 1. Simple case

We start by implementing the simple case of the travel without considering the transfer. It will be like taking a one line of the transport and move until it reaches our expectation of the time and probability. 

To realize this case, we have implemented several functions as shown below. The descriptions of each function is commented along with the functions. 

In [3]:
# Define a function to compute the distance with the longtitude and the latitude 
# This function is reused from the PART 0.
from math import sin, cos, sqrt, atan2, radians
zurich_longtitude = 8.540192
zurich_latitude = 47.378177
def compute_distance(parameter,longtitude2 = zurich_longtitude,latitude2 = zurich_latitude):
    # approximate radius of earth in km
    R = 6373.0
    longtitude1,latitude1 = parameter
    longtitude1 = float(longtitude1)
    latitude1 = float(latitude1)
    lat1 = radians(latitude1)
    lon1 = radians(longtitude1)
    lat2 = radians(latitude2)
    lon2 = radians(longtitude2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c
    return distance

In [4]:
import scipy.stats
'''
This function gives the probability that P(X<x) , where X is a gamma distribution random variable
'''
def compute_probability(parameter,x):
    k = parameter[0]
    theta = parameter[1]
    if k*theta*theta < 0.01:
        if k*theta < x:
            return 1.0
        else:
            return 0.0
    dist = scipy.stats.gamma(k, 0, theta)
    return dist.cdf(x)

In [5]:
'''
Now we give the function to compute the probability that P(X<Y) where X and Y are both gamma distribution.This cannot be 
compute analytically so we use Monte Carlo simulation to simulate the probability.
'''
def compute_probability_sample(parameter_departure,parameter_arrival,N=1000):
    k_arrival, theta_arrival = parameter_arrival
    k_departure, theta_departure = parameter_departure
    # If it never departures, k_departure will be -1, then we just give 0 for return which means there is no
    # possibility that you catch any train from here
    if k_departure < 0 or theta_departure < 0:
        return 0.0
    if np.isnan(k_departure) or np.isnan(theta_departure):
        return 0.0
    # Build two distribution
    dist_arrival = scipy.stats.gamma(k_arrival, 0, theta_arrival+1e-20)
    dist_departure = scipy.stats.gamma(k_departure, 0, theta_departure+1e-20)
    # Draw samples
    arrival_s = dist_arrival.rvs(size=N)
    departure_s = dist_departure.rvs(size=N)
    # Return the probability by simulation
    return np.sum(arrival_s<departure_s)/N

In [6]:
'''
This function gives the train that you can catch providing the station number and the time distribution you arrive. 
'''
def get_train_list(station_no,time_mean,time_variance,end_second,dayofweek,probability):
    theta = time_variance/time_mean
    k = time_mean/theta
    # Filter only keeps the data with relevant day
    data_mean_variance = data_mean_variance_list[dayofweek]
    data_schedule = data_schedule_list[dayofweek]
    # Filter only keeps the data with relevant station
    station_frame = data_mean_variance[data_mean_variance.station_id == station_no]
    # If no data , return -1 (-1 means no data in this program)
    if station_frame.shape[0] == 0:
        return -1
    # Merge the schedule data
    train_frame = pd.merge(station_frame, data_schedule, how='left', on=['train_number', 'station_id','line_id'])
    # Filter the data which is not associate to our problem (whose departure time is already out of our time interval)
    train_frame = train_frame[train_frame.departuretimeoffsetschedule<=end_second]
    # Use the MC simulation function to compute the probability
    train_frame_probability_temp = train_frame[['departure_k','departure_theta']].apply(compute_probability_sample, axis=1,args=((k,theta),))
    # Keep the train which satisify the probability low bound
    if train_frame_probability_temp.shape[0] == 0:
        return -1
    train_frame = train_frame[train_frame_probability_temp>probability]
    return train_frame

In [7]:
'''
Now we define another function to find the station it can reach using the trains it can be taken.Basically we 
find the train station after that station using the schedule.
'''
def get_reachable_station(time_mean,time_variance,end_second,station_no,dayofweek,probability):
    # Find the train can be taken
    train_list = get_train_list(station_no,time_mean,time_variance,end_second,dayofweek,probability)
    # If no train(which train list == -1),return -1
    if type(train_list) is int:
        return -1
    data_schedule = data_schedule_list[dayofweek]
    train_list = train_list.reset_index()
    train_dataFrame_list = []
    # Search every train in the list
    for i in range(train_list.shape[0]):
        # Get the train number and the departure time
        departure_time = train_list.at[i,"departuretimeoffsetschedule"]
        train_number = train_list.at[i,"train_number"]
        line_number = train_list.at[i,"line_id"]
        # Filter the relevant data
        train_reachable_place = data_schedule[data_schedule.train_number == train_number]
        train_reachable_place = train_reachable_place[train_reachable_place.line_id == line_number]
        # Keep the station after the departure
        train_reachable_place = train_reachable_place[train_reachable_place.arrivaltimeoffsetschedule > departure_time]
        train_reachable_place = train_reachable_place[train_reachable_place.arrivaltimeoffsetschedule <= end_second]
        # If no data, use -1 to represent no data 
        if train_reachable_place.shape[0] == 0:
            train_reachable_place = -1
        train_dataFrame_list.append(train_reachable_place)
    # Remove all -1
    train_dataFrame_list = list(filter(lambda a: type(a) is not int, train_dataFrame_list))
    # If no data, return -1 to represent no data 
    if train_dataFrame_list == []: 
        return -1
    return pd.concat(train_dataFrame_list)
        

In [8]:
'''
This is the function that compute all the station that you can get by one single departure station.
Basically use the function above.
'''
def one_line_find(time_mean,time_variance,end_second,station_no,dayofweek,probability):
    # Get station list
    station_list = get_reachable_station(time_mean,time_variance,end_second,station_no,dayofweek,probability)
    # Filter only keeps the data with relevant day
    data_mean_variance = data_mean_variance_list[dayofweek]
    # No train can be taken
    if type(station_list) is int:
        return -1
    print ("Search for the departure from station ",station_no)
    # Join the data with the distribution data
    result = pd.merge(station_list, data_mean_variance, how='left', on=['train_number', 'station_id','line_id'])
    result = result.sort_values(by = ["train_number",'line_id',"arrivaltimeoffsetschedule"])
    result["probability"] = result[['arrival_k','arrival_theta']].apply(compute_probability, axis=1,args=(end_second,))
    # Keep the data which satisfy the probability low bound
    result = result[result.probability>probability].reset_index().drop(columns="index")
    # If the result is Null
    if result.shape[0] == 0:
        return -1
    return result

In [9]:
'''
Now we define the function to give the simple case --- No transfer!
station_no is the departure station
'''
def simple_case(datetime,time_interval,station_no,probability):
    # Change the data to pandas datatime
    time = pd.to_datetime(datetime,dayfirst=True)
    end_time = time + timedelta(seconds = time_interval)
    start_second = time.timestamp()%(24*3600)
    end_second = end_time.timestamp()%(24*3600)
    dayofweek = time.dayofweek
    # For the first station, it departs with a fix time
    # So start_second is the mean of the departure time , and give a variance which is very small (0.01)
    # In fact the variance is 0 , but 0 cause overflow here, and 0.01 second won't change anything
    result = one_line_find(start_second,0.01,end_second,station_no,dayofweek,probability)
    return result

### Example of simple case

Here is the example of the list of all the transportation we can take from Zürich HB(8503000). 

1. 
   - Datetime : 28.05.2018 15:45:00 
   - Allowed time : 10 minutes (= 600 seconds)
   - Probability : 90%
   
2. 
    - Datetime : 06.06.2018 11:00:00 
    - Allowed time : 7 minutes (= 420 seconds)
    - Probability : 80%
    
3. 
    - Datetime : 06.06.2018 11:00:00 
    - Allowed time : 7 minutes (= 420 seconds)
    - Probability : 50%   
    
4. 
    - Datetime : 10.06.2018 11:00:00 
    - Allowed time : 7 minutes (= 420 seconds)
    - Probability : 50%   

In [10]:
# First example
simple_case('28.05.2018 15:45:00',600,8503000,0.9)

Search for the departure from station  8503000


Unnamed: 0,train_number,station_id,line_id,arrivaltimeoffsetschedule,departuretimeoffsetschedule,avg(arrivaltimeoffset),avg(departuretimeoffset),var(arrivaltimeoffset),var(departuretimeoffset),arrival_theta,departure_theta,arrival_k,departure_k,probability
0,85:11:18259:001,8503011,Zug:18259:S2,56940,56940,57011.0,57045.0,330.0,368.0,0.005788,0.006451,9849255.0,8842750.0,1.0
1,85:11:18259:001,8503010,Zug:18259:S2,57120,57180,57120.0,57218.0,394.0,99.0,0.006898,0.00173,8280950.0,33069690.0,1.0
2,85:11:18259:002,8503011,Zug:18259:S2,56940,56940,57006.0,57046.0,245.0,324.0,0.004298,0.00568,13264020.0,10043970.0,1.0
3,85:11:18259:002,8503010,Zug:18259:S2,57120,57180,57125.0,57226.0,403.0,135.0,0.007055,0.002359,8097433.0,24257890.0,1.0
4,85:11:19259:001,8503003,Zug:19259:S12,56940,57000,56984.0,57055.0,2207.0,2035.0,0.03873,0.035667,1471308.0,1599643.0,1.0
5,85:11:19659:001,8503003,Zug:19659:S16,56820,56880,56880.0,56963.0,3314.0,4931.0,0.058263,0.086565,976262.6,658037.6,1.0
6,85:11:19659:001,8503004,Zug:19659:S16,57000,57060,57098.0,57156.0,5216.0,4662.0,0.091352,0.081566,625034.8,700731.1,0.997398
7,85:11:19959:001,8503006,Zug:19959:S19,57240,57300,57255.0,57350.0,441.0,159.0,0.007702,0.002772,7433413.0,20685680.0,0.98392


In [11]:
# second example
simple_case('06.06.2018 11:00:00',420,8503000,0.8)

Search for the departure from station  8503000


Unnamed: 0,train_number,station_id,line_id,arrivaltimeoffsetschedule,departuretimeoffsetschedule,avg(arrivaltimeoffset),avg(departuretimeoffset),var(arrivaltimeoffset),var(departuretimeoffset),arrival_theta,departure_theta,arrival_k,departure_k,probability
0,85:11:18341:001,8503003,Zug:18341:S3,39900,39960,39949.0,40027.0,3343.0,3690.0,0.083682,0.092188,477392.3,434189.9,0.890226
1,85:11:18640:001,8503020,Zug:18640:S6,39780,39780,39824.0,39876.0,800.0,886.0,0.020088,0.022219,1982439.0,1794690.0,1.0


In [12]:
# Third example
simple_case('06.06.2018 11:00:00',420,8503000,0.5)

Search for the departure from station  8503000


Unnamed: 0,train_number,station_id,line_id,arrivaltimeoffsetschedule,departuretimeoffsetschedule,avg(arrivaltimeoffset),avg(departuretimeoffset),var(arrivaltimeoffset),var(departuretimeoffset),arrival_theta,departure_theta,arrival_k,departure_k,probability
0,85:11:18341:001,8503003,Zug:18341:S3,39900,39960,39949.0,40027.0,3343.0,3690.0,0.083682,0.092188,477392.3,434189.9,0.890226
1,85:11:18639:001,8503003,Zug:18639:S6,39720,39780,39787.0,39857.0,26387.0,26628.0,0.663207,0.668088,59991.87,59658.27,0.924062
2,85:11:18639:001,8503004,Zug:18639:S6,39900,39960,39963.0,40015.0,435.0,453.0,0.010885,0.011321,3671359.0,3534658.0,0.996851
3,85:11:18640:001,8503020,Zug:18640:S6,39780,39780,39824.0,39876.0,800.0,886.0,0.020088,0.022219,1982439.0,1794690.0,1.0


In [13]:
# Fourth example
simple_case('16.06.2018 15:45:00',600,8503000,0.9)

Search for the departure from station  8503000


Unnamed: 0,train_number,station_id,line_id,arrivaltimeoffsetschedule,departuretimeoffsetschedule,avg(arrivaltimeoffset),avg(departuretimeoffset),var(arrivaltimeoffset),var(departuretimeoffset),arrival_theta,departure_theta,arrival_k,departure_k,probability
0,85:11:18259:001,8503011,Zug:18259:S2,56940,56940,57016.0,57052.0,405.0,622.0,0.007103,0.010902,8026727.0,5233008.0,1.0
1,85:11:18259:001,8503010,Zug:18259:S2,57120,57180,57129.0,57219.0,613.0,193.0,0.01073,0.003373,5324180.0,16963800.0,1.0
2,85:11:18259:002,8503011,Zug:18259:S2,56940,56940,57008.0,57043.0,228.0,335.0,0.003999,0.005873,14254000.0,9713146.0,1.0
3,85:11:18259:002,8503010,Zug:18259:S2,57120,57180,57119.0,57216.0,338.0,19.0,0.005917,0.000332,9652604.0,172298500.0,1.0
4,85:11:18758:001,8503020,Zug:18758:S7,57060,57060,57089.0,57141.0,373.0,510.0,0.006534,0.008925,8737678.0,6402145.0,1.0
5,85:11:19259:001,8503003,Zug:19259:S12,56940,57000,56971.0,57046.0,207.0,337.0,0.003633,0.005908,15679690.0,9656517.0,1.0
6,85:11:19658:001,8503020,Zug:19658:S16,56880,56880,56914.0,56972.0,592.0,733.0,0.010402,0.012866,5471627.0,4428116.0,1.0
7,85:11:19658:001,8503006,Zug:19658:S16,57180,57240,57187.0,57277.0,698.0,76.0,0.012206,0.001327,4685319.0,43166510.0,0.99999
8,85:11:19659:001,8503003,Zug:19659:S16,56820,56880,56856.0,56925.0,167.0,171.0,0.002937,0.003004,19356910.0,18950030.0,1.0
9,85:11:19659:001,8503004,Zug:19659:S16,57000,57060,57059.0,57118.0,219.0,291.0,0.003838,0.005095,14866340.0,11211220.0,1.0


As we can see from the results of the examples, when we increase the probability, the more journeys that we can choose from (compare example 2 and example 3). 

The change of the data also affects on the result since some journey may not be operated during weekends. (compare example 1 and example 4)

## CASE 2. Complicated case

Now we are implementing the complicated case of the travel by considering the transfer. In this case, we are not only considering the direct transfer(which is transfering to a different line from the same station) but also the transfer by walking to another stations. 

Here, we assumed that walking speed is 1.75km/h and we would only transfer to the other stations within 100m radius.

To realize this case, we have implemented more functions as shown below. The `full_transfer` function searches for all the journeys with possible transfers that meets our time and probability criteria.

In [14]:
'''
This function gives a list of the station that you can find within radius.
Also, it returns the time you need to get there
'''
def get_near_station(station_no,walk_speed = 1.75,radius = 0.1):
    # Get this station data position
    station = station_data[station_data.station_number == str(station_no)].reset_index()
    station_longtitude = float(station.at[0,"longtitude"])
    station_latitude = float(station.at[0,"latitude"])
    # Apply the function to add the column distance_temp to store distance relative to this station
    station_data['distance_temp'] = station_data[['longtitude','latitude']].apply(compute_distance, axis=1,args=(station_longtitude,station_latitude))
    # Filter the station in the radius
    station_in_radius = station_data[station_data.distance_temp < radius]
    # Get the list of station number and the distance (divide by walk_speed we will have the time)
    station_list = station_in_radius.station_number.values.tolist()
    time_list = np.around(station_in_radius.distance_temp.values*1000/walk_speed).tolist()
    # Change all the str to int for convenience
    station_list = [int(i) for i in station_list]
    # Delete the station itself or it will cause infinite loop in the function after
    for i,now_station in enumerate(station_list):
        if now_station == station_no:
            del station_list[i]
            del time_list[i]
            break
    return station_list,time_list

In [15]:
'''
This function gives you all the station that we can transfer with their arriving distribution
It mainly contains a loop of dealing each station using the function before.
'''
def find_all_station(station_list,time_mean,time_variance):
    all_station_list = station_list
    all_time_mean = time_mean
    all_time_variance = time_variance
    for index,station_no in enumerate(station_list):
        # Get the only this station data
        near_station,time_list = get_near_station(station_no)
        # For all station, walking do not change the variance that we arrive but the mean
        time_mean_temp = np.array(time_list) + time_mean[index]
        time_mean_temp = list(time_mean_temp)
        # Create a variance list with the same length with the mean list
        time_variance_temp = len(time_list)*[time_variance[index]]
        # Concat all the lists
        all_station_list = all_station_list + near_station
        all_time_mean = all_time_mean + time_mean_temp 
        all_time_variance = all_time_variance + time_variance_temp
    return all_station_list,all_time_mean,all_time_variance

In [16]:
'''
This function computes the case of transfering the transportations.
'''
def transfer_case(station_list,time_mean,time_variance,end_second,probability,dayofweek,\
                                            pass_station_no_list,pass_station_mean_list,pass_station_var_list):
    result = []
    result_fifo = []
    depth = -1
    while True:
        # Find all the station in the radius
        all_station_list,all_time_mean,all_time_variance = find_all_station(station_list,time_mean,time_variance)
        # Do a loop for the station list to compute the simple case
        for station_no,station_time_mean,station_time_variance in zip(all_station_list,all_time_mean,all_time_variance):
            i = -1
            # If already passed that station, then compute the probability for the first pass
            if station_no in pass_station_no_list:
                for i,no_temp in enumerate(pass_station_no_list):
                    if no_temp == station_no:
                        last_mean = pass_station_mean_list[i] 
                        last_var = pass_station_var_list[i]
                        break
                p = compute_probability_sample((station_time_mean**2/station_time_variance,station_time_variance\
                                                /station_time_mean),(last_mean**2/last_var,last_var/last_mean))
                print ('The probability of first pass this station is ',p)
                # If greater than 85 percent it is a duplicate pass, then remove it
                if p > 0.85:
                    continue

            temp = one_line_find(station_time_mean,station_time_variance,end_second,station_no,dayofweek,probability)
            # Update the passed station information
            if i==-1:
                pass_station_no_list.append(station_no) 
                pass_station_mean_list.append(station_time_mean) 
                pass_station_var_list.append(station_time_variance)
            else:
                pass_station_no_list[i] = station_no 
                pass_station_mean_list[i] = station_time_mean
                pass_station_var_list[i] = station_time_variance
            result.append((temp,depth+1))
            result_fifo.append((temp,depth+1))
            #print ("depth of transfer: ",depth)
        # Filter all -1
        result = list(filter(lambda a: type(a[0]) is not int, result))
        result_fifo = list(filter(lambda a: type(a[0]) is not int, result_fifo))
        print ("****** First IN First OUT ******* ",len(result_fifo))
        if len(result_fifo) == 0:
            break
        # Get the next result in the breadth-first-search fifo
        temp,depth = result_fifo[0]
        del result_fifo[0]
        if type(temp) is not int:
            station_list = list(temp.station_id.values)
            time_mean = list(temp["avg(arrivaltimeoffset)"].values)
            time_variance = list(temp["var(arrivaltimeoffset)"].values)
    # Filter all -1
    result = list(filter(lambda a: type(a[0]) is not int, result))
    if result == []:
        return -1
    return pd.concat([i[0] for i in result],axis=0)    

In [17]:
'''
This function gives you all the station that we can transfer with their arriving distribution
It mainly contains a loop of dealing each station using the function before.
'''
def full_transfer(datetime,time_interval,station_no,probability):
    # Change the data to pandas datatime
    time = pd.to_datetime(datetime,dayfirst=True)
    end_time = time + timedelta(seconds = time_interval)
    dayofweek = time.dayofweek
    # Get the time offset with integer
    start_second = time.timestamp()%(24*3600)
    end_second = end_time.timestamp()%(24*3600)
    pass_station_no_list = []
    pass_station_mean_list = []
    pass_station_var_list = []
    # Begin with only one station number at a fix time. Variance gives 0.01 in case of overflow
    result = transfer_case([station_no],[start_second],[0.01],end_second,probability,dayofweek,\
                           pass_station_no_list,pass_station_mean_list,pass_station_var_list)
    if type(result) is int:
        return result
    return result.drop_duplicates().reset_index().drop(columns="index")

### Example of complicated case

Here is the example of the list of all the transportation we can take from Zürich HB(8503000). 

1. 
   - Datetime : 28.05.2018 15:45:00 
   - Allowed time : 10 minutes (= 600 seconds)
   - Probability : 90%
   
2. 
    - Datetime : 06.06.2018 11:00:00 
    - Allowed time : 7 minutes (= 420 seconds)
    - Probability : 80% 

3. 
    - Datetime : 15.06.2018 17:00:00 
    - Allowed time : 15 minutes (= 900 seconds)
    - Probability : 90%  

In [18]:
result1 = full_transfer('28.05.2018 15:45:00',600,8503000,0.9)

Search for the departure from station  8503000
****** First IN First OUT *******  1
Search for the departure from station  8503011
The probability of first pass this station is  0.417
Search for the departure from station  8503011
The probability of first pass this station is  0.562
The probability of first pass this station is  0.09
Search for the departure from station  8503003
Search for the departure from station  8573710
Search for the departure from station  8591058
The probability of first pass this station is  0.435
Search for the departure from station  8573710
The probability of first pass this station is  0.555
Search for the departure from station  8591058
The probability of first pass this station is  0.572
Search for the departure from station  8503059
The probability of first pass this station is  0.069
Search for the departure from station  8503059
Search for the departure from station  8576182
****** First IN First OUT *******  5
The probability of first pass this stat

In [19]:
result2 = full_transfer('06.06.2018 11:00:00',420,8503000,0.8)

Search for the departure from station  8503000
****** First IN First OUT *******  1
Search for the departure from station  8591060
****** First IN First OUT *******  0


In [20]:
result3 = full_transfer('15.06.2018 17:00:00',900,8503000,0.7)

Search for the departure from station  8503000
****** First IN First OUT *******  1
Search for the departure from station  8503003
Search for the departure from station  8503020
The probability of first pass this station is  0.021
Search for the departure from station  8503003
Search for the departure from station  8503004
Search for the departure from station  8503100
Search for the departure from station  8503141
Search for the departure from station  8503101
The probability of first pass this station is  0.0
Search for the departure from station  8503020
Search for the departure from station  8503006
Search for the departure from station  8503007
Search for the departure from station  8503010
The probability of first pass this station is  0.998
The probability of first pass this station is  1.0
The probability of first pass this station is  1.0
The probability of first pass this station is  1.0
The probability of first pass this station is  0.497
Search for the departure from statio

Search for the departure from station  8503141
The probability of first pass this station is  0.464
Search for the departure from station  8503101
The probability of first pass this station is  0.499
****** First IN First OUT *******  11
The probability of first pass this station is  0.516
Search for the departure from station  8503101
The probability of first pass this station is  0.515
****** First IN First OUT *******  10
The probability of first pass this station is  0.548
The probability of first pass this station is  0.552
Search for the departure from station  8576202
The probability of first pass this station is  0.548
The probability of first pass this station is  0.638
The probability of first pass this station is  0.655
The probability of first pass this station is  0.679
The probability of first pass this station is  0.429
The probability of first pass this station is  0.429
Search for the departure from station  8576202
The probability of first pass this station is  0.41
S

As can be seen from the results above, we can see that more stations can be reached if we consider transfer. 

Next step is to visualize the stations where we can reach from Zürich HB considering the transfering cases. 

## Visualization

For visualization we are using Google Map API and this can be done by importing the `gmaps` and by displaying on the map a marker for each of the station that can be reached withing the time interval selected and the probability choosen. We used the data we compute before, since the computation of the full transfer for a long time interval needs a long time. 

The result we show is:

`result = full_transfer('08.01.2018 11:55:00',1200,8503000,0.9)`

This travel is departing at 2018 January 8th,11.55, and we want to compute how far it can go within 20 minutes. We are going to use this result that we computed before.

In [21]:
result = pd.read_csv('./data/1155MondayResult.csv')

In [22]:
# Get all the stations we are able to reach from the result
stations = []
for station in result['station_id']:
    infos = {}
    infos['name'] = station_data[station_data.station_number==str(station)]['name'].item()
    infos['station_number'] = station_data[station_data.station_number==str(station)]['station_number'].item()
    infos['position'] =(float(station_data[station_data.station_number==str(station)]['latitude'].item()), (float(station_data[station_data.station_number==str(station)]['longtitude'].item())))
    infos['probability'] = round(result[result.station_id==(station)]['probability'].min(),2)
    stations.append(infos)
    
# Get the location of the stations
station_locations = [station['position'] for station in stations]
station_locations

[(47.385195, 8.517106),
 (47.391481, 8.48894),
 (47.399175, 8.447228),
 (47.398875, 8.420423),
 (47.366611, 8.548466),
 (47.397213, 8.596132),
 (47.420195, 8.619255),
 (47.366611, 8.548466),
 (47.366611, 8.548466),
 (47.385195, 8.517106),
 (47.391481, 8.48894),
 (47.366611, 8.548466),
 (47.350124, 8.561372),
 (47.337332, 8.569717),
 (47.326854, 8.575951),
 (47.385195, 8.517106),
 (47.411529, 8.544115),
 (47.411529, 8.544115),
 (47.412717, 8.591911),
 (47.420195, 8.619255),
 (47.411529, 8.544115),
 (47.412717, 8.591911),
 (47.420195, 8.619255),
 (47.371472, 8.523462),
 (47.364099, 8.530805),
 (47.371472, 8.523462),
 (47.364099, 8.530805),
 (47.366611, 8.548466),
 (47.397213, 8.596132),
 (47.400076, 8.623407),
 (47.384381, 8.658659),
 (47.385195, 8.517106),
 (47.366611, 8.548466),
 (47.411529, 8.544115),
 (47.450383, 8.562386),
 (47.411529, 8.544115),
 (47.450383, 8.562386),
 (47.411529, 8.544115),
 (47.411529, 8.544115),
 (47.378177, 8.540192),
 (47.378177, 8.540192),
 (47.385195, 8.517

#### WARNING:  This gmaps library has some inner bug in printing.If you get nothing when running it, reboot the computer and it will work.
#### WARNING:   Here you need to change the api key to your google api key.You need a credit card to get it.

In [23]:
import gmaps
key = 'AI...'
if key == 'AI...':
    raise Exception('You need to put your Gmap key here!')
gmaps.configure(api_key=key)

In [24]:
# Define the format of the info box
info_box_template = """
<dl>
 <center><dt>Name</dt><dd>{name}</dd> 
<dt>Station numer</dt><dd>{station_number}</dd>
<dt>Probability</dt><dd>{probability}</dd></center>
</dl>
"""
stat_info = [info_box_template.format(**station) for station in stations]

#### Visualization 1.
Here we are displaying each of the station that can be reached with a marker with an info box showing the name of the station, the station number and the probability of reaching that station (in the case the station can be reached by more than one means of transportation, the lowest probability is displayed).

In [25]:
marker_layer = gmaps.marker_layer(station_locations, info_box_content=stat_info)
fig = gmaps.figure()
fig.add_layer(marker_layer)
fig

This is the captured image for above cell. 
<img src="./images/map1.png">
</img>

#### Visualization 2.

We can also visualize an heatmap showing the density of the stations.

In [26]:
fig = gmaps.figure()

heatmap = gmaps.heatmap_layer(station_locations, max_intensity=2)
marker_layer = gmaps.marker_layer(station_locations, info_box_content=stat_info)
fig.add_layer(marker_layer)
heatmap.point_radius = 50
fig.add_layer(heatmap)
fig

This is the captured image for above cell. 
<img src="./images/map2.png">
</img>

#### Visualization 3. Convex Hull
This time, we tried to visualize the result by taking the furthest stations from the Zürich HB and draw the convex hull. 

In [27]:
from scipy.spatial import ConvexHull
hull = ConvexHull(station_locations)

In [28]:
# Add the hull vertices
loc = []
for index in hull.vertices:
    loc.append(station_locations[index])

In [29]:
fig = gmaps.figure(center=(47.378177, 8.540192), zoom_level=12)
myMap = gmaps.Polygon(
    loc,
    stroke_color='blue',
    fill_color='blue'
)
drawing = gmaps.drawing_layer(
    features=[myMap],
    show_controls=True
)
fig.add_layer(drawing)

fig

This is the captured image for above cell. 
<img src="./images/map3.png">
</img>

#### Visualization 4. 
Same as before, but including the markers.

In [30]:
stat_info = [info_box_template.format(**station) for station in stations if (station['position'] in loc)]

In [31]:
fig = gmaps.figure(center=(47.378177, 8.540192), zoom_level=12)
myMap = gmaps.Polygon(
    loc,
    stroke_color='blue',
    fill_color='blue'
)
drawing = gmaps.drawing_layer(
    features=[myMap],
    show_controls=True
)
fig.add_layer(drawing)

marker_layer = gmaps.marker_layer(loc, info_box_content=list(set(stat_info)))
fig.add_layer(marker_layer)

fig

This is the captured image for above cell. 
<img src="./images/map4.png">
</img>

## Verification

To make sure that our result is reasonable, not only we **compared the results manually**(if the result is small enough to be compared by human), we also **check the isochronic map from othersites**. [https://app.traveltimeplatform.com]

We first start by using a small dataset and check if what we have computed is correct. Once we are confident that our algorithm is working correctly, we use the whole dataset and visualize them. 

In this part, we are plotting a heat map and the convex hull map to compare our result and the othersites' result at 15.06.2018 17:00.

In [32]:
# Choose the current data result to compare. (15.06.2018 17:00, 900 seconds, 70%)
result = result3

# Get all the stations we are able to reach from the result
stations = []
for station in result['station_id']:
    infos = {}
    infos['name'] = station_data[station_data.station_number==str(station)]['name'].item()
    infos['station_number'] = station_data[station_data.station_number==str(station)]['station_number'].item()
    infos['position'] =(float(station_data[station_data.station_number==str(station)]['latitude'].item()), (float(station_data[station_data.station_number==str(station)]['longtitude'].item())))
    infos['probability'] = round(result[result.station_id==(station)]['probability'].min(),2)
    stations.append(infos)
    
# Get the location of the stations
station_locations = [station['position'] for station in stations]

In [33]:
# Define the format of the info box
info_box_template = """
<dl>
 <center><dt>Name</dt><dd>{name}</dd> 
<dt>Station numer</dt><dd>{station_number}</dd>
<dt>Probability</dt><dd>{probability}</dd></center>
</dl>
"""
stat_info = [info_box_template.format(**station) for station in stations]

In [34]:
fig = gmaps.figure()

heatmap = gmaps.heatmap_layer(station_locations, max_intensity=2)
marker_layer = gmaps.marker_layer(station_locations, info_box_content=stat_info)
fig.add_layer(marker_layer)
heatmap.point_radius = 50
fig.add_layer(heatmap)
fig

This is the captured image for above cell. 

<img src="./images/map_validation1.png">
</img>

In [35]:
hull = ConvexHull(station_locations)

# Add the hull vertices
loc = []
for index in hull.vertices:
    loc.append(station_locations[index])
    
stat_info = [info_box_template.format(**station) for station in stations if (station['position'] in loc)]

In [36]:
fig = gmaps.figure(center=(47.378177, 8.540192), zoom_level=12)
myMap = gmaps.Polygon(
    loc,
    stroke_color='blue',
    fill_color='blue'
)
drawing = gmaps.drawing_layer(
    features=[myMap],
    show_controls=True
)
fig.add_layer(drawing)

marker_layer = gmaps.marker_layer(loc, info_box_content=list(set(stat_info)))
fig.add_layer(marker_layer)

fig

This is the captured image for above cell. 

<img src="./images/map_validation2.png">
</img>

Here we can see the results for the deterministic isochrone map obtained from the website below and the non-deterministic one computed by us (convex hull and heatmap). 
[https://app.traveltimeplatform.com/#/search/0_lng=8.53945&0_tt=15&0_time=d1529074809491&0_title=Z%C3%BCrich%20HB%20SZU%2C%20Zurich%2C%20Switzerland&0_lat=47.37737"].
<img src="./images/map_validation3.png">
</img>


Both of them were computed for the 15/06/2018 at 17:00 and the confidence level used was 70%. We can see that the results are overall promising: Because of the confidence level we are missing some of the stations present on the west side of the centre of Zurich, that maybe because to travel in the city center implies taking buses that are statistical more prone to have delays and hence get excluded from our reachable stations. At the same time we can see that the furthest stations are equally reached, that's because they are probably connected by trains that statistically have less delays. The representation we obtained from the determistic isochrone map is a sparse representation of the reachable stations, that means that it only put in evidence a small area around the stations reached, such representation is more similar to the heat-map we computed instead of the convex hull. In this example we can observe that our algorithm computes a whole set of reachable stations along the east cost of the lake while the deterministic one only finds two. The two stops found by the determistic isochrone map are quite far from the south-east side of the lake and the absence of all the intermediate stations found by us may be because the algoritm that run in background of the web-site (to which we don't have access) is computing only the furthest reachable station without the intermediate stops.

## Build an adjacent matrix of direct transfer

Before going into the next part which is finding the routing plan, we build a matrix which contains the information of the direct transfer between two stations. We also consider walking to close station as a direct transfer so that we can have full information of transfering between stations. In the case of the stations connected by the train, we put 10.0 to represent them. For stations which are connected by walking, we use 1.0 to represent them.
To build this matrix, we use arrival and departure schedule(fixed) and we saved the matrix with pickle. 

In [37]:
# Get all the schedule data
data_schedule_all = pd.concat(data_schedule_list,axis=0)

# Make a list for stations
stations = [int(i) for i in station_data.station_number.values]

# Init a connection matrix
inf = 1e9
adj_map = np.zeros((len(stations),len(stations)))+inf

In [38]:
''' 
This function gives whether there is direct connection between two stations. 
It considers walking to another station as direct transfer but with different weight.
'''
def find_connection(station1,station2):
    # Get the station can be reach by walk
    walk_station = get_near_station(station1)
    # Use 1.0 to delegete can be reach by walk
    if station2 in walk_station[0]:
        return 1.0
    # Find whether there is public transport connections in between
    trains_1 = data_schedule_all[data_schedule_all.station_id == station1][["train_number","line_id"]]
    trains_2 = data_schedule_all[data_schedule_all.station_id == station2][["train_number","line_id"]]
    inter = pd.merge(trains_1,trains_2,how = "inner", on = ["train_number","line_id"])
    # If public transport connection exists, use 10 to delegate it
    if inter.shape[0] != 0:
        return 10.0
    return inf

In [39]:
# Compute the adjacent matrix 
# This block takes about 3 hours, we have already saved the results, you can directly use that
k = 0
for i in range(len(stations)):
    for j in range(i+1,len(stations)):
        k = k + 1
        if k%100 == 0:
            #print ("processing, " , k)
            pass
        connect = find_connection(stations[i],stations[j])
        adj_map[i,j]=connect
        adj_map[j,i]=connect

In [40]:
import pickle

# Save the matrix to pickle
pickle.dump(adj_map,open("connection.p","wb"))