In [None]:
# Make sure to run this cell to import the necessary packages.
from routing import TaxiRouting, create_dataframes, plot_returns
import numpy as np
import pandas as pd
import random
from bokeh.io import output_notebook
output_notebook()

# Min-Cost Flow Lab

**Objectives:**

* Understand the min-cost flow problem.
* Describe feasibility conditions for the min-cost flow problem.
* Use the min-cost flow problem to solve the shortest path problem.
* Apply min-cost flow to the taxi-routing problem.
* Analyze solutions to the taxi-routing problem.
    
<font color='red'> **Instructor Comments** </font>

<font color='blue'> **Solutions** </font>

**Review:** Recall the min-cost flow problem. For our input, we have a directed graph $G = (N,A)$. For each arc $(i,j) \in A$, we have a per-unit cost $c(i,j)$ and capacity $u(i,j)$. For each node $i \in N$, we have a demand / supply $b(i)$. If $b(i) > 0$, we have supply at node $i$. If $b(i) < 0$, we have demand at node $i$. Our goal is to distribute the supply to the demand nodes at minimal cost. A solution to the min-cost problem can be described by giving a flow $f(i,j)$ for every arc $(i,j) \in A$. The objective value of a solution is given by $\sum_{(i,j) \in A} c(i,j)*f(i,j)$.

## Part 1: Feasible Flow Conditions

**Q:** What must be true about the relationship between supply and demand for there to be feasible solution to a min-cost flow input?

**A:** <font color='blue'> The total supply must be greater than or equal to the total demand.</font>

Consider the graph below. The arc labels indicate the capacity on that arc and the node labels indicate the supply or demand at each node (if no label is present, there is no supply or demand at that node).

<img src="images-lab/min-cost_flow_set.png" alt="min-cost_flow_set" style="width: 400px;"/>

**Q:**  Is the total supply greater or equal to the total demand for this min-cost flow input? 

**A:** <font color='blue'> Yes, the total supply and demand are both 8.</font>

**Q:** What is the net demand for the nodes shaded blue? And the non-shaded nodes?

**A:** <font color='blue'> The net demand for blue shaded nodes is -7. The net demand for non-shaded nodes is 7.  </font>

**Q:**  Consider the min-cut on this directed graph given by the nodes shaded blue: $S = \{1,2,3,4\}$ and $T = \{5,6\}$. What is the capacity of this cut? (Hint: the capacity of a cut is the sum of arc capacities over arcs $(i,j)$ where $i \in S$ and $j \in T$). 

**A:** <font color='blue'> The capacity of this cut is $u(3,5) + u(2,5) + u(4,5) = 6$.</font>

**Q:** Is there a feasible flow for this input? Why or why not?

**A:** <font color='blue'> No, the net demand for the set $T = \{5,6\}$ is 7. However, the total capacity of edges into $T$ is only 6.</font>

**Q:** Consider a cut $S \subset N$ on the directed graph $G = (N,A)$. What must be true about the capacity of the cut $S$ for there to exist a feasible flow on $G$? 

**A:** <font color='blue'>The capacity of the cut $S$ must be greater or equal to the demand in set $T = N \setminus S$.</font>

## Part 2: Shortest Path Formulation

Recall the shortest path problem. We have a directed graph $G = (N,A)$ with length $\ell(i,j)$ for all arcs $(i,j) \in A$ and source/sink $s,t \in N$. Our goal is to find an $s-t$ path of minimum length. How could we use min-cost flow to solve a shortest path problem? The next questions walk through the formulation.

**Hint**: Consider a *feasible* unit flow on some graph (like in the one below). Note how the unit of flow must leave every node it enters because of flow conservation. How could one interpret a feasible unit flow as a path in the graph?

<img src="images-lab/flow_path.png" alt="flow_path" style="width: 400px;"/>

First, we must show how to construct a min-cost flow input from the shortest path input. Let's use the directed graph $G = (N,A)$ from the shortest path input as our directed graph in the min-cost flow input.

**Q:** We need to assign a cost $c(i,j)$ for all arcs $(i,j) \in A$. What should the cost be?

**A:** <font color='blue'>$c(i,j) = \ell(i,j)$</font>

**Q:** We also need to assign a capacitiy $u(i,j)$ for all edges $(i,j) \in A$. What should the capacity be?

**A:** <font color='blue'>$u(i,j) = 1$</font>

**Q:** Lastly, we need to asssign supply/demand $b(i)$ to every node $i \in N$. What should the supply/demands be?

**A:** <font color='blue'>$b(s) = 1$, $b(t) = -1$, and $b(i) = 0$ otherwise.</font>

If you are unsure about your formulation, reach out to a TA before heading on!

Let's say we have a black box that solves min-cost flow problems. Given an input, it returns the optimal solution which defines a flow $f(i,j)$ for every arc $(i,j) \in A$. If every $u(i,j)$ and $b(i)$ is integral, the **integrality property** states that the black box will return an all integral solution.

**Q:** If $0 \leq x \leq 1$ and $x$ is integral, what values can $x$ be?

**A:** <font color='blue'> Either $x = 1$ or $x = 0$</font>

**Q:** Suppose you have constructed a min-cost flow input for a shortest path problem. What do the integrality property and Q10 imply about the optimal solution returned by the black box?

**A:** <font color='blue'>The optimal solution will be integral. Furthermore, since $0 \leq f(i,j) \leq 1$, we know that $f(i,j)$ will be 1 or 0. </font>

**Q:** How can we interpret a min-cost flow solution as a solution to the shortest path problem.

**A:** <font color='blue'>If $f(i,j) = 1$, then $(i,j)$ is in the path. Otherwise, it is not. </font>

You should now be very familiar with the steps required to show how one problem can be used to solve another. So far, we have done the first step: created a way to transform shortest-path inputs into min-cost flow inputs. The remaining steps are to show there is a one-to-one correspondence between feasible solutions to the shortest-path problem and feasible solutions to the min-cost flow problem. Then, we must show the objective values are the same (or differ by a constant). Think about how this argument might look. The next questions address one complication you will come across.

Suppose we have the following graph for the shortest path problem. 

<img src="images-lab/shortest_path_input.png" alt="shortest_path_input" style="width: 400px;"/>

We construct the corresponding min-cost flow input. Consider the following feasible flow (given in the usual way where the boxed number indicate the flow and the unboxed number indicate capacity. The arcs costs are omitted from the diagram.)  

<img src="images-lab/unit_flow.png" alt="unit_flow" style="width: 400px;"/>

Now, we want to convert this feasible flow back to a feasible shortest path solution. You should have found in your formulation that we can do this by selecting only arcs with one unit of flow to be in the path. Hence, the bold arcs are selected.

<img src="images-lab/corresponding_path.png" alt="corresponding_path" style="width: 400px;"/>

**Q:** Does this set of edges form a feasible $s-t$ path? Why or why not?

**A:** <font color='blue'>No, because $(2,5),(5,3)$, and $(3,2)$ form a cycle. </font>

**Q:** Does the feasible solution **contain** an $s-t$ path? If so, what is it?

**A:** <font color='blue'>Yes, $(1,2),(2,4),(4,6)$ is an $s-t$ path. </font>

**Q:** How did you arrive at an $s-t$ path?

**A:** <font color='blue'> Cut out the cycle. </font>

**Q:** Consider a feasible solution to the min-cost flow problem that corresponds to a path with some cycles. If we cut out the cycles to obtain the corresponding $s-t$ path, how will the cost of the min-cost flow solution compare to the length of the corresponding $s-t$ path.

**A:**  <font color='blue'> The length of the $s-t$ path will always be shorter. </font>

## Part 3: The Taxi-Routing Problem

Suppose we are a New York City taxi company and we know all the ride requests we will receive over some time horizon. Each trip has a start/end location, start time, trip time, and some value (this could be the revenue it generates, the number of passengers, or constant across all rides). Furthermore, we know the number of taxi cabs (which we denote as $B$) and the layout of Manhattan. We can represent Manhattan as a grid where every street intersection is a node and every street segment is an edge. We also know the time it takes to traverse any street segment. Our goal is to maximize the value of the trips we pick-up. Let's use the min-cost flow problem to solve this problem!

First, we must construct an input to the min-cost flow problem. Let us discretize time into minutes. We will have a time horizon of $T$ minutes. Next, let's index every location node so that we have $L$ nodes indexed $0,\dots,L-1$ for all $L$ locations in Manhattan. We now construct the directed graph $G$ that will be the input to the min-cost flow problem. There is a node for every time and location combination where node $(\ell,t)$ is the node representing location $\ell$ at time $t$.

Furthermore, we will have nodes $s$ and $f$. There is an edge from $s$ to node $(\ell,0)$ for all locations and an edge from node $(\ell,T)$ to $f$ for all locations. The idea is that a unit of flow through the graph will represent where a given taxi is at any given time over the time horizon. If you would like to see a visual to ease your understanding of the formulation, run the two code cells before Q19.

**Q:** What should the supplies for the $s$, $f$, and $(\ell,t)$ nodes be?

**A:** <font color='blue'> The supply at $s$ is $B$, the supply at any $(\ell,t)$ is 0, and the supply at $f$ is $-B$. </font>

We are almost there! We need to define the other edges in the graph and determine the cost and capacity for each edge. Since we know the layout of Manhattan, we know how long it takes to get from one location to another. In other words, we have a list of arcs of the form $(a,b,d)$ where $a$ is the start location, $b$ is the end location , and $d$ is the time it takes to travel between the two locations. For each arc of this type we will introduce edges from $(a,t)$ to $(b,t+d)$ for every $t$ such that $0 \leq t$ and $t+d \leq T$. Each edge of this type will have a capacity of $B$ because any taxi can traverse this edge and cost 0 since it is of zero value to drive with no passengers. Most importantly, we need edges to represent each trip we could take.

**Q:** Suppose we have a trip from location $a$ to $b$ that starts as time $t$ with duration $d$ and value $v$. What edge should be added and what are its cost and capacity? (Hint: Costs can be negative)

**A:** <font color='blue'> We should have an edge from $(a,t)$ to $(b,t+d)$ with cost $-v$ and capacity 1. </font>

Let's look at a small input!

In [None]:
# start, end, start_time, trip_time, value
trips = [(2,1,1,2,2),
         (0,1,0,1,2),
         (0,1,2,1,1)]

# start, end, trip_time
arcs = []

B = 2  # number of taxis
L = 3  # of locations
T = max(np.array(list(zip(*trips))[2]) + np.array(list(zip(*trips))[3]))  # time horizon

trips_df, nodes_df, arcs_df = create_dataframes(trips, arcs, L)
small_ex = TaxiRouting(trips_df, nodes_df, arcs_df, 0, T, B)

Let's look at the corresponding min-cost flow input to this problem. The label on each edge is the flow on that edge.

In [None]:
small_ex.draw_graph()

**Q:** Is this what you expected? If not, why?

**A:** <font color='blue'> Depends. </font>

**Q:** The edges $(\ell,t)$ to $(\ell,t+1)$ for all locations $\ell$ and $0 \leq t \leq T-1$ are automatically added to the graph despite not being explicitly defined. What do they represent?

**A:** <font color='blue'> Staying at the same location. </font>

We can now solve the problem and look at the optimal flow.

In [None]:
small_ex.optimize()
small_ex.draw_graph()

**Q:** What is the cost of this solution? Recall that we used a constant value of 1 for each trip.

**A:** <font color='blue'> -2 because we satisfied 2 trips. </font>

**Q:** Intepret this solution as a solution to the taxi-routing problem. That is, give a schedule for every taxi.

**A:** <font color='blue'> One taxi starts at location 0 and takes a trip from location 0 to 1. After that, they stay at location 1 for the rest of the time horizon. The other taxi starts at location 2 and waits until $t = 1$ to take a trip from location 2 to 1. </font>     

**Q:** We have a trip from location 0 to 1 that only takes 1 minute. Suppose this takes a two way road and we can get from location 1 to 0 in 1 minute as well. Add this to the input and see how the optimal solution changes.

In [None]:
# (start, end, trip_time)
# TODO: add the arc to the list of arcs
arcs = [()]

### BEGIN SOLUTION
arcs = [(1,0,1)]
### END SOLUTION

trips_df, nodes_df, arcs_df = create_dataframes(trips, arcs, L)
small_ex = TaxiRouting(trips_df, nodes_df, arcs_df, 0, T, B)
small_ex.optimize()

# when we draw the graph this time, we only draw edges with positive flow
small_ex.draw_graph(draw_all=False)

We can use the following command to get the taxi paths:

In [None]:
paths = small_ex.taxi_paths()

This returns a list of paths for each taxi. Let's look at the first taxi.

In [None]:
paths[0]

We can intepret this output as follows: This taxi went from location 0 to 1 taking a trip then went back to location 0 without taking a trip before taking another trip from location 0 to 1.

**Q:** Interpret the path of the other taxi.

In [None]:
paths[1]

**A:** <font color='blue'> This taxi stayed at location 2 then took a trip from location 2 to 1.</font>

## Part 4: The Taxi-Routing Problem (At Scale)

Let's look at the taxi routing problem at scale! First, we need to input some data:

In [None]:
trips_df = pd.read_csv('data/2013-09-01_trip_data_manhattan.csv').drop(columns='id')
trips_df['revenue'] = 2.50 + 1.56*trips_df.trip_distance + 0.50*trips_df.trip_time
trips_df.revenue = trips_df.revenue.apply(lambda x: round(x,2))

nodes_df = pd.read_csv('data/nyc_nodes_manhattan.csv').drop(columns='Unnamed: 0')
arcs_df = pd.read_csv('data/nyc_links_manhattan.csv').drop(columns='Unnamed: 0')

**Q:** Take a look at the 3 dataframes.

In [None]:
trips_df.head()

In [None]:
nodes_df.head()

In [None]:
arcs_df.head()

**Q:** The `trips_df` dataframe has a list of trips but is missing a field for value. If we want to create a taxi schedule maximizing revenue, what should be the value for each trip?

**A:** <font color='blue'> The value of each trip should be the revenue that trip generates.</font>

**Q:** The following code creates a feild in `trips_df` called `value` and sets it equal to another field in the dataframe. Fill in the missing field name based on your answer to Q26 then run the next line to make sure the field was added properly.

In [None]:
# TODO: uncomment and replace field with the appropriate field name
# trips_df['value'] = trips_df['FIELD']

### BEGIN SOLUTION
trips_df['value'] = trips_df['revenue']
### END SOLUTION

In [None]:
trips_df.head()

We can now pass this input to the TaxiRouting solver! In `trips_df`, the `start_time` field is given in minutes since midnight. Let's first look at the time window of 5:00 PM to 5:15 PM. This corresponds to 1020 to 1035. Furthermore, we will have 300 taxis on the road.

In [None]:
nyc_taxi = TaxiRouting(trips_df, nodes_df, arcs_df, 1020, 1035, 300)

Now, let's solve it and print some statistics about the solution.

In [None]:
nyc_taxi.optimize()
nyc_taxi.get_stats()

In [None]:
nyc_taxi.plot_stats()

It turns out 500 taxis was not enough to get all the requested rides in this time period. The percent of potential trips, passengers, and revenue that were acheived are given in the parentheses following the corresponding statisitc (ex. there were 792 trips that could have been taken but only 628 or 79% were). 

Furthermore, here is a quick description of each statisitcic:

* **Average Moving Percentage:** The moving percentage for a taxi is the percent of time intervals the taxi is moving from one location to another (it may or may not have a passenger). This is the average moving percentage across all taxis.
* **Average On Trip Percentage:** The on trip percentage for a taxi is the percent of time intervals the taxi has a passenger. This is the average on trip percentage across all taxis.
* **Average Total Distance of Trips:** The total distance travelled (in km) during trips is computed for each taxi and then averaged.
* **Average Revenue:** Average revenue collected per taxi cab.
* **Total Trips:** Total number of trips the taxi routing schedule could accommodate.
* **Total Passengers:** Total number of passengers the taxi routing schedule could accommodate.
* **Total Revenue:** Total revenue generated from the taxi routing schedule.

To look at the statistics for an indiviual taxi, we use the following command where the index is the id of the taxi.

In [None]:
nyc_taxi.taxi_stats[12]

We can also plot the path of a set of taxis! The path of each taxi is color-coded. The circle represents its start location and the lower opacity edges indicate the taxi has a passenger.

In [None]:
nyc_taxi.plot_taxi_route([12])

**Q:** What if we wanted to maximize the number of passengers we accommodated instead of maximizing revenue? How would we change the value? Adjust the value field accordingly.

In [None]:
# TODO: uncomment and set the value field
# trips_df['value'] = 

### BEGIN SOLUTION
trips_df['value'] = trips_df['passenger_count']
### END SOLUTION

**Q:** What was the total revenue generated in the previous solution where we tried to maximize revenue. How many passengers were accommodated? Before re-solving with a new objective, what do you know already know about the these values in the new solution? (Hint: How can you bound these values?)

**A:** <font color='blue'> The revenue was 6319.47 and the number of accommodated passengers was 1182. In the new solution, the revenue must be $\leq 6319.47$ and the number of passengers must be $\geq 1182$. </font>

In [None]:
nyc_taxi = TaxiRouting(trips_df, nodes_df, arcs_df, 1020, 1035, 250)
nyc_taxi.optimize()
nyc_taxi.get_stats()

In [None]:
nyc_taxi.plot_stats()

**Q:** On average, what was the total distance a taxi drove on a trip in the previous solution (maximizing revenue). What about the new solution (maximizing passengers)? How do these values compare? Why might this be?

**A:** <font color='blue'> Before, it was 3.16. Now, it is 2.62. It is much less in this solution. It looks like most taxis will only make one ride in this 15 minute time span. The longer the ride, the higher the revenue will be. Since revenue and trip distance are correlated, the trip distance will naturally be pretty high when we maximize revenue. However, the trip distance is not correlated with number of passengers so it is not weighted in the objective function. This leads to a lower value in the new solution.</font>

**Q:** Lastly, what if we wanted to maximize the number of trips. How would we change the value? Adjust the value field accordingly.

In [None]:
# TODO: uncomment and set the value field
# trips_df['value'] = 

### BEGIN SOLUTION
trips_df['value'] = 1
### END SOLUTION

In [None]:
nyc_taxi = TaxiRouting(trips_df, nodes_df, arcs_df, 1020, 1035, 500)
nyc_taxi.optimize()
nyc_taxi.get_stats()

To finish, let's look over a wider time horizon. Since we were only looking at a 15 minute interval before, each taxi only had 1 ride on average. This is more of an assignment problem of taxis to rides and does not capture the full complexity of a schedule with taxis stringing together multiple rides. Let's look at 5:00 PM to 6:30 PM with 1500 taxis now. We will return to our revenue maximizing objective.

Run the cell below to compute the taxi schedule. (Note, this takes a bit (about 2 minutes). The corresponding min-cost flow formulation has 278,735 nodes and 855,945 edges!)

In [None]:
trips_df['value'] = trips_df['revenue']
nyc_taxi = TaxiRouting(trips_df, nodes_df, arcs_df, 1020, 1110, 1500)
nyc_taxi.optimize()
nyc_taxi.get_stats()

In [None]:
nyc_taxi.plot_stats()

**Q:** How many rides does each taxi make on average?

**A:** <font color='blue'> There are 1500 taxis and 8806 rides made so 5.87 $\approx$ 6 rides per taxi</font>

Let's look at a few random taxi paths in this solution.

In [None]:
random.seed(1101)  # set random seed
taxis = [random.randint(0,1500) for i in range(3)]
nyc_taxi.plot_taxi_route(taxis)

**Q:** Let's look at the solution maximizing the number of trips. Run the code to change the objective (change the value field), create the new problem, solve it, and print the summary statisitcs.

In [None]:
# TODO: Add your code here

### BEGIN SOLUTION
trips_df['value'] = 1
nyc_taxi = TaxiRouting(trips_df, nodes_df, arcs_df, 1020, 1110, 1500)
nyc_taxi.optimize()
nyc_taxi.get_stats()
### END SOLUTION

In [None]:
nyc_taxi.plot_stats()

Again, let's look at the path of the same random 3 taxis

In [None]:
nyc_taxi.plot_taxi_route(taxis)

**Q:** Visually compare the paths of the three taxis in the two different solutions. What do you notice? Try to give some explanation.

**A:** <font color='blue'>It appears the taxis do not cover as much ground as they did previously. They stay in one region as opposed to jumping all over the map. Recall that the solution maximizing revenue tends to have longer trips than the solution maximizing trips. Since the trips are shorter and taxis are unlikely to get multiple trips headed in the same direction, this leads to taxis covering smaller regions. </font>

**Bonus:** Play around with other objectives, time windows, number of taxis, etc. Use the summary statistics and path plotting functionalities to compare their solutions. What types of things did you notice? Was it what you expected? If you find something interesting, feel free to share!

### Additional Components

 <font color='red'> The function below plots marginal returns. Need to incorporate this into the lab in some way. Maybe ask them a question about interpretting these plots and how they could be used to identify when to stop adding taxis to the fleet. Kind of surprised that there is not some increasing marginal in the beginning...</font>

In [None]:
trips_df['value'] = trips_df['revenue']

In [None]:
plot_returns(trips_df, nodes_df, arcs_df, 1020, 1040, 250, 1150, 9)

In [None]:
plot_returns(trips_df, nodes_df, arcs_df, 1020, 1060, 1, 6, 6)

The following heatmaps show the flow of bikeshare bikes in Manhattan at two different times of the day. Red indicates a bike is departing that location and blue indicates a bike is arriving at that location.

<img src="images-lab/bikeshare_heatmaps.png" alt="bikeshare heatmap" style="width: 600px;"/>

*From the Cornell Bikeshare Research Group  

 <font color='red'> Add question about what time of day these snapshots might be from.</font>

<font color='red'> Create questions that use the new plot heatmap function.</font>

In [None]:
# Morning rush hour
TaxiRouting(trips_df, nodes_df, arcs_df, 500, 700, 300).plot_heatmap()

In [None]:
# Evening rush hour
TaxiRouting(trips_df, nodes_df, arcs_df, 1000, 1200, 300).plot_heatmap()