# How to Make All the Flights You Want

## Part 1: Framing the Problem

My father and I were born a week short of 41 years apart from one another. As you might expect, there are many things on which we (respectfully) don't always see eye-to-eye. But few topics are as contentious between us as the appropriate amount of lead time before a flight with which one should arrive at the airport.

To be fair, his grievances with my airport habits are not without some empirical basis. My record, particularly in my early twenties, has been less than stellar when it comes to missing flights. But even with these, let's call them missteps, I find it hard to justify the two hour window for which he frequently advocates.

In fact, I find it interesting just how lacking in consensus the public is on the issue (as Twitter responses to an Atlantic article from a couple years ago brought into focus). Here is a phenomenon that many of us experience hundreds, sometimes even _thousands_ of times in our life, and yet we arrive at widely different conclusions about how to approach it.

This notebook will attempt to address this apparent inconsistency first by framing an argument, mathematically, about _why_ one would arrive at the airport at one time compared to another, and then by calculating (or, rather, estimating) the arrival time which _optimizes_ that criterion. In doing so, we can point out which steps in the argument lead to disagreement, and we can evaluate how we can argue about those points.

### How much a dollar cost?
So let's begin by framing the problem. To begin with, I'll mention that I'm not interested in diversions of the form "well what if you have TSA pre?" or "well what if my flight is at midnight on a Wednesday?". As we'll see shortly, these protests do nothing to alter the form of the argument, only particular _parameters_ in it that we'll treat in a fully general manner. For the purposes of discussion, just assume I'm talking about a generic heavily trafficked airport on a generic heavily trafficked time and day, say 9 AM on a Monday or 5 PM on a Friday. Once the form of the argument becomes clear, feel free to sub in the appropriate language for your own hypothetical.

I'm also not concerned with what time you leave your house, just what time you arrive at the airport. I trust that your sense of traffic, or at the very least your smart phone's, is sufficiently well honed that you can map from one to the other. If it isn't, well that's another issue entirely.

With that in mind, let's begin with this question: why don't you just arrive at the airport five minutes before boarding? The obvious answer is that this would make you very likely to miss your flight, which most people might find unpleasant because it both might cost them money in the form of a re-booking fee, and might keep them from getting to the place that they would rather be.

On the flip side, why not just arrive, say, 5 hours early? You'd be almost guaranteed to never miss a flight, and you probably have emails you need to catch up on anyway. The once again obvious answer is that most people just don't really enjoy being at airports. Even if you do find waiting at the gate to be a nice opportunity to read or respond to people you've been ignoring for weeks, the returns on that joy diminish pretty quickly after an hour or so.

In both cases, it seems, we're interested in avoiding some __cost__. Those costs can be both material, in the form of fees or lost wages, or more abstract, in the form of missed time at your destination or boredom at the gate. The time we choose to arrive reflects, at least implicitly, our best efforts to balance between these costs.

The issue is that we don't, in general, know in advance how much of each cost we're likely to incur. If I _knew_ that it was going to take me exactly ten minutes and twenty three seconds to get from the doors of the airport to my gate, I would just arrive ten minutes and twenty three seconds before boarding. This way, I would never spend any time waiting, and I would be guaranteed to make my flight.

But even though we don't _know_ what particular value these __random variables__ are going to take on any particular trip, we tend to act on, again at the very least implicit, estimates of how they're __distributed__. I don't arrive 5 hours early because the I know that the likelihood of getting to the gate taking that long is very low. Accordingly, my likelihood of incurring costs associated with missing my flight is low as well. On the other hand, my likelihood of incurring a lot of bored-at-the-gate cost is high. So we're not just interested in balancing cost, but in balancing __risk__, cost times the probability of incurring that cost.

So what if we described these costs and their probabilities mathematically as a function of how early we arrive at the airport? Then we could find the arrival time which minimizes the __expected cost__, the cost we'll incur _on average_ over the hundreds of flights in our lifetime. We might not act optimally on any one trip (i.e. not arrive at the gate precisely when boarding starts), but, over many trips, we'll incur (approximately) the least possible cost given our uncertainty.

### But why models?
Here comes the math, so if this isn't your thing feel free to skim. But I still think there's some value in the structure of the argument, even for the lay reader. What I'm going to do is propose a __model__ for the total cost incurred $C$ as a function of the amount of time $\tau$ we arrive before our flight (or we can measure this with respect to boarding time if you prefer, it doesn't really make a difference), $C(\tau)$.

One key point to keep in mind here is that our own personal cost functions emerge from some complicated, subconciously computed combination of real costs and our own personal tastes and preferences (e.g. some people like hanging out at airports more than others). It would be foolish to believe that we could come up with some formula that succinctly captures these bizarre subtleties of human psychology. Rather, I'll propose an extremely simple model with easily interpretable __parameters__ that you can plug based on your _own_ preferences. The question I'm ultimately interested in, therefore, is if we're willing to quantify our own cost functions within the context of this model, can we derive a corresponding optimal arrival time?

Let's make things more concrete before we dig more into that. The model I'll adopt is this: let's assume that waiting at the gate accrues some constant cost per unit time $c_w$, so that if we wait at the gate for $T$ minutes the total cost we incur is $c_{w}T$, in some as-yet-undefined cost unit (remember, the unit itself doesn't matter since we're really just interested in the time that minimizes _any_ cost unit). However, if we miss our flight, we incur some fixed cost $C_m$, which does _not_ depend on time. Therefore, if we assume that getting from the door to the gate takes some amount of time $t$, then we can describe $C(\tau)$ as:

$$C(\tau) = \begin{cases}
    c_{w}(\tau-t) & t \leq \tau \\
    C_M & t > \tau \\
    \end{cases}
$$

So let's start with the question: why this model? After all, the implication seems to be that each minute spent waiting at the airport is as painful as the last, and anyone who has had a flight delayed multiple hours knows that first minute of the fourth hour is much worse than the first one of the first hour. And the fixed cost for missing a flight seems off too: missing a flight by an hour makes it harder to talk your way onto the next one, and means you'll be even later to wherever it is you want to go.

These issues notwithstanding, I'm adopting this model for the same reasons we adopt most models in science: it turns out we can solve it for some real answers! And what's more, the answers we _get_ from it turn out to be pretty compelling. Once we have these answers, we can analyze any deficiencies they might have and tie them back to deficiencies the model in order to improve it.

Another part of the answer to this question is that I really don't have any good reason to pick more complicated models. Or, rather, I don't have any good reason to pick one more complicated model over another. Maybe there should be a $t^2$ term somewhere in there, but why not a $t^1.5$, or $log{t}$? (I'm being a bit asinine here: there are good reasons to believe most functions of interest can be approximated by polynomials.) Without good _a priori_ motivation, picking one over the other could only be validated by checking whether it produces "good" answers anyway, so we may as well start simple and go from there. 

Moreover, the cost of this added complexity is less "interpretability" (a term I put in quotes because I have a lot of issues with how it gets thrown around these days, but that's a notebook for another day). What's nice about this model is that, as we'll see, it ends up depending on just one parameter that you, the person using it for your decision making, need to pick in order to reflect your own cost functions. This value is $\frac{C_M}{c_w}$, which I'll refer to as your __cost ratio__ $R$. Looking at the units of $R$, we see that it has $\frac{\text{cost}}{\frac{\text{cost}}{\text{time}}} = \text{time}$. We can interpret $R$ as the _amount of time that you would need to wait at the airport before you considered it as bad as missing your flight_. This value will be unique not just to each person, but even to each trip for each person (e.g. missing your friend's wedding would be presumably worse than starting a vacation a couple hours later). The point is, if we're willing to _pick a value for_ $R$, we'll be able to find the ideal arrival time _for_ that value.

To this end, I want to draw a distinction here between disagreements that might occur due to issues with the model itself, to which I would respectfully respond that you provide a model of your own to compare, versus those disagreements that arise _from_ the model. And here we've already encountered one: what value of $R$ do we choose? My value of $R$ clearly differs from my father's, and this plays a big part in why we arrive at such different conclusions for when to arrive at the airport.

We are, of course, free to disagree about _why_ we've picked one $R$ value versus another. My dad might think that my relatively low $R$ choice reflects financial imprudence or a lack of punctuality, and I might think that his relatively high $R$ choice insufficiently values time spend at home doing nothing. But at least now we know that we're arguing about _values_, and we know _which values_ we're arguing about. Moreover, we'll develop the mathematical scaffolding to map from this values choice to an arrival time, and have a precise sense for how our values disagreements give rise to disparate conclusions.

### Math for real this time
Ok, this has been a lot of talking. Let's dig into this model. So we have our cost function

$$C(\tau) = \begin{cases}
    c_{w}(\tau-t) & t \leq \tau \\
    C_M & t > \tau \\
    \end{cases}.
$$
But as we pointed out above, the key problem in this whole affair is that $t$ is a random variable (or, more precisely in the context above, represents a particular realization of a random variable $T$). So, as we mentioned above, we want to minimize our expected cost over some distribution $p(t)$ of this random variable
$$\mathbb{E}_{t \sim p(t)}\big{[}C(\tau)\big{]} = \int_{0}^{\tau}(\tau - t)c_{w}p(t)dt + \int_{\tau}^{\infty}C_Mp(t)dt.$$

So in order to find the value of $\tau$ that minimizes this, which we'll call $\tau^*$, we'll just take the derivative with respect to $\tau$ and set it equal to $0$

$$\frac{d\mathbb{E}[C(\tau)]}{d\tau} = c_w\int_{0}^{\tau}{p(t)dt} - C_M p(\tau)$$
$$\Rightarrow\ \ \ \ c_w\int_{0}^{\tau^*}{p(t)dt} - C_M p(\tau^*) = 0$$

If we let $P(t)$ represent the __cumulative density function__ of $p(t)$, $P(t) = \int_{0}^{t}p(t')dt'$, then we can finegle this into the neat form

$$\frac{d\log{P(t)}}{dt}\biggr\rvert_{\tau^*} = \frac{1}{R}.$$

I will admit that I don't as of yet have a nice intuitive explanation for this equation, but it is both exceptionally convenient and _fully general_ to _any_ distribution $P(t)$ (which we haven't yet specified). This makes this equation _exact_, conditioned on our acceptance of the underlying cost model. Let's start with a simple distribution for $T$, the exponential, and see if we can solve this equation exactly.

### First cut: An exponential distribution

More precisely, let's assume that the distribution of time it takes for us to get from the door to the gate is

$$p(T = t) = \lambda e^{-\lambda t}$$

where $\lambda$ is called a parameter of the distribution. The choice of this form of distribution picks some general family of _shapes_ that our distribution can take, and the choice of value of the parameter(s) decides which member of this family we use. How we pick a particular parameter value is the key exercise in the science of statistics, and is generally done by ensuring that the corresponding distribution matches well with some observed data (this is called __fitting__ the parameters to the data). We'll take a look at how we can do this in the next section.

For now, let's take a look at a few different values of $\lambda$ to see what I mean by the "shape" of the distribution.

In [1]:
from scipy.stats import expon
from bokeh.io import output_notebook, show
import plot_utils
output_notebook()

dists = {f"Lambda {l}": expon(scale=1./l) for l in [0.02, 0.05, 0.1]}
plot_kwargs = {
    "title": "Exponential Distributions for Airport Time",
    "x_axis_label": "Minutes from door to gate",
    "y_axis_label": "Probability"
}
p = plot_utils.plot_distributions(dists, x_max=100, plot_kwargs=plot_kwargs)
show(p)

So as we can see from the plot above, higher values of $\lambda$ make longer wait times less likely. But one observation should bother us: the most likely time (the highest $y$ value in the plot above) for _all_ values of $\lambda$ is 0! I may not know in advance how long security is going to take, but I can say for sure that it will take _some_ time. If anything, the value of $p(0)$ should be 0, not a maximum!

Obviously, this is a bad model. But just like our choice of cost model above, it has the extremely desirable property that we can _solve_ it. That is, we can find an equation for the ideal arrival time $\tau^*$ as a function of $\lambda$ and $R$. Once we pick the first one based on data, and pick the second one based on our own personal values and preferences, we can plug them in to find out when we should arrive at the airport!

The answer that comes out of this will be __biased__, since it was achieved with a model distribution that we _know_ doesn't "look" like the real distribution. But it will once again provide a nice launching point, and we can explore more complicated distributions from there.

So the cumulative density function for the exponential distribution looks like
$$P(T \leq t) = 1 - e^{-\lambda t},$$
which we can plug into our equation above to get
$$\frac{\lambda e^{-\lambda \tau^*}}{1 - e^{-\lambda \tau^*}} = \frac{1}{R}$$
$$\Rightarrow\ \ \ \  e^{\lambda \tau^*} = \lambda R + 1$$
$$\Rightarrow\ \ \ \ \tau^* = \frac{1}{\lambda}\log(\lambda R + 1).$$

So here's our formula for the idea arrival time! And it makes a fair amount of sense: as $\lambda$ gets larger, we're more likely to be able to get to our gate faster, and so $\tau$ decreases, meaning we don't need to allow ourselves as much time. Likewise, as $R$ gets larger, which means we have _more_ tolerance for waiting (it takes more hours of waiting to be as bad as misisng a flight), $\tau^*$ gets larger too, since we can get there earlier and risk spending some time waiting around. And finally if $R=0$, then _any_ amount of waiting becomes unacceptable to us, so we should not allow ourselves _any_ lead time (this is an interesting artifact of the model's assumption that 0 is the most likely wait time).

The difference between $\lambda$ and $R$, however, is that there is no "right" value of $R$ for _everyone, all the time_. We need to make this decision for ourselves each time we have a new flight to make. $\lambda$, on the other hand, _does_ have a "true" value (for a given time of day, airport, etc.). Or rather, it has a "best" value that makes our distribution as close as possible to the real one. If we could plug this value in, we could plot $\tau^*$ as a function of _just_ $R$, and you could consult this graph to find your ideal time for your own $R$ value.

This sounds easier than it is, but if we're willing to chalk all of the time that goes into $t$ to security wait times (a reasonably fair approximation for most trips. We can always add a roughly constant offset to account for walking time, etc.), we can take a first swing from [this highly questionable data](https://www.tsa.gov/news/press/releases/2019/01/19/tsa-statement-checkpoint-operations-january-19) put out by the TSA a couple years ago. In particular, we can use the observation that
> 91.2 percent of passengers waited less than 15 minutes

combined with the [quantile function](https://en.wikipedia.org/wiki/Exponential_distribution#Quantiles) to compute a __maximum likelihood estimate__ of $\lambda$

In [2]:
import numpy as np
from bokeh.models import Range1d, LinearAxis

p = 0.912
F = 15
l = -np.log(1-p) / F
dist = expon(scale=1./l)

plot_kwargs = {
    "title": "Likelihood of Wait Times",
    "x_axis_label": "Time (minutes)",
    "y_axis_label": "Probability",
    "y_range": (0, 1.02*l)
}
p = plot_utils.plot_distributions({"PDF": dist}, x_max=20, plot_kwargs=plot_kwargs)

# plot the CDF as well on a second axis for clarity
cdf_x = np.linspace(0, 20, 100)
cdf_y = dist.cdf(cdf_x)
p.extra_y_ranges = {"cdf": Range1d(start=0, end=1)}
p.line(x=cdf_x, y=cdf_y, line_color=plot_utils.palette[1], legend_label="CDF", y_range_name="cdf")
p.add_layout(LinearAxis(y_range_name="cdf", axis_label="Cumulative Probability"), "right")

p.legend.title = "MLE Estimate: lambda = {:0.3f}".format(l)
p.legend.location = "center_right"
show(p)

So this says that our wait time, will be 10 minutes or less 80% of the time. For most of us, this should ring absurd. But remember, we used a _model_ that was biased towards lower wait times (remember that it treats 0 as the most likely wait time), and fed it _data_ that was likely biased towards lower wait times too (that press release was put out in the midst of a government shut down that was causing travel delays for which the Trump administration was bearing much of the political cost, calling into question its motives. It's also far too blanket, including travellers from off hours whose wait times don't reflect the average traveller that we're concerned with). _Of course_ the results came out low! We'll work on refining this in the next section, but first let's do what we set out to do and plot the ideal arrival time as a function of the cost ratio $R$.

In [4]:
# multiply by 60 to put in units of minutes
f = lambda R: np.log(l*R*60 + 1) / l
plot_kwargs = {
    "title": "Cost Ratio vs. Ideal Arrival Time",
    "x_axis_label": "Cost Ratio (hours)",
    "y_axis_label": "Ideal Arrival Time (minutes)"
}
p = plot_utils.plot_functions(f, x_min=0.5, x_max=8, plot_kwargs=plot_kwargs)
show(p)

So as we expect, the ideal arrival times are a bit low. For a cost ration of 8, meaning we can wait a full 8 hours before we decide that we would have rather just missed the flight, the recommended arrival lead time is only about 27 minutes. Clearly, there are some flaws here. Let's take a second to review the steps in this argument thus far which might lead to disagreement with these conclusions, and how they lead to them. I'll ignore the choice of the cost model, which we're going to accept as true for now.
- The choice of the exponential distribution, which biases our estimates towards lower wait times
- The choice of the $\lambda$ parameter in that distribution, which was calculated based off of shaky data
- The choice of $R$, which will vary from person to person and place them at a different point on the curve above.

The first and second points, the disagreements ultimately come down to data modelling decisions, and can only be settled by validation of models on *new* data (though compelling _prior_ reasons to prefer one distribution over another, e.g. our observation about 0 being the most likely time in the exponential, certainly help to motivate our decision making). The last, as we've discussed is a disagreement over _values_ that can only be settled by arguing about why one set of tastes is better than another. So we haven't quite settled anything, but at least now we know what we're arguing about.

In the next section, we'll make our model a bit more complicated, and take a look at some better day that we can use to fit it more appropriately.