# The Myopic Mechanic

Consider a car owner (let's call him "Mitt") facing the problem of when to take his car to the mechanic for service. We assume that Mitt makes one vehicle service decision per month, at the beginning of each month.

Let $x$ denote the total miles on the car, and $z$ denote the number of miles since the last service visit (both measured in thousands). Conditional on the total mileage $x$ and the miles since last service $z$, Mitt perceives expected beginning-of-month benefit from operating the car given by
$$
U(x, z) = a_0 + a_1 x - \rho_1 z - \rho_2 \cdot z \mathbb{I}[x \geq 100],
$$
where $\rho_1 > 0$ and $\rho_2 >0 $ represent expected current costs associated with increasing miles since last service visit, including both routine costs and potential breakdown risks, and $\mathbb{I}[x \geq 100]$ is an indicator for whether the car has more than 100,000 miles. Meanwhile, visiting the mechanic involves average cost 
$$
C(x, z) = c_0 + c_1 z + c_2 \cdot z \mathbb{I}[x \geq 100],
$$
where we allow costs of the service visit to increase with miles since last service, but assume that $\rho_1 > c_1$ and $\rho_2 > c_2$ (so that as $z$ increases, perceived costs of inaction increase faster than perceived costs of service).

If Mitt visits the mechanic, he will incur an expected cost $C(x, z)$ specified above. However, visiting the mechanic resets the number of miles $z$ since the last service visit to zero, allowing Mitt to realize benefit $U(x,0)$ over the rest of the month. 

At the beginning of every month, Mitt decides whether or not to take his car to the mechanic. 
Toward this end, he compares the net monthly benefit of taking the car to the mechanic $(d=1)$, 
\begin{equation}
V_1 = U(x,0) - C(x, z) = a_0 + a_1 x - c_0 - c_1 z - c_2 \cdot z \mathbb{I}[x \geq 100] + \epsilon_1,
\end{equation}
to the  net monthly benefit of operating the car without maintenance ($d=0$),
$$
V_0 = U(x, z) = a_0 + a_1 x - \rho_1 z - \rho_2 \cdot z \mathbb{I}[x \geq 100] + \epsilon_0,
$$
where $\epsilon_0$ and $\epsilon_1$ are i.i.d. Type 1 Extreme Value utility shocks representing idiosyncratic variation in Mitt's monthly tastes for mechanic visits (much more on these soon!). If $V_1 > V_0$, Mitt takes the car in ($d=1$); otherwise, he doesn't ($d=0$).

Note that Mitt is *myopic*, in the sense that he considers only costs and benefits of maintenance within the current month, not prospective benefits in future months from maintaining the car today. We will return to the case of a forward-looking mechanic (Harold Zurcher, the subject of a seminal 1987 paper by John Rust) when we introduce dynamic discrete choice analysis.

## Predicted probability of service

We first compute the predicted probability that Mitt takes the car for service $(d=1)$ given $x$ and $z$: denote this conditional choice probability by $P(x,z)$. By definition, this probability is given by
\begin{align*}
P(x, z) &= P(V_1 - V_0 \geq 0|x, z) \\
&= P(-c_0 + (\rho_1 - c_1)z + (\rho_2 - c_2) z \mathbb{I} [x \geq 100k] \geq \epsilon_0 - \epsilon_1) \\
&= \frac{\exp(-c_0 + (\rho_1 - c_1) z + (\rho_2 - c_2) \cdot z \mathbb{I} [x \geq 100])}
{1 + \exp(-c_0 + (\rho_1 - c_1) z + (\rho_2 - c_2) \cdot z \mathbb{I} [x \geq 100])},
\end{align*}
where the last line follows by the i.i.d. Type I extreme value assumption on the idiosyncratic utility shocks $\epsilon_0, \epsilon_1$ (again, much more on this soon!). 

Note the following features of the predicted probability-of-service function $P(x,z)$:

- The terms $a_0 + a_1x$ which appear in both $V_1$ and $V_0$ disappear from the choice probability function. We thus cannot identify either $a_0$ or $a_1$, since these do not affect the choice we observe. In other words, we cannot identify the effects of mileage per se on utility from driving the car; we can only identify effects of mileage insofar as they affect the relative costs of service.
	
- Only differences $(\rho_1 - c_1)$ and $(\rho_2 - c_2)$ show up in the choice probability function. In this particular example, we can thus only identify the effect of mileage since last service on the differential costs of service versus not.
	
- Since the scale of utility is arbitrary, we can identify these differences only up to scale. In assuming that $\epsilon_0$ and $\epsilon_1$ were i.i.d. Type 1 EV, we implicitly imposed a normalization on the variances of $\epsilon_0$ and $\epsilon_1$. Scaling all terms in utility by any positive constant would lead to the same choice probabilities.
	
- Are these identified objects enough? It depends on the counterfactual of interest. For example, if we want to determine how cutting the base cost $c_0$ of a service visit in half would affect the frequency of service, we could do so. If we wanted to determine how a dollar subsidy to service would affect frequency of service, we could not, since in this simple exercise we have no way to convert estimated utilities into dollar terms. For this, we would need to observe variation in price of service, from which we abstract for the moment (as price raises a separate set of endogeneity questions which we will address in detail later). 

Bearing the above caveats in mind, redefine $\gamma_0=c_0$, $\gamma_1 = \rho_1 - c_1$, and $\gamma_2 = \rho_2 - c_2$. These are the primitives which data on Mitt's service choices can identify. 

## Objectives of the exercise

Suppose we observe panel data $(d_{it}, x_{it}, z_{it})_{t=1}^T$ on monthly mileage and service decisions for a collection of individuals $i=1,...,N$, interpreted as a random sample of the population. For simplicity, assume a balanced panel (i.e., the same $T$ for all $i$), although this is inessential.

We aim to estimate the parameter vector $\gamma = (\gamma_0, \gamma_1, \gamma_2)$, assuming that each individual is making auto maintenance choices according to the model described above. Toward this end, we will consider:

1. CMLE estimation based on the predicted choice probability function $P(x_{it}, z_{it}; \gamma)$ derived above.

2. GMM estimation based on the conditional mean restriction 
	$$E[d_{it} - P(x_{it}, z_{it}; \gamma) | x_{it}, z_{it}] = 0.$$ 

In this case, given that we have a fully specified choice model, CMLE will be more efficient, but we also consider GMM for illustrative purposes.

To gain a sense for how the estimators compare, we will simulate several Monte Carlo datasets, then explore the performance of each estimator in these simulations. 
I will provide code for this exercise in Julia, although (for those using other programs) it may be worthwhile to replicate this exercise in other languages.

# Step 1: Drawing data from the model.

We first write a few simple functions to generate simulated data from the model above. These use functionality provided by several packages in the Julia language, which we load next.

In [1]:
# Load packages
using Distributions, LinearAlgebra, DataFrames

We next specify the parameters of the data generating process. We specify the parameters governing service choice as $\gamma = (5.0, 1.0, 0.2)$. We draw initial mileage $x_{i0}$ from an exponential distribution with mean $60$. We assume the monthly mileage for each individual evolves as $x_{i,t+1} = x_{it} + \Delta x_{it}$, with $\Delta x_{it}$ drawn from an exponential distribution with mean $1$. We specify these distribution objects below (note that f_x0 and f_dx defined below are *distribution objects*, which we subsequently feed into a random number generator to draw variables from the relevant distributions). 

In [2]:
# Choice probability parameters (object of interest)
gamma0 = [5. 1 .2]

# Initial mileage distribution (used to simulate data, but not in estimation)
f_x0 = Exponential(60.)

# Distribution of monthly mileage increment (also used to simulate data, but not in estimate)
f_dx = Exponential(1.)

Exponential{Float64}(θ=1.0)

We next define a Julia function computing the predicted probability that individual $i$ in period $t$ chooses to take their car to a dealership for service. This function will be the core component of our algorithm. To facilitate use in both simulation and estimation, we write the function to take arguments $\gamma$, the parameters governing choice, and $w = (-1, z, z*\mathbb{I}[x \geq 100])$, the vector of observed covariates which affect the choice probability. 

One technical note in computing predicted choice probabilities: employing $\gamma$ and $w$ just defined, we may rewrite the predicted choice probability function defined above as

\begin{align}
    P(w; \gamma) &= \frac{\exp(w'\gamma)}{\exp(0) + \exp(w'\gamma)} \\
    &= \frac{\exp(w'\gamma - \bar{v})}{\exp(-\bar{v}) + \exp(w'\gamma - \bar{v})},
\end{align}

where $\bar{v} = \max \{0, w'\gamma\}$ is the maximum of the average net utility associated with $d=0$ (i.e., 0.) and the average net utility associated with $d=1$ (i.e., $w'\gamma$). The latter transformation is useful to ensure numerical stability, as for some values of $w$ and $\gamma$, the product $w'\gamma$ could become very large, so that $\exp(w'\gamma)$ becomes machine infinity. Normalizing by the maximum of mean utilities prevents such numerical overflow, and ensures that the predicted choice probability $P(w, \gamma)$ is always numerically stable. This is good programming practice when working with logit models, and is illustrated in the code below.

In [3]:
# Compute predicted probability of taking car for service
function predicted_service_prob(gamma, w)
    wg = dot(gamma, w)
    vbar = max(wg, 0.)
    expnormv0 = exp(-vbar)
    expnormv1 = exp(wg - vbar)
    prob1 = expnormv1 / (expnormv0 + expnormv1)
    return(prob1)
end

predicted_service_prob (generic function with 1 method)

Finally, we write a function to draw data from the model above. This function takes arguments nI (the number of individuals) and nT (the number of periods per individual). It returns as output a data frame with columns :I, an individual identifier, :T, a time identifier, :P, the true predicted individual choice probability (not observed in actuality), :D, the observed individual decision, :X, the beginning-of-period cumulative mileage, :Z, the beginning-of-period miles since last service, and :W1-:W3, containing the variables $w_{it}$ for each observation. (The notation :X denotes a *symbol* in Julia -- that is, a unique precompiled identifer, in this case a column name.)

Note that we loop over both i and t in simulating data below, computing predicted choice probabilities separately for each individual. This would be a very inefficient construction in Matlab or Python, which tend to slow down dramatically in loops. But the just-in-time compiliation built in to Julia allows loops to execute with little overhead, greatly simplifying efficient coding of inherently recursive operations such as simulation of sequential choices.

In [4]:
# Draw data from model (assuming we start following each individual one period after last service)
function draw_data(nI, nT)
    
    # Pre-initialize output matrices
    I = zeros(nI*nT)
    T = zeros(nI*nT)
    P = zeros(nI*nT)
    D = zeros(nI*nT)
    X = zeros(nI*nT)
    Z = zeros(nI*nT)
    W = zeros(nI*nT, 3)
    w_it = zeros(3)
    
    # Loop through i by t and simulate data
    it = 1
    for ii=1:nI
        
        # initialize x_i, z_i, w_i for this i, x_i0 drawn from f_x0, z_it drawn one period after last service
        x_it = rand(f_x0)
        z_it = rand(f_dx)
        w_it = [-1. z_it z_it*(x_it > 100.)]
        
        # Loop through periods for this i and update x_i, z_i
        for tt=1:nT
            
            # Fill identifier variables for observation it
            I[it] = ii
            T[it] = tt
            
            # Fill beginning-of-period state variables for obs it
            X[it] = x_it
            Z[it] = z_it
            W[it, :] = w_it
            
            # Compute true predicted probability of service P_it (unobserved)
            P[it] = predicted_service_prob(gamma0, w_it)
            
            # Determine whether service is actually chosen: equivalent to U[0, 1] < P_it
            D[it] = rand() < P[it]
            
            # Increment next period mileage x: x' = x + dx, dx drawn from f_dx
            dx = rand(f_dx)
            x_it += dx
            
            # Increment next period z: z' = 0 + dx if service, z' = z + dx otherwise
            z_it = (D[it] > 0) ? dx : z_it + dx
            
            # Update w_it for start of next period and increment counter it
            w_it[2] = z_it
            w_it[3] = z_it * (x_it >= 100.)
            it += 1
        end
    end
    
    # Create data frame composed of variables above
    #   Note: the exclamation point is Julia syntax for modifying an aspect of an object in place
    #   In this case, we first the raw data as a matrix without labels
    #   We then update names of each column in the second line
    data = DataFrame([I T P D X Z W])
    names!(data, [:I; :T; :P; :D; :X; :Z; Symbol.(:W, 1:3)])
    return(data)
end

# Draw a small test dataset to verify function works
data = draw_data(10, 10)


Unnamed: 0_level_0,I,T,P,D,X,Z,W1,W2,W3
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,1.0,0.0661085,0.0,26.3925,2.35194,-1.0,2.35194,0.0
2,1.0,2.0,0.0962938,0.0,26.8015,2.7609,-1.0,2.7609,0.0
3,1.0,3.0,0.097304,1.0,26.813,2.77245,-1.0,2.77245,0.0
4,1.0,4.0,0.0435267,0.0,28.7231,1.91012,-1.0,1.91012,0.0
5,1.0,5.0,0.483762,1.0,31.748,4.93503,-1.0,4.93503,0.0
6,1.0,6.0,0.0252744,0.0,33.0957,1.34764,-1.0,1.34764,0.0
7,1.0,7.0,0.0300237,0.0,33.2728,1.52471,-1.0,1.52471,0.0
8,1.0,8.0,0.0604146,0.0,34.0038,2.25579,-1.0,2.25579,0.0
9,1.0,9.0,0.0868028,0.0,34.3947,2.64669,-1.0,2.64669,0.0
10,1.0,10.0,0.0879955,0.0,34.4097,2.66164,-1.0,2.66164,0.0


# Step 2: CMLE Estimation

We first estimate the model by CMLE. Under the assumptions described above, the choice probability function $P(w_{it}; \gamma_0)$ completely specifies the marginal density of the choice $d_{it}$ given contemporateous covariates $w_{it}$. Furthermore, by hypothesis, individual $i$'s choice in period $t$ does not depend on lagged realizations of $d_{it}$ and $w_{it}$ for individual $i$, although current $w_{it}$ will generally not be independent of lagged $w_{i,t-1}$ for a given individual. In other words, in the language of panel data analysis, our model is *dynamically complete*, which allows us to apply standard MLE asymptotics.

(In this context, dynamic completeness essentially requires that, conditional on current covariates $w_{it}$, current choices $d_{it}$ are independent of past choices $d_{i,t-\tau}$ and past covariates $w_{i,t-\tau}$. Since we have assumed independence of utility shocks $\epsilon_{i0t}, \epsilon_{i1t}$ across periods, this is true in our context. If these errors were not independent, but the marginal choice probability function $P(w_{it}; \gamma_0)$ were otherwise correctly specified, we could consider *partial MLE analysis*, which involves maximizing the same marginal objective function, but requires a general quasi-MLE approach asymptotic inference since the conditional information matrix no longer holds. See Wooldridge 13.8 for a detailed discussion.) 

By definition, the CMLE estimator maximizes the sum of observation-level log-likelihoods, in this case taken across both individuals $i$ and periods $t$. In this case, the observation-level log likelihood is the log of the probability of observing choice $d_{it}$ given the observed covariates $w_{it}$. Bearing in mind that $d_{it}$ is either zero or one, we may write this individual log likelihood concisely as
$$
    \ell_i(\gamma) = d_{it} \log(P(w_{it}; \gamma) + (1-d_{it}) \log(1 - P(w_{it}; \gamma)).
$$
Summing across individuals and periods gives the sample log likelihood: 
$$
    \mathcal{L}_N(\gamma) = \sum_{i=1}^N \sum_{t=1}^T \ell_i(\gamma).
$$
By definition, the CMLE estimator $\hat{\gamma}$ maximizes the sample log likelihood $\mathcal{L}_N(\gamma)$:
$$
    \hat{\gamma} = \arg \max_{\gamma} \mathcal{L}_N(\gamma).
$$
To find this maximum, we will need functions for calculating the value, gradient (sum of scores), and Hessian of $\mathcal{L}_N(\gamma)$. We define these functions next.


## Computing the sample log-likelihood

We first define functions for calculating the individual and sample log-likelihoods. These are straightforward applications of the formulas above. In computing the sample log likelihood, however, we apply one special wrinkle in initializing the output vector which allows Julia to determine the type of the function output dynamically. This allows us to apply automatic differentiation methods to compute the gradient as described below.

In [32]:
# Compute observation (it) level log likelihood: log(P_it) if d_it > 0, log(1-P_it) else
function period_log_likelihood(gamma, d_it, w_it)
    ccp = predicted_service_prob(gamma, w_it)
    if d_it > 0
        return log(ccp)
    else
        return log(1 - ccp)
    end
end

# Compute sample log likelihood: summing over periods
function sample_log_likelihood(gamma, data)
    
    # Retrieve relevant columns of data frame in matrix form
    D = data[:D]
    W = convert(Matrix, data[:, Symbol.(:W, 1:3)])
    
    # Initialize log likelihood to first value
    #  Note: we compute this separately to allow output to be determined by Julia
    #  As opposed, for example, to initializing sumll=0., which forces sumll to be a float
    #  This doesn't matter for computing the numeric value of the log-likelihood
    #  But it is required if we want to use the automatic differentiation methods employed below
    sumll = period_log_likelihood(gamma, D[1], W[1,:])
    
    # Loop over remaining observations and compute overall log likelihood
    for it=2:length(D)
        sumll += period_log_likelihood(gamma, D[it], W[it, :])
    end
    
    # Return sum log likelihood
    return(sumll)
end


sample_score_hessian (generic function with 1 method)

We next write a function to compute the value, gradient and Hessian of the log-likelihood function simultaneously. We use the ForwardDiff package in Julia to compute gradients and Hessians automatically -- a powerful tool based on the fact that all machine calculations ultimately boil down to addition, subtraction, multiplication and division (so the chain rule can be applied at the machine operation level). My implementation requires two additional packages, ForwardDiff and DiffResults, whose use is illustrated below

In [None]:
using ForwardDiff, DiffResults

# Compute sample score and Hessian
#   Here we use the ForwardDiff package, which allows efficient automatic computation of gradients and hessians
function sample_score_hessian(gamma, data)
    
    # We first initialize a HessianResult structure that allows us to compute objective, score, and Hessian in one shot
    #  Here the argument gamma describes the vector with which we aim to take derivatives
    hessres = DiffResults.HessianResult(gamma)
    
    # We now apply the ForwardDiff.hessian! method to compute the gradient and Hessian of the log-likelihood
    #  We first specify the log-likelihood as an anonymous function of gamma only
    #  We then call ForwardDiff.hessian! to fill the results of hessres
    #  Note the !, which specifies that this function will modify one of its arguments
    #  In this case, it will modify the hessres structure, which will ultimately contain the outputs
    func = g -> sample_log_likelihood(g, data)
    ForwardDiff.hessian!(hessres, func, gamma)
    
    # Finally, we retrieve the objective, score and gradient from the HessRes structure
    sumll = DiffResults.value(hessres)
    sumscore = DiffResults.gradient(hessres)
    sumhessian = DiffResults.hessian(hessres)
    return(sumll, sumscore, sumhessian)
end


In [41]:
# Test the log-likelihood, score, and Hessian functions
sumll, sumscore, sumhess = sample_score_hessian(gamma0, data)
@time sumll, sumscore, sumhess = sample_score_hessian(gamma0, data);
@show sumll;
@show sumscore;
@show sumhess;

  0.000176 seconds (697 allocations: 56.047 KiB)
sumll = -34.909437920584004
sumscore = [-5.37963 18.8391 2.95215]
sumhess = [-8.12719 28.3654 8.52102; 28.3654 -113.227 -32.6098; 8.52102 -32.6098 -32.6098]
