# Mixed Logit Estimation Exercise

This problem set originates from Professor Ken Train's course on discrete choice.
Many thanks to him for allowing us to adapt it for use with Larch. It is provided
here with only minor modifications.

<a href="https://colab.research.google.com/github/driftlesslabs/larch/blob/main/exercises/mixed-logit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you run this notebook in CoLab, you'll need to install `larch` like this:

In [None]:
!pip install -q larch6

We will estimate mixed logit models in this problem set. The data represent 
consumers' choices among vehicles in stated preference experiments. The data are 
from a study that he did for Toyota and GM to assist them in their analysis of 
the potential marketability of electric and hybrid vehicles, back before hybrids 
were introduced to the public.   

## Vehicle Type Choice Data

In each choice experiment, the respondent was presented with three vehicles, with 
the price and other attributes of each vehicle described. The respondent was asked 
to state which of the three vehicles he/she would buy if the these vehicles were 
the only ones available in the market. There are 100 respondents in the dataset
for this exercise (the full dataset had 500 respondents). Each respondent was 
presented with 15 choice experiments, and most respondents answered all 15. The 
attributes of the vehicles were varied over experiments, both for a given 
respondent and over respondents. The attributes are: price, operating cost in 
dollars per month, engine type (gas, electric, or hybrid), range if electric (in 
hundreds of miles between recharging), and the performance level of the vehicle 
(high, medium, or low). The performance level was described in terms of top speed 
and acceleration, and these descriptions did not vary for each level; for example, 
"High" performance was described as having a top speed of 100 mph and 12 seconds 
to reach 60 mph, and this description was the same for all "high" performance 
vehicles. 

In [None]:
import pandas as pd

import larch as lx
from larch import PX

raw_data = pd.read_parquet(lx.example_file("vehicle_choice.parquet"))
data = lx.Dataset.construct.from_idca(raw_data)
data["price_scaled"] = data["price"] / 10000
data["opcost_scaled"] = data["opcost"] / 10
raw_data

## First Things First

It is always best to start with a standard logit model. To save you time, the 
script to produce a simple MNL model is provided below.

  - Do the estimated coefficients have the expected signs?
  - What is the estimated willingness to pay for a $1 per month reduction in
    operating cost? Note that price is in tens of thousands of dollars, and
    operating cost is in tens of dollars.
  - The variable "medhiperf" is 1 if the vehicle has either medium or high
    performance and 0 if the vehicle has low performance. The variable
    "hiperf" is 1 for high performance and 0 for medium or low performance.
    So, a vehicle with high performance has a 1 for both of these variables.
    The estimated "utility" from each performance level is therefore 0 for
    low performance, 0.3841 for medium performance, and 0.3841+0.1099=0.4940
    for high performance. Or, stated incrementally, the estimates imply that
    going from low to medium performance increases "utility" by 0.3841, while
    going from medium to high performance increases utility by 0.1099. These
    estimates imply diminishing marginal utility of performance. 
  - There are three engine types, with alternative specific constants entered
    for two of them (with the gas engine normalized to zero.) What do the
    estimated constants imply about consumers' preferences for electric and
    hybrid vehicles relative to gas? 

In [None]:
simple = lx.Model(data)
simple.utility_ca = (
    PX("price_scaled")
    + PX("opcost_scaled")
    + PX("max_range")
    + PX("ev")
    + PX("hybrid")
    + PX("hiperf")
    + PX("medhiperf")
)
simple.choice_ca_var = "chosen"
simple.maximize_loglike(stderr=True, options={"ftol": 1e-9})

## Data Scaling

Operating cost is scaled to be in tens of dollars, and price in tens of thousands. 
The optimizer operates most effectively when the diagonal of the hessian has about the 
same order of magnitude for all parameters, which can usually be accomplished by scaling 
variables such that their coefficients are about the same order of magnitude. To see the  
effect of scaling, remove the scaling of operating cost, such that operating cost enters 
as dollars rather than tens of dollars. How many more iterations does the optimizer 
take to converge when operating cost is in dollars compared to when operating cost 
is in tens of dollars? For standard logit, the difference is run time is immaterial, 
since estimation is so quick in any case. But when running mixed logit and other 
models that require simulation, the difference can be considerable. It is always 
helpful, therefore, to run standard logit models to get the scaling right, since 
checking various scales is quick in standard logit, and then to use that scaling 
when you turn to mixed logit.

## A First Mixed Logit Model

Now estimate a mixed model with a fixed price coefficient and normal coefficients 
5. for all other attributes. This is probably the most common mixed logit specification. 
The code to do so is below.  It uses `lx.mixtures.Normal` to define Normal distributed
parameter mixtures, with the first argument naming an existing parameter to mix, which 
will be the mean of the distribution, and the
second giving a new parameter name that will be the (estimated) standard deviation of 
the mixture. Run this code and examine the output.

In [None]:
mixed = simple.copy()
for k in ["opcost_scaled", "max_range", "ev", "hybrid", "hiperf", "medhiperf"]:
    mixed.mixtures.append(lx.mixtures.Normal(k, f"s_{k}"))
mixed.n_draws = 200
mixed.seed = 42
mixed.maximize_loglike(stderr=True, options={"ftol": 1e-9})

- Each random coefficient is distributed $N(B,W^2)$ where B and W are estimated.
  Eg, the operating cost coefficient is estimated to have a mean of $-0.2149$
  and standard deviation of $0.4668$, such that the variance is $0.4668^2=0.2179$.
  Note, however, that the estimated $W$ can be negative, as occurs for some of
  the coefficients. The negative sign in these cases is simply ignored, and
  the standard deviation is the estimate without the negative sign. Here's
  the reason: the parameter that is being estimated is not actually the standard
  deviation; rather it is $W$ such that $W^2$ is the variance.  $W$ and $-W$
  give the same variance, and hence are equivalent. Also, since the standard
  deviation is defined as the square root of the variance, they give the same
  standard deviation. The advantage of estimating $W$ instead of estimating a
  standard deviation is that the optimization routine does not need to embody
  constraints to keep the parameter positive. Another way to see the issue is
  that a random variable that is $N(B,W^2)$ is created as $B+w*\mu$ where $\mu$
  is standard normal, or equivalently as $B-W*\mu$: both result in a term with
  mean $B$ and standard deviation $W$. An implication of this parameterization
  is that the starting values of $W$ should not be set at 0, but rather at some
  value slightly away from zero. The reason is this: Since $W$ and $-W$ are
  equivalent, the true log likelihood is symmetric around $W=0$ and hence is
  flat at $W=0$. (The simulated log likelihood is not exactly symmetric due
  to simulation noise.) If $W=0$ were used as the starting values, the gradient
  would be zero and the optimization routine would have no guidance on the
  direction to move. (If you want, you can change the starting value for $W$
  to zero, and rerun the model. You'll see that it has a smaller improvement
  in the first iteration and takes more iterations to converge. 

## Considering Panel Data

The mixed logit model above allows the preference parameters to vary across observations.
But what we really want is preference parameters that vary across people.
Remember, each person in this study was given 15 different questions to consider. 
To make preferences vary only by person, we add a `groupid`:

In [None]:
panel = mixed.copy()
panel.groupid = "person_id"
panel.maximize_loglike(stderr=True, options={"ftol": 1e-9})

- Is this "panel" model better than the naive version?  Why do you think that is?

- What is the estimated distribution of willingness to pay for a 1 dollar
  reduction in operating cost? An advantage of having a fixed price coefficient
  is that this distribution can be derived fairly easily. (If, in contrast,
  the price coefficient is random, the willingness to pay is the ratio of two
  random terms, which is more difficult to deal with.)

- What share of the population is estimated to dislike reductions in operating
  cost? To like high performance less than medium performance? Are these results
  reasonable?

## Simulation Noise

Let's explore the effects of simulation noise. The seed is currently set at `42`
(which is, of course, the [correct](https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy#Answer_to_the_Ultimate_Question_of_Life,_the_Universe,_and_Everything_(42)) 
value). Change the seed to other values to see the effect of different random 
draws on the estimation results. Try three or four seeds, to get a sense of 
how much change there is.  Also try changing the number of random draws. With
more random draws, does the effect of differing random seeds change?

## Other Specifications

Now let's return to specification issues. We have seen that the use of 
normal distributions creates unrealistic results for coefficients that 
should be the same sign for all people. Change the distributions for 
the operating cost, range, and the performance, to be `lx.mixtures.LogNormal` 
instead of Normal. Also, this is important: the lognormal distribution 
has support only on the positive side of zero. So, if you want to use 
a lognormal distribution for an undesirable attribute (for which all 
people have negative coefficients), then you need to use 
`lx.mixtures.NegLogNormal` instead.

- Does this model fit the data better or worse than the model with
  normal distributions, based on the log-likelihood value?

- What are the estimated mean and standard deviation of the willingness
  to pay for operating cost reductions? How do they compare to those
  from the model with normal distributions?

- Now allow the price coefficient to have a lognormal coefficient. What
  is the estimated distribution for willingness to pay for operating cost
  reductions?

- Try other specification and find the model that you think is best. 