# D200, Problem Set 2: Discrete Choice Models

Due: 19 February 2026 [here](https://classroom.github.com/a/Jraqcm5s) in
groups of up to 2.

Stefan Bucher

This problem set will review classification as discussed in the lecture
through the lens of discrete choice modeling, a classically used method
in economics.

The problem set uses the
[choice-learn](https://github.com/artefactory/choice-learn) package, see
[here](https://medium.com/artefact-engineering-and-data-science/modeling-customers-decisions-in-python-with-the-choice-learn-package-37752cb7932e)
for more background.
<!-- alternatives: PyLogit, Biogeme, torch-choice, Statsmodels, scikit-learn -->

# Problem 1: The Conditional Logit Model

Discrete choice models are built on the **Random Utility Maximization
(RUM)** framework. A decision-maker chooses the alternative with the
highest utility from a set of available options. The utility of
alternative $j$ for individual $i$ is:

$$U_{ij} = V_{ij} + \varepsilon_{ij}$$

where $V_{ij}$ is the **systematic (observable) utility** and
$\varepsilon_{ij}$ is a **random error term** capturing unobserved
factors.

The **Conditional Logit** model assumes:

1.  Utility is linear in attributes:
    $V_{ij} = \sum_k \beta_{ik} \cdot x_{jk}$
2.  Errors are i.i.d. Type I Extreme Value (Gumbel) distributed

The probability of individual $i$ choosing alternative $j$ from choice
set $\mathcal{A}$ is then given by

$$P_{ij} = \frac{\exp\left(\sum_k \beta_{ik} \cdot x_{jk}\right)}{\sum_{a \in \mathcal{A}} \exp\left(\sum_k \beta_{ik} \cdot x_{ak}\right)}$$

## The ModeCanada Dataset

We’ll work with the **ModeCanada** dataset, which contains
transportation choices for intercity trips between Montréal and Toronto.
This is a classic dataset in choice modeling research.

**(1a)** Load the ModeCanada dataset and explore its structure:

In [3]:
from choice_learn.datasets import load_modecanada
transport_df = load_modecanada(as_frame=True)
print(f"Dataset shape: {transport_df.shape}")
display(transport_df.head(8))

Dataset shape: (15520, 11)


Unnamed: 0,case,alt,choice,dist,cost,ivt,ovt,freq,income,urban,noalt
0,1,train,0,83,28.25,50,66,4,45.0,0,2
1,1,car,1,83,15.77,61,0,0,45.0,0,2
2,2,train,0,83,28.25,50,66,4,25.0,0,2
3,2,car,1,83,15.77,61,0,0,25.0,0,2
4,3,train,0,83,28.25,50,66,4,70.0,0,2
5,3,car,1,83,15.77,61,0,0,70.0,0,2
6,4,train,0,83,28.25,50,66,4,70.0,0,2
7,4,car,1,83,15.77,61,0,0,70.0,0,2


The data is in **long format**: each row represents one alternative
within a choice situation. Key columns:

-   `case`: identifies each choice situation (one traveler’s decision)
-   `alt`: the transportation mode (train, air, bus, car)
-   `choice`: 1 if this alternative was chosen, 0 otherwise
-   `cost`, `ivt` (in-vehicle time), `ovt` (out-of-vehicle time), `freq`
    (frequency): alternative attributes
-   `income`: traveler characteristic (same across alternatives within a
    case)

Examine a single choice situation by filtering for `case == 1`. How many
alternatives were available? Which was chosen?


In [4]:
display(transport_df[transport_df.case == 1])

alternatives = transport_df[transport_df["case"] == 1]["alt"].nunique()

print("Number of alternatives in case 1:", alternatives)

chosen_alt = transport_df.loc[(transport_df["case"] == 1) & (transport_df["choice"] == 1),"alt"]

print("Alternative chosen:", chosen_alt.tolist())


Unnamed: 0,case,alt,choice,dist,cost,ivt,ovt,freq,income,urban,noalt
0,1,train,0,83,28.25,50,66,4,45.0,0,2
1,1,car,1,83,15.77,61,0,0,45.0,0,2


Number of alternatives in case 1: 2
Alternative chosen: ['car']



**(1b)** The `ChoiceDataset` is choice-learn’s core data structure. It
organizes:

-   **Choices**: which alternative was selected
-   **Items features**: attributes that vary by alternative (cost, time,
    etc.)
-   **Shared features**: attributes that are constant across
    alternatives (income, etc.)

Convert the DataFrame to a `ChoiceDataset`:

In [5]:
from choice_learn.data import ChoiceDataset

canada_dataset = ChoiceDataset.from_single_long_df(
    df=transport_df,
    items_id_column="alt",           # identifies each alternative
    choices_id_column="case",         # identifies each choice situation
    choices_column="choice",          # indicates which was chosen
    shared_features_columns=["income"],  # traveler characteristics
    items_features_columns=["cost", "freq", "ovt", "ivt"],  # alternative attributes
    choice_format="one_zero"
)

print(canada_dataset.summary())

%%% Summary of the dataset:
Number of items: 4
Number of choices: 4324
 Shared Features by Choice:
 1 shared features
 with names: (['income'],)


 Items Features by Choice:
4 items features 
 with names: (['cost', 'freq', 'ovt', 'ivt'],)



## Model Specification

**(1c)** The key modeling decision is specifying the utility function.
For ModeCanada, consider:

$$U_{ij} = \beta^{inter}_j + \beta^{cost} \cdot \text{cost}_j + \beta^{freq} \cdot \text{freq}_j + \beta^{ovt} \cdot \text{ovt}_j + \beta^{ivt}_j \cdot \text{ivt}_j + \beta^{income}_j \cdot \text{income}_i$$

**Note the subscripts:**

-   $\beta^{cost}$, $\beta^{freq}$, $\beta^{ovt}$ are **shared**
    coefficients (same effect for all modes)
-   $\beta^{ivt}_j$, $\beta^{income}_j$, $\beta^{inter}_j$ are
    **alternative-specific** (different for each mode)

Why might we want different coefficients for in-vehicle time across
modes? (Think about the experience of traveling by train vs. car
vs. plane.)


ANSWER: The experience of travelling in different vehicles is different. For example, you are unable to do work in a car as you have to drive, unlike a plane or train. However, in a plane or train, you are sat next to strangers, whereas you will be on your own in a car. This could means that for short journey's (where a person cannot do work) they prefer a car, but on longer journeys, they prefer a plane or train. This gives different sensitivities to time, as the extra time in a train won't affect productivity as much. Income may also have different sensitivities as wealth travels may be more willing to pay for faster methods of transport such as a plane.


**(1d)** Implement and fit the Conditional Logit model from (1c) using
choice-learn’s `ConditionalLogit` class. Use the utility specification
above, with `optimizer="lbfgs"` and `get_report=True`.

**Hints:**

-   Use `add_shared_coefficient()` for coefficients that are the same
    across all alternatives, and `add_coefficients()` for
    alternative-specific ones.
-   For alternative-specific constants (intercept, income), you must
    normalize one alternative to zero. Why?


In [6]:
from choice_learn.models import ConditionalLogit

model = ConditionalLogit(optimizer="lbfgs")

# Shared
model.add_shared_coefficient(feature_name="cost", items_indexes=[0, 1, 2, 3])
model.add_shared_coefficient(feature_name="freq", items_indexes=[0, 1, 2, 3])
model.add_shared_coefficient(feature_name="ovt", items_indexes=[0, 1, 2, 3])

# Alternative-specific
model.add_coefficients(feature_name="intercept", items_indexes=[1, 2, 3])
model.add_coefficients(feature_name="ivt", items_indexes=[0, 1, 2, 3])
model.add_coefficients(feature_name="income", items_indexes=[1, 2, 3])

history = model.fit(canada_dataset, get_report=True)

print(model.report)

Using L-BFGS optimizer, setting up .fit() function
Using L-BFGS optimizer, setting up .fit() function
    Coefficient Name  Coefficient Estimation  Std. Err    z_value  \
0          beta_cost               -0.007464  0.002941  -2.537750   
1          beta_freq                0.075283  0.004203  17.913319   
2           beta_ovt               -0.040078  0.002365 -16.949118   
3   beta_intercept_0                1.108153  0.271242   4.085471   
4   beta_intercept_1                2.767576  0.176806  15.653175   
5   beta_intercept_2                3.250667  0.246448  13.190096   
6         beta_ivt_0                0.000294  0.004265   0.068937   
7         beta_ivt_1               -0.011630  0.001637  -7.103386   
8         beta_ivt_2               -0.015734  0.000991 -15.879027   
9         beta_ivt_3               -0.006221  0.000682  -9.116169   
10     beta_income_0               -0.064519  0.004949 -13.035753   
11     beta_income_1               -0.025767  0.002804  -9.188432   
1


One alternative-specific constant must be set to zero as the other coefficients are set relative to that one. If you were to add a constant to every utility function, this would net out and not change any of the probabilities. This means when estimating each constant, we could add 1 to each consant without having any effect on the estimates. This could be repeated infinitely. Therefore there are an infinite number of solutions. This would fail to produce a unique solution. However, if we set one to zero, increasing all the others, would lower the probability of the one set to zero, so there is a unique best solution. However, this also means it doesn't matter which is set to zero, as all that matters is the relative coefficients.


**(1e)** Interpret the estimated coefficients:

1.  What is the sign of $\beta^{cost}$? Does this make economic sense?
2.  Compare the intercepts across modes. Which mode has the highest
    “baseline” utility?
3.  How do the income coefficients vary? What does this tell us about
    mode choice and income?


In [7]:
print(model.report[["Coefficient Name", "Coefficient Estimation", "z_value"]])

    Coefficient Name  Coefficient Estimation    z_value
0          beta_cost               -0.007464  -2.537750
1          beta_freq                0.075283  17.913319
2           beta_ovt               -0.040078 -16.949118
3   beta_intercept_0                1.108153   4.085471
4   beta_intercept_1                2.767576  15.653175
5   beta_intercept_2                3.250667  13.190096
6         beta_ivt_0                0.000294   0.068937
7         beta_ivt_1               -0.011630  -7.103386
8         beta_ivt_2               -0.015734 -15.879027
9         beta_ivt_3               -0.006221  -9.116169
10     beta_income_0               -0.064519 -13.035753
11     beta_income_1               -0.025767  -9.188432
12     beta_income_2               -0.038805 -11.675177


1. The coefficient of cost is negative. This suggests the method of transport has a lower utility, the more expensive is it. This makes perfect sense as individuals would prefer a cheaper method of transport.
2. Car has the highest intercept, then bus, then air, then train, as we set this to zero. Car has the highest baseline utility. This means all else being equal, people prefer the car. This could be due to factors such as that it is not restrcited by a timetable, or it is private transport so the person doesn't have to interact with others.
3. They are all negative. This means that as income increases, the relative utility of the train also increase. This could be because wealthy individuals value the ability to work on the train more, as there is a greater opportunity cost of not working. 


**(1f)** **Price Elasticity** measures how choice probabilities change
with price. For the logit model:

$$\eta_{jj} = \frac{\partial P_{ij}}{\partial p_j} \cdot \frac{p_j}{P_{ij}} = \beta^{cost} \cdot p_j \cdot (1 - P_{ij})$$

This is the **own-price elasticity**. Compute it for the car alternative
at the mean values.


In [9]:
# car = index 3
all_probabilities = model.predict_probas(canada_dataset)
mean_probability_car = all_probabilities[:, 3].numpy().mean()

# mean cost of car
mean_cost = transport_df.loc[transport_df['alt'] == 'car', 'cost'].mean()

# cost coefficient
cost_coefficient = model.trainable_weights[0].numpy()[0, 0]

# Calculate own-price elasticity
elasticity = cost_coefficient * mean_cost * (1 - mean_probability_car)
print(f"Own-price elasticity for car: {elasticity}")

Own-price elasticity for car: -0.4074856285836466



# Problem 2: RUMnet — Neural Network Choice Models

The Conditional Logit assumes utility is *linear* in attributes.
**RUMnet** (Aouad & Désir, 2022) relaxes this assumption using neural
networks while maintaining the RUM framework.

**(2a)** For this problem, we’ll use the more complex [**Expedia** hotel
booking dataset](https://www.kaggle.com/c/expedia-personalized-sort).
First download `train.csv` from Kaggle and save it to your Python
environment’s `choice_learn/datasets/data/expedia.csv` (if the path is
wrong, `choice_learn` will tell you the exact location in a
`FileNotFoundError`).

Load the dataset using
`load_expedia(as_frame=False, preprocessing="rumnet")`, keep only the
first 5000 choices for speed, and split 80/20 into training and test
sets. Explore the dataset structure — how many choices, items, and
features does it have? What do the choice set sizes look like?

**(2b)** Write down a sensible model specification for the Conditional
Logit model for the Expedia dataset, for example using the hotel
features: log(price), star rating, review, whether the hotel is a brand,
location desirability scores. You may also want to include hotel fixed
effects. Fit your model and report the cross-entropy loss on the test
data using TensorFlow’s `tf.keras.losses.CategoricalCrossentropy`.

**(2c)** Display the resulting parameter estimates and interpret them.
What is the sign of the price coefficient? Which features matter most?

**(2d)** Now fit the **RUMnet** model shipped with `choice_learn` to the
Expedia dataset. The dataset has 46 product features and 84 customer
features. Report the cross-entropy loss on the test data and compare it
to the Conditional Logit.

**(2e)** Discuss: What are the tradeoffs between Conditional Logit and
RUMnet?