# Demo of Biogeme using Toyota dataset

For the sake of comparison, this notebook uses Biogeme to perform maximum simulated likelihood estimation (MSLE) on the same data.

This demo uses the dataset that was made available by Kenneth Train at https://eml.berkeley.edu/~train/ec244ps.html

The data represent consumers' choices among vehicles in stated preference experiments. The data is from a study that Kenneth Train did for Toyota and GM to assist them in their analysis of the potential marketability of electric and hybrid vehicles, back before hybrids were introduced.

We begin by performing the necessary imports:

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import logging
import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Fix random seed for reproducibility
np.random.seed(42)

# Load Toyota dataset

About the data:

In each choice experiment, the respondent was presented with three vehicles, with the price and other attributes of each vehicle described. The respondent was asked to state which of the three vehicles he/she would buy if the these vehicles were the only ones available in the market. There are 100 respondents in our dataset (which, to reduce estimation time, is a subset of the full dataset which contains 500 respondents.) Each respondent was presented with 15 choice experiments, and most respondents answered all 15. The attributes of the vehicles were varied over experiments, both for a given respondent and over respondents. The attributes are: price, operating cost in dollars per month, engine type (gas, electric, or hybrid), range if electric (in hundreds of miles between recharging), and the performance level of the vehicle (high, medium, or low). The performance level was described in terms of top speed and acceleration, and these descriptions did not vary for each level; for example, "High" performance was described as having a top speed of 100 mpg and 12 seconds to reach 60 mpg, and this description was the same for all "high" performance vehicles. 

A detailed description of the data is provided by Kenneth Train at https://eml.berkeley.edu/~train/ec244ps.html

In [3]:
column_names = ["IndID","ObsID", "Chosen", "Price", "OperCost", "Range", "EV", "Gas", "Hybrid", "HighPerf", "MedHighPerf"]
df = pd.read_csv("data/toyota.txt", delimiter=" ", names=column_names)

df["Price"] = df["Price"]/10000     # scale price to be in tens of thousands of dollars.
df["OperCost"] = df["OperCost"]/10  # scale operating cost to be in tens of dollars.

# fix dataframe to match expected format
altID = []
menuID = []
curr_n = -1
curr_o = -1
curr_a = -1
curr_t = -1
for n,o in df[["IndID", "ObsID"]].values:
    if n != curr_n:
        curr_n += 1
        curr_t = 0
    if o != curr_o:
        curr_t += 1
        curr_a = 0
    
    curr_a += 1
    curr_n = n
    curr_o = o
    
    altID.append(curr_a)
    menuID.append(curr_t)
    #print(n,o,curr_t,curr_a)
    
df["AltID"] = altID
df["MenuID"] = menuID

df.head()

Unnamed: 0,IndID,ObsID,Chosen,Price,OperCost,Range,EV,Gas,Hybrid,HighPerf,MedHighPerf,AltID,MenuID
0,1,1,0,4.6763,4.743,0.0,0,0,1,0,0,1,1
1,1,1,1,5.7209,2.743,1.3,1,0,0,1,1,2,1
2,1,1,0,8.796,3.241,1.2,1,0,0,0,1,3,1
3,1,2,1,3.3768,0.489,1.3,1,0,0,1,1,1,2
4,1,2,0,9.0336,3.019,0.0,0,0,1,0,1,2,2


In [4]:
# convert to wide format
data_wide = []
for ix in range(0,len(df),3):
    new_row = df.loc[ix][["IndID","ObsID","MenuID"]].values.tolist()
    new_row += df.loc[ix][["Price","OperCost","Range","EV","Hybrid","HighPerf","MedHighPerf"]].values.tolist()
    new_row += df.loc[ix+1][["Price","OperCost","Range","EV","Hybrid","HighPerf","MedHighPerf"]].values.tolist()
    new_row += df.loc[ix+2][["Price","OperCost","Range","EV","Hybrid","HighPerf","MedHighPerf"]].values.tolist()
    choice = np.argmax([df.loc[ix]["Chosen"], df.loc[ix+1]["Chosen"], df.loc[ix+2]["Chosen"]])
    new_row += [choice]
    #print(new_row)
    data_wide.append(new_row)
    
column_names = ["IndID","ObsID","MenuID",
                "Price1","OperCost1","Range1","EV1","Hybrid1","HighPerf1","MedHighPerf1",
                "Price2","OperCost2","Range2","EV2","Hybrid2","HighPerf2","MedHighPerf2",
                "Price3","OperCost3","Range3","EV3","Hybrid3","HighPerf3","MedHighPerf3",
                "Chosen"]
df_wide = pd.DataFrame(data_wide, columns=column_names)
df_wide['ones'] = np.ones(len(data_wide)).astype(int)
df_wide.head()

Unnamed: 0,IndID,ObsID,MenuID,Price1,OperCost1,Range1,EV1,Hybrid1,HighPerf1,MedHighPerf1,...,MedHighPerf2,Price3,OperCost3,Range3,EV3,Hybrid3,HighPerf3,MedHighPerf3,Chosen,ones
0,1.0,1.0,1.0,4.6763,4.743,0.0,0.0,1.0,0.0,0.0,...,1.0,8.796,3.241,1.2,1.0,0.0,0.0,1.0,1,1
1,1.0,2.0,2.0,3.3768,0.489,1.3,1.0,0.0,1.0,1.0,...,1.0,5.7099,2.716,1.8,1.0,0.0,1.0,1.0,0,1
2,1.0,3.0,3.0,4.5534,1.072,1.2,1.0,0.0,0.0,0.0,...,1.0,3.4031,6.062,0.0,0.0,0.0,0.0,0.0,0,1
3,1.0,4.0,4.0,0.8639,2.216,0.0,0.0,0.0,0.0,1.0,...,0.0,6.9325,2.884,1.6,1.0,0.0,0.0,0.0,1,1
4,1.0,5.0,5.0,5.2145,3.975,0.0,0.0,1.0,0.0,1.0,...,1.0,2.1282,5.272,0.0,0.0,0.0,0.0,1.0,0,1


# Mixed Logit specification

In [4]:
import biogeme.biogeme as bio
from biogeme.expressions import Beta, bioLinearUtility, DefineVariable, Plus, Times, bioDraws, PanelLikelihoodTrajectory, MonteCarlo, log
import biogeme.models as models
import biogeme.database as db
import biogeme.messaging as msg
import biogeme.optimization as opt
import biogeme.results as res

database = db.Database('choiceset', df_wide)

# They are organized as panel data. The variable ID identifies each individual.
database.panel("IndID")

globals().update(database.variables)

In [5]:
# Parameters to be estimated
B_PRICE = Beta('B_PRICE', 0, None, None, 0)

B_OperCost = Beta('B_OperCost', 0, None, None, 0)
B_OperCost_S = Beta('B_OperCost_S', 1, None, None, 0)
B_OperCost_RND = B_OperCost + B_OperCost_S * bioDraws('B_OperCost_RND', 'NORMAL_ANTI')

B_Range = Beta('B_Range', 0, None, None, 0)
B_Range_S = Beta('B_Range_S', 1, None, None, 0)
B_Range_RND = B_Range + B_Range_S * bioDraws('B_Range_RND', 'NORMAL_ANTI')

B_EV = Beta('B_EV', 0, None, None, 0)
B_EV_S = Beta('B_EV_S', 1, None, None, 0)
B_EV_RND = B_EV + B_EV_S * bioDraws('B_EV_RND', 'NORMAL_ANTI')

B_Hybrid = Beta('B_Hybrid', 0, None, None, 0)
B_Hybrid_S = Beta('B_Hybrid_S', 1, None, None, 0)
B_Hybrid_RND = B_Hybrid + B_Hybrid_S * bioDraws('B_Hybrid_RND', 'NORMAL_ANTI')

B_HighPerf = Beta('B_HighPerf', 0, None, None, 0)
B_HighPerf_S = Beta('B_HighPerf_S', 1, None, None, 0)
B_HighPerf_RND = B_HighPerf + B_HighPerf_S * bioDraws('B_HighPerf_RND', 'NORMAL_ANTI')

B_MedHighPerf = Beta('B_MedHighPerf', 0, None, None, 0)
B_MedHighPerf_S = Beta('B_MedHighPerf_S', 1, None, None, 0)
B_MedHighPerf_RND = B_MedHighPerf + B_MedHighPerf_S * bioDraws('B_MedHighPerf_RND', 'NORMAL_ANTI')

In [6]:
# Definition of the utility functions
V1 = B_PRICE*Price1 + B_OperCost_RND*OperCost1 + B_Range_RND*Range1 + B_EV_RND*EV1 + B_Hybrid_RND*Hybrid1 + B_HighPerf_RND*HighPerf1 + B_MedHighPerf_RND*MedHighPerf1
V2 = B_PRICE*Price2 + B_OperCost_RND*OperCost2 + B_Range_RND*Range2 + B_EV_RND*EV2 + B_Hybrid_RND*Hybrid2 + B_HighPerf_RND*HighPerf2 + B_MedHighPerf_RND*MedHighPerf2
V3 = B_PRICE*Price3 + B_OperCost_RND*OperCost3 + B_Range_RND*Range3 + B_EV_RND*EV3 + B_Hybrid_RND*Hybrid3 + B_HighPerf_RND*HighPerf3 + B_MedHighPerf_RND*MedHighPerf3

# Associate utility functions with the numbering of alternatives
V = {0: V1, 1: V2, 2: V3}

# Associate the availability conditions with the alternatives
av = {0: ones, 1: ones, 2: ones}

# Mixed Logit model in Biogeme (MSLE)

In [7]:
# Conditional to the random parameters, the likelihood of one observation is
# given by the logit model (called the kernel)
obsprob = models.logit(V, av, Chosen)

# Conditional to the random parameters, the likelihood of all observations for
# one individual (the trajectory) is the product of the likelihood of
# each observation.
condprobIndiv = PanelLikelihoodTrajectory(obsprob)

# We integrate over the random parameters using Monte-Carlo
logprob = log(MonteCarlo(condprobIndiv))

In [8]:
%%time

# Define level of verbosity
logger = msg.bioMessage()
# logger.setSilent()
# logger.setWarning()
# logger.setGeneral()
logger.setDetailed()
# logger.setDebug()

# Create the Biogeme object
biogeme = bio.BIOGEME(database, logprob, numberOfDraws=1000)
biogeme.modelName = 'fakeData'

# Estimate the parameters.
results = biogeme.estimate()
pandasResults = results.getEstimatedParameters()
print(pandasResults)

[17:45:40] < General >   Remove 3 unused variables from the database as only 24 are used.
[17:45:40] < Detailed >  It is suggested to scale the following variables.
[17:45:40] < Detailed >  Multiply IndID by	0.01 because the largest (abs) value is	100.0
[17:45:40] < Detailed >  To remove this feature, set the parameter suggestScales to False when creating the BIOGEME object.
[17:45:40] < General >   *** Initial values of the parameters are obtained from the file __fakeData.iter
[17:45:40] < Detailed >  Parameter values restored from __fakeData.iter
[17:45:40] < Detailed >  Log likelihood (N = 100):  -1558.752
[17:45:40] < Detailed >  ** Optimization: Newton with trust region for simple bounds
[17:45:42] < General >   Log likelihood (N = 100):  -1558.752 Gradient norm:      6e+02 Hessian norm:       1e+03 
[17:45:43] < Detailed >  Log likelihood (N = 100):  -1396.821
[17:45:45] < General >   Log likelihood (N = 100):  -1396.821 Gradient norm:      3e+02 Hessian norm:       2e+03 
[17:45

In [10]:
print(results)


Results for model fakeData
Output file (HTML):			fakeData~00.html
Nbr of parameters:		13
Sample size:			100
Observations:			1484
Excluded data:			0
Init log likelihood:		-1558.752
Final log likelihood:		-1331.894
Likelihood ratio test (init):		453.7161
Rho square (init):			0.146
Rho bar square (init):			0.137
Akaike Information Criterion:	2689.788
Bayesian Information Criterion:	2723.655
Final gradient norm:		0.003149316
B_EV           : -1.73[0.37 -4.67 3.01e-06][0.407 -4.25 2.11e-05]
B_EV_S         : 0.952[0.326 2.92 0.00351][0.367 2.59 0.0095]
B_HighPerf     : 0.103[0.106 0.971 0.331][0.105 0.977 0.329]
B_HighPerf_S   : 0.439[0.153 2.87 0.00413][0.152 2.9 0.00378]
B_Hybrid       : 0.49[0.162 3.02 0.00253][0.165 2.96 0.00305]
B_Hybrid_S     : 0.83[0.145 5.73 1.03e-08][0.145 5.73 9.76e-09]
B_MedHighPerf  : 0.52[0.115 4.54 5.62e-06][0.113 4.61 4e-06]
B_MedHighPerf_S: 0.583[0.141 4.13 3.62e-05][0.149 3.91 9.13e-05]
B_OperCost     : -0.137[0.0533 -2.57 0.0101][0.053 -2.58 0.00975]
B_Ope