<a href="https://colab.research.google.com/github/arteagac/xlogit/blob/master/examples/mixed_logit_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mixed Logit

The purpose of this notebook is to provide users with a step-by-step guide for estimating mixed logit models using xlogit package. 

## Install `xlogit` package

First, we install `xlogit` package using pip install as shown below:

In [1]:
!pip install git+https://github.com/arteagac/xlogit

Collecting git+https://github.com/arteagac/xlogit
  Cloning https://github.com/arteagac/xlogit to /tmp/pip-req-build-ihyvxmo1
  Running command git clone -q https://github.com/arteagac/xlogit /tmp/pip-req-build-ihyvxmo1
Building wheels for collected packages: xlogit
  Building wheel for xlogit (setup.py) ... [?25l[?25hdone
  Created wheel for xlogit: filename=xlogit-0.0.1-cp36-none-any.whl size=12856 sha256=68d902165a0304cfd255e83c528ed72eb9e31f236f974e2d9bedcab6300ddb5b
  Stored in directory: /tmp/pip-ephem-wheel-cache-eb2t16jm/wheels/64/50/8d/a97e0500aac20b521a2896234d6598045323a7d0daca37648a
Successfully built xlogit
Installing collected packages: xlogit
Successfully installed xlogit-0.0.1


## Electricity Dataset

For the first example, we use the Electricity dataset from the study https://escholarship.org/content/qt1900p96t/qt1900p96t.pdf. This dataset is popularly used in examples of R's mlogit package and can be downloaded from https://raw.githubusercontent.com/arteagac/xlogit/master/examples/data/electricity_long.csv" in the long format. The dataset is  from a stated choice experiment conducted to analyse customers' preferences towards four hypothetical electricity suppliers. 

### Read data

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/arteagac/xlogit/master/examples/data/electricity_long.csv")
df

Unnamed: 0,choice,id,alt,pf,cl,loc,wk,tod,seas,chid
0,0,1,1,7,5,0,1,0,0,1
1,0,1,2,9,1,1,0,0,0,1
2,0,1,3,0,0,0,0,0,1,1
3,1,1,4,0,5,0,1,1,0,1
4,0,1,1,7,0,0,1,0,0,2
...,...,...,...,...,...,...,...,...,...,...
17227,0,361,4,0,1,1,0,0,1,4307
17228,1,361,1,9,0,0,1,0,0,4308
17229,0,361,2,7,0,0,0,0,0,4308
17230,0,361,3,0,1,0,1,0,1,4308


The dataset is in panel form, with each individual reporting preferences for upto 12 choice situations. Since, all inidividuals have not responded to all the 12 situations, the dataset in an unbalanced panel. 361 individuals were interviewed with a total of 4,308 observations. See https://cran.r-project.org/web/packages/mlogit/vignettes/e3mxlogit.html for more details on the attributes and the choice analyses.

### Fit the model

The data needs to be in long format. The user inputs required to fit the model are as follows:

1.   `X`: dataframe columns with respect to varnames
2.   `y`: dataframe column containing the choice outcome
3.   `varnames`: list containing all the explanatory variable names to be included in the model 
4.   `isvars`: list of individual-specific variables in varnames
5.   `alts`: dataframe column containing the alternative ids
6.   `randvars`: dictionary of mixing distributions. Possible distributions include 'n'-normal; 'u'-uniform; 'ln'-lognormal; 'tn'-truncated normal; 't'-triangular
7.   `panels`: dataframe column containing the unique individual id
8.   `n_draws`: number of random draws for the cofficients (default value is 100)

The model.fit object from class MixedLogit is called to fit the model. The fit results can be seen using model.summary().

In [3]:
varnames = ["pf", "cl", "loc", "wk", "tod", "seas"]
X = df[varnames].values
y = df['choice'].values

from xlogit import MixedLogit
model = MixedLogit()
model.fit(X, y, 
          varnames, 
          alts=df['alt'], 
          randvars={'pf': 'n','cl':'n','loc':'n','wk':'n','tod':'n','seas':'n'}, 
          panels=df.id.values,
          n_draws=600)
model.summary()

Estimation with GPU processing enabled.
Optimization terminated successfully.
         Current function value: 3888.413414
         Iterations: 46
         Function evaluations: 51
         Gradient evaluations: 51
Estimation time= 14.8 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
pf                     -0.9996286     0.0331488   -30.1557541     9.98e-100 ***
cl                     -0.2355334     0.0220401   -10.6865870      1.97e-22 ***
loc                     2.2307891     0.1164263    19.1605300      5.64e-56 ***
wk                      1.6251657     0.0918755    17.6887855      6.85e-50 ***
tod                    -9.6067367     0.3112721   -30.8628296     2.36e-102 ***
seas                   -9.7892800     0.2913063   -33.6047603     2.81e-112 ***
sd.pf                   0.2357813     0.0181892

The xlogit estimates are similar to those estimated using R's mlogit package (https://cran.r-project.org/web/packages/mlogit/vignettes/e3mxlogit.html). With GPU enables estimations, xlogit estimates the model in less than 6 seconds, significantly faster than open-source pacakges such as mlogit and Biogeme. This feature can be beneficial while fitting models for large datasets with multiple explanatory variables to be estimated with random coefficients.

## Fishing Dataset

The second example uses the revealed preferences dataset of fishing mode choice of 1,182 individuals. The dataset is also open-source dataset and is used in mlogit examples. It can be downloaded from https://raw.githubusercontent.com/arteagac/xlogit/master/examples/data/fishing_long.csv in long format. More information on the dataset can be found in http://www2.uaem.mx/r-mirror/web/packages/mlogit/vignettes/mlogit.pdf

### Read data

In [4]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/arteagac/xlogit/master/examples/data/fishing_long.csv")
df

Unnamed: 0,id,alt,choice,income,price,catch
0,1,beach,0,7083.33170,157.930,0.0678
1,1,boat,0,7083.33170,157.930,0.2601
2,1,charter,1,7083.33170,182.930,0.5391
3,1,pier,0,7083.33170,157.930,0.0503
4,2,beach,0,1249.99980,15.114,0.1049
...,...,...,...,...,...,...
4723,1181,pier,0,416.66668,36.636,0.4522
4724,1182,beach,0,6250.00130,339.890,0.2537
4725,1182,boat,1,6250.00130,235.436,0.6817
4726,1182,charter,0,6250.00130,260.436,2.3014


Four alternatives are considered in the dataset: beach, boat, charter and pier. There are two alternative-specific variables: 'price' and 'catch' and one individual-specific variable 'income'.

### Fit model

In [5]:
varnames = ['price','catch']
X = df[varnames].values
y = df['choice'].values

from xlogit import MixedLogit
model = MixedLogit()
model.fit(X, y, varnames= varnames,
          alts=['beach', 'boat', 'charter', 'pier'],
          randvars = {'price': 'n', 'catch': 'n'})
model.summary()

Estimation with GPU processing enabled.
Optimization terminated successfully.
         Current function value: 1300.795887
         Iterations: 37
         Function evaluations: 64
         Gradient evaluations: 64
Estimation time= 1.1 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
price                  -0.0273092     0.0022691   -12.0351978      1.61e-30 ***
catch                   1.3279443     0.1752850     7.5759175      5.27e-13 ***
sd.price               -0.0103544     0.0020383    -5.0799644      2.26e-06 ***
sd.catch               -1.5646325     0.3762199    -4.1588241      0.000148 ***
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -1300.796
AIC= 2609.592
BIC= 2629.892


## Car Dataset

The third example uses a stated preference panel dataset for choice of car. Three alternatives are considered, with upto 6 choice situations per individual. This again is an unbalanced panel with responses of some individuals less than 6 situations. The dataset contains 8 explanaotry variables: price, operating cost, range, and binary indicators to indicate whether the car is electric, hybrid, and if performance is high or medium respectively. The dataset can be downloaded from https://raw.githubusercontent.com/arteagac/xlogit/master/examples/data/car100_long.csv in the long format.

### Read data

In [6]:
import pandas as pd
import numpy as np

df = pd.read_csv("https://raw.githubusercontent.com/arteagac/xlogit/master/examples/data/car100_long.csv")
df.price = -1*df.price/10000
df.operating_cost = -1*df.operating_cost
df

Unnamed: 0,person_id,choice_situation,choice,price,operating_cost,range,electric,gas,hybrid,high_performance,medium_performance
0,1,1,0,-4.6763,-47.43,0.0,0,0,1,0,0
1,1,1,1,-5.7209,-27.43,1.3,1,0,0,1,1
2,1,1,0,-8.7960,-32.41,1.2,1,0,0,0,1
3,1,2,1,-3.3768,-4.89,1.3,1,0,0,1,1
4,1,2,0,-9.0336,-30.19,0.0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...
4447,100,1483,0,-2.8036,-14.45,1.6,1,0,0,0,0
4448,100,1483,0,-1.9360,-54.76,0.0,0,1,0,1,1
4449,100,1484,1,-2.4054,-50.57,0.0,0,1,0,0,0
4450,100,1484,0,-5.2795,-21.25,0.0,0,0,1,0,1


### Fit the model

In [7]:
varnames = ['high_performance','medium_performance','price', 'operating_cost',
            'range', 'electric', 'hybrid'] 

X = df[varnames].values
y = df['choice'].values

from xlogit import MixedLogit
model = MixedLogit()
model.fit(X, y, varnames = varnames,
          alts=['car','bus','bike'],
          randvars = {'price': 'ln', 'operating_cost': 'n',
                      'range': 'ln', 'electric':'n', 'hybrid': 'n'}, 
          panels=df.person_id.values, #Panel column
          n_draws = 100) 
model.summary()

Estimation with GPU processing enabled.
Optimization terminated successfully.
         Current function value: 1297.937510
         Iterations: 52
         Function evaluations: 63
         Gradient evaluations: 63
Estimation time= 1.3 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
high_performance        0.1059300     0.0923764     1.1467216         0.411    
medium_performance      0.5660352     0.0961273     5.8883953      2.36e-07 ***
price                  -0.7861318     0.1358040    -5.7887229      3.65e-07 ***
operating_cost          0.0120780     0.0057487     2.1009905        0.0898 .  
range                  -0.5886938     0.3564083    -1.6517401         0.204    
electric               -1.6330363     0.3232833    -5.0514085      8.25e-06 ***
hybrid                  0.6902022     0.1474823 