## Mode choice prediction
The purpose of this notebook is to demonstrate the conversion of long-format data into wide-format. Long-format data contains one row per available alternative per choice situation. In contrast, wide-format data contains one row per choice situation. PyLogit and other software packages (e.g. mlogit in R) use data that is in long-format. However, other software packages, such as Statsmodels in Python or Python BIOGEME, use data that is in wide-format.

Because different software packages have different data format requirements, it is useful to be able to convert one's data from one format to another. Other PyLogit example notebooks (such as the "Main PyLogit Example") demonstrate how to take data from wide-format and convert it into long-format. This notebook will demonstrate the reverse process: taking data from long-format and converting it into wide-format.

The dataset being used in this example is the "Travel Mode Choice" dataset from Greene and Hensher. It is described on the statsmodels <a href="http://statsmodels.sourceforge.net/0.6.0/datasets/generated/modechoice.html">website</a>, and their description is reproduced below in full.

<pre>
    The data, collected as part of a 1987 intercity mode choice study, are a sub-sample of 210 non-business
    trips between Sydney, Canberra and Melbourne in which the traveler chooses a mode from four alternatives
    (plane, car, bus and train). The sample, 840 observations, is choice based with over-sampling of the
    less popular modes (plane, train and bus) and under-sampling of the more popular mode, car. The level of
    service data was derived from highway and transport networks in Sydney, Melbourne, non-metropolitan N.S.W.
    and Victoria, including the Australian Capital Territory.
    
    Number of observations: 840 Observations On 4 Modes for 210 Individuals.
    Number of variables: 8
    Variable name definitions::

        individual = 1 to 210
        mode =
            1 - air
            2 - train
            3 - bus
            4 - car
        choice =
            0 - no
            1 - yes
        ttme = terminal waiting time for plane, train and bus (minutes); 0
               for car.
        invc = in vehicle cost for all stages (dollars).
        invt = travel time (in-vehicle time) for all stages (minutes).
        gc = generalized cost measure:invc+(invt*value of travel time savings)
            (dollars).
        hinc = household income ($1000s).
        psize = traveling group size in mode chosen (number).
        
    
    Source

    Greene, W.H. and D. Hensher (1997) Multinomial logit and discrete choice models in Greene, W. H. (1997)
    LIMDEP version 7.0 user’s manual revised, Plainview, New York econometric software, Inc. Download from
    on-line complements to Greene, W.H. (2011) Econometric Analysis, Prentice Hall, 7th Edition (data table
    F18-2) http://people.stern.nyu.edu/wgreene/Text/Edition7/TableF18-2.csv

</pre>

In [None]:
# To access the Travel Mode Choice data
import statsmodels.datasets

# To perform the dataset conversion
import pylogit as pl

### Load the needed dataset

In [None]:
# Access the dataset
mode_data = statsmodels.datasets.modechoice.load_pandas()
# Get a pandas dataframe of the mode choice data
long_df = mode_data["data"]
# Look at the dataframe to ensure that it loaded correctly
long_df.head()

Unnamed: 0,individual,mode,choice,ttme,invc,invt,gc,hinc,psize
0,1.0,1.0,0.0,69.0,59.0,100.0,70.0,35.0,1.0
1,1.0,2.0,0.0,34.0,31.0,372.0,71.0,35.0,1.0
2,1.0,3.0,0.0,35.0,25.0,417.0,70.0,35.0,1.0
3,1.0,4.0,1.0,0.0,10.0,180.0,30.0,35.0,1.0
4,2.0,1.0,0.0,64.0,58.0,68.0,68.0,30.0,2.0


In [None]:
long_df.shape, long_df.individual.nunique(), long_df['mode'].nunique(), long_df['choice'].nunique()

((840, 9), 210, 4, 2)

In [None]:
long_df['mode'].value_counts()

4.0    210
3.0    210
2.0    210
1.0    210
Name: mode, dtype: int64

In [None]:
mode_id=(1,2,3,4)
repeat = len(long_df['mode'])/4
mode_id = ((mode_id)*int(repeat))

In [None]:
long_df['mode_id'] = mode_id

In [None]:
long_df.head()

Unnamed: 0,individual,mode,choice,ttme,invc,invt,gc,hinc,psize,mode_id
0,1.0,1.0,0.0,69.0,59.0,100.0,70.0,35.0,1.0,1
1,1.0,2.0,0.0,34.0,31.0,372.0,71.0,35.0,1.0,2
2,1.0,3.0,0.0,35.0,25.0,417.0,70.0,35.0,1.0,3
3,1.0,4.0,1.0,0.0,10.0,180.0,30.0,35.0,1.0,4
4,2.0,1.0,0.0,64.0,58.0,68.0,68.0,30.0,2.0,1


In [None]:
data1 = long_df.copy()

### Setup a MNL model

In [None]:
variable=['ttme','invc','invt']

In [None]:
from collections import OrderedDict
bs_spec = OrderedDict()
bs_name = OrderedDict()

In [None]:
bs_spec['intercept'] = [2,3,4]
bs_name['intercept'] = ['intercept:train',
                        'intercept:bus',
                        'intercept:car']

for col in variable:
    bs_spec[col] = [[1,2,3,4]]
    bs_name[col] = [col]
    

In [None]:
bs_name

OrderedDict([('intercept',
              ['intercept:train', 'intercept:bus', 'intercept:car']),
             ('ttme', ['ttme']),
             ('invc', ['invc']),
             ('invt', ['invt'])])

In [None]:
model1 = pl.create_choice_model(data=data1,
                                alt_id_col='mode_id',
                                obs_id_col='individual',
                                choice_col='choice',
                                specification=bs_spec,
                                model_type='MNL',
                                names=bs_name)

### Fit model and show results

In [None]:
import numpy as np
model1.fit_mle(np.zeros(6))
model1.get_statsmodels_summary()

Log-likelihood at zero: -291.1218
Initial Log-likelihood: -291.1218
Estimation Time for Point Estimation: 0.09 seconds.
Final log-likelihood: -192.8885


  warn('Method %s does not use Hessian information (hess).' % method,


0,1,2,3
Dep. Variable:,choice,No. Observations:,210.0
Model:,Multinomial Logit Model,Df Residuals:,204.0
Method:,MLE,Df Model:,6.0
Date:,"Fri, 15 Jan 2021",Pseudo R-squ.:,0.337
Time:,19:39:41,Pseudo R-bar-squ.:,0.317
AIC:,397.777,Log-Likelihood:,-192.889
BIC:,417.860,LL-Null:,-291.122

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept:train,-0.7867,0.603,-1.305,0.192,-1.968,0.394
intercept:bus,-1.4336,0.681,-2.106,0.035,-2.768,-0.099
intercept:car,-4.7399,0.868,-5.464,0.000,-6.440,-3.040
ttme,-0.0969,0.010,-9.368,0.000,-0.117,-0.077
invc,-0.0139,0.007,-2.092,0.036,-0.027,-0.001
invt,-0.0040,0.001,-4.704,0.000,-0.006,-0.002


In [None]:
from sklearn.metrics import classification_report

def model_pred(data, model):
    data['predicted'] = model.predict(data)  
    is_chosen = data.groupby(['individual'])['predicted'].idxmax()
    data['predicted_choice'] = 0
    data.loc[is_chosen.values,'predicted_choice'] = 1
    
    actual = data.loc[data['choice'] ==1,'mode_id']
    pred = data.loc[data['predicted_choice'] ==1,'mode_id']
    return data, actual, pred     

res, actual, pred = model_pred(data1,model1)


actual = data1.loc[data1['choice'] ==1,'mode_id']
pred = data1.loc[data1['predicted_choice'] ==1,'mode_id']
print(classification_report(actual, pred))

In [None]:
(actual.values == pred.values).sum()/len(actual.values)

0.7380952380952381

In [None]:
print(classification_report(actual, pred))

              precision    recall  f1-score   support

           1       0.71      0.67      0.69        58
           2       0.77      0.78      0.77        63
           3       0.96      0.77      0.85        30
           4       0.66      0.75      0.70        59

    accuracy                           0.74       210
   macro avg       0.77      0.74      0.75       210
weighted avg       0.75      0.74      0.74       210

