# MNL choice simulation work (unconstrained)

Sam Maurer, Sep 2018

This notebook contains feature development and testing for ChoiceModels [PR #42](https://github.com/UDST/choicemodels/pull/42), related to [issue #26](https://github.com/UDST/choicemodels/issues/26)

### Set up some test data

In [1]:
import numpy as np
import pandas as pd

In [43]:
def build_data(num_obs, num_alts):
    """
    Build a simulated list of scenarios, alternatives, and probabilities
    
    """
    obs = np.repeat(np.arange(num_obs), num_alts)
    alts = np.random.randint(0, num_alts*10, size=num_obs*num_alts)

    weights = np.random.rand(num_alts, num_obs)
    probs = weights / weights.sum(axis=0)
    probslist = probs.flatten(order='F')

    data = pd.DataFrame({'oid': obs, 'aid': alts, 'probs': probslist})
    data = data.set_index(['oid','aid']).probs
    return data

data = build_data(5, 3)

In [44]:
data

oid  aid
0    28     0.349450
     23     0.119010
     12     0.531540
1    8      0.506113
     18     0.431919
     3      0.061968
2    24     0.487376
     17     0.139035
     22     0.373589
3    4      0.360640
     22     0.229418
     4      0.409942
4    27     0.128981
     9      0.745570
     1      0.125449
Name: probs, dtype: float64

### Simulate choices

In [4]:
import choicemodels

In [5]:
choices = choicemodels.tools.simulate_choices(data)

In [6]:
choices

oid
0    15
1    29
2    25
3    14
4    28
Name: aid, dtype: int64

### Compare performance to urbansim.urbanchoice

Note that urbansim.urbanchoice combines probability generation and simulation of choices into a single function

In [7]:
import choicemodels
import patsy
from urbansim.urbanchoice import mnl

In [8]:
def build_combos(num_obs, num_alts):
    """
    Build simulated list of scenarios and alternatives, with characteristics
    but not probabilities.
    
    """
    obs = pd.DataFrame({'oid': np.arange(num_obs), 
                        'obsval': np.random.random(num_obs)}).set_index('oid')
    
    alts = pd.DataFrame({'aid': np.arange(num_alts*10), 
                         'altval': np.random.random(num_alts*10)}).set_index('aid')
    
    mct = choicemodels.tools.MergedChoiceTable(obs, alts, sample_size=num_alts)
    return mct

mct = build_combos(5, 3)
print(mct.to_frame())

           obsval    altval
oid aid                    
0   17   0.150352  0.015013
    27   0.150352  0.086214
    13   0.150352  0.028043
1   11   0.732523  0.022378
    13   0.732523  0.028043
    6    0.732523  0.045925
2   8    0.532549  0.737731
    3    0.532549  0.875759
    21   0.532549  0.673954
3   4    0.756329  0.893739
    25   0.756329  0.171811
    15   0.756329  0.664476
4   1    0.739714  0.322814
    9    0.739714  0.946735
    23   0.739714  0.822562


In [9]:
mct = build_combos(1000000, 10)

In [10]:
model_expression = 'obsval + altval - 1'
coefs = [0.2, 0.8]

In [11]:
%%timeit

data = patsy.dmatrix(model_expression, data=mct.to_frame())
choices = mnl.mnl_simulate(data, coefs, numalts=10)  # just indexes, not ids

626 ms ± 6.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
%%timeit

results = choicemodels.MultinomialLogitResults(model_expression, fitted_parameters=coefs)
probs = results.probabilities(mct)
choices = choicemodels.tools.simulate_choices(probs)

1.41 s ± 416 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### What's the performance of each piece?

In [13]:
results = choicemodels.MultinomialLogitResults(model_expression, fitted_parameters=coefs)

In [14]:
%%timeit
probs = results.probabilities(mct)

1.02 s ± 266 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
probs = results.probabilities(mct)

In [16]:
%%timeit
choices = choicemodels.tools.simulate_choices(probs)

450 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [17]:
%load_ext line_profiler

In [18]:
%load_ext memory_profiler

In [26]:
%lprun -f results.probabilities results.probabilities(mct)

70% of the execution time is getting the design matrix from patsy, actually. None of the numpy operations stand out as particularly costly.

In [20]:
%mprun -f results.probabilities results.probabilities(mct)




RAM usage is reduced by having patsy return a numpy array rather than dataframe. The initial dataframe uses 1 GB of RAM, while the numpy operations use virtually nothing.

In [32]:
%lprun -f choicemodels.tools.simulate_choices choicemodels.tools.simulate_choices(probs)

Wow - 60% of the execution time is pandas operations: getting the index values and counting the unique observations

In [22]:
%%time
obs = probs.index.get_level_values(0)
alts = probs.index.get_level_values(1)

CPU times: user 266 ms, sys: 60.1 ms, total: 326 ms
Wall time: 201 ms


In [23]:
%%time
_ = probs.reset_index()

CPU times: user 486 ms, sys: 133 ms, total: 619 ms
Wall time: 323 ms


In [24]:
%%time
probs.index.get_level_values(0).unique().size

CPU times: user 307 ms, sys: 63.2 ms, total: 371 ms
Wall time: 219 ms


1000000

In [25]:
%%time
np.unique(probs.index.get_level_values(0)).size

CPU times: user 506 ms, sys: 67.5 ms, total: 573 ms
Wall time: 299 ms


1000000

None of these alternatives is faster