# Choice simulation work

Sam Maurer, October 2018

This notebook contains benchmarks, feature development, and testing for ChoiceModels PR #TK, related to issue #TK.

In [3]:
import numpy as np
import pandas as pd

In [5]:
import choicemodels
print(choicemodels.__version__)

0.2.dev3


### Benchmark df.apply vs matrix math for chooser-level random draws

There's no `numpy` function to perform simultaneous random draws from K distinct probability distributions, which we often need to do to simulate choices for K choosers.

Fletcher wrote an implementation using matrix math for `urbansim.urbanchoice.mnl`, which I refactored and generalized in `choicemodels.tools`. 

But I realized that in other places, like `urbansim.models.dcm`, we use `df.apply` for similar operations. This seems cleaner and more easily maintainable, and i'm curious how the performance compares. Maybe the matrix math implementation is only needed for things like GPU acceleration?

In [1]:
from choicemodels.tools import simulate_choices

In [12]:
def generate_probs(n_obs, n_alts):
    n_obs = int(n_obs)
    n_alts = int(n_alts)
    
    d = {'oid': np.repeat(np.arange(n_obs), n_alts),
         'aid': np.tile(np.arange(n_alts), n_obs),
         'probs': np.random.random(n_obs * n_alts)}

    return pd.DataFrame(d).set_index(['oid','aid']).probs

print(generate_probs(2,3))

oid  aid
0    0      0.654696
     1      0.214542
     2      0.544699
1    0      0.563290
     1      0.024289
     2      0.091644
Name: probs, dtype: float64


In [28]:
probs = generate_probs(1e4, 100)

#### 1. Matrix implementation from choicemodels

In [29]:
%%time
c = simulate_choices(probs)
print(len(c))

10000
CPU times: user 23.8 ms, sys: 7.6 ms, total: 31.4 ms
Wall time: 29.6 ms


#### 2. df.apply

In [30]:
df = pd.DataFrame(probs).reset_index()

In [32]:
%%time
c = df.groupby('oid').apply(lambda x: np.random.choice(x.aid, p=x.probs/x.probs.sum()))
print(len(c))

10000
CPU times: user 2.84 s, sys: 25.6 ms, total: 2.87 s
Wall time: 2.86 s


#### 3. Try keeping indexes to make it faster

In [36]:
def mkchoice(probs):
    return np.random.choice(probs.index.values, p=probs/probs.sum())

In [37]:
%%time
c = probs.groupby(level='oid', sort=False).apply(mkchoice)
print(len(c))

10000
CPU times: user 5.02 s, sys: 67.7 ms, total: 5.09 s
Wall time: 5.06 s


**df.apply is way slower!! At least 100x. We should use the matrix math implementation everywhere**