# Sampling correction for large choice sets

1. Replicate synthetic data from Guevara & Ben-Akiva 2013
2. Do MNL with and without sampling correction
3. Check whether parameter estimates deviate from true values
4. Extend to Mixed Logit

## 1. Generate synthetic data set

- N = 1000 observations
- J = 1000 alternatives for all observations (C_n = C)
- X = single attribute distributed Uniform(-2,1) for the first 500 alternatives and Uniform(-1,2) for the second half
- beta = generic linear taste coefficient, distributed Normal(mu=1.5, sigma=0.8) across the 1000 observations
- systematic utility = beta * X
- epsilon = error term distributed ExtremeValue(0,1)
- random utility = beta * X + epsilon

Utility of alternative i for agent n:
$$ U_{in} = V_{in} + \varepsilon_{in} = \beta_n x_{i} + \varepsilon_{in} $$

Probability that agent n will choose alternative i:
$$ L_n(i \mid \beta_n, x_n,C_n) = \frac {e^{V_{in}}} {\sum_{j \epsilon C_n} e^{V_{jn}}} $$

In [5]:
import numpy as np
import pandas as pd

In [36]:
# Generate attribute x for each of J alternatives

# Start with J << 1000 to speed up runtimes

J = 100  # alternatives

Xa = 3 * np.random.rand(J/2) - 2  # uniform distribution over [-2, 1]
Xb = 3 * np.random.rand(J/2) - 1  # uniform distribution over [-1, 2]

X = np.concatenate((Xa, Xb))

print len(X)
print X[:5]

100
[ 0.60136374  0.43818458  0.67369631 -0.66300144  0.9933214 ]


In [37]:
# Generate taste coefficient beta for each of N agents 

# For regular MNL, i think we need to use a single value, instead of a 
# distribution as Guevara & Ben-Akiva used for the mixture model

N = 1000  # agents/observations

beta = np.zeros(1000) + 1.5
# beta = 0.8 * np.random.randn(N) + 1.5

print len(beta)
print beta[:5]

1000
[ 1.5  1.5  1.5  1.5  1.5]


In [None]:
print pd.DataFrame(beta).describe()

In [39]:
# Generate probability matrix for N agents choosing among J alternatives

def probs(n):
    ''' 
    Return list of J probabilities for agent n
    '''
    b = beta[n]
    exps = [np.exp(b*x) for x in X]
    sum_exps = np.sum(exps)
    return [exp/sum_exps for exp in exps]

P = np.array([probs(n) for n in range(N)])
    
print P.shape

(1000, 100)


In [40]:
# Check that each row sums to 1

print np.sum(P, axis=1)[:10]

[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]


In [41]:
# Simulate a choice from J alternatives for each of N agents

C = [np.random.choice(range(J), p=p) for p in P]

print len(C)
print C[:10]

1000
[28, 8, 89, 78, 78, 58, 88, 6, 62, 87]


#### Now we have data:

- N agents/observations with true taste coefficients in array "beta"
- J alternatives with single attributes in array "X"
- N choice outcomes in array "C"

## 2. Estimate beta using PyLogit MNL

In [10]:
import pylogit
from collections import OrderedDict

In [100]:
# Set up an estimation dataset in long format

d = [[n, i, int(C[n]==i), X[i]] for i in range(J) for n in range(N)]

print len(d)

100000


In [101]:
df = pd.DataFrame(d, columns=['obs_id', 'alt_id', 'choice', 'x'])

print df.describe()

              obs_id         alt_id         choice              x
count  100000.000000  100000.000000  100000.000000  100000.000000
mean      499.500000      49.500000       0.010000       0.000190
std       288.676434      28.866214       0.099499       1.110493
min         0.000000       0.000000       0.000000      -1.999087
25%       249.750000      24.750000       0.000000      -0.842462
50%       499.500000      49.500000       0.000000       0.113036
75%       749.250000      74.250000       0.000000       0.923162
max       999.000000      99.000000       1.000000       1.959973


In [44]:
# Set up model spec

spec = OrderedDict([
        ('x', [range(J)])
    ])

labels = OrderedDict([
        ('x', ['beta_x'])
    ])

In [45]:
m = pylogit.create_choice_model(data = df, 
                                alt_id_col = 'alt_id', 
                                obs_id_col = 'obs_id', 
                                choice_col = 'choice', 
                                specification = spec, 
                                model_type = "MNL", 
                                names = labels)

m.fit_mle(init_vals = np.array([0]))
print m.get_statsmodels_summary()

Log-likelihood at zero: -4,605.1702
Initial Log-likelihood: -4,605.1702
Estimation Time: 0.14 seconds.
Final log-likelihood: -3,793.5437
                     Multinomial Logit Model Regression Results                    
Dep. Variable:                      choice   No. Observations:                1,000
Model:             Multinomial Logit Model   Df Residuals:                      999
Method:                                MLE   Df Model:                            1
Date:                     Sun, 13 Nov 2016   Pseudo R-squ.:                   0.176
Time:                             15:18:28   Pseudo R-bar-squ.:               0.176
converged:                            True   Log-Likelihood:             -3,793.544
                                             LL-Null:                    -4,605.170
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
beta_x         1.5461        nan 

  self._store_inferential_results(np.sqrt(np.diag(self.cov)),


## Try with UrbanSim MNL instead of PyLogit

Model class: https://github.com/UDST/urbansim/blob/master/urbansim/models/dcm.py

Estimation algorithms: https://github.com/UDST/urbansim/blob/master/urbansim/urbanchoice/mnl.py

In [52]:
from urbansim.models import MNLDiscreteChoiceModel

In [97]:
# Choosers should be a DataFrame of characteristics, with index as identifier

d = [[n, C[n]] for n in range(N)]

choosers = pd.DataFrame(d, columns=['id', 'choice']).set_index('id')

print len(choosers)

1000


In [98]:
# Alternatives should be a DataFrame of characteristics, with index as identifier

d = [[i, X[i]] for i in range(J)]

alts = pd.DataFrame(d, columns=['id', 'x']).set_index('id')

print len(alts)

100


In [84]:
# It seems like this implementation *requires* us to sample the alternatives, 
# so here i'm estimating the model with J-1 alts

m = MNLDiscreteChoiceModel(model_expression = 'x',
                           sample_size = J-1)

m.fit(choosers = choosers,
      alternatives = alts,
      current_choice = 'choice')

m.report_fit()

Null Log-liklihood: -4595.120
Log-liklihood at convergence: -3793.079
Log-liklihood Ratio: 0.175

+-----------+-------------+------------+---------+
| Component | Coefficient | Std. Error | T-Score |
+-----------+-------------+------------+---------+
| x         |    1.544    |   0.023    |  68.242 |
+-----------+-------------+------------+---------+


In [None]:
# To do 
# - look through PyLogit and LCCM code
# - in many-alternative scenarios, attirbutes of the alternatives will 
#   usually be in a separate data table - what helper functions do we need?

