## MNL prediction demo

Sam Maurer, August 2017 | Python 3.6

Original version July 2017  
Updated July 2017 to include probabilities  
Updated Aug 2017 to fix int/float problems 

### Summary

This notebook demonstrates how to fit a model using the ChoiceModels interface and then use the UrbanSim MNL functions to generate probabilities and predictions. 

Eventually, a prediction interface will be incorporated into the `MultinomialLogitResults` object, but it's not there yet!

This demo uses the estimation data that's set up in the `Data-prep-02` notebook.

In [1]:
import numpy as np
import pandas as pd

from patsy import dmatrix

from choicemodels import mnl  # could also import form urbansim.urbanchoice
from choicemodels import MultinomialLogit
from choicemodels.tools import MergedChoiceTable

  from pandas.core import datetools


### Load data from disk

In [2]:
tracts = pd.read_csv('../data/tracts_v02.csv').set_index('full_tract_id')
tracts = tracts.loc[(tracts.home_density > 0) | (tracts.work_density > 0) | (tracts.school_density > 0)]

print(tracts.shape[0])
print(tracts.head(3))

1566
                   city  home_density  work_density  school_density
full_tract_id                                                      
6001400100     BERKELEY     13.437961     13.130867       13.511570
6001400200      OAKLAND     11.089638      4.248928        0.894794
6001400300      OAKLAND     28.878399      7.671554        0.000000


In [3]:
trips = pd.read_csv('../data/trips_v02.csv').set_index('place_id')
trips = trips.loc[trips.trip_distance_miles.notnull()]

print(trips.shape[0])
print(trips.head(3))

35786
             full_tract_id  mode  trip_distance_miles
place_id                                             
10319850202     6095251902     5             5.125960
10335860102     6085511915     6           156.370628
10335860103     6085512027     6             1.615535


### Set up estimation table

Each observed trip is a realized choice of a particular destination census tract. We can randomly sample alternative census tracts to build a model of destination choice.

We'll divide the trips into a training set and a testing set, fit an MNL model using the training data, use it to generate predicted choices for the testing data, and compare the predicted to the actual choices.

In [4]:
training_observations = trips.iloc[:1000]
training = MergedChoiceTable(observations = training_observations,
                             alternatives = tracts,
                             chosen_alternatives = training_observations.full_tract_id,
                             sample_size = 100)

testing_observations = trips.iloc[1000:]
testing = MergedChoiceTable(observations = testing_observations,
                            alternatives = tracts,
                            chosen_alternatives = testing_observations.full_tract_id,
                            sample_size = 100)

print(training.to_frame().shape)
print(testing.to_frame().shape)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  alts_sample['join_index'] = np.repeat(choosers.index.values, SAMPLE_SIZE)


(100000, 9)
(3473300, 9)


### Fit a model using the training observations

In [5]:
%%time
model_expression = "home_density + work_density + school_density - 1"

model = MultinomialLogit(data = training.to_frame(), 
                         observation_id_col = training.observation_id_col, 
                         choice_col = training.choice_col,
                         model_expression = model_expression)

results = model.fit()
print(results)

                  CHOICEMODELS ESTIMATION RESULTS                  
Dep. Var.:                chosen   No. Observations:               
Model:         Multinomial Logit   Df Residuals:                   
Method:       Maximum Likelihood   Df Model:                       
Date:                              Pseudo R-squ.:                  
Time:                              Pseudo R-bar-squ.:              
AIC:                               Log-Likelihood:       -4,498.317
BIC:                               LL-Null:              -4,605.170
                    coef   std err         z     P>|z|   Conf. Int.
-------------------------------------------------------------------
home_density      0.0120     0.002     6.393                       
work_density      0.0129     0.001    15.988                       
school_density    0.0078     0.004     2.144                       
CPU times: user 362 ms, sys: 82.8 ms, total: 445 ms
Wall time: 187 ms


  return PMAT(np.exp(self.mat))


### Predict destination choices for the testing observations

We'll use the UrbanSim MNL functions directly, because this hasn't been integrated into the ChoiceModels results classes yet. https://github.com/UDST/choicemodels/blob/master/choicemodels/mnl.py#L536

In [6]:
# Pull the coefs out of the results object (the PyLogit syntax would be different)

coefs = results.get_raw_results()['fit_parameters']['Coefficient']
print(coefs)

0    0.011960
1    0.012945
2    0.007753
Name: Coefficient, dtype: float64


In [7]:
# The data columns for prediction need to align with the coefficients; 
# you can do this manually or with patsy, as here

df = testing.to_frame().set_index('full_tract_id')

testing_df = dmatrix(model_expression, data=df, return_type='dataframe')
print(testing_df.shape)
print(testing_df.head(3))

(3473300, 3)
               home_density  work_density  school_density
full_tract_id                                            
6097150607        10.659461      6.868701        7.160030
6097151502        18.674865      1.820991        1.856286
6081609700        27.867920      0.000000        0.000000


In [8]:
# Simulate a destination choice for each testing observation

choices = mnl.mnl_simulate(testing_df, coefs, numalts=100, returnprobs=False)

print(len(choices))
print(choices[:5])

34733
[93 39 89 56 67]


In [9]:
# Annoyingly, that identifies the choices by position rather than by id;
# here's a function to get the id's

def get_chosen_ids(ids, positions):
    """
    We observe N choice scenarios. In each, one of J alternatives is chosen.
    We have a long (len N * J) list of the available alternatives. We have a 
    list (len N) of which alternatives were chosen, but it identifies them 
    by POSITION and we want their ID.    
    
    Parameters
    ----------
    ids : list or list-like
        List of alternative ID's (len N * J).
        
    positions : list or list-like
        List of chosen alternatives by position (len N), where each entry is
        an int in range [0, J)
    
    Returns
    -------
    chosen_ids : list
        List of chosen alternatives by ID (len N)
    
    """
    N = len(positions)
    J = len(ids) // N
    
    ids_by_obs = np.reshape(ids, (N,J))
    return [ids_by_obs[i][positions[i]] for i in range(N)]
    

print(get_chosen_ids(['a','b','c','d'], [0,1]))

['a', 'd']


In [10]:
# Get tract id's for the simulated choices

predicted_tracts = get_chosen_ids(testing_df.index.tolist(), choices)

print(len(predicted_tracts))
print(predicted_tracts[:5])

34733
[6001400900, 6075045100, 6013355301, 6001406202, 6085504410]


In [11]:
# Get tract id's for observed choices

df = testing.to_frame()
observed_tracts = df.loc[df.chosen == 1, 'full_tract_id'].tolist()

print(len(observed_tracts))
print(observed_tracts[:5])

34733
[6097150607, 6097153200, 6097151402, 6097151402, 6097151204]


### Compare the predicted choices to the observed ones

Multinomial models are kind of tricky to validate. We don't expect the actual choices to match, because there are so many alternatives, but we do expect the characteristics of the predicted choices to be similar to the characteristics of the observed choices. 

Choose your own metric for this depending on what you're trying to evaluate! It's even plausible that the metric could be something not directly in the model, like the distance between the predicted and actual destination choices.

In [12]:
# What portion of predicted destination choices were a perfect match?
# With an uninformative model we would expect 0.01, given that the 
# observed choice is included in the 100 available alternatives.

perfect_match = np.equal(predicted_tracts, observed_tracts)
print(sum(perfect_match)/len(perfect_match))

0.0156047562836


In [13]:
# What's the correlation between employment density of the predicted and 
# observed destinations? With an uninformative model we would expect 0.

density_1 = pd.Series([tracts.loc[t,'work_density'] for t in predicted_tracts])
density_2 = pd.Series([tracts.loc[t,'work_density'] for t in observed_tracts])

print(density_1.corr(density_2))

0.138159532444


### How does UrbanSim generate household location choices?

These three class methods collectively set up the choosers and alternatives according to various parameters like the sample size, prediction filters, "probability mode," and "choice mode" (aggregate or individual):

- `urbansim.models.MNLDiscreteChocieModel.probabilities()` 
- `urbansim.models.MNLDiscreteChocieModel.summed_probabilities()` 
- `urbansim.models.MNLDiscreteChocieModel.predict()` 

https://github.com/UDST/urbansim/blob/master/urbansim/models/dcm.py#L474

Then this lower-level function generates a table of probabilities for each alternative, which is passed back to the `MNLDiscreteChoiceModel` class for further processing:

- `urbansim.urbanchoice.mnl.mnl_simulate()`

https://github.com/UDST/urbansim/blob/master/urbansim/urbanchoice/mnl.py#L121

### Generate probabilities instead of predictions

In [14]:
# Use coefs and testing dataset from above

print(coefs)
print(testing_df.shape)
print(testing_df.head(3))

0    0.011960
1    0.012945
2    0.007753
Name: Coefficient, dtype: float64
(3473300, 3)
               home_density  work_density  school_density
full_tract_id                                            
6097150607        10.659461      6.868701        7.160030
6097151502        18.674865      1.820991        1.856286
6081609700        27.867920      0.000000        0.000000


In [15]:
probs = mnl.mnl_simulate(testing_df, coefs, numalts=100, returnprobs=True)

print(probs.shape)
print(probs[:5,:5])

(34733, 100)
[[ 0.01027528  0.01016694  0.01092577  0.00952525  0.01036695]
 [ 0.01043302  0.0137536   0.00928926  0.00886562  0.01099936]
 [ 0.01185824  0.00912012  0.00813987  0.0082176   0.00935786]
 [ 0.01152868  0.00832404  0.00775938  0.00822781  0.00909811]
 [ 0.00644534  0.00714275  0.00711916  0.00645062  0.0082218 ]]


In [16]:
# Join probabilities to a multi-index of chooser and alternative id's
# Code adapted from UrbanSim: 
#   https://github.com/UDST/urbansim/blob/master/urbansim/models/dcm.py#L549-L556

mi = pd.MultiIndex.from_arrays(
        [testing.to_frame()[testing.observation_id_col], 
         testing.to_frame()[testing.alternative_id_col]],
        names=('chooser_id', 'alternative_id'))

probs_df = pd.Series(probs.flatten(), index=mi)

print(probs_df.head())

chooser_id   alternative_id
11485050104  6097150607        0.010275
             6097151502        0.010167
             6081609700        0.010926
             6001431000        0.009525
             6095252704        0.010367
dtype: float64


### Sum the probabilities

Calculate the total probability associated with each alternative. This approach is adapted from UrbanSim. 

https://github.com/UDST/urbansim/blob/master/urbansim/models/dcm.py#L562-L597

Conceptually, the fitted model implies a probability density function (PDF) for each agent choosing among a set of alternatives. Here we're summing the densities across agents to get a single density function that can serve as a proxy for the aggregate appeal of the alternatives.

Important note! What we're actually creating here (I think) is PDFs over the alternatives sampled for each chooser. With random sampling, the sum will approximate a PDF over all the alternatives. Non-random sampling will alter the interpretation -- it's still a measure of aggregate appeal, but conditioned on the sampling procedure.

In [17]:
# Code adapted from UrbanSim - For each chooser, normalize the probabilities so
# they sum to 1 (is this really necessary?). Then sum the probabilties associated
# with each alternative. I'm using the first 500 choosers for efficiency.

def normalize(s):
    return s / s.sum()

summed_probs = probs_df[:50000].groupby(level=0).apply(normalize).groupby(level=1).sum()

print(summed_probs.head())

alternative_id
6001400100    0.259805
6001400200    0.268773
6001400300    0.210791
6001400400    0.312045
6001400500    0.414284
dtype: float64
