## MNL prediction demo

Sam Maurer, July 2017

Python 3.6

### Summary

This notebook demonstrates how to fit a model using the ChoiceModels interface and then use the UrbanSim MNL functions to generate predictions. 

Eventually, a prediction interface will be incorporated into the `MultinomialLogitResults` object, but it's not there yet!

This demo uses the estimation data that's set up in the `Data-prep-01` notebook.

In [1]:
import numpy as np
import pandas as pd

from patsy import dmatrix

from choicemodels import mnl  # could also import form urbansim.urbanchoice
from choicemodels import MultinomialLogit
from choicemodels.tools import MergedChoiceTable

In [2]:
# Suppress deprecation warnings
import warnings; warnings.simplefilter('ignore')

### Load data from disk

In [3]:
tracts = pd.read_csv('../data/tracts.csv').set_index('full_tract_id')
tracts = tracts.loc[(tracts.home_density > 0) | (tracts.work_density > 0) | (tracts.school_density > 0)]

print(tracts.shape[0])
print(tracts.head(3))

1566
                   city  home_density  work_density  school_density
full_tract_id                                                      
6.001400e+09   BERKELEY     13.437961     13.130867       13.511570
6.001400e+09    OAKLAND     11.089638      4.248928        0.894794
6.001400e+09    OAKLAND     28.878399      7.671554        0.000000


In [4]:
trips = pd.read_csv('../data/trips.csv').set_index('place_id')
trips = trips.loc[trips.trip_distance_miles.notnull()]

print(trips.shape[0])
print(trips.head(3))

35787
              full_tract_id  mode  trip_distance_miles
place_id                                              
1.031985e+10   6.095252e+09   6.0            13.428271
1.031985e+10   6.095252e+09   5.0             5.125960
1.033586e+10   6.085512e+09   6.0           156.370628


### Set up estimation table

Each observed trip is a realized choice of a particular destination census tract. We can randomly sample alternative census tracts to build a model of destination choice.

We'll divide the trips into a training set and a testing set, fit an MNL model using the training data, use it to generate predicted choices for the testing data, and compare the predicted to the actual choices.

In [5]:
training_observations = trips.iloc[:1000]
training = MergedChoiceTable(observations = training_observations,
                             alternatives = tracts,
                             chosen_alternatives = training_observations.full_tract_id,
                             sample_size = 100)

testing_observations = trips.iloc[1000:]
testing = MergedChoiceTable(observations = testing_observations,
                            alternatives = tracts,
                            chosen_alternatives = testing_observations.full_tract_id,
                            sample_size = 100)

print(training.to_frame().shape)
print(testing.to_frame().shape)

(100000, 9)
(3473400, 9)


### Fit a model using the training observations

In [6]:
%%time
model_expression = "home_density + work_density + school_density - 1"

model = MultinomialLogit(data = training.to_frame(), 
                         observation_id_col = training.observation_id_col, 
                         choice_col = training.choice_col,
                         model_expression = model_expression)

results = model.fit()
print(results)

                  CHOICEMODELS ESTIMATION RESULTS                  
Dep. Var.:                chosen   No. Observations:               
Model:         Multinomial Logit   Df Residuals:                   
Method:       Maximum Likelihood   Df Model:                       
Date:                              Pseudo R-squ.:                  
Time:                              Pseudo R-bar-squ.:              
AIC:                               Log-Likelihood:       -4,504.887
BIC:                               LL-Null:              -4,605.170
                    coef   std err         z     P>|z|   Conf. Int.
-------------------------------------------------------------------
home_density      0.0109     0.002     5.848                       
work_density      0.0122     0.001    15.221                       
school_density    0.0071     0.004     1.976                       
CPU times: user 499 ms, sys: 46.8 ms, total: 546 ms
Wall time: 192 ms


### Predict destination choices for the testing observations

We'll use the UrbanSim MNL functions directly, because this hasn't been integrated into the ChoiceModels results classes yet. https://github.com/UDST/choicemodels/blob/master/choicemodels/mnl.py#L536

In [7]:
# Pull the coefs out of the results object (the PyLogit syntax would be different)

coefs = results.get_raw_results()['fit_parameters']['Coefficient']
print(coefs)

0    0.010935
1    0.012232
2    0.007140
Name: Coefficient, dtype: float64


In [8]:
# The data columns for prediction need to align with the coefficients; 
# you can do this manually or with patsy, as here

df = testing.to_frame().set_index('full_tract_id')

testing_df = dmatrix(model_expression, data=df, return_type='dataframe')
print(testing_df.shape)
print(testing_df.head(3))

(3473400, 3)
               home_density  work_density  school_density
full_tract_id                                            
6.097151e+09      10.659461      6.868701        7.160030
6.085512e+09      34.971081      5.483731        2.181334
6.013326e+09      21.491132      0.153325        1.326145


In [9]:
# Simulate a destination choice for each testing observation

choices = mnl.mnl_simulate(testing_df, coefs, numalts=100, returnprobs=False)

print(len(choices))
print(choices[:5])

34734
[90 24 75 80 70]


In [10]:
# Annoyingly, that identifies the choices by position rather than by id;
# here's a function to get the id's

def get_chosen_ids(ids, positions):
    """
    We observe N choice scenarios. In each, one of J alternatives is chosen.
    We have a long (len N * J) list of the available alternatives. We have a 
    list (len N) of which alternatives were chosen, but it identifies them 
    by POSITION and we want their ID.    
    
    Parameters
    ----------
    ids : list or list-like
        List of alternative ID's (len N * J).
        
    positions : list or list-like
        List of chosen alternatives by position (len N), where each entry is
        an int in range [0, J)
    
    Returns
    -------
    chosen_ids : list
        List of chosen alternatives by ID (len N)
    
    """
    N = len(positions)
    J = len(ids) / N
    
    ids_by_obs = np.reshape(ids, (N,J))
    return [ids_by_obs[i][positions[i]] for i in range(N)]
    

print(get_chosen_ids(['a','b','c','d'], [0,1]))

['a', 'd']


In [11]:
# Get tract id's for the simulated choices

predicted_tracts = get_chosen_ids(testing_df.index.tolist(), choices)

print(len(predicted_tracts))
print(predicted_tracts[:5])

34734
[6085500400.0, 6085512020.0, 6013355115.0, 6085505008.0, 6075016802.0]


In [12]:
# Get tract id's for observed choices

df = testing.to_frame()
observed_tracts = df.loc[df.chosen == 1, 'full_tract_id'].tolist()

print(len(observed_tracts))
print(observed_tracts[:5])

34734
[6097150607.0, 6097150607.0, 6097153200.0, 6097151402.0, 6097151402.0]


### Compare the predicted choices to the observed ones

Multinomial models are kind of tricky to validate. We don't expect the actual choices to match, because there are so many alternatives, but we do expect the characteristics of the predicted choices to be similar to the characteristics of the observed choices. 

Choose your own metric for this depending on what you're trying to evaluate! It's even plausible that the metric could be something not directly in the model, like the distance between the predicted and actual destination choices.

In [13]:
# What portion of predicted destination choices were a perfect match?
# With an uninformative model we would expect 0.01, given that the 
# observed choice is included in the 100 available alternatives.

perfect_match = np.equal(predicted_tracts, observed_tracts)
print(sum(perfect_match)/len(perfect_match))

0.0154603558473


In [14]:
# What's the correlation between employment density of the predicted and 
# observed destinations? With an uninformative model we would expect 0.

density_1 = pd.Series([tracts.loc[t,'work_density'] for t in predicted_tracts])
density_2 = pd.Series([tracts.loc[t,'work_density'] for t in observed_tracts])

print(density_1.corr(density_2))

0.145854426158


### How does UrbanSim generate household location choices?

These three class methods collectively set up the choosers and alternatives according to various parameters like the sample size, prediction filters, "probability mode," and "choice mode" (aggregate or individual):

- `urbansim.models.MNLDiscreteChocieModel.probabilities()` 
- `urbansim.models.MNLDiscreteChocieModel.summed_probabilities()` 
- `urbansim.models.MNLDiscreteChocieModel.predict()` 

https://github.com/UDST/urbansim/blob/master/urbansim/models/dcm.py#L474

Then this lower-level function generates a table of probabilities for each alternative, which is passed back to the `MNLDiscreteChoiceModel` class for further processing:

- `urbansim.urbanchoice.mnl.mnl_simulate()`

https://github.com/UDST/urbansim/blob/master/urbansim/urbanchoice/mnl.py#L121