# Mode choice prediction for non-car owning households in the USA
**Decision-aid methodologies in transportation, EPFL Spring 2021**

Florent Zolliker, Gaelle Abi Younes, Luca Bataillard

## Step 1: Data pre-processing

In this step, we will process and adjust the dataset in order to facilitate our model training. We begin by importing the datasets and relevant libraries.

In [109]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

SEED = 42
np.random.seed(SEED)

In [110]:
train_validate = pd.read_csv("nhts_train_validate.csv", index_col="TRIPID")
test = pd.read_csv("nhts_test.csv", index_col="TRIPID")

We now need to consider the features and their format in order to select the appropriate ones. 

We first notice a group of context columns, that are not relevant to model training:
* `TRIPID`: trip identifier, indexes the dataset but otherwise not relevant for training or grouped sampling.
* `HOUSEID`: household identifier, this is the topmost hiearchical group in the survey. This column should be used to perform grouped samping during cross-validation
* `PERSONID`: person identifier, this is another hiearchical group, but since `HOUSEID` is higher in the hiearchy, using it is not necessary.
* `TDTRPNUM`: trip numbering per person in the survey.

The target label is `TRAVELMODE`. This label is categorical, so we will encode the values using a simple numeric encoding. 

We also notice a `TRPTRANS` column that is very highly correllated with `TRAVELMODE` but does not feature in the `nhts_dictionary.csv` file provided with the project. After inspecting the NHTS documentation online, we suspect that the travel mode column was most likely generated from this column. We will thus discard the `TRPTRANS`column. Furthermore, negative responses in the `TRPTRANS` column resulted in `NaN` values in `TRAVELMODE`, that need to be filtered out.

Let us analyse the remaining columns. Missing values indicates that some values for that column are invalid in the dataset. In the case of categorical, this does not pose a big problem, since 

| Column name | Missing values | Categorical data | One-hot encoding | Scaling | Use feature? | Description | Comments |
| ---         | ---            | ---              | ---              | ---     | ---          | ---         | ---      |
| `STRTTIME`  |  -  |  -  |  -  | yes | yes | start time of trip | |
| `TRPMILES`  | yes |  -  |  -  | yes | yes | length of trip in miles ||
| `LOOP_TRIP` |  -  | yes | yes |  -  | yes | same origin and destination | binary variable, applying `mod 2` 
to this column can replace one-hot encoding
| `TRIPPURP`  |  -  | yes | yes |  -  | yes | trip purpose ||
| `TRAVDAY`   |  -  | yes |  -  | yes | yes | weekday of travel ||
| `HOMEOWN`   | yes | yes | yes |  -  | yes | home ownership ||
| `HHSIZE`    |  -  |  -  |  -  | yes | yes | size of household ||
| `HHFAMINC`  | yes | yes |  -  | yes | yes | household income ||
| `HHSTATE`   |  -  | yes | yes |  -  |  ?  | household state of residency | could use either `HHSTATE` or `CENSUS_D` |
| `WRKCOUNT`  |  -  |  -  |  -  | yes | yes | number of workers in household ||
| `LIF_CYC`   |  -  |  -  |  -  | yes | yes | life cycle classification ||
| `URBAN`     |  -  | yes |  ?  |  ?  | yes | classification of urban area ||
| `URBANSIZE` |  -  | yes |  ?  |  ?  |  ?  | population size of urban area | redundant with `URBAN`, needs reordering or one-hot encoding |
| `CENSUS_D`  |  -  | yes | yes |  -  |  -  | census division (region) of household | could be redundant with `HHSTATE` |
| `HH_RACE`   | yes | yes | yes |  -  |  yes  | race of household | |
| `EDUC`      | yes | yes |  -  | yes |  yes  | educational attainment of household | |
| `WORKER`    | yes | yes | yes |  -  |  yes  | worker status | |
| `WHYTRP90`  | yes | yes | yes |  -  |  yes  | trip purpose with 1990 NPTS design | possible duplicate of `TRIPPURP`|
| `R_AGE_IMP` |  -  |  -  |  -  | yes |  yes  | age | |
| `R_SEX_IMP` |  -  | yes | yes |  -  |  yes  | gender | |
| `OBHUR`     | yes | yes |  -  |  -  |  yes  | urban/rural indicator at origin | |
| `DBHUR`     | yes | yes |  -  |  -  |  yes  | urban/rural indicator at destination | |
| `OBPPOPDN`  | yes | yes |  -  |  -  |   -   | population density at origin | already covered by `OBHUR` |
| `DBPPOPDN`  | yes | yes |  -  |  -  |   -   | population density at destination | already covered by `DBHUR` |














In [123]:
# Target column and topmost hiearchical sampling column
target = "TRAVELMODE"
group = "HOUSEID"

# Context columns (not used here) 
context_cols = [
    "TRIPID", "HOUSEID", "PERSONID", "TDTRPNUM"
]

# Columns used as features that do not need a specific encoding 
no_changes = [
    "STRTTIME", "TRPMILES", "TRAVDAY", "HHSIZE", "HHFAMINC", "WRKCOUNT", 
    "LIF_CYC", "URBAN", "EDUC", "R_AGE_IMP", "DBPPOPDN", "OBPPOPDN"
]

# Categorical columns that need one-hot encoding and the values they can take
one_hot_encodings = {
    "TRIPPURP": ["HBO", "HBSHOP", "HBSOCREC", "HBW", "NHB", -9],
    "HOMEOWN": [1, 2, 97, -8, -7],
    "HHSTATE": ["AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI", "IA", "ID", "IL", "IN", 
                   "KS","KY", "LA", "MA", "MD", "ME", "MI", "MN", "MO", "MS", "MT", "NC", "ND", "NE", "NH", "NJ","NM",
                    "NV","NY", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", 
                    "WV", "WY"],
    "CENSUS_D": [c for c in range(1, 10)],
    "HH_RACE": [1, 2, 3, 4, 5, 6, 97, -7, -8], 
    "WORKER": [1, 2, -9, -1],
    "WHYTRP90": [1, 2, 3, 4, 5, 6, 8, 10, 11, 99],
    "R_SEX_IMP": [1, 2],
}

# Columns that need some in-place transformation (such as numerical label encoding) 
hur_encoding = lambda x: ["-9", "R", "T", "C", "S", "U"].index(x)
label_encodings = {
    "LOOP_TRIP": lambda x: x % 2,
    "URBANSIZE": lambda x: (x % 6) + 1,
    "OBHUR": hur_encoding,
    "DBHUR": hur_encoding,
}

In [124]:
def one_hot_encode(df, col_cat_encodings):
    """
    Takes a NHTS pandas dataset and a dictionary mapping categorical columns to their
    possible values. 
    Returns a new dataset with all categorical column one-hot encoding. A column of zeroes
    is created for each value that appears in the list of possible values but not in actual data
    
    TODO: Check out https://www.algosome.com/articles/dummy-variable-trap-regression.html
    """
    
    labels = set([col + ":" + str(cat) for col, cats in col_cat_encodings.items() for cat in cats])
    original_columns = set(df.columns)
    new_columns = original_columns | labels - col_cat_encodings.keys()
    
    df_one_hot = pd.get_dummies(df, columns=col_cat_encodings.keys(), prefix_sep=":")
    df_full = df_one_hot.reindex(columns=new_columns, fill_value=0)
    
    return df_full

def process_dataset(df):
    """
    Takes a pandas dataset in the NHTS survey format, keeps only columns in `no_changes`,
    `one_hot_encodings` and `label_encodings`. One-hot encodes features in `one_hot_encodings`
    and generates numeric labels for columns in `label_encodings`
    Returns a new dataframe containing the transformed features.
    """
    
    columns = [*no_changes, *one_hot_encodings.keys(), *label_encodings.keys()]
    df_no_nans = df.dropna(axis="index")
    
    features = df_no_nans[columns]
    X = one_hot_encode(features, one_hot_encodings)
    for column, encoding in label_encodings.items():
        X[column] = X[column].map(encoding)
        
    return X

In [125]:
X = process_dataset(train_validate)
y = train_validate[target]
groups = train_validate[group]

X_test = process_dataset(test)

In [122]:
X

Unnamed: 0_level_0,WHYTRP90:1,HH_RACE:97,TRIPPURP,HOMEOWN:1,STRTTIME,WHYTRP90:5,HH_RACE:-7,HH_RACE:-8,HHFAMINC,WHYTRP90:3,...,URBAN,CENSUS_D:5,WHYTRP90:11,TRIPPURP:HBSHOP,WORKER:-9,WHYTRP90:99,WHYTRP90:4,WORKER,CENSUS_D:9,TRPMILES
TRIPID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,615,0,0,0,5,0,...,1,0,0,1,0,0,0,0,0,2.140
2,0,0,0,0,703,0,0,0,5,0,...,1,0,0,0,0,0,1,0,0,2.426
3,0,0,0,0,735,0,0,0,5,0,...,1,0,0,0,0,0,0,0,0,2.752
4,0,0,0,0,1500,0,0,0,5,0,...,1,0,0,0,0,0,1,0,0,2.752
5,0,0,0,0,1612,0,0,0,5,1,...,1,0,0,0,0,0,0,0,0,1.057
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16137,0,0,0,1,1300,0,0,0,7,1,...,1,0,0,0,0,0,0,0,0,1.876
16138,0,0,0,1,1325,0,0,0,7,1,...,1,0,0,0,0,0,0,0,0,0.239
16139,0,0,0,1,1440,0,0,0,7,1,...,1,0,0,1,0,0,0,0,0,3.683
16140,0,0,0,0,1430,0,0,0,2,1,...,2,0,0,1,0,0,0,0,0,4.914


In [126]:
X_test

Unnamed: 0_level_0,WHYTRP90:1,HH_RACE:97,TRIPPURP,HOMEOWN:1,STRTTIME,WHYTRP90:5,HH_RACE:-7,HH_RACE:-8,HHFAMINC,WHYTRP90:3,...,URBAN,CENSUS_D:5,WHYTRP90:11,TRIPPURP:HBSHOP,WORKER:-9,WHYTRP90:99,WHYTRP90:4,WORKER,CENSUS_D:9,TRPMILES
TRIPID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
16142,1,0,0,0,2100,0,0,0,6,0,...,1,1,0,0,0,0,0,0,0,16.092
16143,1,0,0,0,800,0,0,0,3,0,...,1,0,0,0,0,0,0,0,0,1.641
16144,1,0,0,0,1400,0,0,0,3,0,...,1,0,0,0,0,0,0,0,0,1.641
16145,1,0,0,0,800,0,0,0,3,0,...,1,0,0,0,0,0,0,0,0,1.145
16146,1,0,0,0,1530,0,0,0,3,0,...,1,0,0,0,0,0,0,0,0,1.145
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23176,1,0,0,1,810,0,0,0,10,0,...,1,1,0,0,0,0,0,0,0,1.168
23177,0,0,0,1,1320,0,0,0,10,0,...,1,1,0,0,0,0,0,0,0,0.238
23178,0,0,0,1,1415,0,0,0,10,0,...,1,1,0,0,0,0,0,0,0,0.238
23179,0,0,0,1,1820,0,0,0,10,0,...,1,1,0,0,0,0,1,0,0,0.867
