# Mode choice prediction for non-car owning households in the USA
**Decision-aid methodologies in transportation, EPFL Spring 2021**

Florent Zolliker, Gaelle Abi Younes, Luca Bataillard

## Step 1: Data pre-processing

In this step, we will process and adjust the dataset in order to facilitate our model training. We begin by importing the datasets and relevant libraries.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

SEED = 42
np.random.seed(SEED)

In [3]:
train_validate = pd.read_csv("nhts_train_validate.csv", index_col="TRIPID")
test = pd.read_csv("nhts_test.csv", index_col="TRIPID")

We now need to consider the features and their format in order to select the appropriate ones. 

We first notice a group of context columns, that are not relevant to model training:
* `TRIPID`: trip identifier, indexes the dataset but otherwise not relevant for training or grouped sampling.
* `HOUSEID`: household identifier, this is the topmost hiearchical group in the survey. This column should be used to perform grouped samping during cross-validation
* `PERSONID`: person identifier, this is another hiearchical group, but since `HOUSEID` is higher in the hiearchy, using it is not necessary.
* `TDTRPNUM`: trip numbering per person in the survey.
* `LOOP_TRIP`: 

The target label is `TRAVELMODE`. This label is categorical, so we will encode the values using a simple numeric encoding. 

We also notice a `TRPTRANS` column that is very highly correllated with `TRAVELMODE` but does not feature in the `nhts_dictionary.csv` file provided with the project. After inspecting the NHTS documentation online, we suspect that the travel mode column was most likely generated from this column. We will thus discard the `TRPTRANS`column. Furthermore, negative responses in the `TRPTRANS` column resulted in `NaN` values in `TRAVELMODE`, that need to be filtered out.

Let us analyse the remaining columns. Missing values indicates that some values for that column are invalid in the dataset. In the case of categorical, this does not pose a big problem, since 

| Column name | Missing values | Categorical data | One-hot encoding | Scaling | Use feature? | Description | Comments |
| ---         | ---            | ---              | ---              | ---     | ---          | ---         | ---      |
| `STRTTIME`  |  -  |  -  |  -  | yes | yes | start time of trip | |
| `TRPMILES`  | yes |  -  |  -  | yes | yes | length of trip in miles ||
| `TRIPPURP`  |  -  | yes | yes |  -  | yes | trip purpose ||
| `TRAVDAY`   |  -  | yes |  -  | yes | yes | weekday of travel ||
| `HOMEOWN`   | yes | yes | yes |  -  | yes | home ownership ||
| `HHSIZE`    |  -  |  -  |  -  | yes | yes | size of household ||
| `HHFAMINC`  | yes | yes |  -  | yes | yes | household income ||
| `HHSTATE`   |  -  | yes | yes |  -  |  ?  | household state of residency | could use either `HHSTATE` or `CENSUS_D` |
| `WRKCOUNT`  |  -  |  -  |  -  | yes | yes | number of workers in household ||
| `LIF_CYC`   |  -  |  -  |  -  | yes | yes | life cycle classification ||
| `URBAN`     |  -  | yes |  ?  |  ?  | yes | classification of urban area ||
| `URBANSIZE` |  -  | yes |  ?  |  ?  |  ?  | population size of urban area | redundant with `URBAN`, needs reordering or one-hot encoding |
| `CENSUS_D`  |  -  | yes | yes |  -  |  -  | census division (region) of household | could be redundant with `HHSTATE` |
| `HH_RACE`   | yes | yes | yes |  -  |  yes  | race of household | |
| `EDUC`      | yes | yes |  -  | yes |  yes  | educational attainment of household | |
| `WORKER`    | yes | yes | yes |  -  |  yes  | worker status | |
| `WHYTRP90`  | yes | yes | yes |  -  |  yes  | trip purpose with 1990 NPTS design | possible duplicate of `TRIPPURP`|
| `R_AGE_IMP` |  -  |  -  |  -  | yes |  yes  | age | |
| `R_SEX_IMP` |  -  | yes | yes |  -  |  yes  | gender | |
| `OBHUR`     | yes | yes |  -  |  -  |  yes  | urban/rural indicator at origin | |
| `DBHUR`     | yes | yes |  -  |  -  |  yes  | urban/rural indicator at destination | |
| `OBPPOPDN`  | yes | yes |  -  |  -  |   -   | population density at origin | already covered by `OBHUR` |
| `DBPPOPDN`  | yes | yes |  -  |  -  |   -   | population density at destination | already covered by `DBHUR` |














In [28]:
def process_dataset(df):
    """
    Takes a pandas dataset in the NHTS survey format, selects and transforms features.
    Returns (X,y,groups), a tuple containing features, labels and sampling groups respectively.
    """
    
    target = "TRAVELMODE"
    group = "HOUSEID"
    context_cols = ["TRIPID", "HOUSEID", "PERSONID", "TDTRPNUM"]
    
    df_no_nans = df[df[target].notna()]
    
    y = df_no_nans[target]
    groups = df_no_nans[group]
    
    no_changes = [
        "STRTTIME",
        "TRPMILES",
        "TRAVDAY",
        "HHSIZE",
        "HHFAMINC",
        "WRKCOUNT",
        "LIF_CYC",
        "URBAN",
        "EDUC",
        "R_AGE_IMP",
        "DBPPOPDN",
        "OBPPOPDN"
    ]
    
    one_hot = {
        "TRIPPURP": ["HBO", "HBSHOP", "HBSOCREC", "HBW", "NHB", -9],
        "HOMEOWN": [1, 2, 97, -8, -7],
        "HHSTATE": ["AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI", "IA", "ID", "IL", "IN", 
                   "KS","KY", "LA", "MA", "MD", "ME", "MI", "MN", "MO", "MS", "MT", "NC", "ND", "NE", "NH", "NJ","NM",
                    "NV","NY", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", 
                    "WV", "WY"],
        "CENSUS_D": [c for c in range(1, 10)],
        "HH_RACE": [1, 2, 3, 4, 5, 6, 97, -7, -8], 
        "WORKER": [1, 2, -9, -1],
        "WHYTRP90": [1, 2, 3, 4, 5, 6, 8, 10, 11, 99],
        "R_SEX_IMP": [1, 2],
        ""
    }
    
    hur_encoding = {"-9", "R", "T", "C", "S", "U"}
    
    label_encoding = {
        "URBANPOP": lambda x: (x % 6) + 1,
        "OBHUR": hur_encoding,
        "DBHUR": hur_encoding,
    }
    
    columns = [*no_changes, *one_hot.keys(), *label_encoding.keys()]
    
    return None, y, groups
    
    

In [30]:
X,y,groups = process_dataset(train_validate)
groups

TRIPID
1           1
2           1
3           1
4           1
5           1
         ... 
16137    3303
16138    3303
16139    3303
16140    3304
16141    3304
Name: HOUSEID, Length: 16133, dtype: int64

In [8]:
pd.DataFrame(train_validate[["TRPTRANS", "TRAVELMODE"]].groupby(by=["TRPTRANS", "TRAVELMODE"]))[0]

0           (1, WALK)
1          (2, CYCLE)
2          (3, DRIVE)
3      (3, PASSENGER)
4          (4, DRIVE)
5      (4, PASSENGER)
6          (5, DRIVE)
7      (5, PASSENGER)
8          (6, DRIVE)
9      (6, PASSENGER)
10         (7, OTHER)
11     (8, PASSENGER)
12     (9, PASSENGER)
13         (10, RAIL)
14         (11, RAIL)
15         (12, RAIL)
16        (13, OTHER)
17         (14, RAIL)
18          (15, BUS)
19          (16, BUS)
20         (17, TAXI)
21        (18, DRIVE)
22    (18, PASSENGER)
23        (19, OTHER)
24        (20, OTHER)
25        (97, OTHER)
Name: 0, dtype: object

In [39]:
train_validate[train_validate["TRIPPURP"] < 0]

Unnamed: 0_level_0,HOUSEID,PERSONID,TDTRPNUM,STRTTIME,TRPMILES,TRPTRANS,LOOP_TRIP,TRIPPURP,TRAVDAY,HOMEOWN,...,EDUC,WORKER,WHYTRP90,R_AGE_IMP,R_SEX_IMP,OBHUR,DBHUR,OBPPOPDN,DBPPOPDN,TRAVELMODE
TRIPID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
120,24,1,1,700,-9.0,1,1,HBO,3,2,...,3,2,10,69,2,T,T,1500,1500,WALK
1073,228,1,1,1840,-9.0,97,1,HBO,4,1,...,2,2,11,92,2,T,T,1500,1500,OTHER
2645,547,2,1,1000,-9.0,97,1,HBO,5,2,...,3,2,11,64,1,S,S,3000,3000,OTHER
3552,739,1,3,1455,-9.0,1,1,HBO,6,2,...,2,2,10,72,2,U,U,7000,7000,WALK
3655,754,3,3,1620,-9.0,11,1,HBO,6,2,...,-1,-1,11,11,1,U,U,30000,30000,RAIL
3744,772,1,1,850,-9.0,17,1,NHB,4,2,...,5,1,10,32,1,C,C,7000,7000,TAXI
4057,836,1,5,1800,-9.0,1,1,HBO,5,2,...,5,2,10,79,2,T,T,750,750,WALK
4619,946,1,1,830,-9.0,1,1,NHB,3,2,...,3,1,1,50,1,C,C,7000,7000,WALK
4687,959,2,1,1000,-9.0,97,1,HBO,6,2,...,2,2,11,60,2,U,U,30000,30000,OTHER
6833,1383,1,2,1000,-9.0,11,1,NHB,4,2,...,4,2,1,62,1,S,S,1500,1500,RAIL


In [62]:
train_validate["OBHUR"].unique()

array(['U', 'S', 'C', 'R', 'T', '-9'], dtype=object)