# Mode choice prediction for non-car owning households in the USA
**Decision-aid methodologies in transportation, EPFL Spring 2021**

Florent Zolliker, Gaelle Abi Younes, Luca Bataillard

## Step 1: Data pre-processing

In this step, we will process and adjust the dataset in order to facilitate our model training. We begin by importing the datasets and relevant libraries.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

SEED = 42
np.random.seed(SEED)

In [3]:
train_validate = pd.read_csv("nhts_train_validate.csv", index_col="TRIPID")
test = pd.read_csv("nhts_test.csv", index_col="TRIPID")

We now need to consider the features and their format in order to select the appropriate ones. 

We first notice a group of context columns, that are not relevant to model training:
* `TRIPID`: trip identifier, indexes the dataset but otherwise not relevant for training or grouped sampling.
* `HOUSEID`: household identifier, this is the topmost hiearchical group in the survey. This column should be used to perform grouped samping during cross-validation
* `PERSONID`: person identifier, this is another hiearchical group, but since `HOUSEID` is higher in the hiearchy, using it is not necessary.
* `TDTRPNUM`: trip numbering per person in the survey.
* `LOOP_TRIP`: 

The target label is `TRAVELMODE`. This label is categorical, so we will encode the values using a simple numeric encoding. 

We also notice a `TRPTRANS` column that is very highly correllated with `TRAVELMODE` but does not feature in the `nhts_dictionary.csv` file provided with the project. After inspecting the NHTS documentation online, we suspect that the travel mode column was most likely generated from this column. We will thus discard the `TRPTRANS`column. Furthermore, negative responses in the `TRPTRANS` column resulted in `NaN` values in `TRAVELMODE`, that need to be filtered out.

Let us analyse the remaining columns. Missing values indicates that some values for that column are invalid in the dataset. In the case of categorical, this does not pose a big problem, since 

| Column name | Missing values | Categorical data | One-hot encoding | Scaling | Use feature? | Description | Comments |
| ---         | ---            | ---              | ---              | ---     | ---          | ---         | ---      |
| `STRTTIME`  |  -  |  -  |  -  | yes | yes | start time of trip | |
| `TRPMILES`  | yes |  -  |  -  | yes | yes | length of trip in miles ||
| `TRIPPURP`  |  -  | yes | yes |  -  | yes | trip purpose ||
| `TRAVDAY`   |  -  | yes |  -  | yes | yes | weekday of travel ||
| `HOMEOWN`   | yes | yes | yes |  -  | yes | home ownership ||
| `HHSIZE`    |  -  |  -  |  -  | yes | yes | size of household ||
| `HHFAMINC`  | yes | yes |  -  | yes | yes | household income ||
| `HHSTATE`   |  -  | yes | yes |  -  |  ?  | household state of residency | could use either `HHSTATE` or `CENSUS_D` |
| `WRKCOUNT`  |  -  |  -  |  -  | yes | yes | number of workers in household ||
| `LIF_CYC`   |  -  |  -  |  -  | yes | yes | life cycle classification ||
| `URBAN`     |  -  | yes |  ?  |  ?  | yes | classification of urban area ||
| `URBANSIZE` |  -  | yes |  ?  |  ?  |  ?  | population size of urban area | redundant with `URBAN`, needs reordering or one-hot encoding |
| `CENSUS_D`  |  -  | yes | yes |  -  |  -  | census division (region) of household | could be redundant with `HHSTATE` |
| `HH_RACE`   | yes | yes | yes |  -  |  yes  | race of household | |
| `EDUC`      | yes | yes | yes |  -  |  yes  | educational attainment of household | |
| `WORKER`    | yes | yes | yes |  -  |  yes  | worker status | |
| `WHYTRP90`  | yes | yes | yes |  -  |  yes  | trip purpose with 1990 NPTS design | possible duplicate of `TRIPPURP`|
| `R_AGE_IMP` |  -  |  -  |  -  |  yes |  yes  | age | |
| `R_SEX_IMP` |  -  | yes | yes |  -  |  yes  | gender | |
| `OBHUR`     | yes | yes | yes |  -  |  yes  | urban/rural indicator at origin | |
| `DBHUR`     | yes | yes | yes |  -  |  yes  | urban/rural indicator at destination | |
| `OBPPOPDN`  | yes | yes | yes |  -  |   -   | population density at origin | already covered by `OBHUR` |
| `DBPPOPDN`  | yes | yes | yes |  -  |   -   | population density at destination | already covered by `DBHUR` |














In [4]:
def process_dataset(df):
    """
    Takes a pandas dataset in the NHTS survey format, selects and transforms features.
    Returns (X,y,groups), a tuple containing features, labels and sampling groups respectively.
    """
    
    pass

In [5]:
train_validate.dtypes

HOUSEID         int64
PERSONID        int64
TDTRPNUM        int64
STRTTIME        int64
TRPMILES      float64
TRPTRANS        int64
LOOP_TRIP       int64
TRIPPURP       object
TRAVDAY         int64
HOMEOWN         int64
HHSIZE          int64
HHFAMINC        int64
HHSTATE        object
WRKCOUNT        int64
LIF_CYC         int64
URBAN           int64
URBANSIZE       int64
CENSUS_D        int64
HH_RACE         int64
EDUC            int64
WORKER          int64
WHYTRP90        int64
R_AGE_IMP       int64
R_SEX_IMP       int64
OBHUR          object
DBHUR          object
OBPPOPDN        int64
DBPPOPDN        int64
TRAVELMODE     object
dtype: object

In [8]:
pd.DataFrame(train_validate[["TRPTRANS", "TRAVELMODE"]].groupby(by=["TRPTRANS", "TRAVELMODE"]))[0]

0           (1, WALK)
1          (2, CYCLE)
2          (3, DRIVE)
3      (3, PASSENGER)
4          (4, DRIVE)
5      (4, PASSENGER)
6          (5, DRIVE)
7      (5, PASSENGER)
8          (6, DRIVE)
9      (6, PASSENGER)
10         (7, OTHER)
11     (8, PASSENGER)
12     (9, PASSENGER)
13         (10, RAIL)
14         (11, RAIL)
15         (12, RAIL)
16        (13, OTHER)
17         (14, RAIL)
18          (15, BUS)
19          (16, BUS)
20         (17, TAXI)
21        (18, DRIVE)
22    (18, PASSENGER)
23        (19, OTHER)
24        (20, OTHER)
25        (97, OTHER)
Name: 0, dtype: object

In [43]:
train_validate[train_validate["OBPPOPDN"] <= 0]

Unnamed: 0_level_0,HOUSEID,PERSONID,TDTRPNUM,STRTTIME,TRPMILES,TRPTRANS,LOOP_TRIP,TRIPPURP,TRAVDAY,HOMEOWN,...,EDUC,WORKER,WHYTRP90,R_AGE_IMP,R_SEX_IMP,OBHUR,DBHUR,OBPPOPDN,DBPPOPDN,TRAVELMODE
TRIPID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
106,23,1,1,1000,10.483,3,2,NHB,2,2,...,2,2,3,28,2,-9,S,-9,1500,PASSENGER
108,23,1,3,1210,1.225,3,2,NHB,2,2,...,2,2,4,28,2,-9,-9,-9,-9,PASSENGER
109,23,1,4,1230,10.154,3,2,NHB,2,2,...,2,2,3,28,2,-9,-9,-9,-9,PASSENGER
110,23,1,5,1500,10.0,3,2,NHB,2,2,...,2,2,4,28,2,-9,-9,-9,-9,PASSENGER
111,23,1,6,1530,1.225,3,2,NHB,2,2,...,2,2,4,28,2,-9,-9,-9,-9,PASSENGER
112,23,1,7,1545,1.216,3,2,NHB,2,2,...,2,2,8,28,2,-9,-9,-9,-9,PASSENGER
113,23,1,8,1640,2.857,3,2,NHB,2,2,...,2,2,3,28,2,-9,-9,-9,-9,PASSENGER
114,23,1,9,1700,1.324,3,2,HBSHOP,2,2,...,2,2,3,28,2,-9,-9,-9,-9,PASSENGER
115,23,2,1,1230,10.153,3,2,HBSHOP,2,2,...,2,2,3,30,1,-9,-9,-9,-9,PASSENGER
116,23,2,2,1500,10.0,3,2,NHB,2,2,...,2,2,8,30,1,-9,-9,-9,-9,PASSENGER


In [44]:
train_validate["OBPPOPDN"].unique()

array([ 7000, 17000,  3000,    50,  1500, 30000,   750,   300,    -9],
      dtype=int64)