## Introduction to choice-learn's data management

In [None]:
import os
import sys
from pathlib import Path

sys.path.append("../")

import numpy as np
import pandas as pd

### ChoiceDataset - Getting Started !

choice-learn package aims at being able to handle large datasets. One of the main idea is to limit as much as possible the usage of memory to save several time the same feature.
We define two sources of features, the items and the contexts.

<ins>**Items**</ins> represent a product, an alternative that can be chosen by the customer at some point.

<ins>**Contexts**</ins> represent the different cases of the dataset. One context corresponds to one choice and regroups every factor that might be different from one choice to another.


From these two concepts, we defines 5 different types of data, storing them separatly in order to avoid redundancy in memory:

- **choices:** The main information, indicating which item/alternative has been chosen among all availables
- **fixed_items_features:** The items features that never change (e.g. size, color, etc...) over the choices/contexts.
- **contexts_features:** It represents all the features that might change from one choice to another and that are **common** to all items (e.g. day of week, customer features, etc...)
- **contexts_items_features:** The features that are function of the item and of the context (e.g. prices change over contexts and are specific to each sold item, etc...)

- **contexts_items_availabilities:** For each context it represents whether each item/alternative is proposed to the customer (1.) or not (0.).


In order to estimate a model using the choice-learn API, you will first need to wrap your dataset within a ChoiceDataset. The easiest way to do it is to use a pandas DataFrame, let's see how to do it !

We will use the ModeCanada [1] dataset for this example. It is provided with the choice-learn package and can loaded as follows:

In [None]:
from choice_learn.data import ChoiceDataset
from choice_learn.datasets import load_modecanada

canada_transport_df = load_modecanada(as_frame=True)
canada_transport_df.head()

Unnamed: 0,case,alt,choice,dist,cost,ivt,ovt,freq,income,urban,noalt
1,1,train,0,83,28.25,50,66,4,45.0,0,2
2,1,car,1,83,15.77,61,0,0,45.0,0,2
3,2,train,0,83,28.25,50,66,4,25.0,0,2
4,2,car,1,83,15.77,61,0,0,25.0,0,2
5,3,train,0,83,28.25,50,66,4,70.0,0,2


An extensive description of the dataset can be found [here](https://www.ssc.wisc.edu/~bhansen/econometrics/Koppelman_description.pdf).
An extract indicates:

"The dataset was assembled in 1989 by VIA Rail (the Canadian national rail carrier) to estimate the demand for high-speed rail in the Toronto-Montreal corridor. The main information source was a Passenger Review administered to business travelers augmented by information about each trip. The observations consist of a choice between four modes of transportation (train, air, bus, car) with information about the travel mode and about the passenger. The posted dataset has been balanced to only include cases where all four travel modes are recorded. The file contains 11,116 observations on 2779 individuals.  "

Alright !
If we go back to our dataframe, we can see the following columns:
- case: an ID of the traveler
- alt: the alternative concerned by the row
- choice: 1 if the alternative was chosen, 0 otherwise
- dist: trip distance
- cost: trip cost
- ivt: travel time in-vehicule (minutes)
- ovt: travel time out-vehicule (minutes)
- income: housold income of traveler ($)
- urban: 1 if origin or destination is a large city
- noalt: the number of alternative among which the traveler had to chose
- freq: the frequence of the alternative (0 for car)

Following our specification, we can see that one case corresponds to one customer thus one choice. In our choice-learn language it corresponds to "one context": a set of available alternatives and their features/specificites resulting in one choice.
Let's regroup our features:

**choices**
Easy ! It is the alternative whenever the value is one.

**contexts_features**
The income, urban and distance (also noalt which is not really a feature) features are the same for all the alternative within a context: they are contexts_features.

**contexts_items_features**
Ivt, Ovt, cost and freq depends on the alternative and change over the contexts. They are contexts_items_features.

**contexts_items_features**
It in not directly indicated, however it can be easily deduced. Whenever an alternative is not available, it is not precised for its case. For example for the case=1, our first context, only train and car are given as alternatives, meaning that air and bus were could not be chosen/were not available.

Okay, but we are missing fixed_items_features... Indeed there isn't really any in this dataset. Let's create one for the example.
We will create is_public, indicating if an alternative is a public_transportation (1) or a private one (0).

In [None]:
transport_df = canada_transport_df.copy()
items = ["air", "bus", "car", "train"]

transport_df["is_public"] = transport_df.apply(lambda row: 0. if row.alt == "car" else 1., axis=1)

# Just some typing
transport_df.income = transport_df.income.astype("float32")

# Let's take a look at our new df:
transport_df.head()

Our feature, is_public is 0 for the car and 1 for all other alternatives, seems fine! We can now create our ChoiceDataset !\
*Note that you do NOT need each type of feature, here the purpose was to give a complete example.*

In order to create the ChoiceDataset from the DataFrame, we need to specify:
- the columns representing the fixed_items_features
- the columns representing the contexts_features
- the columns representing the contexts_items_features
- the column where the item is identified 
- the column where the context is identified
- the column in which the choice is given

For our Canada Transport example, here is how it should be done:

In [None]:
dataset = ChoiceDataset.from_single_df(df=transport_df,
                                       fixed_items_features_columns=["is_public"],
                                       contexts_features_columns=["income", "urban", "dist"],
                                       contexts_items_features_columns=["cost", "freq", "ovt", "ivt"],
                                       items_id_column="alt",
                                       contexts_id_column="case",
                                       choices_column="choice",
                                       choice_mode="one_zero")

I have added an argument without any warning, the "choice_mode". It only precises how the choice is encoded in the dataframe. Currently two modes are availble:

**one_zero:**
The choice column contains a 0 when the alternative/item is not chosen in the session and a 1 if it is chosen.
This is the case here with Canada Transport.

**item_id:**
The choice column contains the id of the choice during the session. The id corresponds to the values used in the column 'items_id_column'.
In this case of Canada Transport, the dataframe would need to be:

| | case | alt | choice | dist | cost | ivt | ovt | freq | 	income | urban | noalt | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | train | car | 83 | 28.25 | 50 | 66 | 4 | 45 | 0 | 2 |
| 2 | 1 | car | car | 83 | 15.77 | 61 | 0 | 0 | 45 | 0 | 2 |
| 3 | 2 | train | car | 83 | 28.25 | 50 | 66 | 4 | 25 | 0 | 2 |
| 4 | 2 | car | car | 83 | 15.77 | 61 | 0 | 0 | 25 | 0 | 2 |
| 5 | 3 | train | car | 83 | 28.25 | 50 | 66 | 4 | 70 | 0 | 2 |

In the first 5 examples, the chosen transportation is always the car.

The ChoiceDataset is ready !

You now have three possibilities to continue discovering the choice-learn package:
- You can directly go [here]() to the modelling tutorial if you want to understand how a first simple ConditionMNl would be implementd.
- You can go [here]() if your dataset is organized differently to see all the different ways to instantiate a ChoiceDataset. In particular it helps if you DataFrame is in the wide format or if it is splitted into several DataFrames.
- Or you can continue this current tutorial to better understand the ChoiceDataset machinery and everything there is to know about it.

Whatever your choice, you can also check [here](#ready-to-use-datasets) the list of open source datasets available directly with the package.

### ChoiceDataset - Inside the machine

Let's see an example of ChoiceDataset instantiation from numpy arrays:

In [None]:
# Let's see an example of ChoiceDataset instantiation

# from choice-learn.data import ChoiceDataset

# Let's consider three items whose features are:
# - Size
# - Weight
# - price
# - whether is is on promotion or not

fixed_items_features = [
    [1, 2], # item 1 [size, weight]
    [2, 4], # item 2 [size, weight]
    [1.5, 1.5], # item 3 [size, weight]
]

# We have two customers whose features are
# - Budget
# - Age
# Customer 1 bought item 1 at session 1 and item 2 at session 3
# Customer 2 bought item 3 at session 2

choices = [0, 2, 1]
contexts_items_availabilities = [
    [1, 1, 1], # All items available at session 1
    [1, 1, 1], # All items available at session 2
    [0, 1, 1], # Item 1 not available at session 3
]

contexts_features = [
    [100, 20], # session 1, customer 1 [budget, age]
    [200, 40], # session 2, customer 2 [budget, age]
    [80, 20], # session 3, customer 1 [budget, age]
]

contexts_items_features = [
    [
        [100, 0], # Session 1, Item 1 [price, promotion]
        [140, 0], # Session 1, Item 2 [price, promotion]
        [200, 0], # Session 1, Item 2 [price, promotion]
    ],
    [
        [100, 0], # Session 2 Item 1 [price, promotion]
        [120, 1], # Session 2, Item 2 [price, promotion]
        [200, 0], # Session 2, Item 2 [price, promotion]
    ],
    [
        [100, 0], # Session 3, Item 1 [price, promotion], values do not really matter, but needs to exist for shapes sake
        [120, 1], # Session 3, Item 2 [price, promotion]
        [180, 1], # Session 3, Item 2 [price, promotion]
    ],
]


Note that in items_features and contexts_items_features, the features need to be well ordered:
- The features are ordered the same for all items
- The items are ordered the same for items_features and contexts_items_features, and their index is used in choices:


**items_features** = [[feature_1_item_A, feature_2_item_A, ...], [features_1_item_B, feature_2_item_B, ...], ...]

**contexts_items_features** = [[[context_1_feature_1_item_A, ...], [context_1_feature_1_item_B, ...]], [[context_2_feature_1_item_A, ...], [context_2_feature_1_item_B, ...]], ...]

**choices** then represent the index of the item: 0 when item_1 is chose, 1 when item_2, etc..., e.g. [0, 0, 2, 1, ...]

In [None]:
dataset = ChoiceDataset(
    fixed_items_features=fixed_items_features,
    fixed_items_features_names=["size", "weight"], # You can precise the names of the features if you want
    contexts_features=contexts_features,
    contexts_features_names=["budget", "age"], # same, not mandatory
    contexts_items_features=contexts_items_features,
    contexts_items_features_names=["price", "promotion"], # same, not mandatory
    contexts_items_availabilities=contexts_items_availabilities,
    choices=choices,
)

In [None]:
dataset.contexts_items_features[0].shape

(3, 3, 2)

ChoiceDataset is indexed by session. You can use [] to subset it.
It is particularly useful for train/test split:

In [None]:
train_index = [0, 1]
test_index = [2]
train_dataset = dataset[train_index]
test_dataset = dataset[test_index]
print("Train Dataset length:", len(train_dataset), "Test Dataset lenght:", len(test_dataset))

No features_by_ids given.
Some choices never happen in the dataset: {1}
No features_by_ids given.
Some choices never happen in the dataset: {0, 2}
Train Dataset length: 2 Test Dataset lenght: 1


If you want to access the features you can use the .iloc function with sessions indexes 
It returns the features in this order:

- items_features (n_items, n_items_features)
- contexts_features (n_choices, n_sessions_features)
- contexts_items_features (n_choices, n_items, n_sessions_items_features)
- contexts_items_availabilities (n_choices, n_items)
- choices (n_choices,)

As a reminder, we have as many contexts as we have choices in the dataset !

| index | feature  | shape  |   
|---|---|---|
| 0 | items_features | (n_items, n_items_features) |
| 1 | contexts_features | (n_choices, n_contexts_features) |
| 2 | contexts_items_features | (n_choices, n_items, n_contexts_items_features) |
| 3 | context_items_availabilities | (n_choices, n_items) |
| 4 | choices | (n_choices,) |

In [None]:
contexts_indexes = [0, 1]
print("Items features:", train_dataset.batch[contexts_indexes][0])
print("Contexts features:", train_dataset.batch[contexts_indexes][1])
print("Contexts Items features:", train_dataset.batch[contexts_indexes][2])
print("Contexts Items Availabilities features:", train_dataset.batch[contexts_indexes][3])
print("Contexts Choices:", train_dataset.batch[contexts_indexes][4])

Items features: (array([[1. , 2. ],
       [2. , 4. ],
       [1.5, 1.5]], dtype=float32),)
Contexts features: (array([[100,  20],
       [200,  40]], dtype=int32),)
Contexts Items features: (array([[[100,   0],
        [140,   0],
        [200,   0]],

       [[100,   0],
        [120,   1],
        [200,   0]]], dtype=int32),)
Contexts Items Availabilities features: [[1. 1. 1.]
 [1. 1. 1.]]
Contexts Choices: [0 2]


To simplify the iteration over the dataset you can call the .iter_batch method, with the batch_size argument.

Note that batch_size=-1 returns the whole dataset

In [None]:
# All the features are given for each session, in order to compute utility and NegativeLogLikelihood
for n_batch, batch in enumerate(dataset.iter_batch(batch_size=1)):
    print(n_batch, batch)

0 (array([[1. , 2. ],
       [2. , 4. ],
       [1.5, 1.5]], dtype=float32), array([[100,  20]], dtype=int32), array([[[100,   0],
        [140,   0],
        [200,   0]]], dtype=int32), array([[1., 1., 1.]], dtype=float32), array([0], dtype=int32))
1 (array([[1. , 2. ],
       [2. , 4. ],
       [1.5, 1.5]], dtype=float32), array([[200,  40]], dtype=int32), array([[[100,   0],
        [120,   1],
        [200,   0]]], dtype=int32), array([[1., 1., 1.]], dtype=float32), array([2], dtype=int32))
2 (array([[1. , 2. ],
       [2. , 4. ],
       [1.5, 1.5]], dtype=float32), array([[80, 20]], dtype=int32), array([[[100,   0],
        [120,   1],
        [180,   1]]], dtype=int32), array([[0., 1., 1.]], dtype=float32), array([1], dtype=int32))


Note that you will need to use a ChoiceDataset to use the models.

**Stacking features**
If you need to keep a clear distinction between different features, you can use stacking in the ChoiceDataset. For example if we have two kind of items_features and we do not want them to be within the same np.ndarray we can as follow:

In [None]:
items_features_2 = [
    [11, 12], # item 1 
    [12, 14], # item 2 
    [11.5, 11.5], # item 3 
]
dataset = ChoiceDataset(
    # Here items_features specified as a tuple of the two features lists
    fixed_items_features=(fixed_items_features, items_features_2),
    contexts_features=contexts_features,
    contexts_items_features=contexts_items_features,
    contexts_items_availabilities=contexts_items_availabilities,
    choices=choices,
)

When indexing or batching your ChoiceDataset, you will now get items_features as a tuple, with elements corresponding to (items_features, items_features_2)

In [None]:
dataset.batch[0]

([array([[1. , 2. ],
         [2. , 4. ],
         [1.5, 1.5]], dtype=float32),
  array([[11. , 12. ],
         [12. , 14. ],
         [11.5, 11.5]], dtype=float32)],
 array([100,  20]),
 array([[100,   0],
        [140,   0],
        [200,   0]], dtype=int32),
 array([1., 1., 1.], dtype=float32),
 0)

This is possible with:
- items_features
- contexts_features
- contexts_items_features
As the other should not need any superposition of values.

In [None]:
dataset = ChoiceDataset(
    fixed_items_features=(fixed_items_features, items_features_2), # Here items_features specified as a tuple of the two features lists
    contexts_features=(contexts_features, contexts_features),
    contexts_items_features=(contexts_items_features, contexts_items_features),
    contexts_items_availabilities=contexts_items_availabilities,
    choices=choices,
)

dataset.batch[0]

## More Advanced use: the FeatureStore & OneHotStore

The FeaturseStore class is here to stock values that regularly repeat themselves over a sequence.
Let's take an example where several stores are considered. If we want to model the utility from store features (such as surface, average number of customers, etc...), these features are shared by several choices in our dataset.

In [None]:
# We have three stores, represented by their (surface, average_number_of_customers):
store_features = [[100, 250], [150, 500], [80, 100]]
# Now we consider a sequence of choices that happen in [store_1, store_1, store_2, store_3, store_2, store_1, store_3]
# The usual way to store the features would be:
usual_store_features = [[100, 250], [100, 250], [150, 500], [80, 100], [150, 500], [100, 250], [80, 100]]
# There are a lot of repetetitions, which is not very efficient...
# Let's multiply this with 600 stores represented as one-hot over thousands of sessions, we will have a memory problem !
# Now the StoreFeatures tries to be more efficient:

store = {1: [100, 250], 2: [150, 500], 3: [80, 100]} # We can use a dictionary to store the features of each store
sequence = [1, 1, 2, 3, 2, 1, 3] # We can use a sequence of keys to represent the sequence of stores

from choice_learn.data import FeaturesStore
feat_store = FeaturesStore.from_dict(store, sequence)

In [None]:
# Now we can access the features of the store appearing at index i in the sequence with the iloc method:

# Let's see the features of the store at index 0
print(feat_store.batch[0])
# Now we can also take a whole batch:
print(feat_store.batch[[0, 1, 5]])
# Ah ! We selected all the indexes where the store is 1, which is why we always have the same features !

[100, 250]
[[100, 250], [100, 250], [100, 250]]


In order to further optimize RAM usage, you can use the OneHotStore, built specifically for one-hot encoded features. The store will only keep the index of the one of each element and will consitute the one-hot vector only when needed.

In [None]:
from choice_learn.data import OneHotStore

In [None]:
store = OneHotStore.from_sequence(["a", "a", "a", "c", "b", "b", "a"])

# When using from_sequence, the store collects ranked order (lower to higher) of each element as index
print("RAM storage of the OneHotStore:", store.store, "with sequence:", store.sequence)
# When indexing with iloc, we can access the one-hot encoding of the element at index i in the sequence
print("One-hot vector at index 0:", store.batch[0])
print("One-hot vector at indexes [0, 1, 3]:")
print(store.batch[[0, 1, 3]])

RAM storage of the OneHotStore: {'a': 0, 'b': 1, 'c': 2} with sequence: ['a' 'a' 'a' 'c' 'b' 'b' 'a']
One-hot vector at index 0: [1. 0. 0.]
One-hot vector at indexes [0, 1, 3]:
[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]


- Add: Example from pandas.DataFrame

## Ready-to-use datasets
A few well-known open source datasets are directly integrated and the package and can be downloaded in one line:
- SwissMetro from Bierlaire et al (2001) [2]
- ModeCanada from Koppleman et al. (1993) [1]

If you feel like another open-source dataset should be included, reach out !

In [None]:
from choice_learn.datasets import load_swissmetro, load_modecanada

canada_choice_dataset = load_modecanada()
swissmetro_choice_dataset = load_swissmetro()

The datasets can also be downloaded as dataframes:

In [None]:
swissmetro_df = load_swissmetro(as_frame=True)
swissmetro_df.head()

Unnamed: 0,GROUP,SURVEY,SP,ID,PURPOSE,FIRST,TICKET,WHO,LUGGAGE,AGE,...,TRAIN_CO,TRAIN_HE,SM_TT,SM_CO,SM_HE,SM_SEATS,CAR_TT,CAR_CO,CHOICE,CAR_HE
0,2.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,3.0,...,48.0,120.0,63.0,52.0,20.0,0.0,117.0,65.0,2.0,0.0
1,2.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,3.0,...,48.0,30.0,60.0,49.0,10.0,0.0,117.0,84.0,2.0,0.0
2,2.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,3.0,...,48.0,60.0,67.0,58.0,30.0,0.0,117.0,52.0,2.0,0.0
3,2.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,3.0,...,40.0,30.0,63.0,52.0,20.0,0.0,72.0,52.0,2.0,0.0
4,2.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,3.0,...,36.0,60.0,63.0,42.0,20.0,0.0,90.0,84.0,2.0,0.0


### References
[1] Koppelman et al. (1993), *Application and Interpretation of Nested Logit Models of Intercity Mode Choice*\
[2] Bierlaire, M., Axhausen, K. and Abay, G. (2001), *The Acceptance of Modal Innovation: The Case of SwissMetro*