## Introduction to choice-learn's data management

In [None]:
import os
import sys
from pathlib import Path

sys.path.append("../")

import numpy as np
import pandas as pd

### ChoiceDataset - Getting Started (Fast-track start ??)

choice-learn package aims at being able to handle large datasets. It uses a data structure similar to the one defined in the torch-choice package.

Particularly, it defines 5 types of different data, storing them separatly in order to avoid redundancy in memory:

- items_features: features of considered items that never change (e.g. size, color, etc...)
- session_features: features that depend on the session (time) of the choice and that are common to all items (e.g. day of week, etc...)
- session_items_features: features that depend on the item and on the session (e.g. price)

- items_sessions_availabilities: For each session whether the item it present (1.) or not (0.)
- choices: chosen item among available ones


In order to estimate a model using the choice-learn API, you will first need to wrap your dataset within a ChoiceDataset. The easiest way to do it is to use a pandas DataFrame, let's see how to do it !

In [None]:
from choice_learn.data import ChoiceDataset

# path to dataset file is:
filepath = "choice_learn/datasets/data/ModeCanada.csv.gz"
canada_transport_df = pd.read_csv((Path("..") / filepath), index_col=0)

print("Let's look at the dataframe:")
canada_transport_df.head()

Let's look at the dataframe:


Unnamed: 0,case,alt,choice,dist,cost,ivt,ovt,freq,income,urban,noalt
1,1,train,0,83,28.25,50,66,4,45,0,2
2,1,car,1,83,15.77,61,0,0,45,0,2
3,2,train,0,83,28.25,50,66,4,25,0,2
4,2,car,1,83,15.77,61,0,0,25,0,2
5,3,train,0,83,28.25,50,66,4,70,0,2


This dataframe is the raw version of the Canada Transport Dataset \cite. We need to follow some specifications in order to use it as a ChoiceDataset.

The dataset does not contain any items_features (i.e. fixed features for the different items). We will create a One-Hot encoding that will be represent these items_features for the example.

In [None]:
transport_df = canada_transport_df.copy()
items = ["air", "bus", "car", "train"]

transport_df["oh0"] = transport_df.apply(lambda row: 1. if row.alt == items[0] else 0., axis=1)
transport_df["oh1"] = transport_df.apply(lambda row: 1. if row.alt == items[1] else 0., axis=1)
transport_df["oh2"] = transport_df.apply(lambda row: 1. if row.alt == items[2] else 0., axis=1)
transport_df["oh3"] = transport_df.apply(lambda row: 1. if row.alt == items[3] else 0., axis=1)

# Just some typing
transport_df.income = transport_df.income.astype("float32")

In order to create the ChoiceDataset from the dataframe, we need to specify:
- the columns representing the items_features
- the columns representing the sessions_features
- the columns representing the sessions_items_features
- the column where the item is identified 
- the column where the session is identified
- the column in which the choice is given

For our Canada Transport example, here is how it should be done:

In [None]:
dataset = ChoiceDataset.from_single_df(df=transport_df,
                                       items_features_columns=["oh0", "oh1", "oh2", "oh3"],
                                       sessions_features_columns=["income"],
                                       sessions_items_features_columns=['cost', 'freq', 'ovt', 'ivt'],
                                       items_id_column="alt",
                                       sessions_id_column="case",
                                       choices_column="choice",
                                       choice_mode="one_zero")

In our case, the items_feature are the one-hot values we have created. The column 'income' represents the income of the subject and is shared by all items durint a session. It thus falls in the sessions_features category. On the contrary, 'cost', 'freq', 'ovt' and 'ivt' describe each transportation mode (items) and change for each session (or each case of the survey). They are considered sessions_items_features.

An argument that needs to be precised is choice-mode. It precised how the choice is encoded in the dataframe. Currently two modes are availble:

**one_zero:**
The choice column contains a 0 when the alternative/item is not chosen in the session and a 1 if it is chosen.
This is the case here with Canada Transport.

**item_id:**
The choice column contains the id of the choice during the session. The id corresponds to the values used in the column 'items_id_column'.
In this case of Canada Transport, the dataframe would need to be:

| | case | alt | choice | dist | cost | ivt | ovt | freq | 	income | urban | noalt | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | train | car | 83 | 28.25 | 50 | 66 | 4 | 45 | 0 | 2 |
| 2 | 1 | car | car | 83 | 15.77 | 61 | 0 | 0 | 45 | 0 | 2 |
| 3 | 2 | train | car | 83 | 28.25 | 50 | 66 | 4 | 25 | 0 | 2 |
| 4 | 2 | car | car | 83 | 15.77 | 61 | 0 | 0 | 25 | 0 | 2 |
| 5 | 3 | train | car | 83 | 28.25 | 50 | 66 | 4 | 70 | 0 | 2 |

In the first 5 examples, the chosen transportation is always the car.

In [None]:
dataset.sessions_features[0].shape, dataset.items_features[0].shape, dataset.sessions_items_features[0].shape

### ChoiceDataset - Inside the machine

Let's see an example of ChoiceDataset instantiation from numpy arrays:

In [None]:
# Let's see an example of ChoiceDataset instantiation

# from choice-learn.data import ChoiceDataset

# Let's consider three items whose features are:
# - Size
# - Weight
# - price
# - whether is is on promotion or not

items_features = [
    [1, 2], # item 1 [size, weight]
    [2, 4], # item 2 [size, weight]
    [1.5, 1.5], # item 3 [size, weight]
]

# We have two customers whose features are
# - Budget
# - Age
# Customer 1 bought item 1 at session 1 and item 2 at session 3
# Customer 2 bought item 3 at session 2

choices = [0, 2, 1]
sessions_items_availabilities = [
    [1, 1, 1], # All items available at session 1
    [1, 1, 1], # All items available at session 2
    [0, 1, 1], # Item 1 not available at session 3
]

sessions_features = [
    [100, 20], # session 1, customer 1 [budget, age]
    [200, 40], # session 2, customer 2 [budget, age]
    [80, 20], # session 3, customer 1 [budget, age]
]

sessions_items_features = [
    [
        [100, 0], # Session 1, Item 1 [price, promotion]
        [140, 0], # Session 1, Item 2 [price, promotion]
        [200, 0], # Session 1, Item 2 [price, promotion]
    ],
    [
        [100, 0], # Session 2 Item 1 [price, promotion]
        [120, 1], # Session 2, Item 2 [price, promotion]
        [200, 0], # Session 2, Item 2 [price, promotion]
    ],
    [
        [100, 0], # Session 3, Item 1 [price, promotion], values do not really matter, but needs to exist for shapes sake
        [120, 1], # Session 3, Item 2 [price, promotion]
        [180, 1], # Session 3, Item 2 [price, promotion]
    ],
]


Note that in items_features and sessions_items_features, the features need to be well ordered:
- The features are ordered the same for all items
- The items are ordered the same for items_features and sessions_items_features, and their index is used in choices:


**items_features** = [[feature_1_item_A, feature_2_item_A, ...], [features_1_item_B, feature_2_item_B, ...], ...]

**session_items_features** = [[[session_1_feature_1_item_A, ...], [session_1_feature_1_item_B, ...]], [[session_2_feature_1_item_A, ...], [session_2_feature_1_item_B, ...]], ...]

**choices** then represent the index of the item: 0 when item_1 is chose, 1 when item_2, etc..., e.g. [0, 0, 2, 1, ...]

In [None]:
dataset = ChoiceDataset(
    items_features=items_features,
    items_features_names=["size", "weight"], # You can precise the names of the features if you want
    sessions_features=sessions_features,
    sessions_features_names=["budget", "age"], # same, not mandatory
    sessions_items_features=sessions_items_features,
    sessions_items_features_names=["price", "promotion"], # same, not mandatory
    sessions_items_availabilities=sessions_items_availabilities,
    choices=choices,
)

In [None]:
dataset.sessions_items_features[0].shape

(3, 3, 2)

ChoiceDataset is indexed by session. You can use [] to subset it.
It is particularly useful for train/test split:

In [None]:
train_index = [0, 1]
test_index = [2]
train_dataset = dataset[train_index]
test_dataset = dataset[test_index]
print("Train Dataset length:", len(train_dataset), "Test Dataset lenght:", len(test_dataset))

Some choices never happen in the dataset: {1}
Some choices never happen in the dataset: {0, 2}
Train Dataset length: 2 Test Dataset lenght: 1


If you want to access the features you can use the .iloc function with sessions indexes 
It returns the features in this order:

- items_features (n_items, n_items_features)
- sessions_features (n_sessions_indexes, n_sessions_features)
- sessions_items_features (n_sessions_indexes, n_items, n_sessions_items_features)
- sessions_items_availabilities (n_sessions_indexes, n_items)
- choices (n_sessions_indexes,)

| index | feature  | shape  |   
|---|---|---|
| 0 | items_features | (n_items, n_items_features) |
| 1 | sessions_features | (n_sessions_indexes, n_sessions_features) |
| 2 | sessions_items_features | (n_sessions_indexes, n_items, n_sessions_items_features) |
| 3 | sessions_items_availabilities | (n_sessions_indexes, n_items) |
| 4 | choices | (n_sessions_indexes,) |

In [None]:
sessions_indexes = [0, 1]
print("Items features:", train_dataset.batch[sessions_indexes][0])
print("Sessions features:", train_dataset.batch[sessions_indexes][1])
print("Sessions Items features:", train_dataset.batch[sessions_indexes][2])
print("Sessions Items Availabilities features:", train_dataset.batch[sessions_indexes][3])
print("Sessions Choices:", train_dataset.batch[sessions_indexes][4])

Items features: (array([[1. , 2. ],
       [2. , 4. ],
       [1.5, 1.5]], dtype=float32),)
Sessions features: (array([[100,  20],
       [200,  40]], dtype=int32),)
Sessions Items features: (array([[[100,   0],
        [140,   0],
        [200,   0]],

       [[100,   0],
        [120,   1],
        [200,   0]]], dtype=int32),)
Sessions Items Availabilities features: [[1. 1. 1.]
 [1. 1. 1.]]
Sessions Choices: [0 2]


To simplify the iteration over the dataset you can call the .batch method, with the batch_size argument.

Note that batch_size=-1 returns the whole dataset

In [None]:
# All the features are given for each session, in order to compute utility and NegativeLogLikelihood
for n_batch, batch in enumerate(dataset.iter_batch(batch_size=1)):
    print(n_batch, batch)

0 (array([[1. , 2. ],
       [2. , 4. ],
       [1.5, 1.5]], dtype=float32), array([[100,  20]], dtype=int32), array([[[100,   0],
        [140,   0],
        [200,   0]]], dtype=int32), array([[1., 1., 1.]], dtype=float32), array([0], dtype=int32))
1 (array([[1. , 2. ],
       [2. , 4. ],
       [1.5, 1.5]], dtype=float32), array([[200,  40]], dtype=int32), array([[[100,   0],
        [120,   1],
        [200,   0]]], dtype=int32), array([[1., 1., 1.]], dtype=float32), array([2], dtype=int32))
2 (array([[1. , 2. ],
       [2. , 4. ],
       [1.5, 1.5]], dtype=float32), array([[80, 20]], dtype=int32), array([[[100,   0],
        [120,   1],
        [180,   1]]], dtype=int32), array([[0., 1., 1.]], dtype=float32), array([1], dtype=int32))


Note that you will need to use a ChoiceDataset to use the models.

**Stacking features**
If you need to keep a clear distinction between different features, you can use stacking in the ChoiceDataset. For example if we have two kind of items_features and we do not want them to be within the same np.ndarray we can as follow:

In [None]:
items_features_2 = [
    [11, 12], # item 1 
    [12, 14], # item 2 
    [11.5, 11.5], # item 3 
]
dataset = ChoiceDataset(
    items_features=(items_features, items_features_2), # Here items_features specified as a tuple of the two features lists
    sessions_features=sessions_features,
    sessions_items_features=sessions_items_features,
    sessions_items_availabilities=sessions_items_availabilities,
    choices=choices,
)

When indexing or batching your ChoiceDataset, you will now get items_features as a tuple, with elements corresponding to (items_features, items_features_2)

In [None]:
dataset.batch[0]

((array([[1. , 2. ],
         [2. , 4. ],
         [1.5, 1.5]], dtype=float32),
  array([[11. , 12. ],
         [12. , 14. ],
         [11.5, 11.5]], dtype=float32)),
 array([100,  20], dtype=int32),
 array([[100,   0],
        [140,   0],
        [200,   0]], dtype=int32),
 array([1., 1., 1.], dtype=float32),
 0)

This is possible with:
- items_features
- sessions_features
- sessions_items_features
As the other should not need any superposition of values.

In [None]:
dataset = ChoiceDataset(
    items_features=(items_features, items_features_2), # Here items_features specified as a tuple of the two features lists
    sessions_features=(sessions_features, sessions_features),
    sessions_items_features=(sessions_items_features, sessions_items_features),
    sessions_items_availabilities=sessions_items_availabilities,
    choices=choices,
)

dataset.batch[0]

## More Advanced use: the FeatureStore & OneHotStore

The FeaturseStore class is here to stock values that regularly repeat themselves over a sequence.
Let's take an example where several stores are considered. If we want to model the utility from store features (such as surface, average number of customers, etc...), these features are shared by several choices in our dataset.

In [None]:
# We have three stores, represented by their (surface, average_number_of_customers):
store_features = [[100, 250], [150, 500], [80, 100]]
# Now we consider a sequence of choices that happen in [store_1, store_1, store_2, store_3, store_2, store_1, store_3]
# The usual way to store the features would be:
usual_store_features = [[100, 250], [100, 250], [150, 500], [80, 100], [150, 500], [100, 250], [80, 100]]
# There are a lot of repetetitions, which is not very efficient...
# Let's multiply this with 600 stores represented as one-hot over thousands of sessions, we will have a memory problem !
# Now the StoreFeatures tries to be more efficient:

store = {1: [100, 250], 2: [150, 500], 3: [80, 100]} # We can use a dictionary to store the features of each store
sequence = [1, 1, 2, 3, 2, 1, 3] # We can use a sequence of keys to represent the sequence of stores

from choice_learn.data import FeaturesStore
feat_store = FeaturesStore.from_dict(store, sequence)

In [None]:
# Now we can access the features of the store appearing at index i in the sequence with the iloc method:

# Let's see the features of the store at index 0
print(feat_store.batch[0])
# Now we can also take a whole batch:
print(feat_store.batch[[0, 1, 5]])
# Ah ! We selected all the indexes where the store is 1, which is why we always have the same features !

[100, 250]
[[100, 250], [100, 250], [100, 250]]


In order to further optimize RAM usage, you can use the OneHotStore, built specifically for one-hot encoded features. The store will only keep the index of the one of each element and will consitute the one-hot vector only when needed.

In [None]:
from choice_learn.data import OneHotStore

In [None]:
store = OneHotStore.from_sequence(["a", "a", "a", "c", "b", "b", "a"])

# When using from_sequence, the store collects ranked order (lower to higher) of each element as index
print("RAM storage of the OneHotStore:", store.store, "with sequence:", store.sequence)
# When indexing with iloc, we can access the one-hot encoding of the element at index i in the sequence
print("One-hot vector at index 0:", store.batch[0])
print("One-hot vector at indexes [0, 1, 3]:")
print(store.batch[[0, 1, 3]])

RAM storage of the OneHotStore: {'a': 0, 'b': 1, 'c': 2} with sequence: ['a' 'a' 'a' 'c' 'b' 'b' 'a']
One-hot vector at index 0: [1. 0. 0.]
One-hot vector at indexes [0, 1, 3]:
[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]


- Add: Example from pandas.DataFrame

### Ready-to-use datasets
A few well-known open source datasets are directly integrated and the package and can be downloaded in one line:
- SwissMetro from Bierlaire et al (2001)
- ModeCanada from Koppleman et al. (1993)

If you feel like another open-source dataset should be included, reach out !

In [None]:
from choice_learn.datasets import load_swissmetro, load_modecanada

canada_choice_dataset = load_modecanada()
swissmetro_choice_dataset = load_swissmetro()

The datasets can also be downloaded as dataframes:

In [None]:
canada_df = load_modecanada(as_frame=True)
canada_df.head()

Unnamed: 0,case,alt,choice,dist,cost,ivt,ovt,freq,income,urban,noalt
1,1,train,0,83,28.25,50,66,4,45.0,0,2
2,1,car,1,83,15.77,61,0,0,45.0,0,2
3,2,train,0,83,28.25,50,66,4,25.0,0,2
4,2,car,1,83,15.77,61,0,0,25.0,0,2
5,3,train,0,83,28.25,50,66,4,70.0,0,2
