## The different possible ways to create a ChoiceDataset

Listed below:

- [From a single long format DataFrame](#from-a-single-long-format-dataframe)
- [From a single wide format DataFrame](#from-a-single-wide-format-dataframe)
- [From several DataFrames](#from-several-dataframes)
- [From several np.ndarrays](#from-several-npndarrays)

In [None]:
import os
import sys
from pathlib import Path


sys.path.append("../../")

import numpy as np
import pandas as pd

from choice_learn.data import ChoiceDataset
from choice_learn.data.storage import FeaturesStorage

We will use the CanadaMode dataset for this example. We can download it directly:

In [None]:
from choice_learn.datasets import load_modecanada

canada_df = load_modecanada(as_frame=True, add_is_public=True)
canada_df.head()

Unnamed: 0,case,alt,choice,dist,cost,ivt,ovt,freq,income,urban,noalt,is_public
1,1,train,0,83,28.25,50,66,4,45.0,0,2,1.0
2,1,car,1,83,15.77,61,0,0,45.0,0,2,0.0
3,2,train,0,83,28.25,50,66,4,25.0,0,2,1.0
4,2,car,1,83,15.77,61,0,0,25.0,0,2,0.0
5,3,train,0,83,28.25,50,66,4,70.0,0,2,1.0


Let's create a column indicating whether the considered transport alternative is individual or not transport.

### From a single long format dataframe

In [None]:
dataset = ChoiceDataset.from_single_long_df(df=canada_df,
                                       fixed_items_features_columns=["is_public"],
                                       contexts_features_columns=["dist", "income", "urban"],
                                       contexts_items_features_columns=["freq", "cost", "ivt", "ovt"],
                                       items_id_column="alt",
                                       contexts_id_column="case",
                                       choices_column="choice",
                                       # the choice columns indicates if the item is chosen (1) or not (0)
                                       choice_format="one_zero",
                                       )
print(dataset.summary())

No features_by_ids given.
%%% Summary of the dataset:
Number of items: 4
Number of choices: 4324
 Fixed Items Features:
 1 items features
 with names: (['is_public'],)


 Contexts features:
 3 context features
 with names: (['dist', 'income', 'urban'],)


 Contexts Items features:
 4 context
                 items features
 with names: (['freq', 'cost', 'ivt', 'ovt'],)



Another mode is possible, if the dataframe indicates the name of the chosen item instead of ones and zeros:

In [None]:
canada_df = load_modecanada(as_frame=True, add_is_public=True, choice_format="items_id")
canada_df.head()

Unnamed: 0,case,alt,choice,dist,cost,ivt,ovt,freq,income,urban,noalt,is_public
1,1,train,car,83,28.25,50,66,4,45.0,0,2,1.0
2,1,car,car,83,15.77,61,0,0,45.0,0,2,0.0
3,2,train,car,83,28.25,50,66,4,25.0,0,2,1.0
4,2,car,car,83,15.77,61,0,0,25.0,0,2,0.0
5,3,train,car,83,28.25,50,66,4,70.0,0,2,1.0


This time, the choice is not given by ones and zeros but actually names for each context which alternative (item) has been chosen.
The ChoiceDataset handles this case easily, by specifying 'choice_format="items_id"'.

In [None]:
dataset = ChoiceDataset.from_single_long_df(df=canada_df,
                                       fixed_items_features_columns=["is_public"],
                                       contexts_features_columns=["dist", "income", "urban"],
                                       contexts_items_features_columns=["freq", "cost", "ivt", "ovt"],
                                       items_id_column="alt",
                                       contexts_id_column="case",
                                       choices_column="choice",
                                       # the choice columns indicates the id of the chosen item
                                       choice_format="items_id",
                                       )
print(dataset.summary())

No features_by_ids given.
%%% Summary of the dataset:
Number of items: 4
Number of choices: 4324
 Fixed Items Features:
 1 items features
 with names: (['is_public'],)


 Contexts features:
 3 context features
 with names: (['dist', 'income', 'urban'],)


 Contexts Items features:
 4 context
                 items features
 with names: (['freq', 'cost', 'ivt', 'ovt'],)



### From a single wide format DataFrame

If your DataFrame is in the wide format you can use the 'from_single_wide_df' method. Here is an example with the SwissMetro dataset.

In [None]:
from choice_learn.datasets import load_swissmetro

swiss_df = load_swissmetro(as_frame=True,)
swiss_df.loc[swiss_df.CHOICE != 0]
swiss_df["CHOICE"] = swiss_df["CHOICE"] - 1
swiss_df.head()

In [None]:
dataset = ChoiceDataset.from_single_wide_df(
    df=swiss_df,
    items_id=["TRAIN", "SM", "CAR"],
    fixed_items_suffixes=None,
    contexts_features_columns=["GROUP", "SURVEY", "SP", "PURPOSE", "FIRST", "TICKET", "WHO", "LUGGAGE", "AGE",
                               "MALE", "INCOME", "GA", "ORIGIN", "DEST"],
    contexts_items_features_suffixes=["CO", "TT", "HE", "SEATS"],
    contexts_items_availabilities_suffix="AV", # ["TRAIN_AV", "SM_AV", "CAR_AV"] also works
    choices_column="CHOICE",
    choice_format="item_index",
)

### From several DataFrames

Now, let's say that you have your data split into several files. It can happen if you store the different type of features in different SQL Tables for example.
You will only need to follow some restrictions:

In [None]:
fixed_items_features, contexts_features, contexts_items_features, choices =\
load_modecanada(as_frame=True, split_features=True, add_is_public=True)

fixed_items_features need to have a column named "item_id" referencing the item. Others columns are free to be any feature.

In [None]:
fixed_items_features.head()

Unnamed: 0,item_id,is_public
0,car,0
1,train,1
2,bus,1
3,air,1


contexts_features need to have a "context_id" column (otherwise index is used). Other columns are free to be any feature.

In [None]:
contexts_features.head()

Unnamed: 0,context_id,income,dist,urban
1,1,45.0,83,0
3,2,25.0,83,0
5,3,70.0,83,0
7,4,70.0,83,0
9,5,55.0,83,0


contexts_items_features need to have the column "item_id" and is recommended to have the column "context_id" (otherwise index is used).\
Of course "item_id" and "context_id" should match fixed_items_features and contexts_features.

In [None]:
contexts_items_features.head()

Unnamed: 0,context_id,item_id,freq,cost,ivt,ovt
1,1,train,4,28.25,50,66
2,1,car,0,15.77,61,0
3,2,train,4,28.25,50,66
4,2,car,0,15.77,61,0
5,3,train,4,28.25,50,66


choices should have a column "context_id" and a column "choice". The value in "choice" should match the values in the column "item_id" in items_features and contexts_items_features.

In [None]:
choices.head()

Unnamed: 0,context_id,choice
2,1,car
4,2,car
6,3,car
8,4,car
10,5,car


In [None]:
# And now you can create the dataset with:
dataset = ChoiceDataset(fixed_items_features=fixed_items_features,
                        contexts_features=contexts_features,
                        contexts_items_features=contexts_items_features,
                        choices=choices)
print(dataset.summary())

No features_by_ids given.
%%% Summary of the dataset:
Number of items: 4
Number of choices: 4324
 Fixed Items Features:
 1 items features
 with names: (['is_public'],)


 Contexts features:
 3 context features
 with names: (Index(['income', 'dist', 'urban'], dtype='object'),)


 Contexts Items features:
 4 context
                 items features
 with names: (Index(['cost', 'freq', 'ivt', 'ovt'], dtype='object'),)



### From several np.ndarrays

Finally, another alternative is to specify each type of feature as np.ndarrays. You can or not also give features names. It is not necessary unless you plan to use a model with specification w.r.t. to those features names.

In [None]:
fixed_items_features, contexts_features, contexts_items_features, contexts_items_availabilities, choices =\
load_modecanada(split_features=True, add_is_public=True)

If you are using this method, it is your job to make sure that the arrays are well organized.\
First, contexts_features, contexts_items_features, contexts_items_availabilities and choices must be in the right order and their dimension (first one) must match.\
Second, fixed_items_features, contexts_items_availabilities and contexts_items_features must also have the same number of items and ordered the sames. Only it must be on the first dimension of fixed_items_features and the second one of contexts_items_features and contexts_items_availabilities.\
Third, choices must indicate the index of the chosen item as ordered in fixed_items_features and contexts_items_features.
Finally you have to precise the contexts_items_availabilities, or which items were available (1) or not (0) for each context/choice.

To summarize the shape of the arrays must be:
- (n_items, n_fixed_items_features) for fixed_items_features
- (n_choices, n_contexts_features) for contexts_features
- (n_choices, n_items, n_contexts_items_features) for contexts_items_features
- (n_choices, n_items) for contexts_items_availabilities
- (n_choices, ) for choice

*Reminder:* One context corresponds to one choice.

In [None]:
print("For our example here are the arrays shapes:")
print(f"Fixed Items Features shape: {fixed_items_features.shape}, 4 items, 1 feature (is_public)")
print(f"Contexts Features shape: {contexts_features.shape}, 4324 choices, 3 features (income, dist, urban)")
print(f"Contexts Items Features shape: {contexts_items_features.shape}, 4324 choices, 4 items, 4 features (freq, cost, ivt, ovt)")
print(f"Contexts Items Availabilities shape: {contexts_items_availabilities.shape}, 4324 choices, 4 items")
print(f"Choices shape: {choices.shape}, 4324 choices")

For our example here are the arrays shapes:
Fixed Items Features shape: (4, 1), 4 items, 1 feature (is_public)
Contexts Features shape: (4324, 3), 4324 choices, 3 features (income, dist, urban)
Contexts Items Features shape: (4324, 4, 4), 4324 choices, 4 items, 4 features (freq, cost, ivt, ovt)
Contexts Items Availabilities shape: (4324, 4), 4324 choices, 4 items
Choices shape: (4324,), 4324 choices


In [None]:
dataset = ChoiceDataset(fixed_items_features=fixed_items_features,
                        contexts_features=contexts_features,
                        contexts_items_features=contexts_items_features,
                        choices=choices,
                        contexts_items_availabilities=contexts_items_availabilities,
                        # We can give the name of the features as follows, with the right order:
                        fixed_items_features_names=["is_public"],
                        contexts_features_names=["income", "dist", "urban"],
                        contexts_items_features_names=["freq", "cost", "ivt", "ovt"],
                        )
print(dataset.summary())

No features_by_ids given.
%%% Summary of the dataset:
Number of items: 4
Number of choices: 4324
 Fixed Items Features:
 1 items features
 with names: (['is_public'],)


 Contexts features:
 3 context features
 with names: (['income', 'dist', 'urban'],)


 Contexts Items features:
 4 context
                 items features
 with names: (['freq', 'cost', 'ivt', 'ovt'],)

