# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 02: Training Data & Feature views</span>

<span style="font-width:bold; font-size: 1.4rem;"> This notebook explains how to read from a feature group and create training dataset within the feature store</span>

## 🗒️ In this notebook you will see how to create a training dataset from the feature groups:
1. **Select the features** you want to train our model on,
2. **How the features should be preprocessed,**
3. **Create a dataset split** for training, validation and test data.

![tutorial-flow](../../images/02_training-dataset.png) 

In [1]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store() 

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/258
Connected. Call `.close()` to terminate connection gracefully.




## <span style="color:#ff5f27;"> 🔪 Feature Selection </span>

You will start by selecting all the features you want to include for model training/inference.

In [2]:
# Load feature groups.
meteorological_measurements_fg = fs.get_or_create_feature_group(
    name = 'meteorological_measurements_fg',
    version = 1
)

In [3]:
week_days_electricity_prices_fg = fs.get_or_create_feature_group(
    name = 'week_days_electricity_prices_fg',
    version = 1
)

In [4]:
intra_day_electricity_prices_fg = fs.get_or_create_feature_group(
    name = 'intra_day_electricity_prices_fg',
    version = 1
)

## <span style="color:#ff5f27;">💼 Query Preparation</span>

In [None]:
# Select features for training data.
fg_query = intra_day_electricity_prices_fg.select_all()\
                        .join(
                            meteorological_measurements_fg.select_all()
                        )\
                        .join(
                            week_days_electricity_prices_fg.select_all()
                        )
fg_query.show(5)

# uncomment this if you would like to view query results
#fg_query.show(5)

🤖 Transformation Functions </span>

Hopsworks Feature Store provides functionality to attach transformation functions to feature views and comes with built-in transformation functions such as `min_max_scaler`, `standard_scaler`, `robust_scaler` and `label_encoder`.

You will preprocess our data using *min-max scaling* on numerical features and *label encoding* on categorical features. To do this you simply define a mapping between our features and transformation functions. This ensures that transformation functions such as *min-max scaling* are fitted only on the training data (and not the validation/test data), which ensures that there is no data leakage.

In [6]:
# Load transformation functions.
standard_scaler = fs.get_transformation_function(name = 'standard_scaler')
label_encoder = fs.get_transformation_function(name = 'label_encoder')

#Map features to transformations.
mapping_transformers = {
    "price": standard_scaler,
    "mean_temp_per_day": standard_scaler,
    "mean_wind_speed": standard_scaler,
    "precipitaton_amount": standard_scaler,
    "total_sunshine_time": standard_scaler,
    "mean_cloud_perc": standard_scaler,
    
    "area": label_encoder,
    "precipitaton_type": label_encoder,
    "type_of_day": label_encoder
}

---

## <span style="color:#ff5f27;"> ⚙️ Feature View Creation </span>

`Feature Views` stands between **Feature Groups** and **Training Dataset**. Сombining **Feature Groups** we can create **Feature Views** which store a metadata of our data. Having **Feature Views** we can create **Training Dataset**.

The Feature Views allows schema in form of a query with filters, define a model target feature/label and additional transformation functions.

In order to create Feature View we can use `FeatureStore.create_feature_view()` method.

We can specify next parameters:

- `name` - name of a feature group.

- `version` - version of a feature group.

- `labels`- our target variable.

- `transformation_functions` - functions to transform our features.

- `query` - query object with data.

In [9]:
feature_view = fs.create_feature_view(
    name = 'electricity_feature_view',
    version = 1,
    transformation_functions = mapping_transformers,
    query = fg_query
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/258/fs/198/fv/electricity_feature_view/version/1


In [10]:
feature_view

<hsfs.feature_view.FeatureView at 0x7fbbdae10c70>

For now `Feature View` is saved in Hopsworks and we can retrieve it using `FeatureStore.get_feature_view()`.

In [11]:
feature_view = fs.get_feature_view(
    name = 'electricity_feature_view',
    version = 1
)

In [12]:
feature_view.version

1

---

## <span style="color:#ff5f27;"> 🏋️ Training Dataset Creation</span>

In Hopsworks training data is a query where the projection (set of features) is determined by the parent FeatureView with an optional snapshot on disk of the data returned by the query.

**Training Dataset  may contain splits such as:** 
* Training set - the subset of training data used to train a model.
* Validation set - the subset of training data used to evaluate hparams when training a model
* Test set - the holdout subset of training data used to evaluate a mode

To create training dataset we use `FeatureView.create_training_data()` method.

Here are some importand things:

- It will inherit the name of FeatureView.

- The feature store currently supports the following data formats for
training datasets: **tfrecord, csv, tsv, parquet, avro, orc**.

- We can choose necessary format using **data_format** parameter.

- **start_time** and **end_time** in order to filter dataset in specific time range.

- We can create **train, test** splits using `create_train_test_split()`. 

- We can create **train,validation, test** splits using `create_train_validation_test_splits()` methods.

- The only thing is that we should specify desired ratio of splits.

#### <span style="color:#ff5f27;"> ⛳️ Dataset with train, test and validation splits</span>

In [13]:
# Create training datasets based event time filter
td_jan2021_feb2022_version, td_job = feature_view.create_training_data(
        start_time = "20210101",
        end_time = "20220228",    
        description = 'Electricity price prediction training dataset jan2021/feb2022',
        data_format = "csv",
        coalesce = True,
        write_options = {'wait_for_job': True},
    )

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/258/jobs/named/electricity_feature_view_1_1_create_fv_td_06092022201105/executions




In [14]:
# Create training datasets based event time filter
td_spring2022, td_job = feature_view.create_training_data(
        start_time = "20220301",
        end_time = "20220531",    
        description = 'Electricity price prediction training dataset March/May 2022',
        data_format = "csv",
        coalesce = True,
        write_options = {'wait_for_job': True},
    )

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/258/jobs/named/electricity_feature_view_1_2_create_fv_td_06092022201217/executions




In [15]:
# Create training datasets based event time filter
td_summer2022, td_job = feature_view.create_training_data(
        start_time = "20220601",
        end_time = "20220905",    
        description = 'Electricity price prediction training dataset June/August 2022',
        data_format = "csv",
        coalesce = True,
        write_options = {'wait_for_job': True},
    )

Training dataset job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/258/jobs/named/electricity_feature_view_1_3_create_fv_td_06092022201330/executions




---

### <span style="color:#ff5f27;"> Next Steps</span>

In the next notebook, we will train a model on the Training Dataset we created in this notebook.