# Introduction

This notebook serves to research the metadata of the datasets we are interested in using. The metadata will serve as a basis to do the data query and will support the data analysis.

# Involved datasets

We intend to analyse some facts about energy in the EU. For that we intend to use:


|Dataset_id|Name  |Url |
|--|--|-- |
| demo_pjan |# Population on 1 January by age and sex  |https://ec.europa.eu/eurostat/web/products-datasets/-/DEMO_PJAN|
| nrg_bal_s| Simplified Energy Balances | https://ec.europa.eu/eurostat/databrowser/view/nrg_bal_s|

In [1]:
from pysdmx.io import get_datasets
import utils

We need to understand the dataset before using it.
We should do this by seeing the data structure (i.e., the columns and the possible values for each of them).

The SDMX standard API allows to query the exact codes that are used by a dataflow. Unfortunately, that method is not always availabe, and that is the case for Eurostat. Therefore, in order to research what are the actual combinations used by this dataset, we need to get actual data. Our proposal in this case is to get data for the last observation, which should provide the maximum number of combinations with the minimum number of periods.

We have created a method in utils to simplify this to the maximum. We only need to write the dataset we are interested in researching, and we'll get a summary of the contents.

In [2]:
utils.get_summary_dataset('demo_pjan')

https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/demo_pjan/all?format=SDMX-CSV&lastNObservations=1
https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/dataflow/all/demo_pjan/latest?detail=full&references=descendants


{'structure_type': 'Schema',
 'structure_id': 'DEMO_PJAN',
 'data_effective_structure_summary': {'age': {'code': 'age',
   'name': 'Age class',
   'enumeration': {'TOTAL': 'Total',
    'UNK': 'Unknown',
    'Y1': '1 year',
    'Y10': '10 years',
    'Y11': '11 years',
    'Y12': '12 years',
    'Y13': '13 years',
    'Y14': '14 years',
    'Y15': '15 years',
    'Y16': '16 years',
    'Y17': '17 years',
    'Y18': '18 years',
    'Y19': '19 years',
    'Y2': '2 years',
    'Y20': '20 years',
    'Y21': '21 years',
    'Y22': '22 years',
    'Y23': '23 years',
    'Y24': '24 years',
    'Y25': '25 years',
    'Y26': '26 years',
    'Y27': '27 years',
    'Y28': '28 years',
    'Y29': '29 years',
    'Y3': '3 years',
    'Y30': '30 years',
    'Y31': '31 years',
    'Y32': '32 years',
    'Y33': '33 years',
    'Y34': '34 years',
    'Y35': '35 years',
    'Y36': '36 years',
    'Y37': '37 years',
    'Y38': '38 years',
    'Y39': '39 years',
    'Y4': '4 years',
    'Y40': '40 years',
 

With this summary we can see what we are interested in. We can create a dictionary that we will use later to query the right data.

We leave the countries open, since this is something we want to modify later during the data analysis:

In [4]:
demo_pjan_constraints={
    'age': 'TOTAL',
    'sex': 'T',
    'unit': 'NR',
    'age': 'TOTAL',
    }

In [None]:
nrg_bal_s = get_eurostat_dataset('nrg_bal_s', start_period='2023')

In [None]:
utils.get_summary_dataset(nrg_bal_s)

In [None]:
{
    'nrg_bal_total_consumption': ['FC_E'],
    'nrg_bal_consumption_basic': ['FC_IND_E','FC_TRA_E','FC_OTH_E'],
    'nrg_bal_consumption_industry_full': [
        'FC_IND_CON_E','FC_IND_CPC_E','FC_IND_FBT_E','FC_IND_IS_E','FC_IND_MAC_E','FC_IND_MQ_E',
        'FC_IND_NFM_E','FC_IND_NMM_E','FC_IND_NSP_E','FC_IND_PPP_E','FC_IND_TE_E','FC_IND_TL_E','FC_IND_WP_E'],
    'nrg_bal_consumption_other_full': ['FC_OTH_AF_E', 'FC_OTH_CP_E', 'FC_OTH_FISH_E', 'FC_OTH_HH_E', 'FC_OTH_NSP_E'],
    'nrg_bal_consumption_transport_full': ['FC_TRA_DAVI_E', 'FC_TRA_DNAVI_E', 'FC_TRA_NSP_E', 'FC_TRA_PIPE_E',
                                           'FC_TRA_RAIL_E', 'FC_TRA_ROAD_E',],
    'nrg_bal_siec_total': ['TOTAL'],
    'nrg_bal_siec_basic':[
    'P1000', 'S2000', 'G3000', 'E7000', 'H8000'
]}