#### TransMonEE Indicators - API (Helix and UIS) sources populated in Data Dictionary - LEGACY DATA ETL
In this notebook, I will loop along these indicators for extraction and transformation.

**Numbers:**
* Helix sources (28 indicators - 1 missing in dataflow)
* UIS sources (129 indicators)
* Legacy Excel file (322 indicators)

#### Imports

In [4]:
from utils import get_API_code_address_etc
from fileUtils import fileDownload
from sdmx import sdmx_struc
from extraction import legacy
from extraction.wrap_api_address import wrap_api_address
from transformation.destination import Destination
from transformation.dataflow import Dataflow
from data_in.legacy_data.prepare_mapping import match_country_name
import os
import re
import pandas as pd
import numpy as np

#### TransMonEE countries list - Country ISO codes
##### Countries list is taken from dataflow TransMonEE in UNICEF Warehouse (requested by Eduard)

In [None]:
# UNICEF’s REST API data endpoint for TransMonEE Dataflow
url_endpoint = 'https://sdmx.data.unicef.org/ws/public/sdmxapi/rest/data/ECARO,TRANSMONEE,1.0/'

In [None]:
# address and parameters for dataflow structure request
api_address = url_endpoint + 'all'
api_params = {'format':'sdmx-json', 'detail':'structureOnly'}
# API dataflow structure request
d_flow_struc = fileDownload.api_request(api_address,api_params)

##### Country ISO codes (2 and 3 letters)

In [None]:
# TransMonEE three-letters country codes are taken from its dataflow
country_codes_3 = sdmx_struc.get_all_country_codes(d_flow_struc.json())

In [None]:
# country codes equivalence from excel file in repo root
country_codes_file = "./data_in/all_countrynames_list.xlsx"
country_codes_df = pd.read_excel(country_codes_file)

In [None]:
# map TMEE country_codes (three-letters/two-letters equivalence)
country_codes_2 = [country_codes_df.CountryIso2[country_codes_df.CountryIso3 == elem].values
                   for elem in country_codes_3.values()]
# country names are repeated in the list, and I want the uniques only
# numpy unique sorts the array, I take an extra step to retrieve the original order
uni_sort, sort_ind = np.unique(np.concatenate(country_codes_2), return_index=True)
country_codes_2 = uni_sort[np.argsort(sort_ind)]

In [None]:
# country codes mapping dictionary (two-letters/three-letters)
country_map = {k:v for k,v in zip(country_codes_2,country_codes_3.values())}
# write dictionary in py file to use it during transformations
path_file = "./transformation/country_map.py"
f = open(path_file, 'w')
f.write('country_map = ' + repr(country_map) + '\n')
f.close()

##### Country names as defined in Legacy Data
Required to identify rows with data by the Excel Legacy parser.

In [None]:
# list of countries as reported in legacy data
legacy_country_list = ["albania", "armenia", "azerbaijan", "belarus", "bosnia and herzegovina", "bulgaria", "croatia",
                 "czech republic", "estonia", "georgia", "hungary", "kazakhstan", "kyrgyzstan", "latvia", "lithuania",
                 "moldova", "montenegro", "poland", "romania", "russian federation", "serbia", "slovakia", "slovenia",
                 "tajikistan", "the former yugoslav republic of macedonia", "turkmenistan", "ukraine", "uzbekistan"]

In [None]:
# match country names (legacy data) with country names used in TMEE
# build dictionary with country names (legacy data) and country codes in TMEE
legacy_country_codes_3 = {}
for name in legacy_country_list:
    match = match_country_name(name, list(country_codes_3.keys()))
    legacy_country_codes_3[name] = country_codes_3[match]

In [None]:
# write legacy_country_codes_3 dictionary in py file to use during transformations
path_file = "./transformation/country_names_map.py"
f = open(path_file, 'w')
f.write('country_names_map = ' + repr(legacy_country_codes_3) + '\n')
f.close()

#### Legacy data Extraction
##### Source file

In [None]:
# path to legacy excel file
source_path_nsi = './data_in/legacy_data/'
source_file = 'TM-2019-EN-June.xlsx'
full_path = source_path_nsi + source_file

##### Raw data destination

In [None]:
# raw data destination path
raw_path = './data_out/data_raw/'

##### Parse all legacy indicators from Excel `source_file`
There's one spreadsheet with contents and 6 spreadsheets containing data.

The loop calls `parse_legacy` function for different spreadsheets.

**Dev improvement**: `parse_legacy` could get the number of sheets directly from excel file and loop inside.

In [None]:
n_sheets = 6
# Initialize legacy dataframe as None type
legacy_df = None
# legacy data filename to write
legacy_file_write = 'legacy_data'

# Skip extraction if legacy already parsed and writen
flag_parsed = os.path.exists(f"{raw_path}{legacy_file_write}.csv")

if flag_parsed:
    print(f"Legacy data already parsed and writen")
else:
    for i in range(1, n_sheets+1):
        print(f"Parsing Spreadsheet: {i}")
        df = legacy.parse_legacy(full_path,i,legacy_country_list)
        legacy_df = pd.concat([legacy_df,df])
        
    # write legacy raw data (all indicators) to csv file
    legacy_df.to_csv(f"{raw_path}{legacy_file_write}.csv",index=False)

**Warning Messages**: Education legacy indicators specify seasons instead of years, e.g: 2005/06

SDMX accepts only a year as time dimension. *Daniele* suggested adding an attribute in Data Structure Definition to denote this.
##### Transformation of legacy data into an SDMX structure
It is performed on `legacy_df` dataframe, and placed in this [**Section**](#Transformation-of-Legacy-Indicators-into-an-SDMX-structure).




#### TransMonEE UIS API Key

In [None]:
uis_key = "9d48382df9ad408ca538352a4186791b"

#### Read and Query Data Dictionary

In [None]:
# path to excel data dictionary in repo
data_dict_file = './data_in/data_dictionary/indicator_dictionary_TM_v3.xlsx'

In [None]:
# get indicators that are extracted by API (code, address and more in pandas dataframe)
api_code_addr_df = get_API_code_address_etc(data_dict_file)

#### Extract and Transform Indicators from dataframe `api_code_addr_df`

##### API Extraction: parameters

In [None]:
# parameters: API request dataflow from Helix
helix_api_params = {'startPeriod':'1950', 'endPeriod':'2050', 'locale':'en'}
# parameters: API request dataflow form UIS
uis_api_params = {**helix_api_params, 'subscription-key':uis_key}

##### API Extraction: headers

In [None]:
# API headers (desired format and compress response)
api_headers = {'Accept':'application/vnd.sdmx.data+csv;version=1.0.0', 'Accept-Encoding':'gzip'}

##### Transformation: map raw data into dataflow TransMonEE in UNICEF Warehouse

In [None]:
# transformed data destination path
trans_path = './data_out/data_transformed/'
# name of dataflow TransMonEE in UNICEF warehouse
dataflow_out = "ECARO:TRANSMONEE(1.0)"

In [None]:
# TMEE DSD (data structure definition)
dest_dsd = destination('TMEE')

##### Loop on dataframe `api_code_addr_df`

In [None]:
# actual loop (EXTRACT AND TRANSFORM)
for index, row in api_code_addr_df.iterrows():
    
    # sanity check on strings: strip leading and ending spaces
    url_endpoint = row['Address'].strip()
    indicator_code = row['Code'].strip()
    indicator_source = row['Data_Source'].strip()
    # get source_key from indicator_source
    pattern = "(.*?):"
    source_key = re.findall(pattern, indicator_source)[0].strip()
    indicator_notes = row['Obs_Footnote']
    
    print(f"Dealing with indicator: {indicator_code}")
        
    # wrap API addresses
    api_address = wrap_api_address(source_key, url_endpoint, indicator_code, country_codes_3, country_codes_df)
    
    # wrap API parameters
    if source_key.lower() == 'helix':
        api_params = helix_api_params
    else:
        api_params = uis_api_params
        
    # Skip extraction if indicator already downloaded
    flag_download = os.path.exists(f"{raw_path}{indicator_code}.csv")
    # This skip would need extra info to be executed for update purposes!
    # File names could include the year of execution?
    if flag_download:
        print(f"Indicator {indicator_code} skipped (already downloaded)")
    else:
        # request indicator raw data
        indicator_raw = fileDownload.api_request(api_address,api_params,api_headers)
        # if requests satisfactory
        if indicator_raw.status_code == 200:
            # write raw data to raw file
            raw_file = f"{raw_path}{indicator_code}.csv"
            with open(raw_file, 'wb') as f:
                f.write(indicator_raw.content)
            print(f"Indicator {indicator_code} succesfully downloaded")
            flag_download = True
    
    # Transform raw_data if it hasn't occured before
    flag_transform = os.path.exists(f"{trans_path}{indicator_code}.csv")
    
    if flag_transform:
        print(f"Transformation for {indicator_code} skipped (already done)")
    elif flag_download:        
        # build dataframe with indicator raw data
        data_raw = pd.read_csv(f"{raw_path}{indicator_code}.csv", dtype=str)

        # retain only codes from csv headers
        raw_columns = data_raw.columns.values
        rename_dict = {k:v.split(':')[0] for k,v in zip(raw_columns,raw_columns)}
        data_raw.rename(columns=rename_dict,inplace=True)

        # get dataflow from data raw anchor [0,0]
        text = data_raw.iloc[0,0]
        pattern = ':(.+?)\('
        dataflow_key = re.findall(pattern, text)[0]

        print(f"Transform indicator: {indicator_code}, from dataflow: {dataflow_key}")

        # instantiate dataflow class with the actual one
        dflow_actual = dataflow(dataflow_key)
        if dflow_actual.cod_map:
            # map the codes - normalization - works 'inplace'
            dflow_actual.map_codes(data_raw)

        # "metadata" from data dictionary: dataflow constants
        # any of these below won't be used if they are dataflow columns
        # Development NOTE: data dictionary info may be overwriten after
        constants = {
            'UNICEF_INDICATOR': indicator_code,
            'DATA_SOURCE': indicator_source,
            'OBS_FOOTNOTE': indicator_notes
        }

        # map the columns
        data_map = dflow_actual.map_dataframe(data_raw, constants)

        # save transformed indicator info independently (through pandas)
        data_trans = pd.DataFrame(columns=dest_dsd.get_csv_columns(), dtype=str)
        data_trans = data_trans.append(data_map)
        # destination Dataflow: corresponding UNICEF Warehouse DSD name
        data_trans['Dataflow'] = dataflow_out
        # save file
        data_trans.to_csv(f"{trans_path}{indicator_code}.csv",index=False)


#### Transformation of Legacy Indicators into an SDMX structure
For this purpose we need some indicators *metadata* that allows the mappings.

**Dev note**: data dictionary is not leveraged for legacy data so far. *Metadata* is prepared in a separated csv file `content_legacy_codes_v2`, located in `legacy_data` folder.

In [None]:
# Transform raw_data if it hasn't occured before
flag_transform = os.path.exists(f"{trans_path}{legacy_file_write}.csv")

if flag_transform:
    print(f"Transformation for legacy data skipped (already done)")
else:
    # dataflow to process is legacy data
    dataflow_key = "LEGACY"
    # instantiate dataflow class with the actual key (LEGACY)
    dflow_actual = dataflow(dataflow_key)
    
    # build dataframe with legacy raw data
    data_raw = pd.read_csv(f"{raw_path}{legacy_file_write}.csv", dtype=str)
    
    # map the codes - normalization - from legacy dataframe
    dflow_actual.map_codes(data_raw)

    # initialize constants empty (no data from dictionary for legacy)
    constants = {}
    # map the columns
    data_map = dflow_actual.map_dataframe(data_raw, constants)
    
    # save transformed indicator info independently (through pandas)
    data_trans = pd.DataFrame(columns=dest_dsd.get_csv_columns(), dtype=str)
    data_trans = data_trans.append(data_map)
    # destination Dataflow: TMEE DSD in UNICEF Warehouse
    data_trans['Dataflow'] = dataflow_out
    # save file
    data_trans.to_csv(f"{trans_path}{legacy_file_write}.csv",index=False)


#### Data to Upload - Build only one CSV with all data transformed
Could be done with Linux command `sed` for faster performance.

In [None]:
# all csv files with data transformed
files_trans = [file for file in os.listdir(trans_path) if file.endswith(".csv")]
# pandas concat
dest_dsd_df = pd.concat([pd.read_csv(f"{trans_path}{f}", dtype=str) for f in files_trans])

# save file
etl_out_file = 'TMEE_ETL_out'
dest_dsd_df.to_csv(f"{trans_path}{etl_out_file}.csv",index=False)
