# Join raw data

In this notebook we will define and create the parameters file to process the raw data. This notebook is expected to be run in one full run, running every cell once configuration is done.

The tasks to do are:
1. Select which files from the raw directory are going to be used for the OMOP conversion.
2. Try to open the files and select the optimal parameters to do so
3. Identify the date columns at each file and convert them to a date format
4. Save all files in the same directory in parquet file for faster further processing

The following snippet of code will retrieve the gathered information and apply the transformation to the files.

```python 
from package.datasets import data_dir

import external.bps_to_omop.bps_to_omop.extract as ext

# -- PARAMETERS -------------------------------------------------------
# -- Get environment variables and create output dir
config_file = "./hepapred/preomop/process_raw_data_params.yaml"

# -- MAIN -------------------------------------------------------------
# Apply the changes
yaml_dict = ext.read_yaml_config(config_file)
ext.apply_modifications(data_dir, config_file, verbose=1)
```


The rest of the document will describe how to automatically populate the configuration file. 

It is also possible to do it manually, here is an example of the expected format:

```yaml
input_dir: raw/
input_files:
  input_files:
  - path_to/file1.txt
  - path_to/file2.txt
output_dir: rare/01_parquet/
output_files:
  path_to/file1.txt: out1.parquet
  path_to/file2.txt: out2.parquet
read_options:
  path_to/file1.txt:
    encoding: utf-8
    sep: '|'
  path_to/file2.txt:
    encoding: utf-8
    sep: ';'
date_columns:
  path_to/file1.txt:
  - COD_FEC_FALLECIMIENTO
  - COD_FEC_NACIMIENTO
  path_to/file2.txt:
  - COD_FEC_INI_PATOLOGIA
date_formats:
  path_to/file1.txt:
    COD_FEC_FALLECIMIENTO:
      errors: coerce
      format: ISO8601
    COD_FEC_NACIMIENTO:
      errors: raise
      format: ISO8601
  path_to/file2.txt:
    COD_FEC_INI_PATOLOGIA:
      errors: raise
      format: ISO8601
```


## Identification of needed files

It is often useful to include here a reasoning about what files to use.


## Definition of parameters

Now we need to define the parameters to perform the translation.

First of all, we need to know where to save the configuration file (`yaml_path`). Usually it is placed next to [process_raw_data.py](../../../hepapred/preomop/process_raw_data.py). 

The parameters `input_dir` and `output_dir` are defined in relation to the `data_dir` folder defined in the `.env` file. This way the general location of the files can remain hidden.
- `input_dir` is a str that defines the folder where raw data is.
- `output_dir` is a str that defines the folder where output is going to be saved.
- `input_files` is a list with all files to be processed. They can be filenames or relative paths from `input_dir`.
- `output_files` is a dict that maps every item in `input_files` to its new name. Make sure to finish it with *.parquet*.

In [None]:
import sys

# Append the location of the submodule bps_to_omop to PATH
sys.path.append("../external/bps_to_omop/")

import bps_to_omop.extract as ext
from hepapred.datasets import data_dir

# == Define parameters ===========================================
# Parameters file path
yaml_path = "../package/preomop/process_raw_data_params.yaml"

# Path to relevant files
input_dir = "raw/"
# Define saving directory
output_dir = "rare/01_parquet/"

# Input files to be used
input_files = [
    "20231112/file1.txt",
    "20231112/casos/file2.txt",
]
# Remap to new names (make sure they are parquet)
output_files = {
    "20231112/file1.txt": "out1.parquet",
    "20231112/casos/file2.txt": "out2.parquet",
}


The following piece of code write the parameters and prepares the full path to the actual files.

In [None]:
# == Preconfiguration ==========================================
# write params to configuration file
ext.update_yaml_config(yaml_path, "input_dir", input_dir)
ext.update_yaml_config(yaml_path, "input_files", {"input_files": input_files})
ext.update_yaml_config(yaml_path, "output_dir", output_dir)
ext.update_yaml_config(yaml_path, "output_files", output_files)

# Create the full path to the folder with files
raw_data_dir = data_dir / input_dir

## Testing of reading parameters

This section deals with the reading of the files.
- `default_params` contains a dictionary with the parameters in pandas.read_csv() that will always work.
  - If every file uses the same separator. You can put it here.
- `candidate_params` contains a dictionary of dictionaries. Each key is a parameter in pandas.read_csv() and their values are the possible values to try.
  - This is useful if some files have a rows to skip and others do not.
- `funcs_to_check` contains a list of functions that, applied on the read dataframe, would raise an error if there's something weird.
  - This is here for possible future convenience. So far it has been better to just check the output.
  - Currently checks that:
    - The file is actually read.
    - The resulting dataframe has more than 1 column.

The function `get_reading_params()` will try every possible combination of parameters and save the first that work on the configuration file.

In [None]:
# == Format file config ========================================
# Default parameters (works for every file)
default_params = {"sep": "|"}

# Possible candidates to try
candidate_params = {
    "encoding": ["latin9", "utf-8"],
}
# Extra functions to check if read data makes sense
funcs_to_check = []

# == Main ======================================================
print("Extraction config:")
# Retrieve appropiate read_options
read_options = ext.get_reading_params(
    raw_data_dir,
    input_files,
    default_params,
    candidate_params,
    funcs_to_check,
    verbose=1,
)
ext.update_yaml_config(yaml_path, "read_options", read_options)


## Testing of date columns

This section deals with the columns of the files containing dates.

First, the function `find_matching_keys_on_files()` will try every possible combination of parameters and save the first that work on the configuration file. Default strings are: *"fecha", "fec", "inicio", "fin" and "f_"* (case insensitive), but more can be added using the parameters `search_words`. See docstring of `find_matching_keys_on_files()` for more information.

Secondly, the function `get_date_parser_options()` will try to parse the column dates. To do so it will read a limited number of rows in the file and try to parse the date columns. It will first to try to nicely transform to datetime. If no combination works nicely, it will try to coerce the transformation, reporting (if verbose >= 1) the number of values that were transformed to nans/nulls in the process.

- `candidate_params` contains a dictionary of dictionaries. Each key is a parameter in pandas.to_datetime() function and their values are the possible values to try.
  - This is useful if different files have different date formats.


In [None]:
# == Date extraction config ====================================
# Define possible parameters to extract the date
candidate_params = {
    "format": [
        "ISO8601",  # This is a fast read, eq to "%Y%m%d","%Y/%m/%d","%Y-%m-%d",
        "%d/%m/%y",
    ]
}

# == Main =====================================================
print("Date format config:")
# Get the date column names of the files
date_columns = ext.find_matching_keys_on_files(
    raw_data_dir, input_files, read_options, verbose=1
)
# Append info to yaml_dict
ext.update_yaml_config(yaml_path, "date_columns", date_columns)
print('\n')

# Test possible date parser options
(date_formats, date_formats_coercions) = ext.get_date_parser_options(
    raw_data_dir,
    input_files,
    date_columns,
    candidate_params,
    read_options,
    verbose=1,
    nrows=5000,
)

# Append info to yaml_dict
ext.update_yaml_config(yaml_path, "date_formats", date_formats)