# Omopization

Once the files have been preprocessed, we can move on to prepare them for omopization. Some general modifications will be done to adapt the data to the omop format. These files are still the original files (after process_rare_files stage), but with changes into column names or extra columns. The actions that will be performed are:

- Rename of the column that identifies the patient to 'person_id', as in OMOP.
- Rename of date columns to a general 'start_date', 'end_date'. 
  - Most omop tables have a start_date and end_date, preceeding by a str that identifies the table. We will create those columns for each file. If only one date is present, it will be duplicated as both 'start_date' and 'end_date'.
  - This ensures any future process does not have to deal with specific names of each file.
- Assign a type_concept in relation to the origin of the information.


## Procedure

The typical script that will launch the omopization is:
 
```python
# %%
import os
import sys

import pyarrow.parquet as parquet

from package.datasets import data_dir

sys.path.append("external/bps_to_omop/")
import bps_to_omop.extract as ext
import bps_to_omop.general as gen
import bps_to_omop.person as per

# %%
# -- Define parameters ------------------------------------------------
params_file = "./package/preomop/omopization_params.yaml"

# -- Load parameters --------------------------------------------------
print("Reading parameters...")

# -- Load yaml file and related info
params_data = ext.read_yaml_params(params_file)
input_dir = data_dir / params_data["input_dir"]
output_dir = data_dir / params_data["output_dir"]
input_files = params_data["input_files"]
person_columns = params_data["person_columns"]
date_columns = params_data["date_columns"]
type_concept_mapping = params_data["type_concept_mapping"]
os.makedirs(output_dir, exist_ok=True)

# %%
# -- transform the tables ---------------------------------------------
print("Transforming tables...")
for f in input_files:
    print(f"- {f}")
    # Read data
    table = parquet.read_table(input_dir / f)
    cols_to_remove = []
    # Remove the __index_level_0__ if exists
    cols_to_remove += ["__index_level_0__"]

    # -- person_id --------------------------------------------------------------------------------
    # Get the person_id
    person_id, person_source_value = per.transform_person_id(table, person_columns[f])
    # Remove column from list to keep
    cols_to_remove += [person_columns[f]]

    # -- start_date and end_date ------------------------------------------------------------------
    # Ensure they are ordered, i.e. end_date is after start_date
    try:
        start_date, end_date = ext.find_start_end_dates(
            table, date_columns[f], verbose=0
        )
    except (ValueError, TypeError) as inst:
        print(f"Error found! {inst}")
        raise inst
    # Remove columns from list to keep
    cols_to_remove += date_columns[f]

    # -- type_concept -----------------------------------------------------------------------------
    # Create a columns with the code
    type_concept_code = type_concept_mapping[f]
    type_concept = gen.create_uniform_int_array(len(table), type_concept_code)

    # -- Final steps ------------------------------------------------------------------------------
    # Append to old table
    print(f"{f} input and output columns:")
    print(" >", table.column_names)
    table = table.add_column(0, "person_id", person_id)
    table = table.add_column(1, "person_source_value", person_source_value)
    table = table.add_column(1, "start_date", start_date)
    table = table.add_column(2, "end_date", end_date)
    table = table.add_column(3, "type_concept", type_concept)
    # Remove unnecesary columns
    cols_to_keep = [col for col in table.column_names if col not in cols_to_remove]
    table = table.select(cols_to_keep)
    print(" <", table.column_names)

    # Save to the same file
    f_save = output_dir / f
    parquet.write_table(table, f_save)

print("Done!\n")
```

This file generates the folders needed to store the OMOP tables. To work it requires the configuration file in the following format:

 ```yaml
input_dir: /path/to/input_dir/ # Common path where all input_files are located
output_dir: /path/to/output_dir/ # Common path where all output_files will be located
input_files:  
  # Path to each file from input_dir
  - /path/tp/file_1
  - /path/tp/file_2
person_columns:
  # For each file, name of the column that contains the patient id
  file_1: NUHSA_ENCRIPTADO
  file_2: NUHSA_ENCRIPTADO
date_columns:
  # For each file, name of the column or columns that contains the dates
  file_1:
  - FECHA_INICIO
  file_2:
  - COD_FEC_INI_DIAGNOSTICO
type_concept_mapping:
  # For each file, omop code that represent the origin of the information
  # See https://github.com/OHDSI/Vocabulary-v5.0/wiki/Vocab.-TYPE_CONCEPT
  file_1: 32817
  file_2: 32840
```

## Code to automatically generate params

In this section we will automatically search all files to generate the params file. Take into account this are all dummy variables, and proper configuration is needed for it work.

We need to provide:

- `params_file` is a path to the parameters file. 
- `input_dir` is a str that defines the folder where raw data is.
- `output_dir` is a str that defines the folder where output is going to be saved.
- `input_files` is a list with all files to be processed. They can be filenames or relative paths from `input_dir`.
- `str_checklist` is a list with substring that will be used to identify date columns. Anything containing any of these strings will be considered a date column.

Take into account that the parameters `input_dir` and `output_dir` are defined in relation to the `data_dir` folder defined in the `.env` file. This way the general location of the files can remain hidden.

In [None]:
import os
import sys

import pandas as pd

from package.datasets import (
    data_dir,
)  # It is advisable to provide the data_dir as a module. load_dotenv can ve used too.

sys.path.append("../external/bps_to_omop/")  # Make sure the path is right!
import bps_to_omop.extract as ext

# == Define parameters ======================================================
params_file = "../package/preomop/omopization_params.yaml"
os.remove(params_file)

input_dir = "rare/02_cleaned/"
input_files = ["01_sociodemo.parquet", "02_Patologias_BPS.parquet", "03_MPA.parquet"]
output_dir = "done/"

# string to identify date columns
str_checklist = ["fecha", "fec", "inicio", "fin", "f_"]

# == Work with parameters ===================================================
# Write to file
ext.update_yaml_params(params_file, "input_dir", input_dir)
ext.update_yaml_params(params_file, "output_dir", output_dir)
ext.update_yaml_params(params_file, "input_files", input_files)

### person_id fields

We need to define the columns that have the ID of each patient in the files. This code basically checks that the first column of the file contains something similar to 'NUHSA' and adds it to the configuration file.

In [None]:
# == Create person_id ======================================================
# -- Get columns with person_id info
person_columns = {}
for f in input_files[:]:
    # Read data
    table_raw = pd.read_parquet(data_dir / input_dir / f).reset_index()
    init_cols = table_raw.columns
    # Check if first column is like NUHSA
    str_to_check = ["NUHSA", "NUSA"]
    person_source_col = table_raw.columns[0]
    if not any(x in person_source_col.upper() for x in str_to_check):
        raise AssertionError(
            f"First column name ({person_source_col}) does not contain 'NUHSA'"
        )
    person_columns[f] = person_source_col

# Add to config file
ext.update_yaml_params(params_file, "person_columns", person_columns)
person_columns

### date_fields

Now we need to define which columns have the date information. The following code will identify columns whose name refers to something similar to a date and add it to the configuration file.

- If there is one date, it will become start_date and end_date. 
- If there are two dates, the oldest column will be start_date and the most recent will be end_date.
- If there are inconsistencies, i.e. not all older dates in one column come after their corresponding recent dates, an error will be raised. This should be fixed upstream. More specifically, in process_rare_files.py or similar.

There should not be more than two dates. In that case it should be fixed upstream in process_rare_files.py or similar.

In [None]:
# == Identify date columns ========================================================
# Iteramos sobre los archivos
date_columns = {}
for f in input_files[:]:
    # Look for columns with dates on each file
    table_raw = pd.read_parquet(data_dir / input_dir / f).reset_index()
    date_columns_names = ext.find_matching_keys(table_raw.columns, str_checklist)
    date_columns[f] = date_columns_names

# Add to config file
ext.update_yaml_params(params_file, "date_columns", date_columns)
date_columns

### type_concept fields

Finally, a code must be assigned to indicate the origin of the information. Here we simply assign an OMOP code to identify the source of the record. In this case we are going to assign a code to each file, but there may be files with different information sources that can be included inside.

Once the codes are assigned, we add them to the configuration file.

In [None]:
type_concept_mapping = {
    "01_Datos_Sociodemograficos.parquet": 32817,  # EHR
    "02_Patologias_BPS.parquet": 32840,  # EHR problem list
    "03_MPA.parquet": 32856,  # Lab
}

ext.update_yaml_params(params_file, "type_concept_mapping", type_concept_mapping)