# Introduction to bps_to_omop

This document serves as a first step to introduce the rationale behind this package. There are two major well-defined levels: 
1. Preparation of the data for omopization (Preomop processing).
2. Generation of the OMOP tables.

## 1. Preomop processing

Prior of the generation of the OMOP tables, it is best to prepare the original data. The required steps are shown in the following diagram:

```{mermaid}
flowchart LR
	source1[(\nClinical raw data)]
	source2[(\nOMOP extended\nvocabularies)]
	source3[(\nOMOP Initial Data)]

	script1["retrieve_raw_data"]
	script2["process_raw_data"]
	script3["process_rare_data"]
	script4["omopization"]
	script5["retrieve_omop_vocab_tables"]

	direction LR
	source1 --> script1 --> script2 --> script3 --> script4 --> source3
	script4
	script5
	
	source2-->script5-->source3

```

Roughly described, each step performs the following functions:
1. **retrieve_raw_data**: Data is fetched from an external directory, *Clinical raw data*. 
2. **process_raw_data**: The original data is read and brought into parquet format. **No** data transformations are performed, just an attempt to read and save the data in the proper format so that it can be read quickly in the future using parquet files. 
  - This simplifies access to files and speeds up future transformations.
3. **process_rare_data**: The parquet data is processed to facilitate the following steps. Data transformations/purges, if any, should be applied here.
  - This process, if needed, is mostly manual. Somo general things can be done, like dropping duplicates or invalid rows, but the specific depends largely on the each project and file, so no attemp for generalization has been made.
  - If something fails in the following steps or generation OMOP tables, the most easy solution is to fix it in this stage.
  - This step also serves as a registry of "weird things" that had to be done to the data in order to prepare it.
  - Some things that should be done in this stage:
    - Join files with the same information.
    - Separate files that contain too much information (OMOP-generation files expect only two dates by file: start of event, end of event. If more than one file is present)
4. **omopization**: The processed data are prepared to facilitate their transfer to OMOP format. Here columns are created or renamed for later use.
  - Things that are done here:
    - Generation of the unique person_id across all files to prevent inconsistencies.
    - Standarization of column names to omop-like equivalents.
    - Generation of a type_concept_id code to identify the origin of the information. This code will be carried over to every OMOP table.
5. **retrieve_OMOP_vocab_tables**: Data with extended vocabularies are retrieved from a remote directory. See the [MappingToOMOP](http://gitlab.cbra.com/igutierrez/MappingToOmop) repo.

After this process, all data should be in a bps_to_omop friendly format and the generation of the OMOP tables can begin.


## 2. Generation of the OMOP tables

Once the original files are "omopized", we can start building the OMOP tables themselves. Depending on the original data, not every table needs to be built. This generates a situation where a general procedure to build every table is not optimal, hence, we have provided single script for each table that rely on bps_to_omop modules. Which tables to build depends almost entirely on the specifics of the original data.

Take into accoutn that, tables CDM_SOURCE, PERSON and OBSERVATION_PERIOD are always mandatory.

It is recommended to study the original data and plan ahead what tables would be needed for the current project. As an example, if our main goal is to create the tables PERSON, CONDITION_OCCURRENCE and MEASUREMENT, the process can be summarized in the following diagram:

```{mermaid}
%%{init: {'flowchart': {'curve': 'linear'}} }%%
flowchart LR
	script_cdm_source["genomop_cdm_source"]
	script_person["genomop_person"]
	script_obs["genomop_observation_period"]
	script_visit["genomop_visit_occurrence"]
	script_cond["genomop_condition_occurrence"]
	script_meas["genomop_measurement"]

	folder_cdm_source{{<i>/omop_intermediate/CDM_SOURCE/</i>}}
	folder_person{{<i>/omop_intermediate/PERSON/</i>}}
	folder_obs{{<i>/omop_intermediate/OBSERVATION_PERIOD/</i>}}
	folder_visit{{<i>/omop_intermediate/VISIT_OCCURRENCE/</i>}}
	folder_cond{{<i>/omop_intermediate/CONDITION_OCCURRENCE/</i>}}
	folder_meas{{<i>/omop_intermediate/MEASUREMENT/</i>}}

	subgraph group1 [Independent tables]
		script_cdm_source	
		script_person
		script_obs
		script_visit
	end
	
	script_visit --> script_cond
	script_visit --> script_meas

	script_cdm_source --> folder_cdm_source
	script_person --> folder_person
	script_obs --> folder_obs
	script_visit --> folder_visit
	script_cond --> folder_cond
	script_meas --> folder_meas

	subgraph group2 [Dependent tables]
		script_cond["genomop_condition_occurrence"]
		script_meas["genomop_measurement"]

	end

	subgraph group3 [Final OMOP tables]
		folder_cdm_source
		folder_person 
		folder_obs 
		folder_visit  
		folder_cond
		folder_meas
	end
```

It is important to determine which tables need to be built first. Usually, VISIT_OCCURRENCE is the next step after OBSERVATION_PERIOD, but VISIT_OCCURRENCE can reference entries in the LOCATION table. If that's the case, LOCATION should be built before VISIT_OCCURRENCE.

The best way to determine this is to go to the [CDM definiton website](https://ohdsi.github.io/CommonDataModel/cdm54.html) and see which tables ones **needs** first, and then look in the contents of those tables for other tables that need to be built in advance. 

Most likely, most tables will only depend on one or two others. After defining that, just pick one of the earliest tables in the chain, look for the corresponding guide in the docs folder and generation script example in the examples folder and follow the instructions.