# COHORT and COHORT definition

See [cohort](https://ohdsi.github.io/CommonDataModel/cdm54.html#cohort) and [cohort_definition](https://ohdsi.github.io/CommonDataModel/cdm54.html#cohort_definition). 

The COHORT table records a group of patients that fulfill certain conditions. COHORT_DEFINITION contains a description of the cohort.

```{mermaid}
erDiagram
    OMOP_COHORT {
        integer cohort_id
        integer subject_id
        date cohort_start_date
        date cohort_end_date
    }
```

```{mermaid}
erDiagram
    OMOP_COHORT_DEFINITION {
        integer cohort_definition_id
        varchar(255) cohort_definition_name
        varchar(MAX) cohort_definition_description
        integer definition_type_concept_id
        varcha(MAX) cohort_definition_syntax
        integer subject_concept_id
        date cohort_initiation_date
    }
```

It this project, we are going to follow the original data and predefine the cohorts:
- *Diagnosed*: With patients diagnosed with hepatitis-c.
- *Pseudo-control*: With patients that have a negative hepatitis-c test.
- *Control*: A random sample from the general population.

The execution of the transformation is carried out by the file [genomop_cohort.py](../src/genomop_cohort.py). The script assumes that there is a file with sociodemographic information (akin to PERSON table) for each relevant cohort to be defined. If it does not exist, this file can be generated in the `process_rare_data` stage.

This script performs the following steps:
1. Loads parameters
2. Creates output directory
3. Iterate over the cohorts to be defined. See `cohorts` parameter.
   1. Loads the file that contains the list of patients in the current cohort. See `sociodemo_file` parameter.
   2. Creates the COHORT_DEFINITION table using the parameters defined in the params file.
   3. Iterates over the patients in the file and save a row of the COHORT table for each patient
4. Join everything and create the COHORT and COHORT_DEFINITION tables.
5. Adapt the COHORT and COHORT_DEFINITION tables to their respective shechemas.
6. Save the COHORT and COHORT_DEFINITION tables to the defined output folder.
 

The configuration file will be [genomop_cohort_params.yaml](../src/genomop_cohort_params.yaml). It must have the following structure:

```yaml
input_dir: rare/03_omop_initial/
output_dir: rare/04_omop_intermediate/COHORT/
cohorts:
  diagnosed: 
    sociodemo_file: Hepatitis_Diag_Sociodemo.parquet
    cohort_definition_id: 1
    cohort_definition_name: Patients with a positive test for hepatitis-C
    subject_concept_id: 1147314 # OMOP code for person
    definition_type_concept_id: 32882 # OMOP code for Standard algorithm from EHR
    cohort_start_date: "2017-01-01"
    cohort_end_date: "2023-12-01"
  control: 
    sociodemo_file: Hepatitis_Control_Sociodemo.parquet
    cohort_definition_name: Patients with a negative test for hepatitis-C
    subject_concept_id: 1147314 # OMOP code for person
    definition_type_concept_id: 32882 # OMOP code for Standard algorithm from EHR
    cohort_definition_id: 2
    cohort_start_date: "2017-01-01"
    cohort_end_date: "2023-12-01"
  aleatorio: 
    sociodemo_file: Hepatitis_Aleatorio_Sociodemo.parquet
    cohort_definition_name: Random sample from the general population
    subject_concept_id: 1147314 # OMOP code for person
    definition_type_concept_id: 32882 # OMOP code for Standard algorithm from EHR
    cohort_definition_id: 3
    cohort_start_date: "2017-01-01"
    cohort_end_date: "2023-12-01"
```

The parameters are:

- input_dir: 
- `input_dir` is the path from `data_dir` to the directory where input data is.
- `output_dir` is the path from `data_dir` to the directory where data will be saved to.
- `cohorts`: is a dict the defines the cohorts. We used the first defined one as an example:
  - `diagnosed`: is the name of the cohort
    - `sociodemo_file`: file that contains the list of patients that belong to this cohort.
      - In this file there has to be a field named `person_id`.
    - `cohort_definition_id`: Integer that uniquely identifies this cohort.
    - `cohort_definition_name`: A short description of the cohort.
    - `subject_concept_id`: This field contains a Concept that represents the domain of the subjects that are members of the cohort (e.g., Person, Provider, Visit). 
      - Usually 1147314. The OMOP code for person. But can be a cohort of hospitals.
    - `definition_type_concept_id`: Type defining what kind of Cohort Definition the record represents and how the syntax may be executed. 
      - Usually, 32882. The OMOP code for Standard algorithm from EHR
    - `cohort_start_date`: Date when the subject entered the cohort. 
      - Currently the same start_date is applied to every subject in the cohort. This could be applied on a subject-by-subject basis.
    - `cohort_end_date`: Date when the subject left the cohort. 
      - Currently the same end_date is applied to every subject in the cohort. This could be applied on a subject-by-subject basis.

Important notes:
- Any field in [COHORT_DEFINITION](https://ohdsi.github.io/CommonDataModel/cdm54.html#cohort_definition) can be provided in the parameters file and will be used to populate the table.