# Tutorial

This tutorial takes you through the entire workflow of the [Biology][biology] module.

In [1]:
%load_ext autoreload
%autoreload 2

In [12]:
import eds_scikit
import pandas as pd

```python3
spark, sc, sql = eds_scikit.improve_performances() # (1)
```

1. See the [welcome page](../../index.md) for an explanation of this line

## 1. Load Data

First, you need to load your data. As detailed in [the dedicated section](../generic/io), eds-scikit is expecting to work with [Pandas](https://pandas.pydata.org/) or [Koalas](https://koalas.readthedocs.io/en/latest/) DataFrames.  We provide various connectors to facilitate data fetching, namely a [Hive](../generic/io/#loading-from-hive-hivedata) connector, a [Postgres](../generic/io/#loading-from-postgres-postgresdata) connector and a [Pandas](../generic/io/#persistingreading-a-sample-tofrom-disk-pandasdata) connector.

This tutorial uses the [Hive](../generic/io/#loading-from-hive-hivedata) connector.

In [7]:
from eds_scikit.io import HiveData

data = HiveData(
    database_name="cse_XXX",
    tables_to_load=[
        "care_site",
        "concept",
        "concept_relationship",
        "measurement",
        "visit_occurrence",
    ],
)

                                                                                

Number of unique patients: 100000


## 2. Define your concepts-sets

In order to work on the measurements of interest, you can extract a list of concepts-sets by:

- Selecting [default concepts-sets](../../datasets/concepts-sets.md) provided in the library.
- Modifying the codes of a selected default concepts-set.
- Creating a concepts-set from scratch.

This tutorial uses all the default concepts-set with an additional custom concepts-set.

In [9]:
from eds_scikit.biology import ConceptsSet


protein_blood = ConceptsSet("Protein_Blood_Quantitative")
protein_urine = ConceptsSet("Protein_Urine_Quantitative")
protein = ConceptsSet(
    name="Protein_Quantitative",
    concept_codes=protein_blood.concept_codes + protein_urine.concept_codes,
)

custom_entity = ConceptsSet(
    name="Custom_entity", concept_codes=["G6616", "I2013", "C2102"]
)

concepts_sets = [
    protein,
    custom_entity,
]

## 3. Create your own configuration  (**OPTIONAL**)

If the [default configuration](../../datasets/biology-config.md) file based on the AP-HP's Data Warehouse does not meet your requirements, you can follow this tutorial to create your own configuration file.

As a reminder, a configuration file is a csv table where each row corresponds to a given standard concept_code and a given unit. For each row, it gives a maximum threshold and a minimum threshold to flag outliers and a unit conversion coefficient to normalize units if needed.

### 3.1 Plot statistical summary

The first step is to compute the statistical summary of each concepts-set with the function ``plot_biology_summary(stats_only=True)``. 

In [None]:
from eds_scikit.biology import plot_biology_summary

start_date = "2017-01-01"
end_date = "2022-01-01"

plot_biology_summary(
    data,
    concepts_sets=concepts_sets,
    start_date=start_date,
    end_date=end_date,
    stats_only=True,
)

By default, the data will be saved in the `Biology_summary` folder.  

Each `ConceptSet` will have its own folder.
Here, we used, `stats_only=True`, so

- No graphical dashboard will be generated
- Data will not be stratified by care site

Let us display the results for the protein-related `ConceptSet`:

In [53]:
pd.read_csv("./Biology_summary/Protein_Quantitative/stats_summary.csv")

Unnamed: 0,LOINC_concept_code,AnaBio_concept_code,LOINC_concept_name,AnaBio_concept_name,unit_source_value,count,mean,std,min,25%,50%,75%,max,MAD,max_threshold,min_threshold
0,2885-2,A0249,Prot SerPl-mCnc,Protéines_Sérum_g/L,g/l,6021,77.286,8.321,24.819,65.504,61.279,85.818,104.826,8.924,103.919,23.073
1,2885-2,A0250,Prot SerPl-mCnc,Protéines_Sérum_Electrophorèse_g/L,g/l,1176,59.705,7.609,24.735,47.535,84.605,90.445,137.543,7.131,91.838,32.455
2,2885-2,A7347,Prot SerPl-mCnc,Protéines_Plasma_g/L,g/l,12421,51.113,8.548,22.551,63.876,58.16,77.023,95.262,8.17,86.654,33.378
3,2885-2,B9417,Prot SerPl-mCnc,Protéines_Sérum_Colorimétrie_g/L,g/l,601,56.906,12.196,32.205,55.82,56.61,69.69,79.671,7.919,121.822,31.16
4,2885-2,C9874,Prot SerPl-mCnc,Protéines_Sérum_Electrophorèse 2_g/L,g/l,169,54.237,6.402,54.82,51.428,76.413,74.323,84.257,8.145,124.186,34.603
5,2885-2,D0058,Prot SerPl-mCnc,Protéines Après dialyse_Sérum/Plasma_g/L,g/l,51,64.92,4.699,52.023,71.595,61.444,78.434,76.351,4.502,73.379,39.551
6,2885-2,F2624,Prot SerPl-mCnc,Protéines Pédiatrique_Sérum/Plasma_g/L,g/l,3,58.934,11.768,45.364,40.882,54.139,59.366,84.88,11.952,77.996,5.854
7,2885-2,F5122,Prot SerPl-mCnc,Protéines Duplication A7347_Plasma_g/L,g/l,213,80.395,6.134,40.129,69.549,66.73,85.024,110.905,8.824,113.764,38.456
8,2888-6,A1694,Protéines [Masse/Volume] Urine - Numérique,Protéines_Urines 24h_g/L,g/l,193,2.343,4.262,0.063,0.089,0.257,1.62,52.679,0.162,1.275,0.0
9,2888-6,A1695,Protéines [Masse/Volume] Urine - Numérique,Protéines_Urines_g/L,g/l,2300,0.648,1.621,0.0,0.076,0.181,0.428,35.934,0.144,0.76,0.0


If you prefer, a [HTML table](./Biology_summary/Protein_Quantitative/stats_summary.html) is also generated along with the CSV (same name, but with a `.html` extension

### 3.2 Create configuration from statistical summary

Then, you can use the function ``create_config_from_stats()`` to pre-fill the configuration file with ``max_threshold`` and ``min_threshold``. The thresholds computation is based on the Median Absolute Deviation (MAD) Methodology[@madmethodology].

In [None]:
from eds_scikit.biology.utils.config import create_config_from_stats

config_name = "my_custom_config"

create_config_from_stats(
    concepts_sets=concepts_sets,
    config_name=config_name,
)

### 3.3 Edit units manually

The ``transformed_unit`` column is pre-filled with the unit that corresponds to the most measurements. When you notice a ``unit_source_value`` different than a ``transformed_unit``, it probably means that the concept's unit needs to be normalized.

- To normalize the unit of a concept you need to fill in manually the ``Action`` column with *Transform* and the ``Coefficient`` column with the unit conversion factor.
- If you consider the concept irrelevant, you can fill in the ``Action`` column with *Delete* and it will delete the measurements corresponding to the concept.
- If the ``unit_source_value`` matches the ``transformed_unit`` you can leave the ``Action`` and the ``Coefficient`` columns empty.

### 3.4 Use your custom configuration

Once you created your configuration (for instance under the name `config_name="my_custom_config"`), you can use provide it to the relevant functions (see below).

You can also check the configuration file directly:

```python
from eds_scikit.resources import registry
config = registry.get("data", "biology_config.my_custom_config")()
```

## 4. Clean the data

Now you can use the ``bioclean()`` function with your custom configuration or the default configuration to:

- [Extract concepts-sets][2-extract-concepts-sets]
- [Normalize units][3-normalize-units]
- [Detect outliers][4-detect-outliers]

It will add a ``bioclean`` table to your ``data``. For more details, have a look on [the dedicated section](cleaning).

In [None]:
from eds_scikit.biology import bioclean

bioclean(
    data,
    concepts_sets=concepts_sets,
    config_name=config_name,
    start_date=start_date,
    end_date=end_date,
)

See below the columns created by the ``bioclean()`` function:

| concepts_set               | LOINC_concept_code | LOINC_concept_name | AnaBio_concept_code | AnaBio_concept_name  | transformed_unit | transformed_value | max_threshold | min_threshold | outlier | value_source_value | unit_source_value |
| :------------------------- | :----------------- | :----------------- | :------------------ | :------------------- | :--------------- | :---------------- | :------------ | :------------ | :------ | :----------------- | :---------------- |
| EntityA_Blood_Quantitative | 000-0              | EntityA #Bld       | A0000               | EntityA_Blood        | x10*9/l          | 115               | 190           | 0             | False   | 115 x10*9/l        | x10*9/l           |
| EntityA_Blood_Quantitative | 000-1              | EntityA_Blood_Vol  | A0001               | EntityA_Blood_g/l    | x10*9/l          | 220               | 190           | 0             | True    | 560 g/l            | g/l               |
| EntityB_Blood_Quantitative | 001-0              | EntityB_Blood      | B0000               | EntityB_Blood_artery | mmol             | 0.45              | 8.548         | 0.542         | True    | 0.45 mmol          | mmol              |
| EntityB_Blood_Quantitative | 001-0              | EntityB_Blood      | B0001               | EntityB_Blood_vein   | mmol             | 4.52              | 8.548         | 0.542         | False   | 4.52 mmol          | mmol              |
| EntityB_Blood_Quantitative | 000-1              | EntityB Bld Auto   | B0002               | EntityB_Blood_µg/l   | mmol             | 9.58              | 8.548         | 0.542         | True    | 3587 µg/l          | µg/l              |

## 5. Visualize the statistical summary of clean data

Finally, you can build and save two interactive dashboards and a summary table for each concepts-set. It describes various statistical properties of your clean data.

In [None]:
from eds_scikit.biology import plot_biology_summary

plot_biology_summary(data)

Please see below some examples:

- [Statistical summary table](../../_static/biology/viz/stats_summary.html)
- [Interactive dashboard describing the volumetric properties](../../_static/biology/viz/interactive_volumetry.html)
- [Interactive dashboard describing the distribution properties](../../_static/biology/viz/interactive_distribution.html)