# Tutorial - Preparing measurement table

This tutorial takes you through the entire workflow of the [Biology][biology] module.

# Summary
1. <a href="#load-data" style="font-size:18px;">Load data</a>
2. <a href="#quick-use" style="font-size:18px;">Quick use : Preparing measurement table</a>
   - a) [Define biology concept sets](#define-biology-concept-set)
   - b) [Prepare measurements](#prepare-measurements)
3. <a href="#detailed-use" style="font-size:18px;">Detailed use : Analysing measurement table</a>
   - a) [Measurements statistic table](#stat-table)
   - b) [Measurements units correction](#units-correction)
   - c) [Plot measurements biology summary](#plot-summary)
4. <a href="#further" style="font-size:18px;">Further : Concepts Sets, Concept Codes and Units</a>
   - a) [Concept codes relationships exploration](#concept-codes-explorer)
   - b) [Concepts Sets](#concepts-sets)
   - c) [Units](#units)


In [None]:
import eds_scikit
import pandas as pd

# 1 - Load data <a id="load-data"></a>

!!!danger "Big volume" Measurement table can be large. Do not forget to set proper spark config.

In [None]:
to_add_conf = [
    ("master", "yarn"),
    ("deploy_mode", "client"),
    ("spark.driver.memory", ...),
    ("spark.executor.memory", ...),
    ("spark.executor.cores", ...),
    ("spark.executor.memoryOverhead", ...),
    ("spark.driver.maxResultSize", ...)
    ...
]

spark, sc, sql = eds_scikit.improve_performances(to_add_conf=to_add_conf)

from eds_scikit.io.hive import HiveData

In [None]:
data = HiveData(
    spark_session=spark,
    database_name="cse_xxxxxxx_xxxxxxx",
    tables_to_load=[
        "care_site",
        "concept",
        "visit_occurrence",
        "measurement",
        "concept_relationship",
    ],
)

# 2 - Quick use : Preparing measurement table <a id="quick-use"></a>

## a) Define biology concept-sets <a id="define-biology-concept-set"></a>

In order to work on the measurements of interest, you can extract a list of concepts-sets by:

- Selecting [default concepts-sets](../../datasets/concepts-sets.md) provided in the library.
- Modifying the codes of a selected default concepts-set.
- Creating a concepts-set from scratch.

__Code selection can be tricky. See <a href="#concept-codes-explorer">Concept codes relationships exploration</a> section for more details on how to select them.__

In [None]:
from eds_scikit.biology import ConceptsSet

# Creating Concept-Set
custom_leukocytes = ConceptsSet("Custom_Leukocytes")

custom_leukocytes.add_concept_codes(
    concept_codes=['A0174', 'H6740', 'C8824'], 
    terminology='GLIMS_ANABIO' 
)
custom_leukocytes.add_concept_codes(
    concept_codes=['6690-2'], 
    terminology='ITM_LOINC'
)

# Importing Concept-Set (see. 4.b for details on existing concepts sets)
glucose_blood = ConceptsSet("Glucose_Blood")

In [None]:
concepts_sets = [
    custom_leukocytes, 
    glucose_blood
]

## b) Prepare measurements <a id="prepare-measurements"></a>

Execution will be lazy, except if convert_units=True.

In [None]:
from eds_scikit.biology.utils.prepare_measurement import prepare_measurement_table

In [None]:
measurement_bioclean = prepare_measurement_table(data,
                                                 start_date="2022-01-01", end_date="2022-05-01",
                                                 concept_sets=concepts_sets,
                                                 convert_units=False,
                                                 get_all_terminologies=True
                                                )

__Now you have your measurement table mapped with concept set terminology.__ Next sections are about measurement codes analysis, units and plots.

# 3 - Detailed use : Analysing measurement table<a id="detailed-use"></a>

## a) Measurements statistics table <a id="stat-table"></a>

In [None]:
from eds_scikit.biology import measurement_values_summary

In [17]:
stats_summary = measurement_values_summary(measurement_bioclean, 
                                           category_cols=["concept_set", "GLIMS_ANABIO_concept_code", "GLIMS_LOINC_concept_code"], 
                                           value_column="value_as_number", 
                                           unit_column="unit_source_value")


stats_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,range_low_anomaly_count,range_high_anomaly_count,measurement_count,value_as_number_count,value_as_number_mean,value_as_number_std,value_as_number_min,value_as_number_25%,value_as_number_50%,value_as_number_75%,value_as_number_max
concept_set,GLIMS_ANABIO_concept_code,no_units,unit_source_value,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Custom_Leukocytes,A0174,148,x10*9/l,813,1099,11857,11857,21,18,0,25,50,75,100
Custom_Leukocytes,C8824,121,x10*9/l,1166,1196,11821,11821,20,20,0,25,50,75,100
Custom_Leukocytes,C9784,83,x10*9/l,935,902,11082,11082,10,16,0,25,50,75,100
Glucose_Blood,A0141,147,mmol/l,1179,976,12811,12811,27,20,0,25,50,75,100
Glucose_Blood,A7338,43,mmol/l,819,755,11312,11312,27,13,0,25,50,75,100
Glucose_Blood,A8424,176,mmol/l,916,936,12020,12020,14,25,0,25,50,75,100
Glucose_Blood,B9553,107,mmol/l,794,1046,13409,13409,17,19,0,25,50,75,100
Glucose_Blood,C0565,50,g/l,586,1030,9720,9720,18,26,0,25,50,75,100
Glucose_Blood,C7236,64,g/l,1121,882,13685,13685,25,12,0,25,50,75,100
Glucose_Blood,E7312,51,mg/dl,1266,874,10971,10971,15,17,0,25,50,75,100


## b) Measurements units correction <a id="units-correction"></a>

In [None]:
glucose_blood.add_conversion("mol", "g", 180)
glucose_blood.add_target_unit("mmol/l")

concepts_sets = [glucose_blood, custom_leukocytes]

In [None]:
measurement_bioclean = prepare_measurement_table(data, 
                                                 start_date="2022-01-01", end_date="2022-05-01",
                                                 concept_sets=concepts_sets,
                                                 convert_units=True, 
                                                 get_all_terminologies=False 
                                                )

In [18]:
stats_summary = measurement_values_summary(measurement_bioclean, 
                                           category_cols=["concept_set", "GLIMS_ANABIO_concept_code"], 
                                           value_column="value_as_number_normalized", #converted
                                           unit_column="unit_source_value_normalized")

stats_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,range_low_anomaly_count,range_high_anomaly_count,measurement_count,value_as_number_normalized_count,value_as_number_normalized_mean,value_as_number_normalized_std,value_as_number_normalized_min,value_as_number_normalized_25%,value_as_number_normalized_50%,value_as_number_normalized_75%,value_as_number_normalized_max,value_as_number_count,value_as_number_mean,value_as_number_std,value_as_number_min,value_as_number_max,value_as_number_25%,value_as_number_50%,value_as_number_75%
concept_set,GLIMS_ANABIO_concept_code,no_units,unit_source_value_normalized,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Custom_Leukocytes,A0174,215,10*6/l,1259,1131,16187,1377440.0,8961.115795,9214.091859,0.0,5600.0,7690.0,10590.0,969720.0,16187,27,17,0,100,25,50,75
Custom_Leukocytes,C8824,41,10*6/l,1092,1286,12351,78.0,27487.307692,85884.332632,1100.0,5000.0,9700.0,20200.0,697500.0,12351,18,16,0,100,25,50,75
Custom_Leukocytes,C9784,80,10*6/l,886,855,13957,96962.0,9089.040139,12083.185385,0.0,5600.0,7600.0,10460.0,725400.0,13957,21,4,0,100,25,50,75
Glucose_Blood,A0141,103,mmol/l,808,821,13563,105958.0,6.541606,4.859138,0.1,4.8,5.6,7.0,1007.0,13563,29,8,0,100,25,50,75
Glucose_Blood,A7338,197,mmol/l,895,1064,12503,35302.0,5.837275,2.239133,0.3,4.72,5.2,6.03,49.68,12503,20,8,0,100,25,50,75
Glucose_Blood,A8424,88,mmol/l,801,959,10270,18377.0,6.256536,3.357366,0.1,4.7,5.4,6.8,65.5,10270,17,24,0,100,25,50,75
Glucose_Blood,B9553,200,mmol/l,999,950,16199,3940.0,6.553718,3.23465,1.8,4.8,5.5,7.2,44.8,16199,19,22,0,100,25,50,75
Glucose_Blood,C0565,193,mmol/l,991,1228,10263,13431.0,6.020934,2.496683,0.305556,4.777778,5.277778,6.222222,50.183333,10263,30,22,0,100,25,50,75
Glucose_Blood,C7236,31,mmol/l,1090,970,13942,5246.0,7.03372,3.799966,0.455556,5.0,5.777778,7.611111,58.161111,13942,15,21,0,100,25,50,75
Glucose_Blood,E7312,182,mmol/l,678,1334,16282,16757.0,6.440041,3.487719,0.6,4.7,5.405556,6.805556,53.427778,16282,19,22,0,100,25,50,75


## c) Plot biology summary <a id="plot-summary"></a>

Applying ```plot_biology_summary``` to computed measurement dataframe, merged with care sites, allows to generate nice exploration plots such as :

- [Interactive volumetry](../../_static/biology/viz/interactive_volumetry.html)

- [Interactive distribution](../../_static/biology/viz/interactive_distribution.html)

In [None]:
from eds_scikit.biology.viz import plot_biology_summary

In [None]:
measurement_bioclean = measurement_bioclean.merge(data.visit_occurrence[["care_site_id", "visit_occurrence_id"]], on="visit_occurrence_id")
measurement_bioclean = measurement_bioclean.merge(data.care_site[["care_site_id", "care_site_short_name"]], on="care_site_id")

In [None]:
plot_biology_summary(measurement_bioclean, value_column="value_as_number_normalized") 

# 4 - Further : Concept Codes, Concepts Sets and Units <a id="further"></a>

## 1 - Concept codes relationships exploration <a id="concept-codes-explorer"></a>

Concept codes relationships can be tricky to understand and to manipulate. Function ```prepare_biology_relationship_table``` allows to build __mapping dataframe between main AP-HP biology referential__.

See ```eds_scikit.settings.mapping``` and ```eds_scikit.settings.source_terminologies``` configurations for mapping details.

In [None]:
from eds_scikit.biology.utils.prepare_relationship import prepare_biology_relationship_table

biology_relationship_table = prepare_biology_relationship_table(data)
biology_relationship_table = biology_relationship_table.to_pandas()

Relationship between codes from different referentials.

In [19]:
columns = [col for col in biology_relationship_table.columns if "concept_code" in col]

biology_relationship_table[biology_relationship_table.GLIMS_ANABIO_concept_code.isin(['A0174', 'H6740', 'C8824'])][columns].drop_duplicates()

pd.read_csv("anabio_relation_1", index_col=0)

Unnamed: 0,ANALYSES_LABORATOIRE_concept_code,GLIMS_ANABIO_concept_code,GLIMS_LOINC_concept_code,ITM_ANABIO_concept_code,ITM_LOINC_concept_code
0,0,C8824,33256-9,Unknown,Unknown
1,1,A0174,6690-2,A0174,6690-2
2,1,A0174,26464-8,A0174,6690-2


In [20]:
biology_relationship_table[biology_relationship_table.GLIMS_LOINC_concept_code.isin(['33256-9', '6690-2', '26464-8'])][columns].drop_duplicates()

Unnamed: 0,ANALYSES_LABORATOIRE_concept_code,GLIMS_ANABIO_concept_code,GLIMS_LOINC_concept_code,ITM_ANABIO_concept_code,ITM_LOINC_concept_code
0,4,E4358,6690-2,Unknown,Unknown
1,2,C9097,26464-8,Unknown,Unknown
2,6,K3232,6690-2,Unknown,Unknown
3,5,E6953,26464-8,Unknown,Unknown
4,1,C8824,33256-9,Unknown,Unknown
5,4,E4358,26464-8,Unknown,Unknown
6,5,E6953,6690-2,Unknown,Unknown
7,7,K6094,6690-2,Unknown,Unknown
8,0,C9784,6690-2,C9784,6690-2
9,0,C9784,26464-8,C9784,6690-2


## 2 - Concepts-Sets <a id="concepts-sets"></a>

To get all availables concepts sets see `datasets.default_concepts_sets`. More details about their definition and how they are build can be found in this [section](#concepts-sets).


In [22]:
from eds_scikit import datasets
from eds_scikit.biology import ConceptsSet

In [23]:
print(ConceptsSet("Troponin").concept_codes)

{'GLIMS_ANABIO': ['A0283', 'C5560', 'F9934', 'E6954', 'L3534', 'G7716', 'J5184', 'A3832', 'E7249']}


## 3 - Units <a id="units"></a>

Units module makes conversion between units easier. It uses configuration files `datasets.units` and `datasets.elements`.

In [None]:
from eds_scikit import datasets

In [24]:
from eds_scikit.biology.utils.process_units import Units

In [25]:
units = Units()

print("L to ml : ", units.convert_unit("L", "ml"))
print("m/s to m/h : ", units.convert_unit("m/s", "m/h"))
print("g to mol : ", units.convert_unit("g", "mol"))
units.add_conversion("mol", "g", 180)
print("g to mol : ", units.convert_unit("g", "mol"))

L to ml :  1000.0
m/s to m/h :  3600.000000001008
g to mol :  nan
g to mol :  0.005555555555555556
