# Diabetes as a risk factor for infarction

## Goal

The goal here is to estimate the prevalence of diabetes for patients that had an infarction.  

## Method

- The diabetic status of a patient is retrieved using ICD-10 codes.
- A "diabetic period" is forged by using the date of the first diabete-related ICD-10 occurrence
- The same method is used to retrieve infarction occurrences
- Each infarction event is tagged using the "diabetic period"

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
DBNAME = "coronaomop_unstable"

## Analysis

### Loading data

Provide your database name instead of the `DBNAME` placeholder below

In [3]:
import eds_scikit

spark, sc, sql = eds_scikit.improve_performances()

from eds_scikit.io.hive import HiveData

# Data from Hive
data = HiveData(DBNAME)

    To improve performances when using Spark and Koalas, please call `eds_scikit.improve_performances()`
    This function optimally configures Spark. Use it as:
    `spark, sc, sql = eds_scikit.improve_performances()`
    The functions respectively returns a SparkSession, a SparkContext and an sql method
    
                                                                                

In [4]:
import datetime

from eds_scikit.event import conditions_from_icd10
from eds_scikit.event.diabetes import diabetes_from_icd10
from eds_scikit.period.tagging_functions import tagging

We won't dive into the details of each function here, but you should definitely check the available parameters in the documentation. For instance, check to documentation of the `diabetes_from_icd10` function [here][eds_scikit.event.diabetes.diabetes_from_icd10].  

### Extraction of the diabetic status

So let us get visits containing a diabetes-related ICD-10 code:

In [5]:
event_diabetes = diabetes_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    date_min=datetime.datetime(2017,1,1),
    date_max=None,
) 

In [6]:
event_diabetes.groupby(["concept","value"]).size()

                                                                                

concept              value                  
DIABETES_FROM_ICD10  DIABETES_TYPE_II           285544
                     DIABETES_MALNUTRITION         525
                     DIABETES_INSIPIDUS           2612
                     DIABETES_IN_PREGNANCY        6502
                     OTHER_DIABETES_MELLITUS     15124
                     DIABETES_TYPE_I             30056
dtype: int64

The [default configuration][eds_scikit.event.diabetes.DEFAULT_DIABETE_FROM_ICD10_CONFIG] shows that various diabetes types are extracted. Lets only keep the ones we're interested in

In [51]:
kept_diabetes = [
    'DIABETES_TYPE_I',
    'DIABETES_TYPE_II',
    'OTHER_DIABETES_MELLITUS',
]

In [40]:
event_diabetes = event_diabetes[event_diabetes.value.isin(kept_diabetes)] 

Finally we can switch from an event-level DataFrame to a person-level DataFrame

In [41]:
person_diabetes = (
    event_diabetes
    .groupby(['person_id','concept','value'])
   .agg(
       t_start=('t_start','min'),
       t_end=('t_end','max'),
   )
   .reset_index()
)

person_diabetes.loc[:,'t_end'] = datetime.datetime.now()

Let's run the pipeline defined above and extract everything as a Pandas DataFrame

In [None]:
person_diabetes_pd = person_diabetes.to_pandas()

We can check the extracted concept:

In [75]:
person_diabetes_pd.concept.unique()

array(['DIABETES_FROM_ICD10'], dtype=object)

And the repartition of the corresponding values:

In [77]:
person_diabetes_pd.value.value_counts(normalize=True)

DIABETES_TYPE_II           0.790726
DIABETES_TYPE_I            0.136691
OTHER_DIABETES_MELLITUS    0.072583
Name: value, dtype: float64

### Extraction of infarction occurrences

As mentionned above, please check the [documentation of the `conditions_from_icd10` function][eds_scikit.event.conditions_from_icd10] for more implementation details.

In [42]:
infarction_codes = dict(
    INFARCTION=dict(
        code_list="I21",
        code_type="prefix",
    )
)

event_infarction = conditions_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    codes=infarction_codes,
    date_from_visit=True,
    additional_filtering=dict(
        condition_status_source_value={"DP", "DAS"}
    ),
    date_min=datetime.datetime(2017,1,1),
    date_max=None,
)

event_infarction_pd = event_infarction.to_pandas()

By using the dedicated [`tagging`][eds_scikit.period.tagging_functions.tagging] function, we can tag each infarction occurrence with the diabetic status of the patient **at the moment of the infarction**:

In [92]:
tagged_infarction = tagging(
    event_infarction_pd,
    person_diabetes_pd,
    concept_to_tag="DIABETES_FROM_ICD10",
    algo="intersection",
)
tagged_infarction.drop(columns={"person_id"}).head()

Unnamed: 0,t_start,t_end,concept,value,visit_occurrence_id,condition_status_source_value,DIABETES_TYPE_I,DIABETES_TYPE_II,OTHER_DIABETES_MELLITUS
0,2017-01-06 08:40:00,2017-01-10 12:18:00,INFARCTION,I21400,217977558,DP,False,True,False
1,2017-05-04 11:13:00,2017-05-05 15:11:00,INFARCTION,I2120,251157356,DP,False,False,False
2,2017-05-04 11:13:00,2017-05-05 15:11:00,INFARCTION,I21200,251157356,DP,False,False,False
3,2017-05-28 11:33:00,2017-06-15 13:33:00,INFARCTION,I2140,259440193,DP,False,False,False
4,2017-07-30 12:57:00,2017-08-31 17:07:00,INFARCTION,I2100,277123123,DP,False,True,False


The prevalence of diabetes for

- patients that had an infarction
- the general population

can now easily be computed

In [93]:
diabetes_and_infarctus_prevalence = (
    100
    * tagged_infarction[
        tagged_infarction[kept_diabetes].any(axis=1)
    ].person_id.nunique()
    / tagged_infarction.person_id.nunique()
)

In [94]:
diabetes_prevalence = 100 * (
    person_diabetes_pd.person_id.nunique()
    / data.condition_occurrence.person_id.nunique()
)

In [73]:
print(f"Prevalence of diabetes in the general population: {diabetes_prevalence:.2f}%")
print(f"Prevalence of diabetes in infarction: {diabetes_and_infarctus_prevalence:.2f}%")

Prevalence of diabetes in the general population: 5.69%
Prevalence of diabetes in infarction: 24.60%


## Saving cohort to disk

For further analysis, we may want to focus only on diabetic patients.  
As this will be a modest size cohort, its analysis may be easier with Pandas.  
We can thus store this cohort on disk, and use it later as Pandas DataFrames

In [18]:
from eds_scikit.io.files import PandasData

In [21]:
folder = './diabetes_data/'

In [None]:
data.persist_tables_to_folder(
    folder = folder,
    person_ids = person_diabetes_pd['person_id'],
)

The cohort is now easily available:

In [None]:
data = PandasData(folder_path)

As before, you can now use `data.person`, `data.visit_occurrence`, etc.  
**Here**, those tables will be Pandas Dataframes.