In [32]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 120)
pd.set_option('display.max_columns', 120)

EDA on possible grouping among: 1) blood gasses, 2) "cytes", 3) bilirubin 

In [20]:
df = pd.read_excel("../data/dataset.xlsx")
df.head(5)

Unnamed: 0,Patient ID,Patient age quantile,SARS-Cov-2 exam result,"Patient addmited to regular ward (1=yes, 0=no)","Patient addmited to semi-intensive unit (1=yes, 0=no)","Patient addmited to intensive care unit (1=yes, 0=no)",Hematocrit,Hemoglobin,Platelets,Mean platelet volume,...,Hb saturation (arterial blood gases),pCO2 (arterial blood gas analysis),Base excess (arterial blood gas analysis),pH (arterial blood gas analysis),Total CO2 (arterial blood gas analysis),HCO3 (arterial blood gas analysis),pO2 (arterial blood gas analysis),Arteiral Fio2,Phosphor,ctO2 (arterial blood gas analysis)
0,44477f75e8169d2,13,negative,0,0,0,,,,,...,,,,,,,,,,
1,126e9dd13932f68,17,negative,0,0,0,0.236515,-0.02234,-0.517413,0.010677,...,,,,,,,,,,
2,a46b4402a0e5696,8,negative,0,0,0,,,,,...,,,,,,,,,,
3,f7d619a94f97c45,5,negative,0,0,0,,,,,...,,,,,,,,,,
4,d9e41465789c2b5,15,negative,0,0,0,,,,,...,,,,,,,,,,


For subsequent analysis, create y variable - level of care/intensity.

In [21]:
conditions = [
    (df['Patient addmited to regular ward (1=yes, 0=no)'] == 1),
    (df['Patient addmited to semi-intensive unit (1=yes, 0=no)'] == 1),
    (df['Patient addmited to intensive care unit (1=yes, 0=no)'] == 1),
    (df['Patient addmited to intensive care unit (1=yes, 0=no)'] == 0) & 
     (df['Patient addmited to semi-intensive unit (1=yes, 0=no)'] == 0) &
     (df['Patient addmited to regular ward (1=yes, 0=no)'] == 0)]

# create a list of the values we want to assign for each condition
values = ['regular', 'semi', 'icu', 'discharged']

# create a new column and use np.select to assign values to it using our lists as arguments
df['y'] = np.select(conditions, values)

df['y'].value_counts()

discharged    5474
regular         79
semi            50
icu             41
Name: y, dtype: int64

## Blood gasses

In [33]:
# get ids and all the blood gasses cols 
ids = df.loc[:, 'Patient ID':'Patient addmited to intensive care unit (1=yes, 0=no)']
bg_1 = df.loc[:, 'pCO2 (venous blood gas analysis)':'HCO3 (venous blood gas analysis)']
bg_2 = df.loc[:, 'pCO2 (arterial blood gas analysis)':'ctO2 (arterial blood gas analysis)']
bg = pd.concat([ids, df['y'], bg_1, bg_2], axis=1)

# filter to only positive patients
bg = bg[bg['SARS-Cov-2 exam result'] == 'positive']
bg

Unnamed: 0,Patient ID,Patient age quantile,SARS-Cov-2 exam result,"Patient addmited to regular ward (1=yes, 0=no)","Patient addmited to semi-intensive unit (1=yes, 0=no)","Patient addmited to intensive care unit (1=yes, 0=no)",y,pCO2 (venous blood gas analysis),Hb saturation (venous blood gas analysis),Base excess (venous blood gas analysis),pO2 (venous blood gas analysis),Fio2 (venous blood gas analysis),Total CO2 (venous blood gas analysis),pH (venous blood gas analysis),HCO3 (venous blood gas analysis),pCO2 (arterial blood gas analysis),Base excess (arterial blood gas analysis),pH (arterial blood gas analysis),Total CO2 (arterial blood gas analysis),HCO3 (arterial blood gas analysis),pO2 (arterial blood gas analysis),Arteiral Fio2,Phosphor,ctO2 (arterial blood gas analysis)
67,78511c183ae18bc,7,positive,0,0,0,discharged,,,,,,,,,,,,,,,,,
284,d7834ed75f2da44,16,positive,1,0,0,regular,,,,,,,,,,,,,,,,,
513,b16b49f7bd3e692,10,positive,0,0,0,discharged,,,,,,,,,,,,,,,,,
568,4382f5ea05e60c4,2,positive,0,0,0,discharged,-0.090035,0.337027,-0.611396,-0.084646,,-0.479346,-0.436537,-0.512865,,,,,,,,,
676,d3729cd2658ca64,15,positive,0,0,0,discharged,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5632,5c386388ba3c3f0,16,positive,0,0,0,discharged,,,,,,,,,,,,,,,,,
5633,9f8dfe2ae239238,4,positive,0,0,0,discharged,,,,,,,,,,,,,,,,,
5634,db77903261ab6d0,15,positive,0,0,0,discharged,,,,,,,,,,,,,,,,,
5639,ae66feb9e4dc3a0,3,positive,0,0,0,discharged,,,,,,,,,,,,,,,,,


All vars are > 90 % missing. Perhaps we should keep < 98 percent missing instead. 

There are two clusters of blood gasses tests - arterial blood gas analysis & venous blood gas analysis. Fio2 (venous blood gas analysis), Phosphor & Arterial Fio2 will be excluded for having too many missing values. 

In [23]:
bg_missing = (bg.isnull().sum().sort_values(ascending = False)).to_frame(name = 'num')
bg_missing.reset_index(inplace=True)
bg_missing.rename(columns = {'index':'var'}, inplace = True)
bg_missing['prop'] = bg_missing['num']/len(bg)
bg_missing[bg_missing.prop > 0]

Unnamed: 0,var,num,prop
0,Fio2 (venous blood gas analysis),558,1.0
1,Phosphor,557,0.998208
2,Arteiral Fio2,549,0.983871
3,pH (arterial blood gas analysis),545,0.976703
4,pCO2 (arterial blood gas analysis),545,0.976703
5,Base excess (arterial blood gas analysis),545,0.976703
6,ctO2 (arterial blood gas analysis),545,0.976703
7,Total CO2 (arterial blood gas analysis),545,0.976703
8,HCO3 (arterial blood gas analysis),545,0.976703
9,pO2 (arterial blood gas analysis),545,0.976703


Let's see how these values vary by level of care: sent home, general ward, semi-intensive, ICU

For many of these, there are substantial differences in test results among the different patient statuses. We also see a few instances where regular wards and discharged patients have similar test results. 

In [31]:
bg_tests = bg.drop(['Patient age quantile',
                    'Patient addmited to regular ward (1=yes, 0=no)',
                   'Patient addmited to semi-intensive unit (1=yes, 0=no)',
                   'Patient addmited to intensive care unit (1=yes, 0=no)',
                   'Fio2 (venous blood gas analysis)',
                   'Phosphor',
                   'Arteiral Fio2'], axis = 1, inplace = False)
bg_tests.groupby('y').mean()

Unnamed: 0_level_0,pCO2 (venous blood gas analysis),Hb saturation (venous blood gas analysis),Base excess (venous blood gas analysis),pO2 (venous blood gas analysis),Total CO2 (venous blood gas analysis),pH (venous blood gas analysis),HCO3 (venous blood gas analysis),pCO2 (arterial blood gas analysis),Base excess (arterial blood gas analysis),pH (arterial blood gas analysis),Total CO2 (arterial blood gas analysis),HCO3 (arterial blood gas analysis),pO2 (arterial blood gas analysis),ctO2 (arterial blood gas analysis)
y,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
discharged,0.708986,-0.004499,0.025698,-0.138534,0.296007,-0.672211,0.270566,-0.590135,-0.450511,0.354563,-0.986938,-0.971345,0.636332,-0.742191
icu,-0.034691,-0.078813,-0.02891,-0.04423,-0.100626,-0.01756,-0.092881,-0.184417,0.706046,0.396815,0.324521,0.385189,1.175689,0.182693
regular,-0.095569,-0.252106,0.415236,-0.316364,0.239199,0.426555,0.26249,-0.42416,-0.235152,0.250742,-0.599202,-0.568496,0.600244,0.902048
semi,-0.717261,1.010594,0.262333,1.325423,-0.213219,0.911171,-0.157494,-0.258184,0.394972,0.369049,-0.006195,0.031668,-0.354516,-0.105048


We know these tests have two groups: venous blood gas analysis & arterial blood gas analysis. We also know that these groups have the same # of missing vals. Double check that if a person gets one venous blood gas / one arterial blood gas lab, they get them all. If this is the case:
- one binary var for whether or not venous blood gas analysis was run; impute missing values based on mean
- one binary var for whether arterial blood gas analysis was run; impute missing values based on mean
- one binary var for whether both venous & arterial blood gas analysis both run

## Bilirubin

From webmd:
"A bilirubin test measures the amount of bilirubin in your blood. It’s used to help find the cause of health conditions like jaundice, anemia, and liver disease.

Bilirubin is an orange-yellow pigment that occurs normally when part of your red blood cells break down. Your liver takes the bilirubin from your blood and changes its chemical make-up so that most of it is passed through your poop as bile.

If your bilirubin levels are higher than normal, it’s a sign that either your red blood cells are breaking down at an unusual rate or that your liver isn’t breaking down waste properly and clearing the bilirubin from your blood.

In children and adults, doctors use it to diagnose and monitor liver and bile duct diseases. These include cirrhosis, hepatitis, and gallstones.

It’ll also help determine if you have sickle cell disease or other conditions that cause hemolytic anemia. That’s a disorder where red blood cells are destroyed faster than they’re made."
 
 From our data dictionary:
* 'Total Bilirubin', Normal results for a total bilirubin test are 1.2 milligrams per deciliter (mg/dL) for adults and usually 1 mg/dL for those under 18. Normal results for direct bilirubin are generally 0.3 mg/dL.
* 'Direct Bilirubin', direct bilirubin travels freely through your bloodstream to your liver. Most of this bilirubin passes into the small intestine. A very small amount passes into your kidneys and is excreted in your urine. This bilirubin also gives urine its distinctive yellow color.
* 'Indirect Bilirubin',Indirect bilirubin is the difference between total and direct bilirubin. Common causes of higher indirect bilirubin include: Hemolytic anemia.

In [36]:
# get ids and all the blood gasses cols 
ids = df.loc[:, 'Patient ID':'Patient addmited to intensive care unit (1=yes, 0=no)']
bili = df.loc[:, 'Total Bilirubin':'Indirect Bilirubin']
bili = pd.concat([ids, df['y'], bili], axis=1)

# filter to only positive patients
bili = bili[bili['SARS-Cov-2 exam result'] == 'positive']
bili.head(5)

Unnamed: 0,Patient ID,Patient age quantile,SARS-Cov-2 exam result,"Patient addmited to regular ward (1=yes, 0=no)","Patient addmited to semi-intensive unit (1=yes, 0=no)","Patient addmited to intensive care unit (1=yes, 0=no)",y,Total Bilirubin,Direct Bilirubin,Indirect Bilirubin
67,78511c183ae18bc,7,positive,0,0,0,discharged,,,
284,d7834ed75f2da44,16,positive,1,0,0,regular,,,
513,b16b49f7bd3e692,10,positive,0,0,0,discharged,,,
568,4382f5ea05e60c4,2,positive,0,0,0,discharged,1.355535,1.163312,1.198484
676,d3729cd2658ca64,15,positive,0,0,0,discharged,,,


Clearly these tests are run together.

In [38]:
bili_missing = (bili.isnull().sum().sort_values(ascending = False)).to_frame(name = 'num')
bili_missing.reset_index(inplace=True)
bili_missing.rename(columns = {'index':'var'}, inplace = True)
bili_missing['prop'] = bili_missing['num']/len(bili)
bili_missing[bili_missing.prop > 0]

Unnamed: 0,var,num,prop
0,Indirect Bilirubin,518,0.928315
1,Direct Bilirubin,518,0.928315
2,Total Bilirubin,518,0.928315


Variation across our response variables? ICU patients have highest values across the board. Discharged and regular patients have similar values.

In [42]:
bili[['y','Indirect Bilirubin', 'Direct Bilirubin', 'Total Bilirubin']].groupby('y').mean()

Unnamed: 0_level_0,Indirect Bilirubin,Direct Bilirubin,Total Bilirubin
y,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
discharged,-0.243484,-0.16985,-0.240498
icu,0.09063,0.14261,0.131181
regular,-0.309428,-0.185473,-0.289691
semi,-0.442781,-0.100414,-0.327952


Can have binary variable for whether bilirubin tests run & then impute missing values for each cat. Note that total bilirubin is just direct + indirect, so impute those based on mean 7 add means together?

## "Cytes"


* 'Rods #',
* 'Segmented',
* 'Promyelocytes', A promyelocyte (or progranulocyte) is a granulocyte precursor, developing from the myeloblast and developing into the myelocyte. Promyelocytes measure 12-20 microns in diameter.
* 'Metamyelocytes', A metamyelocyte is a cell undergoing granulopoiesis, derived from a myelocyte, and leading to a band cell. It is characterized by the appearance of a bent nucleus, cytoplasmic granules, and the absence of visible nucleoli
* 'Myelocytes',A myelocyte is a young cell of the granulocytic series, occurring normally in bone marrow (can be found in circulating blood when caused by certain diseases)
* 'Myeloblasts',a unipotent stem cell which differentiates into the effectors of the granulocyte series.

In [46]:
# get ids and all the blood gasses cols 
ids = df.loc[:, 'Patient ID':'Patient addmited to intensive care unit (1=yes, 0=no)']
cytes = df.loc[:, 'Rods #':'Myeloblasts']
cytes = pd.concat([ids, df['y'], cytes], axis=1)

# filter to only positive patients
cytes = cytes[cytes['SARS-Cov-2 exam result'] == 'positive']
cytes.head(5)

Unnamed: 0,Patient ID,Patient age quantile,SARS-Cov-2 exam result,"Patient addmited to regular ward (1=yes, 0=no)","Patient addmited to semi-intensive unit (1=yes, 0=no)","Patient addmited to intensive care unit (1=yes, 0=no)",y,Rods #,Segmented,Promyelocytes,Metamyelocytes,Myelocytes,Myeloblasts
67,78511c183ae18bc,7,positive,0,0,0,discharged,,,,,,
284,d7834ed75f2da44,16,positive,1,0,0,regular,,,,,,
513,b16b49f7bd3e692,10,positive,0,0,0,discharged,,,,,,
568,4382f5ea05e60c4,2,positive,0,0,0,discharged,,,,,,
676,d3729cd2658ca64,15,positive,0,0,0,discharged,,,,,,


Missing vals - all have > 98 percent missing. This will just be a binary var of whether or not cytes tests were run.

In [51]:
cytes_missing = (cytes.isnull().sum().sort_values(ascending = False)).to_frame(name = 'num')
cytes_missing.reset_index(inplace=True)
cytes_missing.rename(columns = {'index':'var'}, inplace = True)
cytes_missing['prop'] = cytes_missing['num']/len(cytes)
cytes_missing[cytes_missing.prop > 0]

Unnamed: 0,var,num,prop
0,Myeloblasts,549,0.983871
1,Myelocytes,549,0.983871
2,Metamyelocytes,549,0.983871
3,Promyelocytes,549,0.983871
4,Segmented,549,0.983871
5,Rods #,549,0.983871


How we'd create binary var: