# Missing Data Report

In [23]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 120)

Read in data

In [7]:
df = pd.read_excel("../data/dataset.xlsx")
df.head(5)

Unnamed: 0,Patient ID,Patient age quantile,SARS-Cov-2 exam result,"Patient addmited to regular ward (1=yes, 0=no)","Patient addmited to semi-intensive unit (1=yes, 0=no)","Patient addmited to intensive care unit (1=yes, 0=no)",Hematocrit,Hemoglobin,Platelets,Mean platelet volume,...,Hb saturation (arterial blood gases),pCO2 (arterial blood gas analysis),Base excess (arterial blood gas analysis),pH (arterial blood gas analysis),Total CO2 (arterial blood gas analysis),HCO3 (arterial blood gas analysis),pO2 (arterial blood gas analysis),Arteiral Fio2,Phosphor,ctO2 (arterial blood gas analysis)
0,44477f75e8169d2,13,negative,0,0,0,,,,,...,,,,,,,,,,
1,126e9dd13932f68,17,negative,0,0,0,0.236515,-0.02234,-0.517413,0.010677,...,,,,,,,,,,
2,a46b4402a0e5696,8,negative,0,0,0,,,,,...,,,,,,,,,,
3,f7d619a94f97c45,5,negative,0,0,0,,,,,...,,,,,,,,,,
4,d9e41465789c2b5,15,negative,0,0,0,,,,,...,,,,,,,,,,


Numeric vs categorical

In [11]:
df.dtypes == 'object'

Patient ID                                                True
Patient age quantile                                     False
SARS-Cov-2 exam result                                    True
Patient addmited to regular ward (1=yes, 0=no)           False
Patient addmited to semi-intensive unit (1=yes, 0=no)    False
                                                         ...  
HCO3 (arterial blood gas analysis)                       False
pO2 (arterial blood gas analysis)                        False
Arteiral Fio2                                            False
Phosphor                                                 False
ctO2 (arterial blood gas analysis)                       False
Length: 111, dtype: bool

Number of datapoints

In [12]:
len(df)

5644

Missing vals (overall). There are 88 cols with over 80% missing vals and 5 with 100% missing vals.

In [27]:
missing_prop = (df.isnull().sum().sort_values(ascending = False)).to_frame(name = 'num')
missing_prop.reset_index(inplace=True)
missing_prop.rename(columns = {'index':'var'}, inplace = True)
missing_prop['prop'] = missing_prop['num']/len(df)
missing_prop[missing_prop['prop'] >= 0.8]

Unnamed: 0,var,num,prop
0,"Prothrombin time (PT), Activity",5644,1.0
1,D-Dimer,5644,1.0
2,Mycoplasma pneumoniae,5644,1.0
3,Urine - Sugar,5644,1.0
4,Partial thromboplastin time (PTT),5644,1.0
5,Fio2 (venous blood gas analysis),5643,0.999823
6,Urine - Nitrite,5643,0.999823
7,Vitamin B12,5641,0.999468
8,Lipase dosage,5636,0.998583
9,Albumin,5631,0.997697


Just covid positive patients?

In [29]:
covid_pos = df[df['SARS-Cov-2 exam result'] == 'positive']
missing_prop = (covid_pos.isnull().sum().sort_values(ascending = False)).to_frame(name = 'num')
missing_prop.reset_index(inplace=True)
missing_prop.rename(columns = {'index':'var'}, inplace = True)
missing_prop['prop'] = missing_prop['num']/len(covid_pos)
missing_prop[missing_prop['prop'] >= 0.8]

Unnamed: 0,var,num,prop
0,Mycoplasma pneumoniae,558,1.0
1,Vitamin B12,558,1.0
2,Urine - Sugar,558,1.0
3,Urine - Nitrite,558,1.0
4,Fio2 (venous blood gas analysis),558,1.0
5,Partial thromboplastin time (PTT),558,1.0
6,Albumin,558,1.0
7,D-Dimer,558,1.0
8,"Prothrombin time (PT), Activity",558,1.0
9,Phosphor,557,0.998208


Clearly some patterning going on above, where we see the same number of missing values for several different lab results. 

We need to consider whether or not we want to impute missing values. Perhaps the fact that a test was **not** run is more useful than imputing the value... especially if we are able to visually identify groups of tests that are commonly run together and if we assume that certain tests are associated with positive covid results. So, if we do impute values, we should retain some variable that indicates that a particular lab test was not run. 

Could also just impute by giving patients the mean value for their age group (assuming that, if a test was not run, the nurses expected "normal" results for that test or that a test isn't expected to provide additional information about a patient given their symptoms). This may be the best we can do given we don't have other demographic information (race, weight, height, gender, etc.) and is likely more accurate than just using average value. 

Which values should we keep and what should we throw out?
* Remove any variables above a certain threshold of missing values. 
* 90% threshold removes 66 vars
* 99% removes 13
* At the very least, remove the 9 variables with 100% missing values

In [32]:
missing_prop[missing_prop['prop'] >= 0.99]

Unnamed: 0,var,num,prop
0,Mycoplasma pneumoniae,558,1.0
1,Vitamin B12,558,1.0
2,Urine - Sugar,558,1.0
3,Urine - Nitrite,558,1.0
4,Fio2 (venous blood gas analysis),558,1.0
5,Partial thromboplastin time (PTT),558,1.0
6,Albumin,558,1.0
7,D-Dimer,558,1.0
8,"Prothrombin time (PT), Activity",558,1.0
9,Phosphor,557,0.998208
