# Congestive Heart Failure Data: Cleaning & Imputation

Now that we have a data set extracted, we must clean and impute for missingness. Some known tasks that must be completed are bulleted below. Through some basic exploratory data analysis, we may find other manipulations that must be done in this process. The goal is to have a clean and usable data set for our next phase: modeling.
### Data Cleaning


- Remove DOD_SSN & DOD_HOSP
- Determine blood unit measurements that are '%'. Mean, variance and standard deviation metrics cannot be used for these measurements, and these columns must be dropped, as mean, variance, standard deviation of % are not calculated directly from %.
- Calculate Age of patients from DOB vs. Admission Date fields (fields were obscured in database to future dates to secure anonymity and comply with HIPAA).
- Determine if other measurements are invalid and should be dropped.
- Identify portions of other measurements to determine if they should be imputed or dropped due to volume and missingness.

### Imputation


- Impute remaining measurements methodically.
- ...and more(?)

## Data Cleaning
### Load libraries and data. Remove unnecessary indexes, DOD_HOSP & DOD_SSN.

In [102]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

pd.options.display.max_columns = 150

In [20]:
chf_dirty = pd.read_csv('patients_labevents.csv').drop(labels =['Unnamed: 0','ROW_ID','DOD_HOSP','DOD_SSN'], axis = 1)

In [21]:
chf_dirty.head()

Unnamed: 0,SUBJECT_ID,GENDER,DOB,DOD,EXPIRE_FLAG,Anion Gapmin,Base Excessmin,Bicarbonatemin,"Calcium, Totalmin",Calculated Total CO2min,...,RDWvar,Red Blood Cellsvar,SPECIMEN TYPEvar,Sodiumvar,Urea Nitrogenvar,White Blood Cellsvar,pCO2var,pHvar,pO2var,CHF
0,249,F,2075-03-13 00:00:00,,0,8.0,-1.0,19.0,7.8,27.0,...,0.915249,0.126727,,8.221154,133.147541,28.458241,98.423333,0.006568,9958.0,1.0
1,250,F,2164-12-27 00:00:00,2188-11-22 00:00:00,1,8.0,-23.0,13.0,6.8,15.0,...,2.933919,0.136234,,9.871658,9.134581,50.243707,122.549142,0.014248,17682.562724,0.0
2,251,M,2090-03-15 00:00:00,,0,14.0,,19.0,8.9,,...,0.043333,0.019033,,4.333333,1.666667,9.143333,,,,0.0
3,252,M,2078-03-06 00:00:00,,0,6.0,-8.0,14.0,7.0,20.0,...,2.200418,0.221759,,13.57259,61.72093,22.781957,11.339181,0.006636,2751.397661,0.0
4,253,F,2089-11-26 00:00:00,,0,10.0,1.0,24.0,8.3,26.0,...,0.023,0.01532,,4.566667,14.266667,4.943,8.916667,0.000892,22969.0,0.0


### Determine blood unit measurements that are '%'. Remove these columns.

In [22]:
chf_dirty.shape

(46520, 151)

In [23]:
LABEVENTS = pd.read_csv('./MIMIC-III/mimicIII-DB/LABEVENTS.csv')

In [24]:
LABITEMS = pd.read_csv('./MIMIC-III/mimicIII-DB/D_LABITEMS.csv')

In [25]:
LABEVENTS.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ITEMID,CHARTTIME,VALUE,VALUENUM,VALUEUOM,FLAG
0,281,3,,50820,2101-10-12 16:07:00,7.39,7.39,units,
1,282,3,,50800,2101-10-12 18:17:00,ART,,,
2,283,3,,50802,2101-10-12 18:17:00,-1,-1.0,mEq/L,
3,284,3,,50804,2101-10-12 18:17:00,22,22.0,mEq/L,
4,285,3,,50808,2101-10-12 18:17:00,0.93,0.93,mmol/L,abnormal


In [26]:
LABITEMS.head()

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE
0,546,51346,Blasts,Cerebrospinal Fluid (CSF),Hematology,26447-3
1,547,51347,Eosinophils,Cerebrospinal Fluid (CSF),Hematology,26451-5
2,548,51348,"Hematocrit, CSF",Cerebrospinal Fluid (CSF),Hematology,30398-2
3,549,51349,Hypersegmented Neutrophils,Cerebrospinal Fluid (CSF),Hematology,26506-6
4,550,51350,Immunophenotyping,Cerebrospinal Fluid (CSF),Hematology,


In [64]:
lab_UOM_group = LABEVENTS.groupby(['ITEMID'])['VALUEUOM']

In [76]:
test = LABEVENTS.groupby(['ITEMID','VALUEUOM']).count().reset_index()

In [82]:
d = dict(zip(test['ITEMID'],test['VALUEUOM']))
units = []
for i in LABITEMS['ITEMID']:
    units.append(d.get(i))
LABITEMS['units'] = units
LABITEMS.sample(10)

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE,units
67,613,51413,CD56,Other Body Fluid,Hematology,57424-4,
664,538,51338,Immunophenotyping,Bone Marrow,Hematology,,
697,698,51498,Specific Gravity,Urine,Hematology,5811-5,
5,551,51351,Lymphs,Cerebrospinal Fluid (CSF),Hematology,26479-6,%
490,364,51164,CD19,Blood,Hematology,8117-4,
516,390,51190,CD59,Blood,Hematology,17177-7,
177,51,50850,"Triglycerides, Ascites",Ascites,Chemistry,14447-7,mg/dL
734,735,51535,CD55,OTHER BODY FLUID,HEMATOLOGY,,
346,220,51020,"Amylase, Joint Fluid",Joint Fluid,Chemistry,14388-3,IU/L
167,41,50840,"Cholesterol, Ascites",Ascites,Chemistry,14441-0,mg/dL


In [85]:
LAB_PERC = LABITEMS[LABITEMS['units'] == '%']

In [86]:
ITEMID_list = [51221,50971,50983,50912,50902,51006,50882,51265,50868,51301,51222,50931,51249,51279,\
51248,51250,51277,50960,50893,50970,50820,50802,50804,50821,50818,51275,51237,51274,50800]

In [87]:
LAB_PERC[LAB_PERC.ITEMID.isin(ITEMID_list)]

Unnamed: 0,ROW_ID,ITEMID,LABEL,FLUID,CATEGORY,LOINC_CODE,units
547,421,51221,Hematocrit,Blood,Hematology,4544-3,%
575,449,51249,MCHC,Blood,Hematology,786-4,%
603,477,51277,RDW,Blood,Hematology,788-0,%


In [97]:
chf_cleanish = chf_dirty.drop(labels = ['Hematocritmean', 'Hematocritstd', 'Hematocritvar', 'MCHCmean',\
                                        'MCHCstd', 'MCHCvar','RDWmean', 'RDWstd', 'RDWvar'], axis = 1)

In [99]:
chf_cleanish.shape

(46520, 142)

### Determine if other measurements are invalid and should be dropped.

In [105]:
chf_cleanish.head(10)

Unnamed: 0,SUBJECT_ID,GENDER,DOB,DOD,EXPIRE_FLAG,Anion Gapmin,Base Excessmin,Bicarbonatemin,"Calcium, Totalmin",Calculated Total CO2min,Chloridemin,Creatininemin,Glucosemin,Hematocritmin,Hemoglobinmin,INR(PT)min,MCHmin,MCHCmin,MCVmin,Magnesiummin,PTmin,PTTmin,Phosphatemin,Platelet Countmin,Potassiummin,RDWmin,Red Blood Cellsmin,SPECIMEN TYPEmin,Sodiummin,Urea Nitrogenmin,White Blood Cellsmin,pCO2min,pHmin,pO2min,Anion Gapmean,Base Excessmean,Bicarbonatemean,"Calcium, Totalmean",Calculated Total CO2mean,Chloridemean,Creatininemean,Glucosemean,Hemoglobinmean,INR(PT)mean,MCHmean,MCVmean,Magnesiummean,PTmean,PTTmean,Phosphatemean,Platelet Countmean,Potassiummean,Red Blood Cellsmean,SPECIMEN TYPEmean,Sodiummean,Urea Nitrogenmean,White Blood Cellsmean,pCO2mean,pHmean,pO2mean,Anion Gapmax,Base Excessmax,Bicarbonatemax,"Calcium, Totalmax",Calculated Total CO2max,Chloridemax,Creatininemax,Glucosemax,Hematocritmax,Hemoglobinmax,INR(PT)max,MCHmax,MCHCmax,MCVmax,Magnesiummax,PTmax,PTTmax,Phosphatemax,Platelet Countmax,Potassiummax,RDWmax,Red Blood Cellsmax,SPECIMEN TYPEmax,Sodiummax,Urea Nitrogenmax,White Blood Cellsmax,pCO2max,pHmax,pO2max,Anion Gapstd,Base Excessstd,Bicarbonatestd,"Calcium, Totalstd",Calculated Total CO2std,Chloridestd,Creatininestd,Glucosestd,Hemoglobinstd,INR(PT)std,MCHstd,MCVstd,Magnesiumstd,PTstd,PTTstd,Phosphatestd,Platelet Countstd,Potassiumstd,Red Blood Cellsstd,SPECIMEN TYPEstd,Sodiumstd,Urea Nitrogenstd,White Blood Cellsstd,pCO2std,pHstd,pO2std,Anion Gapvar,Base Excessvar,Bicarbonatevar,"Calcium, Totalvar",Calculated Total CO2var,Chloridevar,Creatininevar,Glucosevar,Hemoglobinvar,INR(PT)var,MCHvar,MCVvar,Magnesiumvar,PTvar,PTTvar,Phosphatevar,Platelet Countvar,Potassiumvar,Red Blood Cellsvar,SPECIMEN TYPEvar,Sodiumvar,Urea Nitrogenvar,White Blood Cellsvar,pCO2var,pHvar,pO2var,CHF
0,249,F,2075-03-13 00:00:00,,0,8.0,-1.0,19.0,7.8,27.0,95.0,0.7,61.0,22.5,7.0,1.2,26.5,30.3,86.0,1.1,13.7,21.9,2.3,152.0,3.2,14.0,2.45,,134.0,9.0,4.2,36.0,7.23,77.0,13.580645,4.36,27.532258,8.702,32.92,103.390625,1.168254,132.096774,10.326,2.213333,29.306,89.86,1.985,20.696667,59.496491,3.318,216.745098,3.969697,3.522,,140.461538,27.04918,13.718,57.44,7.3548,163.6,20.0,13.0,37.0,9.5,41.0,110.0,1.8,249.0,39.5,13.2,4.8,31.1,35.3,93.0,3.2,45.5,150.0,6.8,336.0,4.9,17.1,4.44,,148.0,53.0,28.6,80.0,7.52,515.0,2.961769,4.101626,4.071966,0.429756,3.81794,3.552853,0.274053,49.638268,1.168255,0.997873,1.051609,1.906059,0.438729,6.676876,42.798807,0.913993,42.063687,0.330021,0.355987,,2.867255,11.538958,5.334627,9.920853,0.081041,99.789779,8.772078,16.823333,16.58091,0.18469,14.576667,12.622768,0.075105,2463.957694,1.36482,0.995751,1.105882,3.633061,0.192483,44.580667,1831.737845,0.835384,1769.353725,0.108914,0.126727,,8.221154,133.147541,28.458241,98.423333,0.006568,9958.0,1.0
1,250,F,2164-12-27 00:00:00,2188-11-22 00:00:00,1,8.0,-23.0,13.0,6.8,15.0,95.0,0.2,48.0,21.6,7.2,1.0,23.2,29.7,76.0,1.4,12.7,34.3,2.4,46.0,3.4,13.6,2.59,,131.0,4.0,6.0,23.0,6.8,15.0,13.176471,1.174699,24.352941,8.316667,27.253012,103.558824,0.458824,124.757576,10.304444,1.503571,28.9,82.022222,1.871429,15.203571,114.578261,4.021739,191.433333,4.385294,3.570667,,136.647059,11.676471,15.324444,42.837349,7.402738,198.03012,21.0,9.0,30.0,13.8,40.0,114.0,0.7,276.0,39.0,12.6,4.7,31.6,38.2,88.0,2.5,27.3,150.0,7.3,927.0,8.8,18.5,4.33,,142.0,21.0,31.6,105.0,7.6,539.0,3.1572,4.992075,3.891579,1.407948,3.792635,5.332526,0.115778,52.664641,1.21897,0.671914,2.314578,2.544949,0.261595,2.667428,50.287473,1.2225,215.636729,0.949045,0.369098,,3.141919,3.022347,7.08828,11.070192,0.119363,132.975798,9.967914,24.920811,15.144385,1.982319,14.384082,28.435829,0.013405,2773.564394,1.485889,0.451468,5.357273,6.476768,0.068432,7.115172,2528.82996,1.494506,46499.19887,0.900686,0.136234,,9.871658,9.134581,50.243707,122.549142,0.014248,17682.562724,0.0
2,251,M,2090-03-15 00:00:00,,0,14.0,,19.0,8.9,,102.0,0.7,92.0,36.7,13.4,1.2,33.3,35.6,93.0,1.9,13.7,23.9,2.5,218.0,3.6,11.3,3.91,,135.0,8.0,9.3,,7.46,,16.0,,21.333333,9.0,,104.0,0.8,99.333333,13.6,1.25,33.866667,94.0,1.9,13.9,25.9,2.65,242.333333,3.9,4.013333,,137.333333,9.5,11.866667,,7.46,,17.0,,23.0,9.1,,107.0,0.9,107.0,39.0,13.9,1.3,34.5,36.6,95.0,1.9,14.1,27.9,2.8,276.0,4.1,11.7,4.17,,139.0,11.0,15.2,,7.46,,1.732051,,2.081666,0.141421,,2.645751,0.08165,7.505553,0.264575,0.070711,0.602771,1.0,0.0,0.282843,2.828427,0.212132,30.105371,0.264575,0.137961,,2.081666,1.290994,3.023795,,,,3.0,,4.333333,0.02,,7.0,0.006667,56.333333,0.07,0.005,0.363333,1.0,0.0,0.08,8.0,0.045,906.333333,0.07,0.019033,,4.333333,1.666667,9.143333,,,,0.0
3,252,M,2078-03-06 00:00:00,,0,6.0,-8.0,14.0,7.0,20.0,99.0,0.8,85.0,23.8,8.0,1.2,29.7,31.5,86.0,1.2,13.2,28.8,1.8,61.0,2.3,13.8,2.5,,132.0,8.0,7.1,33.0,7.12,48.0,9.789474,-3.210526,24.868421,7.739394,22.105263,111.921053,1.229545,110.810811,10.848718,1.47,31.215385,92.512821,1.748571,15.07,33.016,3.222581,104.244444,3.781395,3.477949,,142.47619,23.0,12.225641,37.684211,7.345714,99.210526,20.0,4.0,32.0,8.6,29.0,119.0,2.0,141.0,40.4,13.4,2.4,33.5,35.4,102.0,2.2,19.3,46.2,5.2,180.0,5.6,21.2,4.42,,149.0,41.0,28.5,45.0,7.48,303.0,2.933109,3.119454,3.793102,0.388056,2.664473,4.789524,0.322476,16.888257,1.378762,0.273105,0.865277,3.440255,0.193073,1.357267,3.984102,0.65864,20.79899,0.653291,0.470913,,3.6841,7.856267,4.773045,3.36737,0.08146,52.453767,8.603129,9.730994,14.387624,0.150587,7.099415,22.939545,0.10399,285.213213,1.900985,0.074586,0.748704,11.835358,0.037277,1.842172,15.873067,0.433806,432.59798,0.426788,0.221759,,13.57259,61.72093,22.781957,11.339181,0.006636,2751.397661,0.0
4,253,F,2089-11-26 00:00:00,,0,10.0,1.0,24.0,8.3,26.0,100.0,0.8,82.0,29.9,10.1,1.1,28.3,33.9,83.0,1.9,12.6,24.9,3.1,209.0,4.0,15.3,3.59,,136.0,12.0,8.2,36.0,7.41,94.0,13.0,2.0,26.166667,8.533333,27.25,103.333333,0.95,106.0,10.88,1.133333,28.96,83.4,1.966667,12.833333,25.7,3.6,219.0,4.3,3.758,,138.166667,16.666667,9.84,39.25,7.4375,222.5,16.0,3.0,27.0,8.7,28.0,106.0,1.1,148.0,32.5,11.4,1.2,30.0,36.0,84.0,2.0,13.3,27.2,4.1,231.0,4.5,15.6,3.89,,141.0,22.0,13.5,43.0,7.48,428.0,1.897367,0.816497,1.32916,0.136626,0.957427,2.42212,0.13784,22.458851,0.535724,0.057735,0.650385,0.547723,0.05164,0.404145,1.3,0.419524,9.027735,0.2,0.123774,,2.136976,3.777124,2.223286,2.986079,0.029861,151.55527,3.6,0.666667,1.766667,0.018667,0.916667,5.866667,0.019,504.4,0.287,0.003333,0.423,0.3,0.002667,0.163333,1.69,0.176,81.5,0.04,0.01532,,4.566667,14.266667,4.943,8.916667,0.000892,22969.0,0.0
5,255,M,2109-08-05 00:00:00,,0,10.0,,26.0,7.8,,97.0,0.8,53.0,29.8,10.5,1.0,30.1,34.5,87.0,1.6,12.1,27.7,3.0,171.0,3.7,12.7,3.44,,134.0,12.0,5.1,,,,11.0,,27.75,8.2,,101.25,0.9,115.6,12.72,1.033333,30.86,87.2,1.7,12.333333,28.6,3.25,205.0,4.05,4.124,,136.0,18.6,5.92,,,,12.0,,30.0,9.1,,108.0,1.0,192.0,40.0,15.0,1.1,32.7,37.6,88.0,1.8,12.6,29.3,3.5,253.0,4.4,14.0,4.6,,140.0,22.0,6.6,,,,1.154701,,2.061553,0.60553,,4.99166,0.070711,52.002885,1.974082,0.057735,1.045466,0.447214,0.1,0.251661,0.818535,0.353553,27.68393,0.310913,0.575178,,2.828427,3.847077,0.637966,,,,1.333333,,4.25,0.366667,,24.916667,0.005,2704.3,3.897,0.003333,1.093,0.2,0.01,0.063333,0.67,0.125,766.4,0.096667,0.33083,,8.0,14.8,0.407,,,,0.0
6,256,M,2086-07-31 00:00:00,,0,8.0,-8.0,18.0,7.5,18.0,100.0,1.0,64.0,21.2,7.4,1.0,27.4,30.5,83.0,1.5,12.1,26.7,2.2,124.0,3.5,12.7,2.46,,135.0,13.0,5.3,35.0,7.2,54.0,12.56338,-3.794118,24.52,8.697619,22.117647,108.368421,1.8,100.014925,11.232955,1.97069,29.375862,88.689655,2.044186,17.253448,37.431915,3.023077,231.850575,4.445238,3.827586,,140.987342,29.821429,10.297701,41.0,7.326579,159.352941,20.0,0.0,32.0,9.8,26.0,119.0,3.0,198.0,47.2,15.4,7.3,31.5,36.2,98.0,2.6,33.1,139.0,4.3,521.0,6.2,17.9,5.02,,147.0,67.0,34.5,50.0,7.42,456.0,2.303615,2.185065,3.112138,0.521484,1.950364,4.285921,0.420498,22.089002,1.622806,1.199855,0.858604,2.792475,0.207367,4.425533,17.744039,0.455066,67.88217,0.556539,0.514384,,2.743261,11.758781,4.442631,3.709121,0.044919,83.256043,5.30664,4.77451,9.685405,0.271945,3.803922,18.369123,0.176818,487.924016,2.633499,1.439652,0.737201,7.797915,0.043001,19.585339,314.850916,0.207085,4607.98904,0.309736,0.264591,,7.525479,138.268933,19.736971,13.757576,0.002018,6931.568627,0.0
7,257,F,2031-04-03 00:00:00,2121-07-08 00:00:00,1,11.0,,22.0,8.7,,107.0,0.6,86.0,34.9,11.8,1.1,31.8,33.9,94.0,1.9,12.6,27.1,2.4,218.0,3.2,13.5,3.73,,143.0,13.0,6.8,,,,12.333333,,24.666667,8.7,,109.666667,0.6,114.0,12.766667,1.225,32.0,94.0,1.95,13.5,28.525,3.0,230.333333,3.75,4.006667,,143.0,15.333333,7.633333,,,,15.0,,28.0,8.7,,114.0,0.6,138.0,42.0,14.3,1.6,32.2,34.3,94.0,2.0,15.4,29.9,3.6,242.0,4.2,13.7,4.48,,143.0,18.0,9.2,,,,2.309401,,3.05505,0.0,,3.785939,0.0,26.229754,1.342882,0.25,0.2,0.0,0.070711,1.283225,1.201041,0.848528,12.013881,0.443471,0.411866,,0.0,2.516611,1.357694,,,,5.333333,,9.333333,0.0,,14.333333,0.0,688.0,1.803333,0.0625,0.04,0.0,0.005,1.646667,1.4425,0.72,144.333333,0.196667,0.169633,,0.0,6.333333,1.843333,,,,0.0
8,258,F,2124-09-19 00:00:00,,0,17.0,,23.0,,,108.0,,,52.8,18.2,,38.7,34.4,113.0,,,,,159.0,4.9,16.2,4.69,,143.0,,10.8,,,,17.0,,23.0,,,108.0,,,18.2,,38.7,113.0,,,,,159.0,4.9,4.69,,143.0,,10.8,,,,17.0,,23.0,,,108.0,,,52.8,18.2,,38.7,34.4,113.0,,,,,159.0,4.9,16.2,4.69,,143.0,,10.8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0
9,260,F,2105-03-23 00:00:00,,0,,,,,,,,,44.3,15.2,,37.0,34.4,107.0,,,,,340.0,,17.2,4.12,,,,9.0,,,,,,,,,,,,15.2,,37.0,107.0,,,,,340.0,,4.12,,,,9.0,,,,,,,,,,,,44.3,15.2,,37.0,34.4,107.0,,,,,340.0,,17.2,4.12,,,,9.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0


## Research Break:
Identify value counts of ethnicity, gender, language, etc. that may inform imputation methods for blood measurements.

Oops! Just realized I forgot to join in these metrics from the admissions table.

Maite is doing research to identify which imputation methods would be best for blood measurements.

In [107]:
ADMISSIONS = pd.read_csv('./MIMIC-III/mimicIII-DB/ADMISSIONS.csv')

In [111]:
ADMISSIONS2 = ADMISSIONS.drop(labels = ['ROW_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME',\
                                        'ADMISSION_TYPE', 'ADMISSION_LOCATION', 'DISCHARGE_LOCATION',\
                                        'HAS_CHARTEVENTS_DATA'], axis = 1)

In [127]:
chf_cleanish.SUBJECT_ID.nunique()

46520

In [135]:
ADMISSIONS2.drop_duplicates(subset = 'SUBJECT_ID', keep = 'first', inplace = True)

In [143]:
chf_patients_cleanish = chf_cleanish.merge(ADMISSIONS2, how='left', on='SUBJECT_ID')

In [146]:
chf_patients_cleanish = chf_patients_cleanish.drop(labels = ['EDREGTIME', 'EDOUTTIME',\
                                                             'DIAGNOSIS', 'HOSPITAL_EXPIRE_FLAG'], axis = 1)

In [147]:
chf_patients_cleanish.head()

Unnamed: 0,SUBJECT_ID,GENDER,DOB,DOD,EXPIRE_FLAG,Anion Gapmin,Base Excessmin,Bicarbonatemin,"Calcium, Totalmin",Calculated Total CO2min,Chloridemin,Creatininemin,Glucosemin,Hematocritmin,Hemoglobinmin,INR(PT)min,MCHmin,MCHCmin,MCVmin,Magnesiummin,PTmin,PTTmin,Phosphatemin,Platelet Countmin,Potassiummin,RDWmin,Red Blood Cellsmin,SPECIMEN TYPEmin,Sodiummin,Urea Nitrogenmin,White Blood Cellsmin,pCO2min,pHmin,pO2min,Anion Gapmean,Base Excessmean,Bicarbonatemean,"Calcium, Totalmean",Calculated Total CO2mean,Chloridemean,Creatininemean,Glucosemean,Hemoglobinmean,INR(PT)mean,MCHmean,MCVmean,Magnesiummean,PTmean,PTTmean,Phosphatemean,Platelet Countmean,Potassiummean,Red Blood Cellsmean,SPECIMEN TYPEmean,Sodiummean,Urea Nitrogenmean,White Blood Cellsmean,pCO2mean,pHmean,pO2mean,Anion Gapmax,Base Excessmax,Bicarbonatemax,"Calcium, Totalmax",Calculated Total CO2max,Chloridemax,Creatininemax,Glucosemax,Hematocritmax,Hemoglobinmax,INR(PT)max,MCHmax,MCHCmax,MCVmax,Magnesiummax,PTmax,PTTmax,Phosphatemax,Platelet Countmax,Potassiummax,RDWmax,Red Blood Cellsmax,SPECIMEN TYPEmax,Sodiummax,Urea Nitrogenmax,White Blood Cellsmax,pCO2max,pHmax,pO2max,Anion Gapstd,Base Excessstd,Bicarbonatestd,"Calcium, Totalstd",Calculated Total CO2std,Chloridestd,Creatininestd,Glucosestd,Hemoglobinstd,INR(PT)std,MCHstd,MCVstd,Magnesiumstd,PTstd,PTTstd,Phosphatestd,Platelet Countstd,Potassiumstd,Red Blood Cellsstd,SPECIMEN TYPEstd,Sodiumstd,Urea Nitrogenstd,White Blood Cellsstd,pCO2std,pHstd,pO2std,Anion Gapvar,Base Excessvar,Bicarbonatevar,"Calcium, Totalvar",Calculated Total CO2var,Chloridevar,Creatininevar,Glucosevar,Hemoglobinvar,INR(PT)var,MCHvar,MCVvar,Magnesiumvar,PTvar,PTTvar,Phosphatevar,Platelet Countvar,Potassiumvar,Red Blood Cellsvar,SPECIMEN TYPEvar,Sodiumvar,Urea Nitrogenvar,White Blood Cellsvar,pCO2var,pHvar,pO2var,CHF,INSURANCE,LANGUAGE,RELIGION,MARITAL_STATUS,ETHNICITY
0,249,F,2075-03-13 00:00:00,,0,8.0,-1.0,19.0,7.8,27.0,95.0,0.7,61.0,22.5,7.0,1.2,26.5,30.3,86.0,1.1,13.7,21.9,2.3,152.0,3.2,14.0,2.45,,134.0,9.0,4.2,36.0,7.23,77.0,13.580645,4.36,27.532258,8.702,32.92,103.390625,1.168254,132.096774,10.326,2.213333,29.306,89.86,1.985,20.696667,59.496491,3.318,216.745098,3.969697,3.522,,140.461538,27.04918,13.718,57.44,7.3548,163.6,20.0,13.0,37.0,9.5,41.0,110.0,1.8,249.0,39.5,13.2,4.8,31.1,35.3,93.0,3.2,45.5,150.0,6.8,336.0,4.9,17.1,4.44,,148.0,53.0,28.6,80.0,7.52,515.0,2.961769,4.101626,4.071966,0.429756,3.81794,3.552853,0.274053,49.638268,1.168255,0.997873,1.051609,1.906059,0.438729,6.676876,42.798807,0.913993,42.063687,0.330021,0.355987,,2.867255,11.538958,5.334627,9.920853,0.081041,99.789779,8.772078,16.823333,16.58091,0.18469,14.576667,12.622768,0.075105,2463.957694,1.36482,0.995751,1.105882,3.633061,0.192483,44.580667,1831.737845,0.835384,1769.353725,0.108914,0.126727,,8.221154,133.147541,28.458241,98.423333,0.006568,9958.0,1.0,Medicare,,CATHOLIC,DIVORCED,WHITE
1,250,F,2164-12-27 00:00:00,2188-11-22 00:00:00,1,8.0,-23.0,13.0,6.8,15.0,95.0,0.2,48.0,21.6,7.2,1.0,23.2,29.7,76.0,1.4,12.7,34.3,2.4,46.0,3.4,13.6,2.59,,131.0,4.0,6.0,23.0,6.8,15.0,13.176471,1.174699,24.352941,8.316667,27.253012,103.558824,0.458824,124.757576,10.304444,1.503571,28.9,82.022222,1.871429,15.203571,114.578261,4.021739,191.433333,4.385294,3.570667,,136.647059,11.676471,15.324444,42.837349,7.402738,198.03012,21.0,9.0,30.0,13.8,40.0,114.0,0.7,276.0,39.0,12.6,4.7,31.6,38.2,88.0,2.5,27.3,150.0,7.3,927.0,8.8,18.5,4.33,,142.0,21.0,31.6,105.0,7.6,539.0,3.1572,4.992075,3.891579,1.407948,3.792635,5.332526,0.115778,52.664641,1.21897,0.671914,2.314578,2.544949,0.261595,2.667428,50.287473,1.2225,215.636729,0.949045,0.369098,,3.141919,3.022347,7.08828,11.070192,0.119363,132.975798,9.967914,24.920811,15.144385,1.982319,14.384082,28.435829,0.013405,2773.564394,1.485889,0.451468,5.357273,6.476768,0.068432,7.115172,2528.82996,1.494506,46499.19887,0.900686,0.136234,,9.871658,9.134581,50.243707,122.549142,0.014248,17682.562724,0.0,Self Pay,HAIT,NOT SPECIFIED,SINGLE,BLACK/AFRICAN AMERICAN
2,251,M,2090-03-15 00:00:00,,0,14.0,,19.0,8.9,,102.0,0.7,92.0,36.7,13.4,1.2,33.3,35.6,93.0,1.9,13.7,23.9,2.5,218.0,3.6,11.3,3.91,,135.0,8.0,9.3,,7.46,,16.0,,21.333333,9.0,,104.0,0.8,99.333333,13.6,1.25,33.866667,94.0,1.9,13.9,25.9,2.65,242.333333,3.9,4.013333,,137.333333,9.5,11.866667,,7.46,,17.0,,23.0,9.1,,107.0,0.9,107.0,39.0,13.9,1.3,34.5,36.6,95.0,1.9,14.1,27.9,2.8,276.0,4.1,11.7,4.17,,139.0,11.0,15.2,,7.46,,1.732051,,2.081666,0.141421,,2.645751,0.08165,7.505553,0.264575,0.070711,0.602771,1.0,0.0,0.282843,2.828427,0.212132,30.105371,0.264575,0.137961,,2.081666,1.290994,3.023795,,,,3.0,,4.333333,0.02,,7.0,0.006667,56.333333,0.07,0.005,0.363333,1.0,0.0,0.08,8.0,0.045,906.333333,0.07,0.019033,,4.333333,1.666667,9.143333,,,,0.0,Private,,OTHER,,UNKNOWN/NOT SPECIFIED
3,252,M,2078-03-06 00:00:00,,0,6.0,-8.0,14.0,7.0,20.0,99.0,0.8,85.0,23.8,8.0,1.2,29.7,31.5,86.0,1.2,13.2,28.8,1.8,61.0,2.3,13.8,2.5,,132.0,8.0,7.1,33.0,7.12,48.0,9.789474,-3.210526,24.868421,7.739394,22.105263,111.921053,1.229545,110.810811,10.848718,1.47,31.215385,92.512821,1.748571,15.07,33.016,3.222581,104.244444,3.781395,3.477949,,142.47619,23.0,12.225641,37.684211,7.345714,99.210526,20.0,4.0,32.0,8.6,29.0,119.0,2.0,141.0,40.4,13.4,2.4,33.5,35.4,102.0,2.2,19.3,46.2,5.2,180.0,5.6,21.2,4.42,,149.0,41.0,28.5,45.0,7.48,303.0,2.933109,3.119454,3.793102,0.388056,2.664473,4.789524,0.322476,16.888257,1.378762,0.273105,0.865277,3.440255,0.193073,1.357267,3.984102,0.65864,20.79899,0.653291,0.470913,,3.6841,7.856267,4.773045,3.36737,0.08146,52.453767,8.603129,9.730994,14.387624,0.150587,7.099415,22.939545,0.10399,285.213213,1.900985,0.074586,0.748704,11.835358,0.037277,1.842172,15.873067,0.433806,432.59798,0.426788,0.221759,,13.57259,61.72093,22.781957,11.339181,0.006636,2751.397661,0.0,Private,,UNOBTAINABLE,SINGLE,WHITE
4,253,F,2089-11-26 00:00:00,,0,10.0,1.0,24.0,8.3,26.0,100.0,0.8,82.0,29.9,10.1,1.1,28.3,33.9,83.0,1.9,12.6,24.9,3.1,209.0,4.0,15.3,3.59,,136.0,12.0,8.2,36.0,7.41,94.0,13.0,2.0,26.166667,8.533333,27.25,103.333333,0.95,106.0,10.88,1.133333,28.96,83.4,1.966667,12.833333,25.7,3.6,219.0,4.3,3.758,,138.166667,16.666667,9.84,39.25,7.4375,222.5,16.0,3.0,27.0,8.7,28.0,106.0,1.1,148.0,32.5,11.4,1.2,30.0,36.0,84.0,2.0,13.3,27.2,4.1,231.0,4.5,15.6,3.89,,141.0,22.0,13.5,43.0,7.48,428.0,1.897367,0.816497,1.32916,0.136626,0.957427,2.42212,0.13784,22.458851,0.535724,0.057735,0.650385,0.547723,0.05164,0.404145,1.3,0.419524,9.027735,0.2,0.123774,,2.136976,3.777124,2.223286,2.986079,0.029861,151.55527,3.6,0.666667,1.766667,0.018667,0.916667,5.866667,0.019,504.4,0.287,0.003333,0.423,0.3,0.002667,0.163333,1.69,0.176,81.5,0.04,0.01532,,4.566667,14.266667,4.943,8.916667,0.000892,22969.0,0.0,Medicare,,CATHOLIC,WIDOWED,WHITE


In [148]:
chf_patients_cleanish.GENDER.value_counts()

M    26121
F    20399
Name: GENDER, dtype: int64

In [149]:
chf_patients_cleanish.INSURANCE.value_counts()

Medicare      20446
Private       19518
Medicaid       4424
Government     1546
Self Pay        586
Name: INSURANCE, dtype: int64

In [150]:
chf_patients_cleanish.LANGUAGE.value_counts()

ENGL    20983
SPAN      799
RUSS      529
PTUN      529
CANT      323
PORT      276
CAPE      201
MAND      134
HAIT      105
ITAL       81
VIET       68
GREE       53
ARAB       28
POLI       24
PERS       24
CAMB       23
HIND       23
AMER       22
KORE       19
ALBA       14
FREN       13
THAI        9
*ARM        8
*BEN        7
ETHI        6
SOMA        6
LAOT        6
*GUJ        5
*URD        4
*CDI        4
        ...  
*TEL        2
TURK        2
*FUL        1
GERM        1
*CRE        1
SERB        1
**SH        1
BENG        1
*TAM        1
*SPA        1
** T        1
*YOR        1
*TOY        1
*RUS        1
*NEP        1
*ARA        1
*MOR        1
*CAN        1
*PER        1
* BE        1
*KHM        1
* FU        1
*FAR        1
*PHI        1
*ROM        1
*PUN        1
*MAN        1
*LIT        1
*BOS        1
*FIL        1
Name: LANGUAGE, Length: 74, dtype: int64

In [151]:
chf_patients_cleanish.RELIGION.value_counts()

CATHOLIC                  15660
NOT SPECIFIED              9550
UNOBTAINABLE               7711
PROTESTANT QUAKER          5118
JEWISH                     3833
OTHER                      2103
EPISCOPALIAN                590
CHRISTIAN SCIENTIST         360
GREEK ORTHODOX              323
BUDDHIST                    195
MUSLIM                      157
JEHOVAH'S WITNESS           104
UNITARIAN-UNIVERSALIST      104
HINDU                       101
ROMANIAN EAST. ORTH          66
7TH DAY ADVENTIST            56
BAPTIST                      25
HEBREW                       15
METHODIST                     6
LUTHERAN                      1
Name: RELIGION, dtype: int64

In [152]:
chf_patients_cleanish.MARITAL_STATUS.value_counts()

MARRIED              18549
SINGLE                9740
WIDOWED               5413
DIVORCED              2337
SEPARATED              378
UNKNOWN (DEFAULT)      312
LIFE PARTNER            11
Name: MARITAL_STATUS, dtype: int64

In [153]:
chf_patients_cleanish.ETHNICITY.value_counts()

WHITE                                                       32074
UNKNOWN/NOT SPECIFIED                                        4234
BLACK/AFRICAN AMERICAN                                       3586
HISPANIC OR LATINO                                           1349
ASIAN                                                        1304
OTHER                                                        1256
UNABLE TO OBTAIN                                              792
PATIENT DECLINED TO ANSWER                                    498
ASIAN - CHINESE                                               223
BLACK/CAPE VERDEAN                                            159
HISPANIC/LATINO - PUERTO RICAN                                147
MULTI RACE ETHNICITY                                          111
WHITE - RUSSIAN                                               106
BLACK/HAITIAN                                                  71
WHITE - OTHER EUROPEAN                                         69
HISPANIC/L

In [154]:
chf_patients_cleanish.to_csv('CHF_Patients.csv')

### (Checkpoint Break)

### Calculate Age of Patients

...from DOB vs. Admission Date fields (fields were obscured in database to future dates to secure anonymity and comply with HIPAA). Comments on calculation method from website (https://mimic.physionet.org/tutorials/intro-to-mimic-iii/): "A patient’s age is given by the difference between their date of birth and the date of their first admission."

In [24]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 150

In [29]:
chf_patients_cleanish = pd.read_csv('CHF_Patients.csv').drop(labels = 'Unnamed: 0', axis = 1)

In [42]:
ADMISSIONS = pd.read_csv('./MIMIC-III/mimicIII-DB/ADMISSIONS.csv')
ADMISSIONS = ADMISSIONS.iloc[:,1:4].sort_values(by='SUBJECT_ID')

In [43]:
ADMISSIONS.drop_duplicates(subset = 'SUBJECT_ID', keep = 'first', inplace = True)
ADMISSIONS.head()

Unnamed: 0,SUBJECT_ID,HADM_ID,ADMITTIME
211,2,163353,2138-07-17 19:04:00
212,3,145834,2101-10-20 19:08:00
213,4,185777,2191-03-16 00:28:00
214,5,178980,2103-02-02 04:31:00
215,6,107064,2175-05-30 07:15:00


In [44]:
ADMISSIONS = ADMISSIONS.drop(labels = 'HADM_ID', axis = 1)

In [48]:
chf_patients_cleanish = chf_patients_cleanish.merge(ADMISSIONS, how= 'left', on = 'SUBJECT_ID')

In [54]:
print(chf_patients_cleanish.DOB.isnull().sum())
print(chf_patients_cleanish.ADMITTIME.isnull().sum())

0
0


In [70]:
chf_patients_cleanish.ADMITTIME = pd.to_datetime(chf_patients_cleanish['ADMITTIME']).dt.date
chf_patients_cleanish.DOB = pd.to_datetime(chf_patients_cleanish['DOB']).dt.date

In [83]:
x = ((chf_patients_cleanish['ADMITTIME'] - chf_patients_cleanish['DOB'])//365)

In [104]:
chf_patients_cleanish['AGE'] = x.dt.days.astype(int)

#### Known Issue: The website cites two HIPPA-based age manipulations:
- Patients below 15 years of age are children, and therefore default age value to 0.
- Patients above 85 years of age have been given age 300. Direction from the website is that the median value is 91.5 and age for these patients should be imputed as the median value.

In [114]:
print("Patient's Ages below 15")
print(chf_patients_cleanish.loc[chf_patients_cleanish['AGE'] < 15]['AGE'].value_counts())
print('-'*50)
print("Patient's Ages above 100")
print(chf_patients_cleanish.loc[chf_patients_cleanish['AGE'] > 100]['AGE'].value_counts())

Patient's Ages below 15
0     7874
14       1
Name: AGE, dtype: int64
--------------------------------------------------
Patient's Ages above 100
300    1889
301      31
302      18
304      15
303      11
305       9
308       5
307       5
306       5
309       3
Name: AGE, dtype: int64


In [130]:
chf_patients_cleanish.AGE.loc[chf_patients_cleanish.AGE > 100] = 92

Series([], Name: AGE, dtype: int64)

In [134]:
chf_patients_cleanish.AGE.value_counts()

0     7874
92    1991
78     875
77     864
76     831
69     826
72     814
80     814
68     806
79     798
81     795
66     788
64     780
60     779
75     779
67     775
62     774
71     772
65     771
63     765
58     762
83     756
82     755
59     753
74     753
61     746
84     738
73     733
57     727
70     722
      ... 
42     336
40     302
41     297
39     287
38     254
37     248
36     208
34     193
33     193
35     190
21     188
22     180
32     176
25     176
23     172
28     171
26     166
20     166
24     162
29     161
31     159
30     158
27     148
19     143
18     132
17      66
89      33
16      21
15       8
14       1
Name: AGE, Length: 78, dtype: int64

Now let's clean up / bucket Language, Marital Status and Religion fields for simpler categorization

In [137]:
chf_patients_cleanish.LANGUAGE = chf_patients_cleanish.LANGUAGE.fillna('NONE')
chf_patients_cleanish.loc[chf_patients_cleanish.LANGUAGE.isin(['ENGL','NONE']) == False, 'LANGUAGE'] = 'OTHER'
chf_patients_cleanish.MARITAL_STATUS = chf_patients_cleanish.MARITAL_STATUS.fillna('UNKNOWN')
chf_patients_cleanish.loc[chf_patients_cleanish.MARITAL_STATUS == 'UNKNOWN (DEFAULT)','MARITAL_STATUS'] = 'UNKNOWN'
chf_patients_cleanish.loc[chf_patients_cleanish.MARITAL_STATUS.isin(['WIDOWED','DIVORCED','SEPARATED']),'MARITAL_STATUS'] = 'POSTMARRIED'
chf_patients_cleanish.loc[chf_patients_cleanish.MARITAL_STATUS == 'LIFE PARTNER','MARITAL_STATUS'] = 'MARRIED'
chf_patients_cleanish.loc[chf_patients_cleanish.RELIGION.isin(['NOT SPECIFIED', 'UNOBTAINABLE']),'RELIGION'] = 'UNKNOWN'
chf_patients_cleanish.RELIGION = chf_patients_cleanish.RELIGION.fillna('UNKNOWN')
chf_patients_cleanish.loc[chf_patients_cleanish.RELIGION.isin(['UNKNOWN', 'CATHOLIC','PROTESTANT QUAKER','JEWISH']) == False,'RELIGION'] = 'OTHER'

In [141]:
#Check
print("Patient's Cleaned Languages:")
print(chf_patients_cleanish.LANGUAGE.value_counts())
print('-'*50)
print("Patient's Cleaned Marital Status:")
print(chf_patients_cleanish.MARITAL_STATUS.value_counts())
print('-'*50)
print("Patient's Cleaned Religion:")
print(chf_patients_cleanish.RELIGION.value_counts())

Patient's Cleaned Languages:
NONE     22130
ENGL     20983
OTHER     3407
Name: LANGUAGE, dtype: int64
--------------------------------------------------
Patient's Cleaned Marital Status:
MARRIED        18560
UNKNOWN        10092
SINGLE          9740
POSTMARRIED     8128
Name: MARITAL_STATUS, dtype: int64
--------------------------------------------------
Patient's Cleaned Religion:
UNKNOWN              17703
CATHOLIC             15660
PROTESTANT QUAKER     5118
OTHER                 4206
JEWISH                3833
Name: RELIGION, dtype: int64


In [144]:
chf_patients_cleanish.drop(labels = ['DOB', 'ADMITTIME'], axis = 1, inplace = True)

In [146]:
chf_patients_cleanish.to_csv('CHF_Patients2.csv')

### (Checkpoint Break)
### Additional Research Required!
Must determine how to handle imputation methods on NA blood measurements and children (Age < 15) with more research.

## Research Results:
- Variance of blood measurements is likely to not be useful in our modeling phases. Primarily, we have standard deviation of every measurement for every patient, and so an additional measurement of variance for that same measurement/patient combo will not explain any additional variance in the prediction of our model. These features will be removed.
- Children have largely different blood measurements for certain conditions. Due to time constraints, we will remove patients' observations with AGE < 18. In future work, we will seek to build a separate product for predicting heart failure in children.
- There are many research findings in blood measurements; these will be grouped and summarised in the imputation section below.

In [9]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = 150

In [5]:
chf_patients = pd.read_csv('CHF_Patients2.csv').drop(labels = ['Unnamed: 0'], axis = 1)

In [11]:
for i in chf_patients.columns:
    print(i)

SUBJECT_ID
GENDER
DOD
EXPIRE_FLAG
Anion Gapmin
Base Excessmin
Bicarbonatemin
Calcium, Totalmin
Calculated Total CO2min
Chloridemin
Creatininemin
Glucosemin
Hematocritmin
Hemoglobinmin
INR(PT)min
MCHmin
MCHCmin
MCVmin
Magnesiummin
PTmin
PTTmin
Phosphatemin
Platelet Countmin
Potassiummin
RDWmin
Red Blood Cellsmin
SPECIMEN TYPEmin
Sodiummin
Urea Nitrogenmin
White Blood Cellsmin
pCO2min
pHmin
pO2min
Anion Gapmean
Base Excessmean
Bicarbonatemean
Calcium, Totalmean
Calculated Total CO2mean
Chloridemean
Creatininemean
Glucosemean
Hemoglobinmean
INR(PT)mean
MCHmean
MCVmean
Magnesiummean
PTmean
PTTmean
Phosphatemean
Platelet Countmean
Potassiummean
Red Blood Cellsmean
SPECIMEN TYPEmean
Sodiummean
Urea Nitrogenmean
White Blood Cellsmean
pCO2mean
pHmean
pO2mean
Anion Gapmax
Base Excessmax
Bicarbonatemax
Calcium, Totalmax
Calculated Total CO2max
Chloridemax
Creatininemax
Glucosemax
Hematocritmax
Hemoglobinmax
INR(PT)max
MCHmax
MCHCmax
MCVmax
Magnesiummax
PTmax
PTTmax
Phosphatemax
Platelet Countmax

Drop variance measurement columns.

In [12]:
chf_patients = chf_patients.drop(labels = ['Anion Gapvar','Base Excessvar','Bicarbonatevar','Calcium, Totalvar',\
                                           'Calculated Total CO2var','Chloridevar','Creatininevar','Glucosevar',\
                                           'Hemoglobinvar','INR(PT)var','MCHvar','MCVvar','Magnesiumvar','PTvar',\
                                           'PTTvar','Phosphatevar','Platelet Countvar','Potassiumvar',\
                                           'Red Blood Cellsvar','SPECIMEN TYPEvar','Sodiumvar','Urea Nitrogenvar',\
                                           'White Blood Cellsvar','pCO2var','pHvar','pO2var'], axis = 1)

Drop rows for patients' AGE < 18.

In [22]:
children_index = chf_patients.loc[chf_patients['AGE']<18].index
chf_patients.drop(index = children_index, axis = 0, inplace = True)
chf_patients.shape

(38550, 121)

In [23]:
chf_patients.CHF.value_counts()

0.0    28722
1.0     9828
Name: CHF, dtype: int64

### NOTE: New Baseline Accuracy for Adult Congestive Heart Failure Prediction: 25.5%

In [28]:
9828/len(chf_patients.CHF)

0.25494163424124516

### Determine if remaining measurements should be imputed or dropped due to volume of missingness.

In [32]:
#SUM OF 'DOD' is captured in 'EXPIRE_FLAG', drop it.
chf_patients.drop(labels = 'DOD', axis = 1, inplace = True)

In [44]:
chf_patients.isnull().sum()[chf_patients.isnull().sum()>5000]

Base Excessmin               9191
Calculated Total CO2min      9191
SPECIMEN TYPEmin            38550
pCO2min                      9194
pHmin                        8157
pO2min                       9194
Base Excessmean              9191
Calculated Total CO2mean     9191
SPECIMEN TYPEmean           38550
pCO2mean                     9194
pHmean                       8157
pO2mean                      9194
Base Excessmax               9191
Calculated Total CO2max      9191
SPECIMEN TYPEmax            38550
pCO2max                      9194
pHmax                        8157
pO2max                       9194
Base Excessstd              13505
Calculated Total CO2std     13503
SPECIMEN TYPEstd            38550
pCO2std                     13504
pHstd                       12618
pO2std                      13502
dtype: int64

Only SPECIMEN TYPE is missing more than 1/3 of rows (all NA) so this will be the only measurement we remove based on missingness.

In [45]:
chf_patients.drop(labels = ['SPECIMEN TYPEmin', 'SPECIMEN TYPEmean', 'SPECIMEN TYPEmax', 'SPECIMEN TYPEstd'],\
                  axis = 1, inplace = True)
chf_patients.shape

(38550, 116)

At this point, without imputation, Random Forest Classification can be achieved even with NA's. Export for Adults with NA blood measurements for initial RF Classification done here as checkpoint.

In [63]:
chf_patients.to_csv('CHF_Adults_wNAs_2019-09-15.csv')

## Imputation

### Assumptions:
Without any information in the MIMIC-III DB documentation on why certain blood measurements do not appear, we have made the educated assumption that hospitals will test for factors relating to a condition that is known or being tested by a doctor. With this assumption, missing values are NOT missing completely at random (MCAR) because there will likely be a bias between the presence of a blood measurement and any certain diagnosed condition related to those measurements. Additionally, patients in this database have allowed for all measurements related to their stays be recorded at a voluntary basis; so it is unlikely that missing measurements are Missing Not at Random (MNAR). Therefore, we conclude that NA blood measurements are missing at random (MAR).
### Methodology:
As the database may hold blood measurements biased by unknown doctor's opinions and their suspected diagnoses, we will rely on medical research as much as possible to indicate imputation methods for missing data. Another practical note is that these are measurements from an Intensive Care Unit (ICU) at a hospital. Blood measurements in this population are likely to deviate largely from the human global population as hospital patients's presence in an ICU is evidence that something in the patients' health is not normal. Based on this assumption, however, we can assume that if a blood measurement is not taken for an individual, the missing blood measurement is likely due to the fact that the patient is not at risk for conditions relevant to that measurement; and therefore the measurement can be assumed to be "normal" to the global population of adults.

### Imputation methods on blood measurements are based on three buckets:
- Measurements with no determined research as to what classifies "Normal": Mean of patient "min", "mean", "max" and "std" will be used (reluctantly).
     - Blood Measurements: Base Excess & Calculated Total CO2.
- Measurements with determined research to common universal ranges will be imputed as such for healthy individuals.
     - Blood Measurements: pCO2, pH, pO2,Anion Gap, Bicarbonate, Calcium(Total), Chloride, Glucose, Magnesium, Phosphate, Potassium, Sodium, Urea Nitrogen, INR(PT), MCH, MCHC, MCV, Platelet Count, PT, PTT, RDW, White Blood Cells

- Measurements with determined research to gender-specific ranges will be imputed as such for healthy individuals by their respective gender.
     - Blood Measurements: Creatinine, Hematocrit, Hemoglobin, Red Blood Cells

In [48]:
chf_patients.sample(10)

Unnamed: 0,SUBJECT_ID,GENDER,EXPIRE_FLAG,Anion Gapmin,Base Excessmin,Bicarbonatemin,"Calcium, Totalmin",Calculated Total CO2min,Chloridemin,Creatininemin,Glucosemin,Hematocritmin,Hemoglobinmin,INR(PT)min,MCHmin,MCHCmin,MCVmin,Magnesiummin,PTmin,PTTmin,Phosphatemin,Platelet Countmin,Potassiummin,RDWmin,Red Blood Cellsmin,Sodiummin,Urea Nitrogenmin,White Blood Cellsmin,pCO2min,pHmin,pO2min,Anion Gapmean,Base Excessmean,Bicarbonatemean,"Calcium, Totalmean",Calculated Total CO2mean,Chloridemean,Creatininemean,Glucosemean,Hemoglobinmean,INR(PT)mean,MCHmean,MCVmean,Magnesiummean,PTmean,PTTmean,Phosphatemean,Platelet Countmean,Potassiummean,Red Blood Cellsmean,Sodiummean,Urea Nitrogenmean,White Blood Cellsmean,pCO2mean,pHmean,pO2mean,Anion Gapmax,Base Excessmax,Bicarbonatemax,"Calcium, Totalmax",Calculated Total CO2max,Chloridemax,Creatininemax,Glucosemax,Hematocritmax,Hemoglobinmax,INR(PT)max,MCHmax,MCHCmax,MCVmax,Magnesiummax,PTmax,PTTmax,Phosphatemax,Platelet Countmax,Potassiummax,RDWmax,Red Blood Cellsmax,Sodiummax,Urea Nitrogenmax,White Blood Cellsmax,pCO2max,pHmax,pO2max,Anion Gapstd,Base Excessstd,Bicarbonatestd,"Calcium, Totalstd",Calculated Total CO2std,Chloridestd,Creatininestd,Glucosestd,Hemoglobinstd,INR(PT)std,MCHstd,MCVstd,Magnesiumstd,PTstd,PTTstd,Phosphatestd,Platelet Countstd,Potassiumstd,Red Blood Cellsstd,Sodiumstd,Urea Nitrogenstd,White Blood Cellsstd,pCO2std,pHstd,pO2std,CHF,INSURANCE,LANGUAGE,RELIGION,MARITAL_STATUS,ETHNICITY,AGE
5370,3775,F,0,13.0,0.0,26.0,8.9,24.0,97.0,1.3,111.0,25.9,9.1,1.3,28.7,33.6,84.0,1.4,13.8,52.1,3.0,157.0,3.8,13.0,3.04,136.0,20.0,4.5,29.0,7.41,88.0,14.363636,2.5,28.272727,9.2,27.25,99.545455,1.416667,184.545455,11.3,1.3,29.24,85.0,1.945455,13.8,62.7,3.525,181.6,4.292308,3.888,137.916667,26.0,6.2,37.75,7.455,108.75,15.0,6.0,31.0,9.7,32.0,104.0,1.6,350.0,39.3,12.6,1.3,30.0,35.3,86.0,2.5,13.8,73.3,4.1,212.0,5.1,13.7,4.38,142.0,34.0,7.9,45.0,7.51,130.0,0.80904,2.645751,1.793929,0.303315,3.593976,2.0181,0.093744,70.520017,1.632483,0.0,0.610737,1.0,0.284125,0.0,14.990664,0.45,22.400893,0.3662,0.635508,1.975225,4.880387,1.305756,6.601767,0.042032,19.172463,1.0,Medicare,NONE,PROTESTANT QUAKER,POSTMARRIED,WHITE,80
10485,12769,M,0,9.0,-6.0,21.0,7.6,20.0,100.0,0.8,75.0,22.0,8.6,1.0,28.9,33.6,81.0,1.4,12.7,22.1,3.8,91.0,3.6,12.8,2.86,132.0,12.0,4.0,34.0,7.27,82.0,12.388889,0.676471,25.52381,8.46,26.176471,103.571429,0.952174,105.333333,12.135,1.273684,30.42,84.7,1.916,14.078947,34.642857,4.354545,156.15625,4.147826,3.996,136.65,16.73913,7.045,40.941176,7.403714,146.117647,16.0,6.0,31.0,9.2,32.0,114.0,1.1,160.0,43.2,14.9,1.7,31.8,38.0,90.0,2.3,16.2,49.9,5.3,337.0,4.7,14.6,5.15,141.0,24.0,10.3,55.0,7.51,394.0,1.914001,2.878558,2.581067,0.531664,2.455305,3.107594,0.103877,22.36857,1.917036,0.205053,0.799737,2.341839,0.199332,1.152089,8.441894,0.47194,46.226029,0.347549,0.693287,2.641272,2.733918,1.691301,5.943885,0.077842,71.937584,0.0,Private,NONE,CATHOLIC,MARRIED,WHITE,66
26058,32229,F,0,11.0,-2.0,21.0,8.4,21.0,99.0,0.6,117.0,24.0,8.2,0.9,29.0,32.9,83.0,2.2,10.5,22.4,1.7,148.0,3.2,13.6,2.75,138.0,11.0,4.6,27.0,7.3,48.0,13.190476,2.228571,25.952381,8.947368,27.4,107.761905,0.7875,142.619048,10.28,1.177778,29.475,86.5,2.405263,13.133333,31.088889,2.831579,275.05,3.754545,3.4895,143.136364,18.583333,9.7,40.628571,7.428286,188.457143,15.0,8.0,31.0,9.6,35.0,117.0,1.3,195.0,36.1,12.3,1.9,30.0,35.8,90.0,2.8,19.5,46.2,3.8,411.0,4.5,14.8,4.14,150.0,28.0,14.6,60.0,7.57,503.0,1.209093,2.414243,2.889225,0.365708,3.541352,4.83637,0.15411,17.206034,1.318532,0.311359,0.366886,1.90567,0.154466,2.82179,8.847379,0.684797,86.548116,0.321792,0.44493,3.681297,5.10683,2.348572,8.051619,0.052215,84.037227,0.0,Medicaid,ENGL,OTHER,MARRIED,HISPANIC OR LATINO,57
5166,5602,F,1,10.0,,16.0,7.7,,108.0,1.4,89.0,20.5,6.4,1.0,30.3,30.4,91.0,1.7,12.3,22.1,3.0,178.0,4.0,18.7,2.07,140.0,22.0,6.0,,,,14.166667,,19.666667,8.075,,112.0,1.733333,110.5,9.083333,1.2,30.9,95.166667,1.95,13.583333,25.683333,3.375,262.666667,4.083333,2.933333,141.833333,29.333333,9.516667,,,,17.0,,21.0,8.9,,115.0,2.1,131.0,35.9,11.6,1.9,31.4,34.1,100.0,2.7,17.5,36.2,3.8,371.0,4.2,20.0,3.7,144.0,34.0,14.7,,,,2.483277,,1.966384,0.556028,,2.828427,0.294392,19.30544,2.207638,0.34641,0.481664,3.656045,0.5,1.94876,5.410884,0.330404,75.045764,0.075277,0.697386,1.722401,4.802777,3.3054,,,,0.0,Medicare,NONE,CATHOLIC,POSTMARRIED,WHITE,92
30245,54341,M,0,14.0,0.0,23.0,9.1,25.0,106.0,0.8,106.0,39.0,13.7,1.0,29.9,34.9,85.0,1.8,11.8,25.2,3.0,184.0,3.7,16.7,4.58,141.0,12.0,6.0,38.0,7.41,160.0,14.666667,0.0,24.666667,9.2,25.0,106.0,0.833333,115.0,13.933333,1.0,30.4,85.666667,1.85,11.8,25.2,3.4,193.666667,3.8,4.586667,141.333333,13.333333,7.9,38.0,7.41,160.0,16.0,0.0,26.0,9.3,25.0,106.0,0.9,131.0,39.5,14.2,1.0,30.9,36.0,86.0,1.9,11.8,25.2,3.8,206.0,3.9,17.2,4.59,142.0,15.0,10.5,38.0,7.41,160.0,1.154701,,1.527525,0.141421,,0.0,0.057735,13.892444,0.251661,,0.5,0.57735,0.070711,,,0.565685,11.23981,0.1,0.005774,0.57735,1.527525,2.330236,,,,0.0,Medicare,ENGL,PROTESTANT QUAKER,POSTMARRIED,BLACK/AFRICAN AMERICAN,59
31738,59807,F,1,10.0,-4.0,20.0,6.7,21.0,103.0,1.2,93.0,23.8,8.2,1.0,30.5,32.2,87.0,2.1,11.9,25.4,2.4,82.0,3.3,13.8,2.68,136.0,26.0,6.2,30.0,7.35,45.0,13.095238,-2.6,22.952381,8.138462,22.4,108.454545,1.631818,114.666667,9.56087,1.1375,31.461905,90.52381,2.4875,13.5125,43.944444,3.623077,130.0,3.85,3.043333,140.818182,38.952381,8.242857,36.0,7.388,239.6,15.0,0.0,29.0,9.0,24.0,114.0,2.2,164.0,30.0,10.4,1.5,32.1,36.0,96.0,2.7,16.5,150.0,5.0,196.0,4.3,15.8,3.3,148.0,52.0,11.4,40.0,7.5,397.0,1.670472,1.516575,2.290768,0.540892,1.140175,3.418817,0.273228,17.655972,0.552465,0.159799,0.444383,2.088517,0.15864,1.394312,40.1557,0.939108,37.702785,0.30355,0.174366,3.825184,7.144762,1.377524,3.674235,0.063403,153.263825,1.0,Medicare,ENGL,UNKNOWN,SINGLE,WHITE,92
40933,73614,M,1,29.0,-9.0,14.0,8.0,16.0,97.0,3.4,63.0,34.2,10.2,2.5,20.6,28.7,72.0,2.6,26.5,33.5,6.8,392.0,5.0,22.7,4.79,137.0,59.0,19.1,30.0,7.29,47.0,30.25,-5.0,15.5,8.333333,18.5,99.75,3.925,92.75,10.65,2.55,20.9,72.0,2.8,26.65,34.6,8.4,440.5,5.875,5.085,139.5,72.25,19.65,31.0,7.365,68.0,33.0,-1.0,17.0,9.0,21.0,102.0,4.5,135.0,38.7,11.1,2.6,21.2,29.7,72.0,3.2,26.8,35.7,10.5,489.0,7.2,23.4,5.38,141.0,87.0,20.2,32.0,7.44,89.0,1.892969,5.656854,1.290994,0.57735,3.535534,2.217356,0.512348,32.806249,0.636396,0.070711,0.424264,0.0,0.34641,0.212132,1.555635,1.9,68.589358,0.93586,0.417193,1.732051,12.996794,0.777817,1.414214,0.106066,29.698485,0.0,Self Pay,ENGL,CATHOLIC,MARRIED,BLACK/AFRICAN AMERICAN,55
4124,4562,M,1,16.0,-5.0,19.0,,19.0,105.0,1.0,89.0,34.9,12.1,1.2,28.1,34.1,82.0,,12.9,25.3,,188.0,3.8,15.4,4.26,141.0,20.0,4.6,32.0,7.27,77.0,16.666667,-3.0,22.333333,,23.666667,106.666667,1.133333,108.333333,12.8,1.2,28.9,82.333333,,13.05,25.35,,210.333333,4.025,4.433333,141.666667,23.666667,9.466667,46.666667,7.306667,104.0,18.0,0.0,26.0,,29.0,108.0,1.3,133.0,38.9,13.8,1.2,29.5,35.9,83.0,,13.2,25.4,,235.0,4.3,15.8,4.74,143.0,29.0,15.0,59.0,7.37,144.0,1.154701,2.645751,3.511885,,5.033223,1.527525,0.152753,22.47962,0.888819,0.0,0.72111,0.57735,,0.212132,0.070711,,23.586719,0.206155,0.266333,1.154701,4.725816,5.231953,13.650397,0.055076,35.341194,0.0,Medicare,NONE,JEWISH,MARRIED,WHITE,75
28142,28043,M,1,8.0,-9.0,15.0,7.3,18.0,100.0,0.5,31.0,18.9,6.1,1.1,27.4,31.1,82.0,1.5,13.1,21.0,1.4,144.0,2.9,13.6,2.02,134.0,6.0,5.7,26.0,7.27,33.0,12.901235,-4.277778,22.777778,7.884058,20.166667,108.716049,1.159259,128.91358,9.85375,1.256522,29.3825,89.45,1.902778,14.491304,32.052174,2.962687,356.05,4.151807,3.371125,140.222222,23.506173,9.8025,33.055556,7.375556,170.277778,24.0,0.0,31.0,8.9,25.0,118.0,2.3,237.0,42.8,13.9,1.5,31.8,35.3,98.0,2.7,16.6,42.7,4.0,666.0,6.2,17.9,4.99,147.0,53.0,22.6,38.0,7.45,416.0,2.804483,2.865732,3.660601,0.441129,2.176073,4.287291,0.601202,45.010331,1.501303,0.099206,1.224825,4.257235,0.218169,0.937071,5.64212,0.613444,109.252454,0.643421,0.587434,3.189828,10.8928,3.10516,3.572315,0.064189,104.384361,0.0,Medicare,ENGL,CATHOLIC,MARRIED,WHITE,86
38521,99004,M,0,10.0,0.0,21.0,7.7,23.0,98.0,0.8,87.0,32.4,10.7,1.1,29.8,31.1,91.0,1.7,12.6,27.2,2.2,335.0,3.7,14.4,3.48,135.0,9.0,12.3,30.0,7.38,105.0,12.333333,1.0,27.933333,8.609091,24.5,101.533333,0.873333,99.333333,11.8875,1.1,30.59375,93.125,2.06,12.6,27.5,2.718182,416.75,4.046667,3.888125,137.666667,11.533333,16.70625,33.0,7.44,119.0,16.0,2.0,32.0,9.0,26.0,107.0,1.1,117.0,42.4,13.2,1.1,31.5,34.0,98.0,2.5,12.6,27.8,3.4,495.0,4.6,15.0,4.31,139.0,15.0,23.6,36.0,7.48,133.0,1.914854,1.414214,3.283436,0.415823,2.12132,2.325838,0.088372,10.132456,0.684957,0.0,0.499958,1.821172,0.244365,0.0,0.424264,0.384235,42.894444,0.25317,0.255087,1.496026,2.099887,2.742376,4.242641,0.052915,19.79899,0.0,Medicare,ENGL,CATHOLIC,MARRIED,WHITE,73


#### No Determined Research: Impute as patient population average min, mean, max, std per measurement:
- Base Excess
- Calculated Total CO2

Base Excess

In [86]:
print(chf_patients['Base Excessmin'].mean())
print(chf_patients['Base Excessmean'].mean())
print(chf_patients['Base Excessmax'].mean())
print(chf_patients['Base Excessstd'].mean())

-4.2978984297830305
-0.40629088286526194
3.1844408869511907
2.711110441996879


In [90]:
#Impute patient sample avg. values for respetive min, mean, max, std:
chf_patients['Base Excessmin'] = chf_patients['Base Excessmin'].fillna(chf_patients['Base Excessmin'].mean())
chf_patients['Base Excessmean'] = chf_patients['Base Excessmean'].fillna(chf_patients['Base Excessmean'].mean())
chf_patients['Base Excessmax'] = chf_patients['Base Excessmax'].fillna(chf_patients['Base Excessmax'].mean())
chf_patients['Base Excessstd'] = chf_patients['Base Excessstd'].fillna(chf_patients['Base Excessstd'].mean())

In [91]:
print(chf_patients['Calculated Total CO2min'].mean())
print(chf_patients['Calculated Total CO2mean'].mean())
print(chf_patients['Calculated Total CO2max'].mean())
print(chf_patients['Calculated Total CO2std'].mean())

21.65659593310399
25.4948631343657
29.399979563336625
2.8242472866274926


In [93]:
chf_patients['Calculated Total CO2min'] = chf_patients['Calculated Total CO2min'].fillna(chf_patients['Calculated Total CO2min'].mean())
chf_patients['Calculated Total CO2mean'] = chf_patients['Calculated Total CO2mean'].fillna(chf_patients['Calculated Total CO2mean'].mean())
chf_patients['Calculated Total CO2max'] = chf_patients['Calculated Total CO2max'].fillna(chf_patients['Calculated Total CO2max'].mean())
chf_patients['Calculated Total CO2std'] = chf_patients['Calculated Total CO2std'].fillna(chf_patients['Calculated Total CO2std'].mean())

#### Determined research to common universal ranges, imputed values for healthy individuals:
Blood Measurements: pCO2, pH, pO2,Anion Gap, Bicarbonate, Calcium(Total), Chloride, Glucose, Magnesium, Phosphate, Potassium, Sodium, Urea Nitrogen, INR(PT), MCH, MCHC, MCV, Platelet Count, PT, PTT, RDW, White Blood Cells

**pCO2**: research says normal values are 38 to 42 mm Hg.

In [97]:
print(chf_patients['pCO2min'].mean())
print(chf_patients['pCO2mean'].mean())
print(chf_patients['pCO2max'].mean())
print(chf_patients['pCO2std'].mean())

33.9358563837035
41.446819079919194
51.817822591633735
6.274777371381051


In [98]:
#STD calc:
np.abs(42-40)*.683

1.366

In [99]:
chf_patients['pCO2min'] = chf_patients['pCO2min'].fillna(38)
chf_patients['pCO2mean'] = chf_patients['pCO2mean'].fillna(40)
chf_patients['pCO2max'] = chf_patients['pCO2max'].fillna(42)
chf_patients['pCO2std'] = chf_patients['pCO2std'].fillna(1.366)

**pH**: research says normal values are 7.38 to 7.42.

In [101]:
print(chf_patients['pHmin'].mean())
print(chf_patients['pHmean'].mean())
print(chf_patients['pHmax'].mean())
print(chf_patients['pHstd'].mean())

7.291275951699279
7.3819648483644515
7.455116638699912
0.058320829385609306


In [102]:
#STD calc:
np.abs(7.42-7.40)*.683

0.01365999999999971

In [103]:
chf_patients['pHmin'] = chf_patients['pHmin'].fillna(7.38)
chf_patients['pHmean'] = chf_patients['pHmean'].fillna(7.40)
chf_patients['pHmax'] = chf_patients['pHmax'].fillna(7.42)
chf_patients['pHstd'] = chf_patients['pHstd'].fillna(0.01366)

**pO2**: research says normal values are 75 to 100 millimeters of mercury (mm Hg).

In [110]:
print(chf_patients['pO2min'].mean())
print(chf_patients['pO2mean'].mean())
print(chf_patients['pO2max'].mean())
print(chf_patients['pO2std'].mean())

84.55961302629787
157.99299657846782
286.22370213925603
77.38185881704379


In [111]:
#STD calc:
np.abs(100-87.5)*.683

8.537500000000001

In [112]:
chf_patients['pO2min'] = chf_patients['pO2min'].fillna(75)
chf_patients['pO2mean'] = chf_patients['pO2mean'].fillna(87.5)
chf_patients['pO2max'] = chf_patients['pO2max'].fillna(100)
chf_patients['pO2std'] = chf_patients['pO2std'].fillna(8.5375)

**Anion Gap**: research says normal values are 8-16 mEq/L.

In [105]:
print(chf_patients['Anion Gapmin'].mean())
print(chf_patients['Anion Gapmean'].mean())
print(chf_patients['Anion Gapmax'].mean())
print(chf_patients['Anion Gapstd'].mean())

10.235407697325506
13.669952342565788
18.238851924331378
2.423014208750281


In [107]:
#STD calc:
np.abs(16-12)*.683

2.732

In [108]:
chf_patients['Anion Gapmin'] = chf_patients['Anion Gapmin'].fillna(8)
chf_patients['Anion Gapmean'] = chf_patients['Anion Gapmean'].fillna(12)
chf_patients['Anion Gapmax'] = chf_patients['Anion Gapmax'].fillna(16)
chf_patients['Anion Gapstd'] = chf_patients['Anion Gapstd'].fillna(2.732)

**Bicarbonate**: research says normal values are 23 to 30 mEq/L.

In [114]:
print(chf_patients['Bicarbonatemin'].mean())
print(chf_patients['Bicarbonatemean'].mean())
print(chf_patients['Bicarbonatemax'].mean())
print(chf_patients['Bicarbonatestd'].mean())

20.636980797328324
25.380718690686404
29.798510227509915
2.838707670660657


In [115]:
#STD calc:
np.abs(30-26.5)*.683

2.3905000000000003

In [116]:
chf_patients['Bicarbonatemin'] = chf_patients['Bicarbonatemin'].fillna(23)
chf_patients['Bicarbonatemean'] = chf_patients['Bicarbonatemean'].fillna(26.5)
chf_patients['Bicarbonatemax'] = chf_patients['Bicarbonatemax'].fillna(30)
chf_patients['Bicarbonatestd'] = chf_patients['Bicarbonatestd'].fillna(2.39)

**Calcium(Total)**: research says normal values are 8.5 to 10.5 mg/dl.

In [118]:
print(chf_patients['Calcium, Totalmin'].mean())
print(chf_patients['Calcium, Totalmean'].mean())
print(chf_patients['Calcium, Totalmax'].mean())
print(chf_patients['Calcium, Totalstd'].mean())

7.76487757418514
8.498114956057885
9.254410572401545
0.4926040564470654


In [119]:
#STD calc:
np.abs(10.5-9.5)*.683

0.683

In [120]:
chf_patients['Calcium, Totalmin'] = chf_patients['Calcium, Totalmin'].fillna(8.5)
chf_patients['Calcium, Totalmean'] = chf_patients['Calcium, Totalmean'].fillna(9.5)
chf_patients['Calcium, Totalmax'] = chf_patients['Calcium, Totalmax'].fillna(10.5)
chf_patients['Calcium, Totalstd'] = chf_patients['Calcium, Totalstd'].fillna(0.683)

**Chloride**: research says normal values are 96 and 106 milliequivalents of chloride per liter of blood (mEq/L).

In [122]:
print(chf_patients['Chloridemin'].mean())
print(chf_patients['Chloridemean'].mean())
print(chf_patients['Chloridemax'].mean())
print(chf_patients['Chloridestd'].mean())

98.0518667327611
103.93149143983936
109.87792533068955
3.6173983394565696


In [123]:
#STD calc:
np.abs(106-101)*.683

3.415

In [124]:
chf_patients['Chloridemin'] = chf_patients['Chloridemin'].fillna(96)
chf_patients['Chloridemean'] = chf_patients['Chloridemean'].fillna(101)
chf_patients['Chloridemax'] = chf_patients['Chloridemax'].fillna(106)
chf_patients['Chloridestd'] = chf_patients['Chloridestd'].fillna(3.415)

**Glucose**: research says normal values for *fasting* are 72 to 99 mg/dL.

In [126]:
print(chf_patients['Glucosemin'].mean())
print(chf_patients['Glucosemean'].mean())
print(chf_patients['Glucosemax'].mean())
print(chf_patients['Glucosestd'].mean())

86.85826831241356
129.80386061004876
218.92254899402417
36.99331774236841


In [127]:
#STD calc:
np.abs(99-85.5)*.683

9.220500000000001

In [128]:
chf_patients['Glucosemin'] = chf_patients['Glucosemin'].fillna(72)
chf_patients['Glucosemean'] = chf_patients['Glucosemean'].fillna(85.5)
chf_patients['Glucosemax'] = chf_patients['Glucosemax'].fillna(99)
chf_patients['Glucosestd'] = chf_patients['Glucosestd'].fillna(9.22)

**Magnesium**: research says normal values are 1.5 to 2.5 mEq/L.

In [131]:
print(chf_patients['Magnesiummin'].mean())
print(chf_patients['Magnesiummean'].mean())
print(chf_patients['Magnesiummax'].mean())
print(chf_patients['Magnesiumstd'].mean())

1.6783169620520004
2.0229706327955745
2.461762304795518
0.24190291399528693


In [132]:
#STD calc:
np.abs(2.5-2)*.683

0.3415

In [133]:
chf_patients['Magnesiummin'] = chf_patients['Magnesiummin'].fillna(1.5)
chf_patients['Magnesiummean'] = chf_patients['Magnesiummean'].fillna(2.0)
chf_patients['Magnesiummax'] = chf_patients['Magnesiummax'].fillna(2.5)
chf_patients['Magnesiumstd'] = chf_patients['Magnesiumstd'].fillna(0.3415)

**Phosphate**: research says normal values are 2.5 to 4.5 mg/dL.

In [135]:
print(chf_patients['Phosphatemin'].mean())
print(chf_patients['Phosphatemean'].mean())
print(chf_patients['Phosphatemax'].mean())
print(chf_patients['Phosphatestd'].mean())

2.446375472717452
3.4483785499049198
4.7405294435440535
0.7706883915658147


In [136]:
#STD calc:
np.abs(4.5-3.5)*.683

0.683

In [137]:
chf_patients['Phosphatemin'] = chf_patients['Phosphatemin'].fillna(2.5)
chf_patients['Phosphatemean'] = chf_patients['Phosphatemean'].fillna(3.5)
chf_patients['Phosphatemax'] = chf_patients['Phosphatemax'].fillna(4.5)
chf_patients['Phosphatestd'] = chf_patients['Phosphatestd'].fillna(0.683)

**Potassium**: research says normal values are 3.6 to 5.2 millimoles per liter (mmol/L).

In [139]:
print(chf_patients['Potassiummin'].mean())
print(chf_patients['Potassiummean'].mean())
print(chf_patients['Potassiummax'].mean())
print(chf_patients['Potassiumstd'].mean())

3.449188752086823
4.1156374833900955
5.084673414023333
0.4573705732113823


In [140]:
#STD calc:
np.abs(5.2-4.4)*.683

0.5463999999999999

In [141]:
chf_patients['Potassiummin'] = chf_patients['Potassiummin'].fillna(3.6)
chf_patients['Potassiummean'] = chf_patients['Potassiummean'].fillna(4.4)
chf_patients['Potassiummax'] = chf_patients['Potassiummax'].fillna(5.2)
chf_patients['Potassiumstd'] = chf_patients['Potassiumstd'].fillna(0.5464)

**Sodium**: research says normal values are 135 to 145 milliequivalents per liter (mEq/L).

In [142]:
print(chf_patients['Sodiummin'].mean())
print(chf_patients['Sodiummean'].mean())
print(chf_patients['Sodiummax'].mean())
print(chf_patients['Sodiumstd'].mean())

133.9509274477577
138.77123556137042
143.29268738097102
2.792935404776167


In [143]:
#STD calc:
np.abs(145-140)*.683

3.415

In [144]:
chf_patients['Sodiummin'] = chf_patients['Sodiummin'].fillna(135)
chf_patients['Sodiummean'] = chf_patients['Sodiummean'].fillna(140)
chf_patients['Sodiummax'] = chf_patients['Sodiummax'].fillna(145)
chf_patients['Sodiumstd'] = chf_patients['Sodiumstd'].fillna(3.145)

**Urea Nitrogen**: research says normal values are 7 to 20 mg/dL.

In [146]:
print(chf_patients['Urea Nitrogenmin'][chf_patients[chf_patients['CHF' == 1.0]].mean())
print(chf_patients['Urea Nitrogenmean'][chf_patients[chf_patients['CHF' == 1.0]].mean())
print(chf_patients['Urea Nitrogenmax'][chf_patients[chf_patients['CHF' == 1.0]].mean())
print(chf_patients['Urea Nitrogenstd'][chf_patients[chf_patients['CHF' == 1.0]].mean())
print(chf_patients['Urea Nitrogenmin'][chf_patients[chf_patients['CHF' == 0]].mean())
print(chf_patients['Urea Nitrogenmean'][chf_patients[chf_patients['CHF' == 0]].mean())
print(chf_patients['Urea Nitrogenmax'][chf_patients[chf_patients['CHF' == 0]].mean())
print(chf_patients['Urea Nitrogenstd'][chf_patients[chf_patients['CHF' == 0]].mean())

14.405025753082565
24.107028559944872
38.13339576504865
7.286449288054007


In [147]:
#STD calc:
np.abs(20-13.5)*.683

4.439500000000001

In [148]:
chf_patients['Urea Nitrogenmin'] = chf_patients['Urea Nitrogenmin'].fillna(7)
chf_patients['Urea Nitrogenmean'] = chf_patients['Urea Nitrogenmean'].fillna(13.5)
chf_patients['Urea Nitrogenmax'] = chf_patients['Urea Nitrogenmax'].fillna(20)
chf_patients['Urea Nitrogenstd'] = chf_patients['Urea Nitrogenstd'].fillna(4.4395)

**INR(PT)**(aka International Normalized Ratio PT): research says normal values are 0.8 to 1.1.

In [150]:
print(chf_patients['INR(PT)min'].mean())
print(chf_patients['INR(PT)mean'].mean())
print(chf_patients['INR(PT)max'].mean())
print(chf_patients['INR(PT)std'].mean())

1.1213479582043215
1.4160785827986377
2.269326420727161
0.36183425211582404


In [186]:
#STD calc:
np.abs(1.1-0.95)*.683

0.1024500000000001

In [187]:
chf_patients['INR(PT)min'] = chf_patients['INR(PT)min'].fillna(0.8)
chf_patients['INR(PT)mean'] = chf_patients['INR(PT)mean'].fillna(0.95)
chf_patients['INR(PT)max'] = chf_patients['INR(PT)max'].fillna(1.1)
chf_patients['INR(PT)std'] = chf_patients['INR(PT)std'].fillna(0.10245)

**MCH**: research says normal values are 27-33 picograms (pg)/cell.

In [151]:
print(chf_patients['MCHmin'].mean())
print(chf_patients['MCHmean'].mean())
print(chf_patients['MCHmax'].mean())
print(chf_patients['MCHstd'].mean())

29.07751534699817
30.243354183528
31.4242820726249
0.7105475955619334


In [152]:
#STD calc:
np.abs(33-30)*.683

2.0490000000000004

In [153]:
chf_patients['MCHmin'] = chf_patients['MCHmin'].fillna(27)
chf_patients['MCHmean'] = chf_patients['MCHmean'].fillna(30)
chf_patients['MCHmax'] = chf_patients['MCHmax'].fillna(33)
chf_patients['MCHstd'] = chf_patients['MCHstd'].fillna(2.049)

**MCHC**: research says normal values are 31-37 grams per deciliter (g/dL).

In [157]:
print(chf_patients['MCHCmin'].mean())
print(chf_patients['MCHCmax'].mean())

32.37052152425524
35.11465210040297


In [158]:
chf_patients['MCHCmin'] = chf_patients['MCHCmin'].fillna(31.0)
chf_patients['MCHCmax'] = chf_patients['MCHCmax'].fillna(37.0)

**MCV**: research says normal values are 80-96 fL/red cell.

In [159]:
print(chf_patients['MCVmin'].mean())
print(chf_patients['MCVmean'].mean())
print(chf_patients['MCVmax'].mean())
print(chf_patients['MCVstd'].mean())

86.59364270107167
89.66433535131029
93.03965248153158
1.9668659902074497


In [160]:
#STD calc:
np.abs(96-88)*.683

5.464

In [182]:
chf_patients['MCVmin'] = chf_patients['MCVmin'].fillna(80)
chf_patients['MCVmean'] = chf_patients['MCVmean'].fillna(88)
chf_patients['MCVmax'] = chf_patients['MCVmax'].fillna(96)
chf_patients['MCVstd'] = chf_patients['MCVstd'].fillna(5.464)

**Platelet Count**: research says normal values are 150 to 450 platelets per ml of blood.

In [163]:
print(chf_patients['Platelet Countmin'].mean())
print(chf_patients['Platelet Countmean'].mean())
print(chf_patients['Platelet Countmax'].mean())
print(chf_patients['Platelet Countstd'].mean())

160.5640882253492
240.491081489221
358.7105105729966
61.78687844225944


In [164]:
#STD calc:
np.abs(450-300)*.683

102.45

In [165]:
chf_patients['Platelet Countmin'] = chf_patients['Platelet Countmin'].fillna(150)
chf_patients['Platelet Countmean'] = chf_patients['Platelet Countmean'].fillna(300)
chf_patients['Platelet Countmax'] = chf_patients['Platelet Countmax'].fillna(450)
chf_patients['Platelet Countstd'] = chf_patients['Platelet Countstd'].fillna(102.45)

**PT**: research says normal values are 11 to 13.5 seconds.

In [166]:
print(chf_patients['PTmin'].mean())
print(chf_patients['PTmean'].mean())
print(chf_patients['PTmax'].mean())
print(chf_patients['PTstd'].mean())

12.959123550579793
15.141424675612654
20.379170731707408
2.4304795836858832


In [167]:
#STD calc:
np.abs(13.5-12.25)*.683

0.85375

In [168]:
chf_patients['PTmin'] = chf_patients['PTmin'].fillna(11.0)
chf_patients['PTmean'] = chf_patients['PTmean'].fillna(12.25)
chf_patients['PTmax'] = chf_patients['PTmax'].fillna(13.5)
chf_patients['PTstd'] = chf_patients['PTstd'].fillna(0.85375)

**PTT**: research says normal values are 60-70 seconds.

In [169]:
print(chf_patients['PTTmin'].mean())
print(chf_patients['PTTmean'].mean())
print(chf_patients['PTTmax'].mean())
print(chf_patients['PTTstd'].mean())

26.733198385976202
37.005503838822115
60.14162818662772
11.692964025398425


In [170]:
#STD calc:
np.abs(70-65)*.683

3.415

In [171]:
chf_patients['PTTmin'] = chf_patients['PTTmin'].fillna(60)
chf_patients['PTTmean'] = chf_patients['PTTmean'].fillna(65)
chf_patients['PTTmax'] = chf_patients['PTTmax'].fillna(70)
chf_patients['PTTstd'] = chf_patients['PTTstd'].fillna(3.415)

**RDW**: research says normal values are 11.5-14.5%.

In [174]:
print(chf_patients['RDWmin'].mean())
print(chf_patients['RDWmax'].mean())

14.003436345663475
16.02537328963113


In [175]:
chf_patients['RDWmin'] = chf_patients['RDWmin'].fillna(11.5)
chf_patients['RDWmax'] = chf_patients['RDWmax'].fillna(14.5)

**White Blood Cells**: research says normal values are 4.3 and 10.8 cells per L of blood.

In [176]:
print(chf_patients['White Blood Cellsmin'].mean())
print(chf_patients['White Blood Cellsmean'].mean())
print(chf_patients['White Blood Cellsmax'].mean())
print(chf_patients['White Blood Cellsstd'].mean())

6.868241521015433
10.83467479278342
17.357003225135184
3.2022491305361633


In [177]:
#STD calc:
np.abs(10.8-7.55)*.683

2.219750000000001

In [178]:
chf_patients['White Blood Cellsmin'] = chf_patients['White Blood Cellsmin'].fillna(4.3)
chf_patients['White Blood Cellsmean'] = chf_patients['White Blood Cellsmean'].fillna(7.55)
chf_patients['White Blood Cellsmax'] = chf_patients['White Blood Cellsmax'].fillna(10.8)
chf_patients['White Blood Cellsstd'] = chf_patients['White Blood Cellsstd'].fillna(2.21975)

#### Determined research to gender-specific ranges, imputed gender-respective values for healthy individuals:
- Creatinine
- Hematocrit
- Hemoglobin
- Red Blood Cells

**Creatinine**: research says normal values are 0.6 to 1.2 milligrams (mg) per deciliter (dL) for males and 0.5 to 1.1 milligrams per deciliter in females.

In [193]:
#FEMALES
print(chf_patients['Creatininemin'][chf_patients['GENDER']=='F'].mean())
print(chf_patients['Creatininemean'][chf_patients['GENDER']=='F'].mean())
print(chf_patients['Creatininemax'][chf_patients['GENDER']=='F'].mean())
print(chf_patients['Creatininestd'][chf_patients['GENDER']=='F'].mean())

0.7630548224568229
1.1234104869629142
1.7153670825335923
0.2779546572380727


In [194]:
#STD calc:
np.abs(1.1-0.8)*.683

0.20490000000000005

In [195]:
chf_patients['Creatininemin'][chf_patients['GENDER']=='F'] = chf_patients['Creatininemin'][chf_patients['GENDER']=='F'].fillna(0.5)
chf_patients['Creatininemean'][chf_patients['GENDER']=='F'] = chf_patients['Creatininemean'][chf_patients['GENDER']=='F'].fillna(0.8)
chf_patients['Creatininemax'][chf_patients['GENDER']=='F'] = chf_patients['Creatininemax'][chf_patients['GENDER']=='F'].fillna(1.1)
chf_patients['Creatininestd'][chf_patients['GENDER']=='F'] = chf_patients['Creatininestd'][chf_patients['GENDER']=='F'].fillna(0.2049)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [192]:
#MALES
print(chf_patients['Creatininemin'][chf_patients['GENDER']=='M'].mean())
print(chf_patients['Creatininemean'][chf_patients['GENDER']=='M'].mean())
print(chf_patients['Creatininemax'][chf_patients['GENDER']=='M'].mean())
print(chf_patients['Creatininestd'][chf_patients['GENDER']=='M'].mean())

0.9670969223702376
1.3815066586822622
2.0493247588424457
0.32896712418363483


In [196]:
#STD calc:
np.abs(1.2-0.9)*.683

0.20489999999999997

In [197]:
chf_patients['Creatininemin'][chf_patients['GENDER']=='M'] = chf_patients['Creatininemin'][chf_patients['GENDER']=='M'].fillna(0.6)
chf_patients['Creatininemean'][chf_patients['GENDER']=='M'] = chf_patients['Creatininemean'][chf_patients['GENDER']=='M'].fillna(0.9)
chf_patients['Creatininemax'][chf_patients['GENDER']=='M'] = chf_patients['Creatininemax'][chf_patients['GENDER']=='M'].fillna(1.2)
chf_patients['Creatininestd'][chf_patients['GENDER']=='M'] = chf_patients['Creatininestd'][chf_patients['GENDER']=='M'].fillna(0.2049)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


**Hematocrit**: research says normal values are 45% to 52% for men for males and 37% to 48% in females.

In [200]:
print(chf_patients['Hematocritmin'][chf_patients['GENDER']=='F'].min())
print(chf_patients['Hematocritmax'][chf_patients['GENDER']=='F'].max())

0.0
77.7


In [202]:
chf_patients['Hematocritmin'][chf_patients['GENDER']=='F'] = chf_patients['Hematocritmin'][chf_patients['GENDER']=='F'].fillna(37)
chf_patients['Hematocritmax'][chf_patients['GENDER']=='F'] = chf_patients['Hematocritmax'][chf_patients['GENDER']=='F'].fillna(48)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [201]:
print(chf_patients['Hematocritmin'][chf_patients['GENDER']=='M'].min())
print(chf_patients['Hematocritmax'][chf_patients['GENDER']=='M'].max())

0.0
70.6


In [203]:
chf_patients['Hematocritmin'][chf_patients['GENDER']=='M'] = chf_patients['Hematocritmin'][chf_patients['GENDER']=='M'].fillna(45.0)
chf_patients['Hematocritmax'][chf_patients['GENDER']=='M'] = chf_patients['Hematocritmax'][chf_patients['GENDER']=='M'].fillna(52.0)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


**Hemoglobin**: research says normal values are 13.5 to 17.5 grams per deciliter for males and 12.0 to 15.5 grams per deciliter for females.

In [205]:
#FEMALES
print(chf_patients['Hemoglobinmin'][chf_patients['GENDER']=='F'].mean())
print(chf_patients['Hemoglobinmean'][chf_patients['GENDER']=='F'].mean())
print(chf_patients['Hemoglobinmax'][chf_patients['GENDER']=='F'].mean())
print(chf_patients['Hemoglobinstd'][chf_patients['GENDER']=='F'].mean())

8.954869273206981
10.683183829889348
12.640777164787767
1.1183817641734388


In [206]:
#STD calc:
np.abs(15.5-13.75)*.683

1.1952500000000001

In [207]:
chf_patients['Hemoglobinmin'][chf_patients['GENDER']=='F'] = chf_patients['Hemoglobinmin'][chf_patients['GENDER']=='F'].fillna(12.0)
chf_patients['Hemoglobinmean'][chf_patients['GENDER']=='F'] = chf_patients['Hemoglobinmean'][chf_patients['GENDER']=='F'].fillna(13.75)
chf_patients['Hemoglobinmax'][chf_patients['GENDER']=='F'] = chf_patients['Hemoglobinmax'][chf_patients['GENDER']=='F'].fillna(15.5)
chf_patients['Hemoglobinstd'][chf_patients['GENDER']=='F'] = chf_patients['Hemoglobinstd'][chf_patients['GENDER']=='F'].fillna(1.19525)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [208]:
#MALES
print(chf_patients['Hemoglobinmin'][chf_patients['GENDER']=='M'].mean())
print(chf_patients['Hemoglobinmean'][chf_patients['GENDER']=='M'].mean())
print(chf_patients['Hemoglobinmax'][chf_patients['GENDER']=='M'].mean())
print(chf_patients['Hemoglobinstd'][chf_patients['GENDER']=='M'].mean())

9.526293063849266
11.268765709782766
13.431685806155363
1.2180338114151776


In [209]:
#STD calc:
np.abs(17.5-15.5)*.683

1.366

In [210]:
chf_patients['Hemoglobinmin'][chf_patients['GENDER']=='M'] = chf_patients['Hemoglobinmin'][chf_patients['GENDER']=='M'].fillna(13.5)
chf_patients['Hemoglobinmean'][chf_patients['GENDER']=='M'] = chf_patients['Hemoglobinmean'][chf_patients['GENDER']=='M'].fillna(15.5)
chf_patients['Hemoglobinmax'][chf_patients['GENDER']=='M'] = chf_patients['Hemoglobinmax'][chf_patients['GENDER']=='M'].fillna(17.5)
chf_patients['Hemoglobinstd'][chf_patients['GENDER']=='M'] = chf_patients['Hemoglobinstd'][chf_patients['GENDER']=='M'].fillna(1.366)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


**Red Blood Cells**: research says normal values are 4.7 to 6.1 million cells per microliter (mcL) for males and 4.2 to 5.4 million mcL for females.

In [212]:
#FEMALES
print(chf_patients['Red Blood Cellsmin'][chf_patients['GENDER']=='F'].mean())
print(chf_patients['Red Blood Cellsmean'][chf_patients['GENDER']=='F'].mean())
print(chf_patients['Red Blood Cellsmax'][chf_patients['GENDER']=='F'].mean())
print(chf_patients['Red Blood Cellsstd'][chf_patients['GENDER']=='F'].mean())

3.0015231470376684
3.582799796606717
4.243322139601806
0.3777006617551198


In [213]:
#STD calc:
np.abs(5.4-4.8)*.683

0.4098000000000004

In [214]:
chf_patients['Red Blood Cellsmin'][chf_patients['GENDER']=='F'] = chf_patients['Red Blood Cellsmin'][chf_patients['GENDER']=='F'].fillna(4.2)
chf_patients['Red Blood Cellsmean'][chf_patients['GENDER']=='F'] = chf_patients['Red Blood Cellsmean'][chf_patients['GENDER']=='F'].fillna(4.8)
chf_patients['Red Blood Cellsmax'][chf_patients['GENDER']=='F'] = chf_patients['Red Blood Cellsmax'][chf_patients['GENDER']=='F'].fillna(5.4)
chf_patients['Red Blood Cellsstd'][chf_patients['GENDER']=='F'] = chf_patients['Red Blood Cellsstd'][chf_patients['GENDER']=='F'].fillna(0.4098)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [215]:
#MALES
print(chf_patients['Red Blood Cellsmin'][chf_patients['GENDER']=='M'].mean())
print(chf_patients['Red Blood Cellsmean'][chf_patients['GENDER']=='M'].mean())
print(chf_patients['Red Blood Cellsmax'][chf_patients['GENDER']=='M'].mean())
print(chf_patients['Red Blood Cellsstd'][chf_patients['GENDER']=='M'].mean())

3.1331798970966367
3.714765807100078
4.4304566335905555
0.4049895324640694


In [216]:
#STD calc:
np.abs(6.1-5.4)*.683

0.4780999999999995

In [217]:
chf_patients['Red Blood Cellsmin'][chf_patients['GENDER']=='M'] = chf_patients['Red Blood Cellsmin'][chf_patients['GENDER']=='M'].fillna(4.7)
chf_patients['Red Blood Cellsmean'][chf_patients['GENDER']=='M'] = chf_patients['Red Blood Cellsmean'][chf_patients['GENDER']=='M'].fillna(5.4)
chf_patients['Red Blood Cellsmax'][chf_patients['GENDER']=='M'] = chf_patients['Red Blood Cellsmax'][chf_patients['GENDER']=='M'].fillna(6.1)
chf_patients['Red Blood Cellsstd'][chf_patients['GENDER']=='M'] = chf_patients['Red Blood Cellsstd'][chf_patients['GENDER']=='M'].fillna(0.4781)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


## Final Check:

In [218]:
chf_patients.isnull().sum()[chf_patients.isnull().sum()> 0]

Series([], dtype: int64)

In [222]:
chf_patients.isnull().sum()

SUBJECT_ID                 0
GENDER                     0
EXPIRE_FLAG                0
Anion Gapmin               0
Base Excessmin             0
Bicarbonatemin             0
Calcium, Totalmin          0
Calculated Total CO2min    0
Chloridemin                0
Creatininemin              0
Glucosemin                 0
Hematocritmin              0
Hemoglobinmin              0
INR(PT)min                 0
MCHmin                     0
MCHCmin                    0
MCVmin                     0
Magnesiummin               0
PTmin                      0
PTTmin                     0
Phosphatemin               0
Platelet Countmin          0
Potassiummin               0
RDWmin                     0
Red Blood Cellsmin         0
Sodiummin                  0
Urea Nitrogenmin           0
White Blood Cellsmin       0
pCO2min                    0
pHmin                      0
                          ..
Bicarbonatestd             0
Calcium, Totalstd          0
Calculated Total CO2std    0
Chloridestd   

In [223]:
#Export Imputed
chf_patients.to_csv('CHF_Adults_Imputed_2019-09-16.csv')

## ONTO MODELING FINALLY.