# Data Preprocessing

As pre-training data, the [behavioral risk factor surveillance system (BRFSS)](https://www.cdc.gov/brfss), a collection of public health surveys in the US is used which creates a single table consisting of 2.03 million rows and 74 columns.

As downstream task, the [stroke prediction dataset](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) in the healthcare domain from Kaggle is used.

The stroke dataset has about 50% overlap with BRFSS columns. However, some of these overlapping columns have different representations of their values. For example, the age column in BRFSS is categorical, but is represented as continuous in the downstream datasets, such as Stroke. We pre-processed the column representations of the downstream datasets to adjust BRFSS columns.

In [20]:
# imports
import os
import random
import warnings
from os.path import join

import numpy as np
import pandas as pd


warnings.simplefilter("ignore")
pd.set_option("display.max_columns", 500)

random.seed(1)

In [21]:
# Constants
DATA_DIR = "/ssd003/projects/aieng/public/ssl_bootcamp_resources/datasets"
SAVE_DIR = "./datasets"

## BRFSS Preprocessing

This dataset collected state-specific risk behaviors related to chronic diseases, injuries, and preventable infectious diseases of adults in the United States. We combined the datasets from 2011–2015 and removed missing values by deleting rows and columns by the following steps: 
1. deleted columns with more than 10% missing values; 
2. deleted rows with missing values

In [22]:
df_15 = pd.read_csv(
    join(DATA_DIR, "brfss/2015.csv"),
)
df_15

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENUM,PVTRESD1,COLGHOUS,STATERES,CELLFON3,LADULT,NUMADULT,NUMMEN,NUMWOMEN,CTELNUM1,CELLFON2,CADULT,PVTRESD2,CCLGHOUS,CSTATE,LANDLINE,HHADULT,GENHLTH,PHYSHLTH,MENTHLTH,POORHLTH,HLTHPLN1,PERSDOC2,MEDCOST,CHECKUP1,BPHIGH4,BPMEDS,BLOODCHO,CHOLCHK,TOLDHI2,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,ASTHNOW,CHCSCNCR,CHCOCNCR,CHCCOPD1,HAVARTH3,ADDEPEV2,CHCKIDNY,DIABETE3,DIABAGE2,SEX,MARITAL,EDUCA,RENTHOM1,NUMHHOL2,NUMPHON2,CPDEMO1,VETERAN3,EMPLOY1,CHILDREN,INCOME2,INTERNET,WEIGHT2,HEIGHT3,PREGNANT,QLACTLM2,USEEQUIP,BLIND,DECIDE,DIFFWALK,DIFFDRES,DIFFALON,SMOKE100,SMOKDAY2,STOPSMK2,LASTSMK2,USENOW3,ALCDAY5,AVEDRNK2,DRNK3GE5,MAXDRNKS,FRUITJU1,FRUIT1,FVBEANS,FVGREEN,FVORANG,VEGETAB1,EXERANY2,EXRACT11,EXEROFT1,EXERHMM1,EXRACT21,EXEROFT2,EXERHMM2,STRENGTH,LMTJOIN3,ARTHDIS2,ARTHSOCL,JOINPAIN,SEATBELT,FLUSHOT6,FLSHTMY2,IMFVPLAC,PNEUVAC3,HIVTST6,HIVTSTD3,WHRTST10,PDIABTST,PREDIAB1,INSULIN,BLDSUGAR,FEETCHK2,DOCTDIAB,CHKHEMO3,FEETCHK,EYEEXAM,DIABEYE,DIABEDU,PAINACT2,QLMENTL2,QLSTRES2,QLHLTH2,CAREGIV1,CRGVREL1,CRGVLNG1,CRGVHRS1,CRGVPRB1,CRGVPERS,CRGVHOUS,CRGVMST2,CRGVEXPT,VIDFCLT2,VIREDIF3,VIPRFVS2,VINOCRE2,VIEYEXM2,VIINSUR2,VICTRCT4,VIGLUMA2,VIMACDG2,CIMEMLOS,CDHOUSE,CDASSIST,CDHELP,CDSOCIAL,CDDISCUS,WTCHSALT,LONGWTCH,DRADVISE,ASTHMAGE,ASATTACK,ASERVIST,ASDRVIST,ASRCHKUP,ASACTLIM,ASYMPTOM,ASNOSLEP,ASTHMED3,ASINHALR,HAREHAB1,STREHAB1,CVDASPRN,ASPUNSAF,RLIVPAIN,RDUCHART,RDUCSTRK,ARTTODAY,ARTHWGT,ARTHEXER,ARTHEDU,TETANUS,HPVADVC2,HPVADSHT,SHINGLE2,HADMAM,HOWLONG,HADPAP2,LASTPAP2,HPVTEST,HPLSTTST,HADHYST2,PROFEXAM,LENGEXAM,BLDSTOOL,LSTBLDS3,HADSIGM3,HADSGCO1,LASTSIG3,PCPSAAD2,PCPSADI1,PCPSARE1,PSATEST1,PSATIME,PCPSARS1,PCPSADE1,PCDMDECN,SCNTMNY1,SCNTMEL1,SCNTPAID,SCNTWRK1,SCNTLPAD,SCNTLWK1,SXORIENT,TRNSGNDR,RCSGENDR,RCSRLTN2,CASTHDX2,CASTHNO2,EMTSUPRT,LSATISFY,ADPLEASR,ADDOWN,ADSLEEP,ADENERGY,ADEAT1,ADFAIL,ADTHINK,ADMOVE,MISTMNT,ADANXEV,QSTVER,QSTLANG,EXACTOT1,EXACTOT2,MSCODE,_STSTR,_STRWT,_RAWRAKE,_WT2RAKE,_CHISPNC,_CRACE1,_CPRACE,_CLLCPWT,_DUALUSE,_DUALCOR,_LLCPWT,_RFHLTH,_HCVU651,_RFHYPE5,_CHOLCHK,_RFCHOL,_MICHD,_LTASTH1,_CASTHM1,_ASTHMS1,_DRDXAR1,_PRACE1,_MRACE1,_HISPANC,_RACE,_RACEG21,_RACEGR3,_RACE_G1,_AGEG5YR,_AGE65YR,_AGE80,_AGE_G,HTIN4,HTM4,WTKG3,_BMI5,_BMI5CAT,_RFBMI5,_CHLDCNT,_EDUCAG,_INCOMG,_SMOKER3,_RFSMOK3,DRNKANY5,DROCDY3_,_RFBING5,_DRNKWEK,_RFDRHV5,FTJUDA1_,FRUTDA1_,BEANDAY_,GRENDAY_,ORNGDAY_,VEGEDA1_,_MISFRTN,_MISVEGN,_FRTRESP,_VEGRESP,_FRUTSUM,_VEGESUM,_FRTLT1,_VEGLT1,_FRT16,_VEG23,_FRUITEX,_VEGETEX,_TOTINDA,METVL11_,METVL21_,MAXVO2_,FC60_,ACTIN11_,ACTIN21_,PADUR1_,PADUR2_,PAFREQ1_,PAFREQ2_,_MINAC11,_MINAC21,STRFREQ_,PAMISS1_,PAMIN11_,PAMIN21_,PA1MIN_,PAVIG11_,PAVIG21_,PA1VIGM_,_PACAT1,_PAINDX1,_PA150R2,_PA300R2,_PA30021,_PASTRNG,_PAREC1,_PASTAE1,_LMTACT1,_LMTWRK1,_LMTSCL1,_RFSEAT2,_RFSEAT3,_FLSHOT6,_PNEUMO2,_AIDTST3
0,1.0,1.0,b'01292015',b'01',b'29',b'2015',1200.0,2.015000e+09,2.015000e+09,1.0,1.0,,1.0,2.0,,3.0,1.000000e+00,2.0,,,,,,,,,5.0,15.0,18.0,10.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0,3.0,,2.0,1.0,4.0,1.0,2.0,,1.0,2.0,8.0,88.0,3.0,2.0,280.0,510.0,,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,3.0,,2.0,3.0,888.0,,,,305.0,310.0,320.0,310.0,305.0,101.0,2.0,,,,,,,888.0,1.0,1.0,1.0,6.0,1.0,1.0,112014.0,1.0,1.0,1.0,,,1.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,b'',,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',3.0,11011.0,28.781560,3.0,86.344681,,,,,1.0,0.614125,341.384853,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,9.0,1.0,63.0,5.0,70.0,178.0,12701.0,4018.0,4.0,2.0,1.0,2.0,2.0,3.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,1.700000e+01,3.300000e+01,6.700000e+01,3.300000e+01,17.0,100.0,5.397605e-79,5.397605e-79,1.000000e+00,1.000000e+00,50.0,217.0,2.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79,2.0,,,2469.0,423.0,,,,,,,,,5.397605e-79,5.397605e-79,,,,,,,4.0,2.0,3.0,3.0,2.0,2.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,1.0
1,1.0,1.0,b'01202015',b'01',b'20',b'2015',1100.0,2.015000e+09,2.015000e+09,1.0,1.0,,1.0,2.0,,1.0,5.397605e-79,1.0,,,,,,,,,3.0,88.0,88.0,,2.0,1.0,1.0,4.0,3.0,,1.0,4.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,2.0,2.0,6.0,1.0,2.0,,2.0,2.0,3.0,88.0,1.0,1.0,165.0,508.0,,1.0,2.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,,3.0,888.0,,,,302.0,305.0,302.0,202.0,202.0,304.0,1.0,64.0,212.0,100.0,69.0,212.0,100.0,888.0,,,,,3.0,2.0,,,2.0,2.0,,,2.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,1.0,,,,,,,,,,1.0,5.0,5.0,,5.0,2.0,2.0,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,2.0,,,,,,,,,,b'',1.0,2.0,,,2.0,60.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',5.0,11011.0,28.781560,1.0,28.781560,,,,,9.0,,108.060903,1.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,7.0,1.0,52.0,4.0,68.0,173.0,7484.0,2509.0,3.0,2.0,1.0,4.0,1.0,1.0,2.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,7.000000e+00,1.700000e+01,7.000000e+00,2.900000e+01,29.0,13.0,5.397605e-79,5.397605e-79,1.000000e+00,1.000000e+00,24.0,78.0,2.0,2.0,1.0,1.0,5.397605e-79,5.397605e-79,1.0,35.0,5.397605e-79,2876.0,493.0,1.0,5.397605e-79,60.0,60.0,2800.0,2800.0,168.0,5.397605e-79,5.397605e-79,5.397605e-79,168.0,5.397605e-79,168.0,5.397605e-79,5.397605e-79,5.397605e-79,2.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,4.0,2.0,2.0,,,2.0
2,1.0,1.0,b'02012015',b'02',b'01',b'2015',1200.0,2.015000e+09,2.015000e+09,1.0,1.0,,1.0,2.0,,2.0,1.000000e+00,1.0,,,,,,,,,4.0,15.0,88.0,88.0,1.0,2.0,2.0,1.0,3.0,,1.0,1.0,1.0,7.0,2.0,1.0,2.0,,2.0,1.0,2.0,1.0,2.0,2.0,3.0,,2.0,2.0,4.0,1.0,2.0,,1.0,2.0,7.0,88.0,99.0,2.0,158.0,511.0,,2.0,2.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,b'',,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',5.0,11011.0,28.781560,2.0,57.563120,,,,,1.0,0.614125,255.264797,2.0,9.0,1.0,1.0,2.0,,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,11.0,2.0,71.0,6.0,71.0,180.0,7167.0,2204.0,2.0,1.0,1.0,2.0,9.0,9.0,9.0,9.0,9.000000e+02,9.0,9.990000e+04,9.0,,,,,,,2.000000e+00,4.000000e+00,5.397605e-79,5.397605e-79,,,9.0,9.0,1.0,1.0,1.000000e+00,1.000000e+00,9.0,,,2173.0,373.0,,,,,,,,,,9.000000e+00,,,,,,,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,
3,1.0,1.0,b'01142015',b'01',b'14',b'2015',1100.0,2.015000e+09,2.015000e+09,1.0,1.0,,1.0,2.0,,3.0,1.000000e+00,2.0,,,,,,,,,5.0,30.0,30.0,30.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,,2.0,1.0,2.0,1.0,1.0,2.0,3.0,,2.0,1.0,4.0,1.0,2.0,,1.0,2.0,8.0,1.0,8.0,2.0,180.0,507.0,,1.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,,,,3.0,888.0,,,,555.0,101.0,555.0,301.0,301.0,201.0,2.0,,,,,,,888.0,1.0,1.0,1.0,8.0,1.0,1.0,777777.0,5.0,1.0,9.0,,,2.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,2.0,,,,,,,,,,1.0,1.0,1.0,2.0,1.0,1.0,2.0,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,2.0,1.0,,,,,,,,b'',4.0,7.0,,,,97.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',3.0,11011.0,28.781560,3.0,86.344681,,,,,1.0,0.614125,341.384853,2.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,9.0,1.0,63.0,5.0,67.0,170.0,8165.0,2819.0,3.0,2.0,2.0,2.0,5.0,4.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,5.397605e-79,1.000000e+02,5.397605e-79,3.000000e+00,3.0,14.0,5.397605e-79,5.397605e-79,1.000000e+00,1.000000e+00,100.0,20.0,1.0,2.0,1.0,1.0,5.397605e-79,5.397605e-79,2.0,,,2469.0,423.0,,,,,,,,,5.397605e-79,5.397605e-79,,,,,,,4.0,2.0,3.0,3.0,2.0,2.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,9.0
4,1.0,1.0,b'01142015',b'01',b'14',b'2015',1100.0,2.015000e+09,2.015000e+09,1.0,1.0,,1.0,2.0,,2.0,1.000000e+00,1.0,,,,,,,,,5.0,20.0,88.0,30.0,1.0,1.0,2.0,1.0,3.0,,1.0,1.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,1.0,2.0,2.0,3.0,,2.0,1.0,5.0,1.0,2.0,,2.0,2.0,8.0,88.0,77.0,1.0,142.0,504.0,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,3.0,888.0,,,,777.0,102.0,203.0,204.0,310.0,320.0,2.0,,,,,,,888.0,1.0,1.0,1.0,7.0,1.0,2.0,,,1.0,1.0,777777.0,1.0,1.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,7.0,,,,,,,,,,2.0,,,,,,1.0,777.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,2.0,5.0,,,,,,,,b'',5.0,5.0,,,,45.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',3.0,11011.0,28.781560,2.0,57.563120,,,,,9.0,,258.682223,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,9.0,1.0,61.0,5.0,64.0,163.0,6441.0,2437.0,2.0,1.0,1.0,3.0,9.0,4.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,,2.000000e+02,4.300000e+01,5.700000e+01,33.0,67.0,1.000000e+00,5.397605e-79,5.397605e-79,1.000000e+00,,200.0,9.0,1.0,1.0,1.0,1.000000e+00,5.397605e-79,2.0,,,2543.0,436.0,,,,,,,,,5.397605e-79,5.397605e-79,,,,,,,4.0,2.0,3.0,3.0,2.0,2.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
441451,72.0,11.0,b'12162015',b'12',b'16',b'2015',1100.0,2.015005e+09,2.015005e+09,,,,,,,,,,1.0,1.0,2.0,1.0,,1.0,2.0,4.0,4.0,88.0,88.0,,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,1.0,2.0,2.0,1.0,55.0,2.0,3.0,2.0,1.0,,,,2.0,7.0,88.0,4.0,2.0,104.0,503.0,,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,,,,3.0,888.0,,,,202.0,555.0,205.0,555.0,201.0,201.0,2.0,,,,,,,888.0,2.0,2.0,3.0,5.0,1.0,2.0,,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,1.0,422.0,1.0,,,,,,,,,,,,,1.0,,2.0,1.0,1.0,,,,,4.0,,,,,,,,,,,,,,,,,,,,,,,,,b'',,,,,,,,,,,,,,,,,,,,,,,,,20.0,2.0,b'',b'',,722019.0,251.209517,1.0,251.209517,9.0,,,,9.0,,531.980410,2.0,9.0,2.0,1.0,2.0,2.0,1.0,1.0,3.0,1.0,6.0,6.0,1.0,8.0,2.0,5.0,3.0,11.0,2.0,72.0,6.0,63.0,160.0,4717.0,1842.0,1.0,1.0,1.0,1.0,2.0,4.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,2.900000e+01,5.397605e-79,7.100000e+01,5.397605e-79,14.0,14.0,5.397605e-79,5.397605e-79,1.000000e+00,1.000000e+00,29.0,99.0,2.0,2.0,1.0,1.0,5.397605e-79,5.397605e-79,2.0,,,2136.0,366.0,,,,,,,,,5.397605e-79,5.397605e-79,,,,,,,4.0,2.0,3.0,3.0,2.0,2.0,4.0,2.0,2.0,2.0,3.0,1.0,1.0,2.0,2.0,2.0
441452,72.0,11.0,b'12142015',b'12',b'14',b'2015',1100.0,2.015005e+09,2.015005e+09,,,,,,,,,,1.0,1.0,2.0,1.0,,1.0,2.0,2.0,1.0,88.0,88.0,,1.0,1.0,2.0,1.0,3.0,,1.0,1.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,2.0,1.0,5.0,1.0,,,,2.0,1.0,1.0,2.0,1.0,160.0,503.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,3.0,888.0,,,,305.0,101.0,202.0,303.0,201.0,202.0,1.0,64.0,105.0,30.0,88.0,,,888.0,,,,,1.0,2.0,,,2.0,1.0,777777.0,8.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,401.0,2.0,,,,,,,,,,,,,2.0,3.0,,,,,,,,7.0,,,,,,,,,,,,,,,,,,,,,,,,,b'',,,,,,,,,2.0,1.0,2.0,,,,,,,,,,,,,,20.0,2.0,b'',b'',,722019.0,251.209517,1.0,251.209517,1.0,6.0,6.0,326.569972,9.0,,746.416599,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,6.0,6.0,1.0,8.0,2.0,5.0,3.0,2.0,1.0,29.0,2.0,63.0,160.0,7257.0,2834.0,3.0,2.0,2.0,3.0,1.0,4.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,1.700000e+01,1.000000e+02,2.900000e+01,1.000000e+01,14.0,29.0,5.397605e-79,5.397605e-79,1.000000e+00,1.000000e+00,117.0,82.0,1.0,2.0,1.0,1.0,5.397605e-79,5.397605e-79,1.0,35.0,5.397605e-79,3727.0,639.0,1.0,5.397605e-79,30.0,,5000.0,,150.0,5.397605e-79,5.397605e-79,5.397605e-79,150.0,5.397605e-79,150.0,5.397605e-79,5.397605e-79,5.397605e-79,2.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,4.0,1.0,1.0,,,1.0
441453,72.0,11.0,b'12232015',b'12',b'23',b'2015',1200.0,2.015005e+09,2.015005e+09,,,,,,,,,,1.0,1.0,2.0,1.0,,1.0,2.0,2.0,4.0,88.0,20.0,88.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,,2.0,1.0,2.0,2.0,2.0,2.0,3.0,,2.0,1.0,4.0,1.0,,,,2.0,7.0,88.0,5.0,2.0,247.0,505.0,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,,7.0,3.0,202.0,2.0,88.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,b'',,,,,,,,,,,,,,,,,,,,,,,,,20.0,2.0,b'',b'',,722019.0,251.209517,1.0,251.209517,9.0,,,,9.0,,207.663634,2.0,9.0,2.0,1.0,2.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,1.0,8.0,2.0,5.0,3.0,11.0,2.0,70.0,6.0,65.0,165.0,11204.0,4110.0,4.0,2.0,1.0,2.0,3.0,3.0,1.0,1.0,7.000000e+00,1.0,9.300000e+01,1.0,,,,,,,2.000000e+00,4.000000e+00,5.397605e-79,5.397605e-79,,,9.0,9.0,1.0,1.0,1.000000e+00,1.000000e+00,9.0,,,2210.0,379.0,,,,,,,,,,9.000000e+00,,,,,,,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,3.0,3.0,4.0,9.0,9.0,9.0,9.0,
441454,72.0,11.0,b'12152015',b'12',b'15',b'2015',1100.0,2.015005e+09,2.015005e+09,,,,,,,,,,1.0,1.0,1.0,1.0,,1.0,1.0,3.0,3.0,88.0,88.0,,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,1.0,5.0,5.0,1.0,,,,2.0,3.0,88.0,1.0,2.0,166.0,511.0,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,3.0,888.0,,,,101.0,101.0,101.0,202.0,301.0,301.0,2.0,,,,,,,888.0,,,,,1.0,2.0,,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,1.0,107.0,1.0,,,,,,,,,,,,,1.0,,2.0,1.0,1.0,,,,,4.0,,,,,,,,,,,,,,,,,,,,,,,,,b'',,,,,,,,,,,,,,,,,,,,,,,,,20.0,2.0,b'',b'',,722019.0,251.209517,1.0,251.209517,9.0,,,,2.0,0.323563,515.758894,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,3.0,2.0,6.0,6.0,1.0,8.0,2.0,5.0,3.0,7.0,1.0,52.0,4.0,71.0,180.0,7530.0,2315.0,2.0,1.0,1.0,3.0,1.0,4.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,1.000000e+02,1.000000e+02,1.000000e+02,2.900000e+01,3.0,3.0,5.397605e-79,5.397605e-79,1.000000e+00,1.000000e+00,200.0,135.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79,2.0,,,3140.0,538.0,,,,,,,,,5.397605e-79,5.397605e-79,,,,,,,4.0,2.0,3.0,3.0,2.0,2.0,4.0,2.0,3.0,3.0,4.0,1.0,1.0,,,2.0


In [23]:
df_15 = df_15.loc[:, (df_15.isnull().sum() / df_15.shape[0]) < 0.1]
df_15.shape

(441456, 145)

In [24]:
df_15 = df_15.dropna()
df_15.shape

(340057, 145)

In [25]:
df_15 = df_15.drop("IDATE", axis=1)

In [26]:
data_dir = join(DATA_DIR, "brfss")
df_1114 = [pd.read_csv(f"{data_dir}/20{i}.csv") for i in range(11, 15)]

In [27]:
df_list = [df_15] + df_1114
all_df = pd.concat(df_list, join="inner")
all_df.shape

(2278648, 77)

In [28]:
all_df = all_df.loc[:, (all_df.isnull().sum() / all_df.shape[0]) < 0.1].dropna()
all_df.shape

(2041151, 77)

In [29]:
all_df["DIABETE3"] = all_df["DIABETE3"].replace({2: 0, 3: 0, 1: 2, 4: 1})
all_df = all_df[(all_df["DIABETE3"] != 7) & (all_df["DIABETE3"] != 9)]
all_df.groupby(["DIABETE3"]).size()

DIABETE3
0.0    1745038
1.0      34563
2.0     259171
dtype: int64

In [30]:
all_df.loc[:, "DIABETE3"] = all_df["DIABETE3"].replace({1: 0}).replace({2: 1})
all_df.groupby(["DIABETE3"]).size()

DIABETE3
0.0    1779601
1.0     259171
dtype: int64

In [31]:
all_df = all_df.rename(columns={"DIABETE3": "Diabetes"})

In [32]:
print(all_df.dtypes)

_STATE      float64
FMONTH      float64
IMONTH       object
IDAY         object
IYEAR        object
             ...   
_RFBING5    float64
_TOTINDA    float64
_RFSEAT2    float64
_RFSEAT3    float64
_AIDTST3    float64
Length: 77, dtype: object


In [33]:
for col in all_df.columns:
    if all_df[col].dtype != float:
        continue
    if all_df[col].apply(float.is_integer).all():
        all_df = all_df.astype({col: "int"})
    else:
        continue
all_df = all_df.astype(dict.fromkeys(["WEIGHT2", "HEIGHT3", "_BMI5", "WTKG3", "_STSTR"], "float"))

In [34]:
df_int = all_df.loc[:, all_df.dtypes == "int64"]
df_int = df_int.drop(["SEQNO", "_PSU"], axis=1)

In [35]:
df_int.head()

Unnamed: 0,_STATE,FMONTH,DISPCODE,GENHLTH,PHYSHLTH,MENTHLTH,HLTHPLN1,PERSDOC2,MEDCOST,CHECKUP1,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,CHCSCNCR,CHCOCNCR,HAVARTH3,ADDEPEV2,CHCKIDNY,Diabetes,SEX,MARITAL,EDUCA,RENTHOM1,VETERAN3,CHILDREN,INCOME2,QLACTLM2,USEEQUIP,SMOKE100,USENOW3,ALCDAY5,EXERANY2,SEATBELT,PNEUVAC3,HIVTST6,QSTVER,QSTLANG,_RFHLTH,_HCVU651,_LTASTH1,_CASTHM1,_ASTHMS1,_DRDXAR1,_AGEG5YR,_AGE65YR,_AGE_G,HTIN4,HTM4,_BMI5CAT,_RFBMI5,_CHLDCNT,_EDUCAG,_INCOMG,_SMOKER3,_RFSMOK3,DRNKANY5,_RFBING5,_TOTINDA,_RFSEAT2,_RFSEAT3,_AIDTST3
0,1,1,1200,5,15,18,1,1,2,1,2,2,2,1,2,2,1,1,2,0,2,1,4,1,2,88,3,1,1,1,3,888,2,1,1,1,10,1,2,1,2,2,1,1,9,1,5,70,178,4,2,1,2,2,3,1,2,1,2,1,1,1
1,1,1,1100,3,88,88,2,1,1,4,2,2,2,2,2,2,2,2,2,0,2,2,6,1,2,88,1,1,2,1,3,888,1,3,2,2,10,1,1,2,1,1,3,2,7,1,4,68,173,3,2,1,4,1,1,2,2,1,1,2,2,2
3,1,1,1100,5,30,30,1,2,1,1,2,2,2,2,2,1,1,1,2,0,2,1,4,1,2,1,8,1,2,2,3,888,2,1,1,9,10,1,2,1,1,1,3,1,9,1,5,67,170,3,2,2,2,5,4,1,2,1,2,1,1,9
5,1,1,1100,2,88,88,1,1,2,1,2,2,2,2,2,2,1,2,2,0,2,3,3,1,2,88,6,1,2,2,3,888,1,1,1,2,10,1,1,9,1,1,3,1,11,2,6,62,157,3,2,1,1,4,4,1,2,1,1,1,1,2
6,1,1,1100,2,88,3,1,1,2,1,2,2,2,2,2,2,2,2,2,0,2,3,5,1,2,88,4,2,2,2,3,203,1,1,1,1,10,1,1,9,1,1,3,2,11,2,6,66,168,2,1,1,3,2,4,1,1,1,1,1,1,1


In [36]:
for col in df_int.columns:
    print(col, len(df_int[col].unique()))

_STATE 53
FMONTH 12
DISPCODE 4
GENHLTH 7
PHYSHLTH 33
MENTHLTH 33
HLTHPLN1 4
PERSDOC2 5
MEDCOST 4
CHECKUP1 7
CVDINFR4 4
CVDCRHD4 4
CVDSTRK3 4
ASTHMA3 4
CHCSCNCR 4
CHCOCNCR 4
HAVARTH3 2
ADDEPEV2 4
CHCKIDNY 4
Diabetes 2
SEX 2
MARITAL 7
EDUCA 7
RENTHOM1 5
VETERAN3 4
CHILDREN 33
INCOME2 10
QLACTLM2 4
USEEQUIP 4
SMOKE100 4
USENOW3 5
ALCDAY5 40
EXERANY2 4
SEATBELT 8
PNEUVAC3 4
HIVTST6 4
QSTVER 8
QSTLANG 4
_RFHLTH 3
_HCVU651 3
_LTASTH1 3
_CASTHM1 3
_ASTHMS1 4
_DRDXAR1 2
_AGEG5YR 14
_AGE65YR 3
_AGE_G 6
HTIN4 59
HTM4 59
_BMI5CAT 4
_RFBMI5 2
_CHLDCNT 7
_EDUCAG 5
_INCOMG 6
_SMOKER3 5
_RFSMOK3 3
DRNKANY5 4
_RFBING5 3
_TOTINDA 3
_RFSEAT2 3
_RFSEAT3 3
_AIDTST3 3


In [37]:
all_df = all_df.drop(["SEQNO", "_PSU"], axis=1)
all_df.shape

(2038772, 75)

In [38]:
if not os.path.exists(join(SAVE_DIR, "brfss")):
    os.makedirs(join(SAVE_DIR, "brfss"))
all_df.to_csv(
    join(SAVE_DIR, "brfss", "all.csv"),
    sep=",",
    index=False,
)

In [39]:
# Automatically divides cate and nam.
categorical = list(all_df.loc[:, (all_df.dtypes == "object") | (all_df.dtypes == "int64")].columns)
numerical = list(all_df.loc[:, all_df.dtypes == "float64"].columns)
print(categorical, numerical)

['_STATE', 'FMONTH', 'IMONTH', 'IDAY', 'IYEAR', 'DISPCODE', 'GENHLTH', 'PHYSHLTH', 'MENTHLTH', 'HLTHPLN1', 'PERSDOC2', 'MEDCOST', 'CHECKUP1', 'CVDINFR4', 'CVDCRHD4', 'CVDSTRK3', 'ASTHMA3', 'CHCSCNCR', 'CHCOCNCR', 'HAVARTH3', 'ADDEPEV2', 'CHCKIDNY', 'Diabetes', 'SEX', 'MARITAL', 'EDUCA', 'RENTHOM1', 'VETERAN3', 'CHILDREN', 'INCOME2', 'QLACTLM2', 'USEEQUIP', 'SMOKE100', 'USENOW3', 'ALCDAY5', 'EXERANY2', 'SEATBELT', 'PNEUVAC3', 'HIVTST6', 'QSTVER', 'QSTLANG', '_RFHLTH', '_HCVU651', '_LTASTH1', '_CASTHM1', '_ASTHMS1', '_DRDXAR1', '_AGEG5YR', '_AGE65YR', '_AGE_G', 'HTIN4', 'HTM4', '_BMI5CAT', '_RFBMI5', '_CHLDCNT', '_EDUCAG', '_INCOMG', '_SMOKER3', '_RFSMOK3', 'DRNKANY5', '_RFBING5', '_TOTINDA', '_RFSEAT2', '_RFSEAT3', '_AIDTST3'] ['WEIGHT2', 'HEIGHT3', '_STSTR', '_STRWT', '_RAWRAKE', '_WT2RAKE', '_LLCPWT', 'WTKG3', '_BMI5', 'DROCDY3_']


In [40]:
dtype_df = pd.DataFrame(all_df.dtypes).T
dtype_df

Unnamed: 0,_STATE,FMONTH,IMONTH,IDAY,IYEAR,DISPCODE,GENHLTH,PHYSHLTH,MENTHLTH,HLTHPLN1,PERSDOC2,MEDCOST,CHECKUP1,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,CHCSCNCR,CHCOCNCR,HAVARTH3,ADDEPEV2,CHCKIDNY,Diabetes,SEX,MARITAL,EDUCA,RENTHOM1,VETERAN3,CHILDREN,INCOME2,WEIGHT2,HEIGHT3,QLACTLM2,USEEQUIP,SMOKE100,USENOW3,ALCDAY5,EXERANY2,SEATBELT,PNEUVAC3,HIVTST6,QSTVER,QSTLANG,_STSTR,_STRWT,_RAWRAKE,_WT2RAKE,_LLCPWT,_RFHLTH,_HCVU651,_LTASTH1,_CASTHM1,_ASTHMS1,_DRDXAR1,_AGEG5YR,_AGE65YR,_AGE_G,HTIN4,HTM4,WTKG3,_BMI5,_BMI5CAT,_RFBMI5,_CHLDCNT,_EDUCAG,_INCOMG,_SMOKER3,_RFSMOK3,DRNKANY5,DROCDY3_,_RFBING5,_TOTINDA,_RFSEAT2,_RFSEAT3,_AIDTST3
0,int64,int64,object,object,object,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,float64,float64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,float64,float64,float64,float64,float64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,int64,float64,float64,int64,int64,int64,int64,int64,int64,int64,int64,float64,int64,int64,int64,int64,int64


## Stroke Dataset Preprocessing

The dataset contains clinical events. The task is to predict whether a subject is likely to get a stroke using 10 features: gender, age, hypertension, heart disease, marriage, work type, residence type, glucose, BMI, and smoking status.

The names and types of common features between the downstream dataset and the pretraining dataset are transformed to match those in the pretraining dataset.

In [46]:
stroke_df = pd.read_csv(
    "/ssd003/projects/aieng/public/ssl_bootcamp_resources/datasets/stroke/healthcare-dataset-stroke-data.csv"
)

In [47]:
stroke_df["SEX"] = stroke_df["gender"].replace({"Male": 1, "Female": 2})
stroke_df = stroke_df[stroke_df["SEX"] != "Other"]

In [48]:
stroke_df = stroke_df.dropna()
low = np.arange(25, 76, 5)
high = np.arange(29, 80, 5)
stroke_df.loc[stroke_df["age"] < 18, ["_AGEG5YR"]] = 14
stroke_df.loc[(stroke_df["age"] >= 18) & (stroke_df["age"] <= 24), ["_AGEG5YR"]] = 1
stroke_df.loc[stroke_df["age"] >= 80, ["_AGEG5YR"]] = 13
for k, i, j in zip(range(2, 13), low, high):
    stroke_df.loc[(i <= stroke_df["age"]) & (stroke_df["age"] <= j), ["_AGEG5YR"]] = k

In [49]:
# ever_married
stroke_df["MARITAL"] = stroke_df["ever_married"].replace({"No": 5, "Yes": 1})
# BMI
stroke_df["_BMI5"] = stroke_df["bmi"] * 100
# smoke
stroke_df["SMOKE100"] = stroke_df["smoking_status"].replace(
    {"smokes": 1, "formerly smoked": 1, "never smoked": 2, "Unknown": 7}
)
# drop
df = stroke_df.drop(["id", "gender", "age", "ever_married", "bmi", "smoking_status"], axis=1)

In [50]:
if not os.path.exists(join(SAVE_DIR, "stroke")):
    os.makedirs(join(SAVE_DIR, "stroke"))

df.to_csv(
    join(SAVE_DIR, "stroke", "stroke.csv"),
    sep=",",
    index=False,
)