# Data preprocessing

Add a a general description of how and why data was preprocessed

## Sections 
[Split waves](#split-waves)  
[Split respondents](#split-respondents)  
[Remove missing values in target variable](#remove-missing-target)  
[Drop columns with a high rate of missing values](#drop-high-rate-missing-values-columns)

In [1]:
import pandas as pd

from src import preprocessing

In [2]:
file_path = 'H_MHAS_c2.dta'
raw_df = pd.read_stata(file_path)

<a id='split-waves'></a>
### Split waves

Our initial approach is to train our model using separate waves. We made this decision because the last three waves took place every two years and there is little data available to helps us bridge that gap, so a cross-sectional cut of the data makes sense as our best option.

In [3]:
wave_5_df = preprocessing.extract_wave_data(raw_df, "5")

print(f'Wave 5 dataframe has the following shape: {wave_5_df.shape}')

Wave 5 dataframe has the following shape: (26839, 1004)


<a id='split-respondents'></a>
### Split respondents

Our initial approach is to train the model using only data from the respondents, as we believe it is the most relevant information to properly train our model; also, given that our MPV requires interaction with the people interested in receiving a hospitalization prediction, we deem it best to ask them questions abouth themselves rather than their spouse or household, as such information might not be available during their interaction with our MVP.

In [4]:
wave_5_respondents_df = preprocessing.extract_respondent_data(wave_5_df)

print(f'Wave 5 respondent-only dataframe has the following shape: {wave_5_respondents_df.shape}')

Wave 5 respondent-only dataframe has the following shape: (26839, 469)


<a id='remove-missing-target'></a>
### Remove missing values in target variable

We need to make sure that our wave data is appropriate for modeling. This includes removing missing values in our target variable and imputating missing values in other columns. Imputation of categorical variables is not as straightforward as imputation of numerical variables, thus, we'll have to take several steps to complete this task.

Our first step is to remove all rows containing missing values in our target variable. Why remove them instead of imputate them? Because this is our ground truth: We cannot alter it by somehow estimating missing values from the data. If we attempt to imputate our ground truth with other features, we'd be incorporating information about the data into the target variable, which could very likely lead us to overfit our model.

Our target variable is 'r5hosp1y', which encodes a 'yes' or 'no' question on whether the respondent has had at least one overnight hospital stay in the last 12 months.

In [None]:
wave_5_respondents_df = preprocessing.remove_missing_values(wave_5_respondents_df, 'r5hosp1y')


# Check that there are no values other tan 0 and 1
print(f"Target variable now has values: {wave_5_respondents_df['r5hosp1y'].unique()}")

Target variable now has values: [0, 1]
Categories (2, int64): [0 < 1]


Next we need to check for deceased respondents. Variable 'r5iwstat' encodes informartion on whether the respondent is alive or deceased. The value 1 is assigned to respondents who are alive. The code block below verifies that, indeed, all respondents are alive.

In [6]:
print(wave_5_respondents_df['r5iwstat'].unique())

['1.Resp, alive']
Categories (6, object): ['0.Inap' < '1.Resp, alive' < '4.NR, alive' < '5.NR, died this wave' < '6.NR, died prev wave' < '9.NR, dk if alive or died']


<a id='drop-high-rate-missing-values-columns'></a>

### Drop columns with a high rate of missing values

We have decided to drop columns with a high missing values ratio (>0.7). A column with such a high proportion of missing values hints at survey unreliability and it doesn't make much sense to imputate missing values when their proportion is higher than existing values.

In [7]:
variables_to_drop = preprocessing.missing_value_ratio(wave_5_respondents_df, 0.1)

# Drop the columns with specified missing values ratio
wave_5_respondents_df = wave_5_respondents_df.drop(columns=variables_to_drop)

# Verify columns were droped. Starting column count is 469
print(f'New column count: {wave_5_respondents_df.shape[1]}')


Variables with a missing value ratio higher than 0.1: ['r5ciqscore6', 'r5rifcaredpmm', 'r5rccarehr', 'r5dresshlp', 'r5wander', 'r5rrcarehr', 'r5rrcaredpm', 'r5bed', 'r5racaany', 'r5pubage', 'r5bmi', 'r5strtsmok', 'r5bedhlp', 'r5rfaany', 'r5rpfcaren', 'r5rifcaredpm', 'r5walkr', 'r5rifcarehr', 'r5ssic', 'r5rfcaredpm', 'r5rfcarehrm', 'r5rfcarehr', 'r5toilt', 'r5rapfcaredpm', 'r5hystere', 'r5raccarehr', 'r5riscarehr', 'r5ciqscore4', 'r5rpfcaredpmm', 'r5ciqscore5', 'r5rccaren', 'r5shophlp', 'r5alone', 'r5moneyhlp', 'r5recstrok', 'r5rscaredpm', 'r5rccarehrm', 'r5rafcare', 'r5rafcarehr', 'r5ripfcarehrm', 'r5ciqscore12', 'r5ciqscore16', 'r5riscaredpmm', 'r5rcany', 'r5stroklmt', 'r5rarcarehr', 'r5rrcaredpmm', 'r5rscaredpmm', 'r5prchmem', 'r5ciqscore2', 'r5rircaren', 'r5rifcarehrm', 'r5rechrtatt', 'r5ciqscore11', 'r5rorgnz', 'r5retyr', 'r5mealhlp', 'r5bede', 'r5papsm', 'r5arthlmt', 'r5cjormscore', 'r5ciqscore3', 'r5rcaany', 'r5rifaany', 'r5rccaredpm', 'r5rapfcaredpmm', 'r5ciqscore8', 'r5riscareh

In [8]:
categorical_columns = wave_5_respondents_df.select_dtypes(include=['object', 'category']).columns
print("Categorical columns:", categorical_columns)


Categorical columns: Index(['r5fdlrc8', 'r5beda', 'r5jog', 'r5ifsret', 'r5flusht', 'r5lideal3',
       'r5lgmusaa', 'r5hip', 'r5diabe', 'r5hibpe',
       ...
       'r5nagi8a', 'r5swell', 'r5iadlfoura', 'r5stoop', 'r5lchnot3',
       'r5rxhrtat', 'r5walk1a', 'r5medsa', 'r5dadliv', 'r5lifta'],
      dtype='object', length=206)


In [9]:
print(wave_5_respondents_df.isnull().sum())


r5iadlfourm       0
r5finea        1331
r5fdlrc8          0
r5uppermobm       0
r5beda          107
               ... 
r5iothr          20
r5walk1a       1369
r5medsa        1401
r5dadliv        265
r5lifta        1482
Length: 289, dtype: int64


In [10]:
print(wave_5_respondents_df.dtypes)


r5iadlfourm     float64
r5finea         float64
r5fdlrc8       category
r5uppermobm     float64
r5beda         category
                 ...   
r5iothr         float32
r5walk1a       category
r5medsa        category
r5dadliv       category
r5lifta        category
Length: 289, dtype: object
