# Final Project - Hospitalization Prediction for Elderly People

In [1]:
from src import extract_data as data, preprocessing
import pandas as pd


# Data Extraction

* *The Mexican Health and Aging Study* (**MHAS**) is a dataset of household surveys designed to collect information on the health, economic status, and quality of life of older adults.
The survey was conducted over 5 time periods, technically known as **Waves**. 
In addition, there are three study subjects: the respondent (r), the spouse (s), and the household (H). For this study we will use the last wave and the respondent(r).

To access the data for this project, you only need to execute the code below. This will download H_MHAS_c2.sas7bdat file inside the `dataset` folder:

In [None]:
# Run only once, or if you need to rebuild the original data
df = data.download_dataset()

In [None]:
print('We have',df.shape[0],'subjects')
print('We have',df.shape[1],'features')
print('Head', df.head())

If you have download the dataset, you only need to execute the code below. 



In [2]:
df = data.load_dataset()




### Features
The dataset `H_MHAS_c2.sas7bdat` has 26839 rows and 5241 features.

All features are divided into the following sections.

- SECTION A: DEMOGRAPHICS, IDENTIFIERS, AND WEIGHTS 
- SECTION B: HEALTH 
- SECTION C: HEALTH CARE UTILIZATION AND INSURANCE 
- SECTION D: COGNITION  
- SECTION E: FINANCIAL AND HOUSING WEALTH 
- SECTION F: INCOME
- SECTION G: FAMILY STRUCTURE 
- SECTION H: EMPLOYMENT HISTORY 
- SECTION I: RETIREMENT 
- SECTION J: PENSION 
- SECTION K: PHYSICAL MEASURES
- SECTION L: ASSISTANCE AND CAREGIVING
- SECTION M: STRESS 
- SECTION O: END OF LIFE PLANNING
- SECTION Q: PSYCHOSOCIAL

# Categorical and Numerical Features 

In [None]:
categorical_vars = df.select_dtypes(include=['object']).columns
numerical_vars = df.select_dtypes(include=['float64']).columns

print('Categorical features ' + str(len(categorical_vars)))
print('Numerical features ' + str(len(numerical_vars)) )


# We will save the features with the possible values.

In [None]:
preprocessing.save_categorical_features_with_values(df, 'features_with_values.txt')

# Extract Data

* In this study, we will focus on the Respondent
* The Householder variable will be removed from data



### Split waves

Our initial approach is to train our model using separate waves. We made this decision because the last three waves took place every two years and there is little data available to helps us bridge that gap, so a cross-sectional cut of the data makes sense as our best option.

In [3]:
wave_5_df = preprocessing.extract_wave_data(df, "5")

print(f'Wave 5 dataframe has the following shape: {wave_5_df.shape}')

Wave 5 dataframe has the following shape: (26839, 1004)


<a id='split-respondents'></a>
### Split respondents

Our initial approach is to train the model using only data from the respondents, as we believe it is the most relevant information to properly train our model; also, given that our MPV requires interaction with the people interested in receiving a hospitalization prediction, we deem it best to ask them questions abouth themselves rather than their spouse or household, as such information might not be available during their interaction with our MVP.

In [4]:
wave_5_respondents_df = preprocessing.extract_respondent_data(wave_5_df)

print(f'Wave 5 respondent-only dataframe has the following shape: {wave_5_respondents_df.shape}')

Wave 5 respondent-only dataframe has the following shape: (26839, 469)


In [None]:
preprocessing.save_categorical_features_with_values(wave_5_respondents_df, 'wave_5_features_with_values.txt')

# Identification of the target variable
* The target variable belongs to *Section C: Health Care Utilization and Insurance* and is labeled **Medical Care Utilization: Hospital** *`rhosp1y`*

* *rhosp1y* indicates whether the respondent reports at least one overnight hospital stay in the last 12 months. RHOSP1Y is coded as 0 if the respondent had no overnight hospital stays, and is coded as 1 if the respondent had at least one overnight hospital stay. 

<a id='remove-missing-target'></a>
### Remove missing values in target variable

Our first step is to remove all rows containing missing values in our target variable. Why remove them instead of imputate them? Because this is our ground truth: We cannot alter it by somehow estimating missing values from the data. If we attempt to imputate our ground truth with other features, we'd be incorporating information about the data into the target variable, which could very likely lead us to overfit our model.

In [5]:
wave_5_respondents_df = preprocessing.remove_missing_values(wave_5_respondents_df, 'r5hosp1y')

print(f'Shape: {wave_5_respondents_df.shape}')

Shape: (17046, 469)


In [None]:
preprocessing.save_categorical_features_with_values(df, 'wave_5_features_with_values.txt')

<a id='drop-high-rate-missing-values-columns'></a>

### Drop columns with a high rate of missing values

We have decided to drop columns with a high missing values ratio (>0.7). A column with such a high proportion of missing values hints at survey unreliability and it doesn't make much sense to imputate missing values when their proportion is higher than existing values.

In [6]:
variables_to_drop = preprocessing.missing_value_ratio(wave_5_respondents_df, 0.2)

# Drop the columns with specified missing values ratio
wave_5_respondents_df = wave_5_respondents_df.drop(columns=variables_to_drop)

# Verify columns were droped. Starting column count is 469
print(f'New column count: {wave_5_respondents_df.shape[1]}')

Variables with a missing value ratio higher than 0.2: ['r5rifcare', 'r5mealhlp', 'r5recstrok', 'r5jrsleft', 'r5haluc', 'r5ciqscore15', 'r5toilt', 'r5bath', 'r5rafcarehrm', 'r5rrcaredpm', 'r5riccarehr', 'r5medhlp', 'r5ciqscore8', 'r5rpfcarehr', 'r5ripfcaredpm', 'r5fallinj', 'r5rapfcarehrm', 'r5rechrtatt', 'r5rascarehr', 'r5jlasty', 'r5ciqscore7', 'r5jhours', 'radyear', 'r5rfcaren', 'r5riscarehr', 'r5rifaany', 'r5rfcarehr', 'r5prchmem', 'r5bede', 'r5eathlp', 'r5eat', 'r5rrcaredpmm', 'r5ciqscore16', 'r5raccaredpm', 'r5rrcaren', 'r5lstmnspd', 'r5flstmnspd', 'r5walkre', 'r5rarcaren', 'r5rscare', 'r5raccarehrm', 'r5rpfcaren', 'r5mammog', 'r5rcaany', 'r5jcten', 'r5ciqscore1', 'r5rafcaren', 'r5stroklmt', 'r5rccaredpm', 'r5rafcaredpm', 'r5rrcarehr', 'r5breast', 'r5unemp', 'r5rifcarehr', 'r5racaany', 'r5ciqscore4', 'r5rapfcare', 'r5rascare', 'r5rpfcaredpm', 'r5rscarehrm', 'r5wander', 'r5ricany', 'r5walkr', 'r5jredhr', 'r5strtsmok', 'r5penic', 'r5ciqscore10', 'r5ciqscore11', 'r5rorgnz', 'r5rascar

In [7]:
preprocessing.save_categorical_features_with_values(wave_5_respondents_df, 'wave_5_features_out_missing_value.txt')

Categorical features with their unique values have been saved to 'wave_5_features_out_missing_value.txt'


# Data Preprocessing

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values


In [None]:
print(X)


In [None]:
print(y)

In [None]:
columns_list = list(df.columns)
print(columns_list)

num_columns = len(df.columns)
print(f"The dataset contains {num_columns} columns.")

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Basic Information
print("Basic Info:")
print(df.info())
print("\n")

print("Summary Statistics:")
print(df.describe(include='all'))
print("\n")

# Check for missing values
print("Missing Values:")
print(df.isnull().sum())
print("\n")

# Check for duplicate rows
print("Duplicate Rows:")
print(df.duplicated().sum())
print("\n")




In [None]:

target_column = 'your_target_column'
if target_column in df.columns:
    plt.figure(figsize=(8, 4))
    sns.countplot(x=target_column, data=df)
    plt.title(f'Target Column Distribution: {target_column}')
    plt.show()