# Final Project - Hospitalization Prediction for Elderly People

In [1]:
from src import config, data


# Data Extraction

* *The Mexican Health and Aging Study* (**MHAS**) is a dataset of household surveys designed to collect information on the health, economic status, and quality of life of older adults.
The survey was conducted over 5 time periods, technically known as **Waves**. 
In addition, there are three study subjects: the respondent (r), the spouse (s), and the household (H). For this study, only the first two were used.


# Load data

To access the data for this project, you only need to execute the code below. This will download H_MHAS_c2.sas7bdat file inside the `dataset` folder:

In [None]:
# Run only once, or if you need to rebuild the original data
df = data.download_dataset()




In [4]:
print('We have',df.shape[0],'subjects')
print('We have',df.shape[1],'features')
print('Head', df.head())

We have 26839 subjects
We have 5241 features
Head    codent01  codent03  ps3  ent2    np  unhhidnp rahhidnp  tipent_01  \
0       1.0       1.0  1.0   1.0  10.0     110.0   b'110'       12.0   
1       2.0       2.0  2.0   2.0  20.0     120.0   b'120'       11.0   
2       1.0       1.0  1.0   1.0  10.0     210.0   b'210'       11.0   
3       2.0       2.0  2.0   2.0  20.0     220.0   b'220'       12.0   
4       1.0       1.0  1.0   1.0  10.0     310.0   b'310'       11.0   

   tipent_03  tipent_12  ...  s4satlife_m  s5satlife_m  r3satlifez  \
0       12.0        1.0  ...          NaN          NaN    1.510731   
1       11.0        1.0  ...          NaN          NaN    1.510731   
2       22.0        1.0  ...          NaN          NaN   -0.397267   
3       11.0        3.0  ...          1.0          NaN         NaN   
4       11.0        1.0  ...          1.0          1.0   -0.397267   

   r4satlifez  r5satlifez  s3satlifez  s4satlifez  s5satlifez  r2cantril  \
0         NaN       

If you have download the dataset, you only need to execute the code below. 



In [3]:
df = data.load_dataset()




### Features
The dataset `H_MHAS_c2.sas7bdat` has 26839 rows and 5241 features.

All features are divided into the following sections.

- SECTION A: DEMOGRAPHICS, IDENTIFIERS, AND WEIGHTS 
- SECTION B: HEALTH 
- SECTION C: HEALTH CARE UTILIZATION AND INSURANCE 
- SECTION D: COGNITION  
- SECTION E: FINANCIAL AND HOUSING WEALTH 
- SECTION F: INCOME
- SECTION G: FAMILY STRUCTURE 
- SECTION H: EMPLOYMENT HISTORY 
- SECTION I: RETIREMENT 
- SECTION J: PENSION 
- SECTION K: PHYSICAL MEASURES
- SECTION L: ASSISTANCE AND CAREGIVING
- SECTION M: STRESS 
- SECTION O: END OF LIFE PLANNING
- SECTION Q: PSYCHOSOCIAL

# Categorical and Numerical Features 

In [None]:
categorical_vars = df.select_dtypes(include=['object']).columns
numerical_vars = df.select_dtypes(include=['float64']).columns

print('We have ' + str(len(categorical_vars)) + ' categorical features in our dataset')
print('We have ' + str(len(numerical_vars)) + ' numerical features in our dataset')


We have 7 categorical features in our dataset
We have 5234 numerical features in our dataset
We need to check if we can convert the following to numerical: 
Index(['rahhidnp', 'h1hhidc', 'h2hhidc', 'h3hhidc', 'h4hhidc', 'h5hhidc',
       'acthog'],
      dtype='object')


# Data Preprocessing

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values


In [None]:
print(X)


In [None]:
print(y)

In [None]:
columns_list = list(df.columns)
print(columns_list)

num_columns = len(df.columns)
print(f"The dataset contains {num_columns} columns.")

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Basic Information
print("Basic Info:")
print(df.info())
print("\n")

print("Summary Statistics:")
print(df.describe(include='all'))
print("\n")

# Check for missing values
print("Missing Values:")
print(df.isnull().sum())
print("\n")

# Check for duplicate rows
print("Duplicate Rows:")
print(df.duplicated().sum())
print("\n")




In [None]:

target_column = 'your_target_column'
if target_column in df.columns:
    plt.figure(figsize=(8, 4))
    sns.countplot(x=target_column, data=df)
    plt.title(f'Target Column Distribution: {target_column}')
    plt.show()