# Final Project - Hospitalization Prediction for Elderly People

In [1]:
from src import extract_data as data, process_data
import pandas as pd


# Data Extraction

* *The Mexican Health and Aging Study* (**MHAS**) is a dataset of household surveys designed to collect information on the health, economic status, and quality of life of older adults.
The survey was conducted over 5 time periods, technically known as **Waves**. 
In addition, there are three study subjects: the respondent (r), the spouse (s), and the household (H). For this study, only the first two were used.


c

To access the data for this project, you only need to execute the code below. This will download H_MHAS_c2.sas7bdat file inside the `dataset` folder:

In [None]:
# Run only once, or if you need to rebuild the original data
df = data.download_dataset()

In [None]:
print('We have',df.shape[0],'subjects')
print('We have',df.shape[1],'features')
print('Head', df.head())

If you have download the dataset, you only need to execute the code below. 



In [2]:
df = data.load_dataset()




### Features
The dataset `H_MHAS_c2.sas7bdat` has 26839 rows and 5241 features.

All features are divided into the following sections.

- SECTION A: DEMOGRAPHICS, IDENTIFIERS, AND WEIGHTS 
- SECTION B: HEALTH 
- SECTION C: HEALTH CARE UTILIZATION AND INSURANCE 
- SECTION D: COGNITION  
- SECTION E: FINANCIAL AND HOUSING WEALTH 
- SECTION F: INCOME
- SECTION G: FAMILY STRUCTURE 
- SECTION H: EMPLOYMENT HISTORY 
- SECTION I: RETIREMENT 
- SECTION J: PENSION 
- SECTION K: PHYSICAL MEASURES
- SECTION L: ASSISTANCE AND CAREGIVING
- SECTION M: STRESS 
- SECTION O: END OF LIFE PLANNING
- SECTION Q: PSYCHOSOCIAL

# Categorical and Numerical Features 

In [3]:
categorical_vars = df.select_dtypes(include=['object']).columns
numerical_vars = df.select_dtypes(include=['float64']).columns

print('Categorical features ' + str(len(categorical_vars)))
print('Numerical features ' + str(len(numerical_vars)) )


Categorical features 7
Numerical features 5234


# We will save the features with the possible values.

In [None]:
# Load your dataset into a DataFrame
# Example: df = pd.read_csv('your_dataset.csv')

# Select only categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns

# Open the file for writing
with open('features_with_values.txt', 'w') as f:
    for column in categorical_cols:
        # Get the unique values of the column
        unique_values = df[column].unique()
        # Limit the number of values shown for readability
        unique_values_preview = unique_values[:10]  # Show up to 10 values
        # Write the feature name and unique values to the file
        f.write(f"{column}: {list(unique_values_preview)}\n\n")

print("Categorical features with their unique values have been saved to 'features_with_values.txt'")

# Extract Data

* In this study, we will focus on the Respondent and the Spouse. 
* The Householder variable will be removed from data
* We have chosen to analyze the last three waves, as they are considered more relevant to this   study due to the common features they share, as well as the presence of new and valuable information
* To ensure consistency and a normalized dataset, we will merge the variables related to the Respondent and Spouse

In [None]:
normalized_df = process_data.unified_data(df)

In [None]:
categorical_vars_clean = normalized_df.select_dtypes(include=['object']).columns
numerical_vars_clean = normalized_df.select_dtypes(include=['float64']).columns

print('Categorical features ' + str(len(categorical_vars_clean)))
print('Numerical features ' + str(len(numerical_vars_clean)) )

Categorical features 0
Numerical features 563


In [None]:
# Display all column names as a list
with open('features_normalized_values.txt', 'w') as f:
    for column in df.columns:
        # Get the unique values of the column
        unique_values = df[column].unique()
        # Limit the number of values shown for readability
        unique_values_preview = unique_values[:10]  # Show up to 10 values
        # Write the feature name and unique values to the file
        f.write(f"{column}: {list(unique_values_preview)}\n\n")

# Data Preprocessing

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values


In [None]:
print(X)


In [None]:
print(y)

In [None]:
columns_list = list(df.columns)
print(columns_list)

num_columns = len(df.columns)
print(f"The dataset contains {num_columns} columns.")

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Basic Information
print("Basic Info:")
print(df.info())
print("\n")

print("Summary Statistics:")
print(df.describe(include='all'))
print("\n")

# Check for missing values
print("Missing Values:")
print(df.isnull().sum())
print("\n")

# Check for duplicate rows
print("Duplicate Rows:")
print(df.duplicated().sum())
print("\n")




In [None]:

target_column = 'your_target_column'
if target_column in df.columns:
    plt.figure(figsize=(8, 4))
    sns.countplot(x=target_column, data=df)
    plt.title(f'Target Column Distribution: {target_column}')
    plt.show()