# Project milestone 4
# Detection of housing-health relationship


The aim of this project is to estimate the relation between housing quality and health status of a person. The project is an observational study based on a survey made by Mexican National Institute of Statistics and Geography in 2017 ([National Household Survey 2017](https://en.www.inegi.org.mx/programas/enh/2017/#Microdata)). The health variables reported for each person will be used to define a single health variable which will be a score for each person's health status. We aim to use machine learning methods to do the classification and we will use regression models to predict this health score from the housing variables. Matching will be used to weed out the possible covariates. The motivation is to estimate the most important parameters of housing quality so that we can propose most cost-effective solutions that would increase the quality of health. The original paper is based on the analysis of influence of concrete floors on health quality, while here we would investigate some other parameters such as material used for building and whether there is a toilet or not in the household.

# Step 1: Data preprocessing

In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

In this section, we load the translated headers (from spanish to english) and the datasets. We verified that the number of entries correspond to the [File Descriptor FD](https://en.www.inegi.org.mx/app/biblioteca/ficha.html?upc=702825101725) provided by INEGI. Note that we will use the word *person* instead of *member* as a translation to *persona* because we consider it is a better translation.

In [2]:
# Load the translated headers
vivienda_header_trans = pd.read_csv('./data/translated_house.txt', squeeze=True, header =None)
hogar_header_trans = pd.read_csv('./data/translated_household.txt', squeeze=True, header=None)
persona_header_trans = pd.read_csv('./data/translated_person.txt', squeeze=True, header=None)

# Load the datasets with the translated headers
data_housing = pd.read_csv('./data/vivienda.csv', skiprows=1, names=vivienda_header_trans)
data_household = pd.read_csv('./data/hogar.csv', skiprows=1, names=hogar_header_trans)
data_person = pd.read_csv('./data/persona.csv', skiprows=1, names=persona_header_trans)

print('vivienda.csv shape: {}'.format(data_housing.shape))
print('hogar.csv shape: {}'.format(data_household.shape))
print('persona.csv shape: {}'.format(data_person.shape))

vivienda.csv shape: (56680, 110)
hogar.csv shape: (57519, 13)
persona.csv shape: (208140, 58)


The [File Descriptor FD](https://en.www.inegi.org.mx/app/biblioteca/ficha.html?upc=702825101725) tell us that for each housing, there can be more that one households and for each household, there can be more than one person. We use these the information to merge first `data_housing` into `data_household` and then into `data_person`.

In [3]:
# Merge the data on the 'house_identifier column'
data_household_all = data_household.merge(data_housing, on='housing_identifier')
data_person_all = data_person.merge(data_household_all, on=['housing_identifier', 'household_identifier'])

print('data_person_all shape: {}'.format(data_person_all.shape))
data_person_all.sample(10)

data_person_all shape: (208140, 178)


Unnamed: 0,housing_identifier,household_identifier,person_identifier,age,birthday,birth_month,sex,relationship,school_attendance,school_type,...,pay_tv_service_availability,availability_of_own_car,total_households_in_the_dwelling,geographic_location,basic_geostatistical_area,location_size,socioeconomic,sample_design_stratum,primary_sampling_unit,expansion_factor
172709,2760143158,1,4,15,12,3,1,3,1.0,1.0,...,2,2,1,270060000,000-0,4,2,281,6842,366
55616,919911009,1,4,9,18,4,1,3,1.0,1.0,...,2,2,1,90170000,000-0,1,2,80,2542,1327
76385,1261383028,1,6,19,8,12,2,5,2.0,,...,1,2,1,120620000,000-0,4,1,110,3258,582
35314,601368092,1,7,1,25,10,1,4,,,...,1,1,1,60090000,000-0,2,2,54,1663,112
28622,505256058,1,1,51,19,2,1,1,2.0,,...,1,2,1,50350000,000-0,1,4,49,1343,535
177921,2804991099,1,3,27,5,9,1,3,2.0,,...,1,1,1,280320000,000-0,1,4,294,7065,691
121698,1911059342,1,2,38,2,11,1,3,2.0,,...,1,1,1,190060000,000-0,3,3,190,4984,963
191407,3009657027,1,1,58,24,5,1,1,2.0,,...,2,2,1,301350000,000-0,3,2,319,7620,1676
118122,1904100099,1,3,19,7,11,1,3,1.0,1.0,...,2,1,1,190260000,000-0,1,4,196,4793,921
117587,1903285062,1,1,35,9,10,1,1,2.0,,...,2,1,1,190260000,000-0,1,3,195,4760,754


# Step 2: Exploratory Data Analysis
## Health variables
We chose as our health variables the following ones. We drop other health related variables because they were not correctly filled and most of the answer were empty (e.g. `['walking_apparatus', 'intensity_of_nervousness', 'intensity_of_depression', 'pain_intensity', 'intensity_of_fatigue', 'tired_time']`). However the information is still present in the chosen ones.

In [4]:
health_var = ['wear_glasses', 'difficulty_seeing', 'wear_a_hearing_aid','difficulty_hearing',
                'difficulty_hearing_without_noise','difficulty_hearing_with_noise', 'dificulty_to_walk',
                'use_a_walking_device', 'difficulty_walking_100_m', 'difficulty_walking_500_m',
                'difficulty_climbing_12_steps_', 'difficulty_remembering','frequency_of_nervousness',
                'medication_for_nervousness', 'frequency_of_depression', 'antidepressant_medications',
                'frequency_of_pain', 'fatigue_frequency']

## Cleaning and guidelines
Besides cleaning the data (e.g. replacing empty values), we will change the labels of the variables according to the following criteria in order to ensure consistency with health and have better properties for the regression.
* 0 - when the variable suggest a healthy person.
* 1 or greater - when the variable suggest a health issue (in increasing order).
* nan - when there is no answer.


In [5]:
# Cleaning
data_person_all = data_person_all.replace(' ', np.nan)
data_person_all = data_person_all.replace('&', np.nan)
data_person_all = data_person_all.replace('9', np.nan)

We drop all the rows where there is not complete information of the health variables since we lost around 14% of the data but don't do any assumption.

In [6]:
# Drop nan rows for the health_var (i.e. work only with rows with complete information)
data_person_all = data_person_all.dropna(subset=health_var)
data_person_all.shape

(179072, 178)

In [7]:
# Transform from strings to numbers
data_person_all[health_var] = data_person_all[health_var].apply(pd.to_numeric)
data_person_all[health_var].dtypes

wear_glasses                        int64
difficulty_seeing                   int64
wear_a_hearing_aid                  int64
difficulty_hearing                  int64
difficulty_hearing_without_noise    int64
difficulty_hearing_with_noise       int64
dificulty_to_walk                   int64
use_a_walking_device                int64
difficulty_walking_100_m            int64
difficulty_walking_500_m            int64
difficulty_climbing_12_steps_       int64
difficulty_remembering              int64
frequency_of_nervousness            int64
medication_for_nervousness          int64
frequency_of_depression             int64
antidepressant_medications          int64
frequency_of_pain                   int64
fatigue_frequency                   int64
dtype: object

In order to flip the grading of the variables, where the lowest value reflects a health issue, we verify that each column has a maximum value corresponding to the [File Descriptor FD](https://en.www.inegi.org.mx/app/biblioteca/ficha.html?upc=702825101725). This allows us to multiply each column by -1 and substract the maximum of the column.

In [8]:
# Variables where we flip the grading for consistency in the grading.
flipped_var = ['wear_glasses', 'wear_a_hearing_aid', 'use_a_walking_device', 'frequency_of_nervousness',
                'medication_for_nervousness', 'frequency_of_depression', 'antidepressant_medications'] 

# Verify that each column has the correct maximum possible value. 
data_person_all[flipped_var].max(axis=0)

# Substract one from all the columns and flip the grading for the flipped var
data_person_all[health_var] = data_person_all[health_var] - 1
data_person_all[flipped_var] = data_person_all[flipped_var]*-1 + data_person_all[flipped_var].max(axis=0)

display(data_person_all[health_var].max(axis=0))
display(data_person_all[health_var].min(axis=0))

wear_glasses                        1
difficulty_seeing                   3
wear_a_hearing_aid                  1
difficulty_hearing                  2
difficulty_hearing_without_noise    2
difficulty_hearing_with_noise       3
dificulty_to_walk                   3
use_a_walking_device                1
difficulty_walking_100_m            2
difficulty_walking_500_m            3
difficulty_climbing_12_steps_       3
difficulty_remembering              3
frequency_of_nervousness            4
medication_for_nervousness          1
frequency_of_depression             4
antidepressant_medications          1
frequency_of_pain                   3
fatigue_frequency                   3
dtype: int64

wear_glasses                        0
difficulty_seeing                   0
wear_a_hearing_aid                  0
difficulty_hearing                  0
difficulty_hearing_without_noise    0
difficulty_hearing_with_noise       0
dificulty_to_walk                   0
use_a_walking_device                0
difficulty_walking_100_m            0
difficulty_walking_500_m            0
difficulty_climbing_12_steps_       0
difficulty_remembering              0
frequency_of_nervousness            0
medication_for_nervousness          0
frequency_of_depression             0
antidepressant_medications          0
frequency_of_pain                   0
fatigue_frequency                   0
dtype: int64

In [9]:
# See all the ranges
for var in health_var:
    print('{}\t{}'.format(var, data_person_all[var].unique()))

wear_glasses	[1 0]
difficulty_seeing	[0 2 1 3]
wear_a_hearing_aid	[0 1]
difficulty_hearing	[1 0 2]
difficulty_hearing_without_noise	[1 0 2]
difficulty_hearing_with_noise	[1 0 2 3]
dificulty_to_walk	[1 0 2 3]
use_a_walking_device	[1 0]
difficulty_walking_100_m	[0 1 2]
difficulty_walking_500_m	[0 2 3 1]
difficulty_climbing_12_steps_	[2 0 1 3]
difficulty_remembering	[0 2 1 3]
frequency_of_nervousness	[1 4 3 0 2]
medication_for_nervousness	[0 1]
frequency_of_depression	[3 2 0 1 4]
antidepressant_medications	[0 1]
frequency_of_pain	[1 0 3 2]
fatigue_frequency	[1 0 2 3]
