# Project milestone 4
# Detection of housing-health relationship


The aim of this project is to estimate the relation between housing quality and health status of a person. The project is an observational study based on a survey made by Mexican National Institute of Statistics and Geography. The health variables reported for each person will be used to define a single health variable which will be a score for each person's health status. We aim to use machine learning methods to do the classification and we will use regression models to predict this health score from the housing variables. Matching will be used to weed out the possible covariates. The motivation is to estimate the most important parameters of housing quality so that we can propose most cost-effective solutions that would increase the quality of health. The original paper is based on the analysis of influence of concrete floors on health quality, while here we would investigate some other parameters such as material used for building and whether there is a toilet or not in the household.

# Step 1: Import data

In [1]:
import pandas as pd
import numpy as np

### Import data ###
#data_household = pd.read_csv('data_translated/household.csv')
data_house = pd.read_csv('data_translated/house.csv',low_memory=False)
data_person = pd.read_csv('data_translated/person.csv',low_memory=False)

In [2]:
#data_person_all = data_person.merge(data_house, left_on='House identifier', right_on='House identifier')
data_person_all = data_person.merge(data_house, left_on=['House_identifier'], right_on=['House_identifier'])

# Step 2: Exploratory Data Analysis

In [3]:
# our selected health variables
health_var = list(data_person_all.columns[35:57]) + [data_person_all.columns[26]]
data_person_all[health_var].describe()

Unnamed: 0,Difficulty_seeing,Wear_a_hearing_aid,Difficulty_hearing,Difficulty_hearing_without_noise,Difficulty_hearing_with_noise,Dificulty_to_walk,Use_a_walking_device,Walking_apparatus,Difficulty_walking_100_m,Difficulty_walking_500_m,...,Medication_for_nervousness,Intensity_of_nervousness,Frequency_of_depression,Antidepressant_medications,Intensity_of_depression,Frequency_of_pain,Pain_intensity,Fatigue_frequency,Tired_time,Limiting_physical_or_mental_activity
count,208140,208140,208140,208140,208140,208140,208140,208140.0,208140,208140,...,208140,208140.0,208140,208140,208140.0,208140,208140.0,208140,208140.0,208140.0
unique,6,3,6,6,6,6,3,9.0,6,6,...,4,5.0,7,4,5.0,6,5.0,6,5.0,2.0
top,1,2,1,1,1,1,2,,1,1,...,2,,5,2,,1,,1,,
freq,163575,197136,186322,190264,177606,175302,191991,201868.0,184590,177483,...,178203,117286.0,126698,180272,152071.0,132028,157153.0,129517,154635.0,205250.0


In [4]:
# replacing " " values with proper None, what about the 9?
data_person_all = data_person_all.replace(' ', np.nan)

In [7]:
# drop columns with more than 30% of NaN values
nulli = []
exc = []
thr = int(len(data_person_all)*0.3)
for i in range(len(list(data_person_all.columns))):
    nulli.append(data_person_all[list(data_person_all.columns)[i]].isnull().values.sum())
    if (nulli[i] > thr):
        exc.append(i)
data_person_all = data_person_all.drop(columns = [list(data_person_all.columns)[i] for i in exc], axis=1)

Unnamed: 0,House_identifier,Household_identifier,Identifier_of_the_person,Age,Birthday,Birth_month,Sex,Relationship,School_attendance,Level_of_instruction,...,Pay_TV_service_availability,Availability_of_own_car,Total_households_in_the_dwelling,Geographic_location,Cell_phone_availability.1,Internet_availability.1,Pay_TV_service_availability.1,Availability_of_own_car.1,Total_households_in_the_dwelling.1,Geographic_location.1
0,100008010,1,1,79,16,12,1,1,2,3,...,2,1,1,10010000,000-0,1,4,8,1,221
1,100008010,1,2,34,30,7,1,4,1,4,...,2,1,1,10010000,000-0,1,4,8,1,221
2,100008010,1,3,31,8,9,2,8,2,3,...,2,1,1,10010000,000-0,1,4,8,1,221
3,100008010,1,4,3,3,9,2,8,2,0,...,2,1,1,10010000,000-0,1,4,8,1,221
4,100008034,1,1,42,20,5,1,1,2,2,...,1,1,1,10010000,000-0,1,4,8,1,221
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
208135,3260737210,1,4,30,12,12,2,3,2,0,...,1,1,1,320550000,000-0,4,1,337,8103,236
208136,3260737210,1,5,19,21,11,2,5,2,1,...,1,1,1,320550000,000-0,4,1,337,8103,236
208137,3260737210,1,6,2,2,8,1,4,,,...,1,1,1,320550000,000-0,4,1,337,8103,236
208138,3260737211,1,1,67,8,9,2,1,2,3,...,1,1,1,320550000,000-0,4,1,337,8103,236


In [10]:
# update health variable
health_var = [i for i in health_var if i in list(data_person_all.columns)]
len(health_var)

23

# Step X: Matching