# Project milestone 4
# Detection of housing-health relationship


The aim of this project is to estimate the relation between housing quality and health status of a person. The project is an observational study based on a survey made by Mexican National Institute of Statistics and Geography. The health variables reported for each person will be used to define a single health variable which will be a score for each person's health status. We aim to use machine learning methods to do the classification and we will use regression models to predict this health score from the housing variables. Matching will be used to weed out the possible covariates. The motivation is to estimate the most important parameters of housing quality so that we can propose most cost-effective solutions that would increase the quality of health. The original paper is based on the analysis of influence of concrete floors on health quality, while here we would investigate some other parameters such as material used for building and whether there is a toilet or not in the household.

# Step 1: Import data

In [157]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

### Import data ###
#data_household = pd.read_csv('data_translated/household.csv')
data_house = pd.read_csv('data_translated/house.csv',low_memory=False)
data_person = pd.read_csv('data_translated/person.csv',low_memory=False)

In [158]:
#data_person_all = data_person.merge(data_house, left_on='House identifier', right_on='House identifier')
data_person_all = data_person.merge(data_house, left_on=['House_identifier'], right_on=['House_identifier'])
data_person_all.columns

Index(['House_identifier', 'Household_identifier', 'Identifier_of_the_person',
       'Age', 'Birthday', 'Birth_month', 'Sex', 'Relationship',
       'School_attendance', 'School_type',
       ...
       'Pay_TV_service_availability', 'Availability_of_own_car',
       'Total_households_in_the_dwelling', 'Geographic_location',
       'Basic_geostatistical_area', 'Location_size', 'Socioeconomic',
       'Sample_design_stratum', 'Primary_sampling_unit', 'Expansion_factor'],
      dtype='object', length=167)

# Step 2: Exploratory Data Analysis

In [159]:
# our selected health variables
health_var = list(data_person_all.columns[35:57]) + [data_person_all.columns[26]]
data_person_all[health_var].describe()

Unnamed: 0,Difficulty_seeing,Wear_a_hearing_aid,Difficulty_hearing,Difficulty_hearing_without_noise,Difficulty_hearing_with_noise,Dificulty_to_walk,Use_a_walking_device,Walking_apparatus,Difficulty_walking_100_m,Difficulty_walking_500_m,...,Medication_for_nervousness,Intensity_of_nervousness,Frequency_of_depression,Antidepressant_medications,Intensity_of_depression,Frequency_of_pain,Pain_intensity,Fatigue_frequency,Tired_time,Limiting_physical_or_mental_activity
count,208140,208140,208140,208140,208140,208140,208140,208140.0,208140,208140,...,208140,208140.0,208140,208140,208140.0,208140,208140.0,208140,208140.0,208140.0
unique,6,3,6,6,6,6,3,9.0,6,6,...,4,5.0,7,4,5.0,6,5.0,6,5.0,2.0
top,1,2,1,1,1,1,2,,1,1,...,2,,5,2,,1,,1,,
freq,163575,197136,186322,190264,177606,175302,191991,201868.0,184590,177483,...,178203,117286.0,126698,180272,152071.0,132028,157153.0,129517,154635.0,205250.0


In [160]:
# replacing " " values with proper None, what about the 9?
data_person_all = data_person_all.replace(' ', np.nan)

In [161]:
# drop columns with more than 30% of NaN values
nulli = []
exc = []
thr = int(len(data_person_all)*0.3)
for i in range(len(list(data_person_all.columns))):
    nulli.append(data_person_all[list(data_person_all.columns)[i]].isnull().values.sum())
    if (nulli[i] > thr):
        exc.append(i)
data_person_all = data_person_all.drop(columns = [list(data_person_all.columns)[i] for i in exc], axis=1)

In [162]:
# update health variable
health_var = [i for i in health_var if i in list(data_person_all.columns)]
len(health_var)

17

# Step X: Regression

In [163]:
# To do something better just for now
data_person_all.fillna(0,inplace=True)

In [164]:
model='~C(Age)'
start=data_person_all.columns.get_loc('Sex')
end=data_person_all.columns.get_loc('Worked_last_week')

for index,item in enumerate(data_person_all.columns):
    # Fill with person variables
    if index >=start and index <=end:
        model=model+'+'+'C('+item+')'
    # To add House variables
    elif index>end:
        break


In [165]:
data_person_all[health_var[0]]=data_person_all[health_var[0]].astype(bool)
response=health_var[0]
health_var[0]
data_person_all.columns

Index(['House_identifier', 'Household_identifier', 'Identifier_of_the_person',
       'Age', 'Birthday', 'Birth_month', 'Sex', 'Relationship',
       'School_attendance', 'Grade_level_of_instruction',
       ...
       'Pay_TV_service_availability', 'Availability_of_own_car',
       'Total_households_in_the_dwelling', 'Geographic_location',
       'Basic_geostatistical_area', 'Location_size', 'Socioeconomic',
       'Sample_design_stratum', 'Primary_sampling_unit', 'Expansion_factor'],
      dtype='object', length=109)

In [166]:
formula=response+model
formula

'Difficulty_seeing~C(Age)+C(Sex)+C(Relationship)+C(School_attendance)+C(Grade_level_of_instruction)+C(Level_of_instruction)+C(Home)+C(Literacy)+C(Marital_status)+C(Worked_last_week)'

In [167]:
# Fits the model (find the optimal coefficients, adding a random seed ensures consistency)
np.random.seed(1950)
mod= smf.ols(formula, data=data_person_all,missing='raise')
res = mod.fit()

ValueError: endog has evaluated to an array with multiple columns that has shape (208140, 2). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).

In [None]:
res.summary()