# First Home Recommender
## AHS Household Transformations

This workbook imports the combined household dataset from the American Housing Survey and readys it for machine learning with various transformations.

Steps:
01 - Subset the household dataset for first-time homeowners only.
02 - Remove weight and flag variales from the household dataset.
03 - Remove all variables related to house "experience"
04 - Remove variables whose portion of missing values is above the threshhold level.
05 - Impute the missing values for continuous, categorical, and binary variables.
06 - Create a dummy variable dataset from categorical variables.
07 - Bin the housing satisfaction target variale.
08 - Scale dollar value variales from 0 to 1.0.
09 - Log transform income variables.
10 - Combine datasets together into regression and classification model-ready datasets.
11 - Update the AWS database.

NOTE: Each of these steps needs to be run numerical order for final datasets to be created correctly.

In [306]:
import os
import timeit
import pandas as pd
import numpy as np
from functools import reduce
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import normalize

Instantiate Relevant Variables

In [307]:
threshhold = 0.20
path = os.path.join(os.getcwd(), 'data', 'working')

In [308]:
df = pd.read_csv(os.path.join(path, 'AHS Household Combined.csv'))
varcat = pd.read_csv(os.path.join(os.getcwd(), 'data', 'concordance', 'varclass.csv'))
varrelevant = pd.read_csv(os.path.join(os.getcwd(), 'data', 'concordance', 'varrelevant.csv'))

01 - Subset the dataset to only first-time home buyers

In [459]:
df_fh = df[df['FIRSTHOME']==1].copy()

02 - Remove weight and flag variales from the dataset 

In [460]:
vars_less_wgt = [i for i in list(df_fh.columns) if 'WGT' not in i]
vars_less_wgt = [i for i in vars_less_wgt if 'WEIGHT' not in i]
vars_less_wgt_flags = [i for i in vars_less_wgt if not i.startswith('J')]
df_fh2 = df_fh[vars_less_wgt_flags].copy()

03 - Remove all variables related to house experience

The project's goal is to predict housing satisfaction using features relevant to a housing search. Therefore, any field that captures information on actual housing experience after the point of purchase should not be included in our dataset.

In [461]:
relevant_vars = list(varrelevant[varrelevant.iloc[:,1]].iloc[:,0])
df_fh3 = df_fh2[['CONTROL','YEAR'] + relevant_vars].copy()

04 - Remove variables whose portion of missing values is above the threshhold

In [462]:
miss_percent = df_fh3.isin([-9]).sum(axis=0) / df_fh3.count(axis=0)
miss_percent_lt_thresh = miss_percent[miss_percent.iloc[:] < threshhold]
df_fh4 = df_fh3[miss_percent_lt_thresh.index].copy()
df_fh4_cols = list(df_fh2.columns)
df_fh4.replace(-9, np.nan, inplace=True)

05 - Impute missing values for all estimators

Divide the list of remaining variales into 4 groups: 1) target, 2) continuous, 3) categorical, and 4) binary

In [463]:
df_varcat = pd.merge(pd.DataFrame(miss_percent_lt_thresh), varcat, how='left', 
                     left_on=pd.DataFrame(miss_percent_lt_thresh).index, right_on=['Variable'])
target_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Target'])
cont_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Continuous'])
cat_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Categorical'])
binary_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Binary'])

In [464]:
exp_vars = ['ELECAMT','GASAMT','OILAMT','OTHERAMT','TRASHAMT','WATERAMT','UTILAMT']
scale_vars = ['EXPSUM','HINCP','FINCP']
inc_vars = ['HINCP','FINCP']

Seperate dataset into target variables and dependent variables

In [481]:
target = df_fh4[['CONTROL','YEAR','RATINGHS']].copy()
estimators = df_fh4.drop(['RATINGHS','RATINGNH'], axis=1).copy()

Impute missing data values for each type of variable

Continuous Variables

In [466]:
imputer_cont = SimpleImputer(missing_values=np.nan, strategy='median')
imputer_cont.fit(estimators[['CONTROL','YEAR'] + cont_vars])
imputed_cont = imputer_cont.transform(estimators[['CONTROL','YEAR'] + cont_vars])
estimators_cont = pd.DataFrame(imputed_cont, columns=['CONTROL','YEAR'] + cont_vars)

Categorical Variables

In [467]:
imputer_cat = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer_cat.fit(estimators[['CONTROL','YEAR'] + cat_vars])
imputed_cat = imputer_cat.transform(estimators[['CONTROL','YEAR'] + cat_vars])
estimators_cat = pd.DataFrame(imputed_cat, columns=['CONTROL','YEAR'] + cat_vars)

Binary Variables

In [468]:
estimators_binary = estimators[['CONTROL','YEAR'] + binary_vars].copy()

06 - Create dummies from categorical variables

In [469]:
estimators_cat_dum = pd.get_dummies(estimators_cat, columns=cat_vars)

07 - Bin Housing Satisfaction Variables 

In [487]:
target['RATINGHS_BIN'] = pd.cut(target['RATINGHS'], bins=[0,7,8,9,10], 
                                labels=['not satisfied','satisfied','very satisfied','extremely satisfied'])
target_bin = pd.DataFrame(target[['CONTROL','YEAR','RATINGHS_BIN']])

08 - Scale the dollar value variales from 0 to 1.0

In [488]:
estimators_cont['EXPSUM'] = estimators_cont[exp_vars].sum(axis=1)
estimators_cont.drop(exp_vars, axis=1)

x_array = estimators_cont[scale_vars]
min_max_scaler = MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x_array)
scaled_cont = pd.DataFrame(x_scaled, columns=scale_vars)

09 - Log transform both income variales and merge back into the dataframe

In [489]:
scaled_cont[['LN_HINCP','LN_FINCP']] = scaled_cont[inc_vars].apply(np.log, inplace=True)
scaled_cont.drop(inc_vars, axis=1, inplace=True)
estimators_cont_inc = pd.concat([estimators_cont[['CONTROL','YEAR']],pd.DataFrame(scaled_cont)], axis=1)

In [490]:
estimators_cont_noninc = estimators_cont.drop(['ELECAMT','GASAMT','OILAMT','OTHERAMT','TRASHAMT','WATERAMT',
                                               'UTILAMT','EXPSUM','HINCP','FINCP'], axis=1)

10 - Merge datasets with different variable types back into one dataset

In [491]:
dfs_reg = [target, estimators_cont_inc, estimators_cont_noninc, estimators_cat_dum, estimators_binary]
dfs_class = [target_bin, estimators_cont_inc, estimators_cont_noninc, estimators_cat, estimators_binary]
df_final_reg = reduce(lambda left, right: pd.merge(left, right, how='inner', on=['CONTROL','YEAR']), dfs_reg).dropna(how='any')
df_final_class = reduce(lambda left, right: pd.merge(left, right, how='inner', on=['CONTROL','YEAR']), dfs_class).dropna(how='any')

Create CSV Files

In [492]:
df_final_reg.to_csv(os.path.join(path, 'AHS Household Reg.csv'))
df_final_class.to_csv(os.path.join(path, 'AHS Household Class.csv'))

11 - Update Database

Send intermediate tables to the database

In [499]:
from sqlalchemy import create_engine
engine = create_engine('postgresql://postgres:Admin123@project.cgxhdwn5zb5t.us-east-1.rds.amazonaws.com:5432/postgres')
df_fh4.to_sql('ahs_household_step_4', engine, if_exists='replace')

In [None]:
from sqlalchemy import create_engine

df_tables = {'ahs_household_step_1':df_fh, 
             'ahs_household_step_4':df_fh4, 
             'ahs_household_class':df_final_class,
             'ahs_household_reg':df_final_reg}
engine = create_engine('postgresql://postgres:Admin123@project.cgxhdwn5zb5t.us-east-1.rds.amazonaws.com:5432/postgres')

for name, df in df_tables.items():
    df.to_sql('{}'.format(name), engine)