# First Home Recommender
## AHS Household Transformations

This workbook imports the combined household dataset from the American Housing Survey and readys it for machine learning with various transformations.

Steps:
01 - Subset the household dataset for first-time homeowners only.
02 - Remove weight and flag variales from the household dataset.
03 - Remove all variables related to house "experience"
04 - Remove variables whose portion of missing values is above the threshhold level.
05 - Impute the missing values for continuous, categorical, and binary variables.
06 - Create a dummy variable dataset from categorical variables.
07 - Bin the housing satisfaction target variale.
08 - Scale dollar value variales from 0 to 1.0.
09 - Log transform income variables.
10 - Combine datasets together into regression and classification model-ready datasets.
11 - Update the AWS database.

NOTE: Each of these steps needs to be run numerical order for final datasets to be created correctly.

In [1]:
import os
import pandas as pd
import numpy as np
from functools import reduce
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler

Instantiate Variables

In [2]:
threshhold = 0.20
path = os.path.join(os.getcwd(), 'data', 'working')

In [6]:
df = pd.read_csv(os.path.join(path, 'AHS Household Combined.csv'))
varcat = pd.read_csv(os.path.join(os.getcwd(), 'data', 'concordance', 'varclass.csv'))
varrelevant = pd.read_csv(os.path.join(os.getcwd(), 'data', 'concordance', 'varrelevant.csv'))

FileNotFoundError: [Errno 2] File b'/home/mdmanley/Workspace/Georgetown Data Science Certificate/First-Home-Recommender/data/concordance/varrelevant.csv' does not exist: b'/home/mdmanley/Workspace/Georgetown Data Science Certificate/First-Home-Recommender/data/concordance/varrelevant.csv'

01 - Subset the dataset to only first-time home buyers

In [None]:
df_fh = df[df['FIRSTHOME']==1].copy()

02 - Remove weight and flag variales from the dataset 

In [None]:
vars_less_wgt = [i for i in list(df_fh.columns) if 'WGT' not in i]
vars_less_wgt = [i for i in vars_less_wgt if 'WEIGHT' not in i]
vars_less_wgt_flags = [i for i in vars_less_wgt if not i.startswith('J')]
df_fh2 = df_fh[vars_less_wgt_flags].copy()

03 - Remove all variables related to house experience

The project's goal is to predict housing satisfaction using features relevant to a housing search. Therefore, any field that captures information on actual housing experience after the point of purchase should not be included in our dataset.

In [None]:
relevant_vars = list(varrelevant[varrelevant.iloc[:,1]].iloc[:,0])
df_fh3 = df_fh2[['CONTROL','YEAR'] + relevant_vars].copy()

04 - Remove variables whose portion of missing values is above the threshhold

In [None]:
miss_percent = df_fh3.isin([-9]).sum(axis=0) / df_fh3.count(axis=0)
miss_percent_lt_thresh = miss_percent[miss_percent.iloc[:] < threshhold]
df_fh4 = df_fh3[miss_percent_lt_thresh.index].copy()
df_fh4_cols = list(df_fh2.columns)
df_fh4.replace(-9, np.nan, inplace=True)

05 - Impute missing values for all estimators

Divide the list of remaining variales into 4 groups: 1) target, 2) continuous, 3) categorical, and 4) binary

In [None]:
df_varcat = pd.merge(pd.DataFrame(miss_percent_lt_thresh), varcat, how='left', 
                     left_on=pd.DataFrame(miss_percent_lt_thresh).index, right_on=['Variable'])
target_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Target'])
cont_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Continuous'])
cat_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Categorical'])
binary_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Binary'])

In [None]:
exp_vars = ['ELECAMT','GASAMT','OILAMT','OTHERAMT','TRASHAMT','WATERAMT','UTILAMT']
scale_vars = ['EXPSUM','HINCP','FINCP']
inc_vars = ['HINCP','FINCP']

Seperate dataset into target variables and dependent variables

In [None]:
target = df_fh4[['CONTROL','YEAR','RATINGHS']].copy()
estimators = df_fh4.drop(['RATINGHS','RATINGNH'], axis=1).copy()

Impute missing data values for each type of variable

Continuous Variables

In [None]:
imputer_cont = SimpleImputer(missing_values=np.nan, strategy='median')
imputer_cont.fit(estimators[['CONTROL','YEAR'] + cont_vars])
imputed_cont = imputer_cont.transform(estimators[['CONTROL','YEAR'] + cont_vars])
estimators_cont = pd.DataFrame(imputed_cont, columns=['CONTROL','YEAR'] + cont_vars)

Categorical Variables

In [7]:
imputer_cat = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer_cat.fit(estimators[['CONTROL','YEAR'] + cat_vars])
imputed_cat = imputer_cat.transform(estimators[['CONTROL','YEAR'] + cat_vars])
estimators_cat = pd.DataFrame(imputed_cat, columns=['CONTROL','YEAR'] + cat_vars)

NameError: name 'estimators' is not defined

Binary Variables

In [8]:
imputer_binary = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer_binary.fit(estimators[['CONTROL','YEAR'] + binary_vars])
imputed_binary = imputer_binary.transform(estimators[['CONTROL','YEAR'] + binary_vars])
estimators_binary = pd.DataFrame(imputed_binary, columns=['CONTROL','YEAR'] + binary_vars)

NameError: name 'estimators' is not defined

06 - Create dummies from categorical variables

In [67]:
estimators_cat_dum = pd.get_dummies(estimators_cat, columns=cat_vars)
estimators_binary_dum = pd.get_dummies(estimators_binary, columns=binary_vars)

07 - Bin Housing Satisfaction Variables 

In [68]:
target['RATINGHS_BIN'] = pd.cut(target['RATINGHS'], bins=[0,7,8,9,10], 
                                labels=['not satisfied','satisfied','very satisfied','extremely satisfied'])
target_bin = pd.DataFrame(target[['CONTROL','YEAR','RATINGHS_BIN']])
target.drop('RATINGHS_BIN', axis=1, inplace=True)

08 - Scale the dollar value variales from 0 to 1.0

In [1]:
estimators_cont['EXPSUM'] = estimators_cont[exp_vars].sum(axis=1)
estimators_cont.drop(exp_vars, axis=1)

x_array = estimators_cont['EXPSUM']
standardscaler = StandardScaler()
x_scaled = standardscaler.fit_transform(x_array)
scaled_cont = pd.DataFrame(x_scaled, columns=scale_vars)

NameError: name 'estimators_cont' is not defined

09 - Log transform both income variales and merge back into the dataframe

In [76]:
scaled_cont['LN_HINCP'] = np.where(scaled_cont['HINCP'] > 1, np.log(scaled_cont['HINCP']), 0)
scaled_cont['LN_FINCP'] = np.where(scaled_cont['FINCP'] > 1, np.log(scaled_cont['FINCP']), 0)
#scaled_cont.drop(inc_vars, axis=1, inplace=True)
estimators_cont_inc = pd.concat([estimators_cont[['CONTROL','YEAR']],pd.DataFrame(scaled_cont)], axis=1)

  """Entry point for launching an IPython kernel.
  


In [77]:
scaled_cont

Unnamed: 0,EXPSUM,HINCP,FINCP,LN_HINCP,LN_FINCP
0,0.206244,0.010986,0.010986,0.0,0.0
1,0.216178,0.022608,0.022608,0.0,0.0
2,0.206717,0.023817,0.023299,0.0,0.0
3,0.602649,0.018118,0.018118,0.0,0.0
4,0.197729,0.003440,0.003440,0.0,0.0
5,0.271523,0.018118,0.018118,0.0,0.0
6,0.225166,0.003129,0.003129,0.0,0.0
7,0.093661,0.013111,0.013111,0.0,0.0
8,0.179281,0.006065,0.006065,0.0,0.0
9,0.340114,0.005857,0.005857,0.0,0.0


In [49]:
estimators_cont_noninc = estimators_cont.drop(['ELECAMT','GASAMT','OILAMT','OTHERAMT','TRASHAMT','WATERAMT',
                                               'UTILAMT','EXPSUM','HINCP','FINCP'], axis=1)

10 - Merge datasets with different variable types back into one dataset

In [50]:
dfs_reg = [target, estimators_cont_inc, estimators_cont_noninc, estimators_cat_dum, estimators_binary_dum]
dfs_class = [target_bin, estimators_cont_inc, estimators_cont_noninc, estimators_cat, estimators_binary]
df_final_reg = reduce(lambda left, right: pd.merge(left, right, how='inner', on=['CONTROL','YEAR']), dfs_reg).dropna(how='any')
df_final_class = reduce(lambda left, right: pd.merge(left, right, how='inner', on=['CONTROL','YEAR']), dfs_class).dropna(how='any')

In [20]:
noninc_cols = estimators_cont_noninc.columns.drop
for col in estimators_cont_noninc.columns:
    estimators_cont_noninc[col] = pd.cut(estimators_cont_noninc[col], bins=10, precision=1)

Create CSV Files

In [52]:
df_final_reg.to_csv(os.path.join(path, 'AHS Household Reg.csv'))
df_final_class.to_csv(os.path.join(path, 'AHS Household Class.csv'))

11 - Update Database

Send intermediate tables to the database

In [51]:
from sqlalchemy import create_engine
engine = create_engine('postgresql://postgres:Admin123@project.cgxhdwn5zb5t.us-east-1.rds.amazonaws.com:5432/postgres')
df_final_reg.to_sql('ahs_household_reg', engine, if_exists='replace')

In [None]:
from sqlalchemy import create_engine

df_tables = {'ahs_household_step_1':df_fh, 
             'ahs_household_step_4':df_fh4, 
             'ahs_household_class':df_final_class,
             'ahs_household_reg':df_final_reg}
engine = create_engine('postgresql://postgres:Admin123@project.cgxhdwn5zb5t.us-east-1.rds.amazonaws.com:5432/postgres')

for name, df in df_tables.items():
    df.to_sql('{}'.format(name), engine)