# First Home Recommender
## AHS Household Transformations

This workbook imports the combined household dataset from the American Housing Survey and readys it for machine learning with various transformations.

Steps:
01. Subset the household dataset for first-time homeowners only.
02. Remove weight and flag variales from the household dataset.
03. Remove variables whose portion of missing values is above the threshhold level.
04. Impute the missing values for continuous, categorical, and binary variables.

In [2]:
import os
import pandas as pd
import numpy as np
from functools import reduce
from sklearn.impute import SimpleImputer

Instantiate Relevant Variables

In [5]:
threshhold = 0.20
path = os.path.join(os.getcwd(), 'data', 'working')

In [6]:
df = pd.read_csv(os.path.join(path, 'AHS Household Combined.csv'))
varcat = pd.read_csv(os.path.join(os.getcwd(), 'data', 'concordance', 'varclass.csv'))

01 - Subset the dataset to only first-time home buyers

In [7]:
df_fh = df[df['FIRSTHOME']==1].copy()

02 - Remove weight and flag variales from the dataset 

In [8]:
vars_less_wgt = [i for i in list(df_fh.columns) if 'WGT' not in i]
vars_less_wgt = [i for i in vars_less_wgt if 'WEIGHT' not in i]
vars_less_wgt_flags = [i for i in vars_less_wgt if not i.startswith('J')]
df_fh2 = df_fh[vars_less_wgt_flags].copy()

03 - Remove variables whose portion of missing values is above the threshhold

In [9]:
miss_percent = df_fh2.isin([-6, -9]).sum(axis=0) / df_fh2.count(axis=0)
miss_percent_lt_thresh = miss_percent[miss_percent.iloc[:] < threshhold]
df_fh2 = df_fh[miss_percent_lt_thresh.index].copy()
df_fh2_cols = list(df_fh2.columns)
df_fh2.replace([-6, -9], np.nan, inplace=True)

04 - Impute missing values os all estimators

Divide the list of remaining variales into 4 groups: 1) target, 2) continuous, 3) categorical, and 4) binary

In [12]:
df_varcat = pd.merge(pd.DataFrame(miss_percent_lt_thresh), varcat, how='left', 
                     left_on=pd.DataFrame(miss_percent_lt_thresh).index, right_on=['Variable'])
target_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Target'])
cont_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Continuous'])
cat_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Categorical'])
binary_vars = list(df_varcat['Variable'][df_varcat['Grouping'] == 'Binary'])

Seperate dataset into target variables and dependent variables

In [13]:
target = df_fh2[['CONTROL','YEAR'] + target_vars].copy()
estimators = df_fh2.drop(['RATINGHS','RATINGNH'], axis=1).copy()

Impute missing data values for each type of variable

Continuous Variables

In [14]:
imputer_cont = SimpleImputer(missing_values=np.nan, strategy='median')
imputer_cont.fit(estimators[['CONTROL','YEAR'] + cont_vars])
imputed_cont = imputer_cont.transform(estimators[['CONTROL','YEAR'] + cont_vars])
estimators_cont = pd.DataFrame(imputed_cont, columns=['CONTROL','YEAR'] + cont_vars)

Categorical Variables

In [15]:
imputer_cat = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer_cat.fit(estimators[['CONTROL','YEAR'] + cat_vars])
imputed_cat = imputer_cat.transform(estimators[['CONTROL','YEAR'] + cat_vars])
estimators_cat = pd.DataFrame(imputed_cat, columns=['CONTROL','YEAR'] + cat_vars)

Binary Variables

In [16]:
estimators_binary = estimators[['CONTROL','YEAR'] + binary_vars].copy()

Create dummies from categorical variables

In [17]:
estimators_cat_dum = pd.get_dummies(estimators_cat, columns=cat_vars)

Merge different variables back into a single dataset

In [18]:
dfs = [target, estimators_cont, estimators_cat_dum, estimators_binary]
df_final = reduce(lambda left, right: pd.merge(left, right, how='inner', on=['CONTROL','YEAR']), dfs).dropna(how='any')