# Lab 3 Assignment
### Authors: Dan Davieau, Paul Panek, Olga Tanyuk, Nathan Wall

### Business Understanding
Washington, D.C. is the capital of the United States. Washington's population is approaching 700,000 people, and has been growing since 2000 following a half-century of population decline. The city is highly segregated and features a high cost of living. In 2017, the average price of a single family home in the district was $649,000. 

Understanding the various features and similarities between the various neighborhoods provides important insights into the housing market within the district. This could be uses as a preliminary exploratory analysis for potential homebuilders to identify house trends amongst the different DC neighborhoods.

*Depending on the goals we will choose our validation and measures of effectiveness here*


### Data Understanding

For this data we will be reading in and joining two data sets from Kaggle (https://www.kaggle.com/christophercorrea/dc-residential-properties):

*raw_residential_data.csv* : 
The Computer Assisted Mass Appraisal - Residential data contains attribution on housing characteristics for residential properties, and was created as part of the DC Geographic Information System (DC GIS) for the D.C. Office of the Chief Technology Officer (OCTO) and participating D.C. government agencies.

*raw_address_points.csv* :
The raw address points data contains locations and attributes of Address points as of July 2018. This file is part of the Master Address Repository (MAR) for the D.C. Office of the Chief Technology Officer and DC Department of Consumer and Regulatory Affairs . It contains the standardized addresses in the District of Columbia which are typically placed on the buildings.

The second dataset will only be used as we interpret results to better understands how the various clusters of DC homes relates to the various DC neighborhoods and if any insights can be uncovered.

In [1]:
import pandas as pd
import numpy as np

df1 = pd.read_csv('Data/raw_residential_data.csv')
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107154 entries, 0 to 107153
Data columns (total 39 columns):
OBJECTID             107154 non-null int64
SSL                  107154 non-null object
BATHRM               107127 non-null float64
HF_BATHRM            107126 non-null float64
HEAT                 107127 non-null float64
HEAT_D               107127 non-null object
AC                   107127 non-null object
NUM_UNITS            107127 non-null float64
ROOMS                107110 non-null float64
BEDRM                107123 non-null float64
AYB                  107141 non-null float64
YR_RMDL              49446 non-null float64
EYB                  107154 non-null int64
STORIES              107080 non-null float64
SALEDATE             107154 non-null object
PRICE                87866 non-null float64
QUALIFIED            107154 non-null object
SALE_NUM             107154 non-null int64
GBA                  107154 non-null int64
BLDG_NUM             107154 non-null int64
STYLE 

From what we can see we have data on 107,154 different residential. They are measured on 39 different measures where applicable. 

Several of the features are dummy variables of the other and are labeled with "D" we will use the D variables to make one hot endoding simpler to read. A dictionary of those dummy variables can be found in the appendix along with a description of each of the variables.

Below we subset our variables of interest and assess the amount of NULL values and potential outliers that need to get addressed.


In [2]:
#Identify all categorical variables
categories = [['CNDTN_D','CNDTN'],['HEAT_D','HEAT'],['STYLE_D','STYLE'],['STRUCT_D','STRUCT'],['GRADE_D','GRADE'],['ROOF_D','ROOF'],['EXTWALL_D','EXTWALL'],['INTWALL_D','INTWALL']]
cat_drop = []
for c in categories:
    round(df1[c[1]])
    cat_drop.append(c[0])
    
# eliminate redundant dummy variables
df1.drop(cat_drop, inplace=True, axis=1)
df1.head()


Unnamed: 0,OBJECTID,SSL,BATHRM,HF_BATHRM,HEAT,AC,NUM_UNITS,ROOMS,BEDRM,AYB,...,GRADE,CNDTN,EXTWALL,ROOF,INTWALL,KITCHENS,FIREPLACES,USECODE,LANDAREA,GIS_LAST_MOD_DTTM
0,1001,0152 0133,4.0,0.0,7.0,Y,2.0,8.0,4.0,1910.0,...,6.0,4.0,14.0,6.0,6.0,2.0,5.0,24,1680,2018-07-22T18:01:43.000Z
1,1002,0152 0134,3.0,1.0,7.0,Y,2.0,11.0,5.0,1898.0,...,6.0,4.0,14.0,2.0,6.0,2.0,4.0,24,1680,2018-07-22T18:01:43.000Z
2,1003,0152 0135,3.0,1.0,13.0,Y,2.0,9.0,5.0,1910.0,...,6.0,5.0,14.0,2.0,6.0,2.0,4.0,24,1680,2018-07-22T18:01:43.000Z
3,1004,0152 0136,3.0,1.0,13.0,Y,2.0,8.0,5.0,1900.0,...,6.0,4.0,14.0,2.0,6.0,2.0,3.0,24,1680,2018-07-22T18:01:43.000Z
4,1005,0152 0138,2.0,1.0,7.0,Y,1.0,11.0,3.0,1913.0,...,6.0,4.0,14.0,13.0,6.0,1.0,0.0,13,2032,2018-07-22T18:01:43.000Z


In [3]:
print(df1.isnull().sum())

OBJECTID                 0
SSL                      0
BATHRM                  27
HF_BATHRM               28
HEAT                    27
AC                      27
NUM_UNITS               27
ROOMS                   44
BEDRM                   31
AYB                     13
YR_RMDL              57708
EYB                      0
STORIES                 74
SALEDATE                 0
PRICE                19288
QUALIFIED                0
SALE_NUM                 0
GBA                      0
BLDG_NUM                 0
STYLE                   27
STRUCT                  27
GRADE                   27
CNDTN                   27
EXTWALL                 27
ROOF                    27
INTWALL                 27
KITCHENS                28
FIREPLACES              28
USECODE                  0
LANDAREA                 0
GIS_LAST_MOD_DTTM        0
dtype: int64


Our data set is still 107,154 observations but now contains only 31 observations. However, we do see that there are several null values for various features. For each of the categorical or numeric features with 100 or fewer obs we will simply impute the missing values using the most common class or median value. However, the year remodeled, & price stand out and will probably need to be treated differently.

For the year remodeled we will assume that variable is missing when no remodels have been done to the home. Thus, converting the year to bins and treating the 57k with no remodel year as there own class. Clusters with a high proportion of these class homes may provides insights into homes ideal for contractors.

Considering the price is the price of the last sale. Considering the volatility of the housing market over time and inflation that we believe that feature could be misleading, so we will opt to leave it out of our analysis.

In addition to the price category we will also drop several other codes that we have deemed not useful.

In [4]:
bins = [0, 1960, 1970, 1980, 1990, 2000, 2010, 2020]
labels = ['50+','50','40','30','20','10','0']
df1['YR_RMDL_ClASS'] = pd.cut(df1['YR_RMDL'], bins=bins, labels=labels)
df1['YR_RMDL_ClASS'] = df1['YR_RMDL_ClASS'].replace(np.nan, 'NONE', regex=True)

# eliminate unnecessary variables
df1.drop(['PRICE','QUALIFIED','BLDG_NUM','GRADE','CNDTN','EYB','USECODE','GIS_LAST_MOD_DTTM','YR_RMDL','SALEDATE'], inplace=True, axis=1)

In [5]:
df1.head()

Unnamed: 0,OBJECTID,SSL,BATHRM,HF_BATHRM,HEAT,AC,NUM_UNITS,ROOMS,BEDRM,AYB,...,GBA,STYLE,STRUCT,EXTWALL,ROOF,INTWALL,KITCHENS,FIREPLACES,LANDAREA,YR_RMDL_ClASS
0,1001,0152 0133,4.0,0.0,7.0,Y,2.0,8.0,4.0,1910.0,...,2522,7.0,7.0,14.0,6.0,6.0,2.0,5.0,1680,30
1,1002,0152 0134,3.0,1.0,7.0,Y,2.0,11.0,5.0,1898.0,...,2567,7.0,7.0,14.0,2.0,6.0,2.0,4.0,1680,10
2,1003,0152 0135,3.0,1.0,13.0,Y,2.0,9.0,5.0,1910.0,...,2522,7.0,7.0,14.0,2.0,6.0,2.0,4.0,1680,10
3,1004,0152 0136,3.0,1.0,13.0,Y,2.0,8.0,5.0,1900.0,...,2484,7.0,7.0,14.0,2.0,6.0,2.0,3.0,1680,10
4,1005,0152 0138,2.0,1.0,7.0,Y,1.0,11.0,3.0,1913.0,...,5255,7.0,8.0,14.0,13.0,6.0,1.0,0.0,2032,0


In [6]:
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].median() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

df = DataFrameImputer().fit_transform(df1)

int_col = ['BATHRM','HF_BATHRM','HEAT','NUM_UNITS','ROOMS','BEDRM','AYB','STORIES','STYLE','STRUCT',
           'EXTWALL','ROOF','INTWALL','KITCHENS','FIREPLACES','LANDAREA']

for i in int_col:
    df[i] = df[i].astype('int64')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107154 entries, 0 to 107153
Data columns (total 22 columns):
OBJECTID         107154 non-null int64
SSL              107154 non-null object
BATHRM           107154 non-null int64
HF_BATHRM        107154 non-null int64
HEAT             107154 non-null int64
AC               107154 non-null object
NUM_UNITS        107154 non-null int64
ROOMS            107154 non-null int64
BEDRM            107154 non-null int64
AYB              107154 non-null int64
STORIES          107154 non-null int64
SALE_NUM         107154 non-null int64
GBA              107154 non-null int64
STYLE            107154 non-null int64
STRUCT           107154 non-null int64
EXTWALL          107154 non-null int64
ROOF             107154 non-null int64
INTWALL          107154 non-null int64
KITCHENS         107154 non-null int64
FIREPLACES       107154 non-null int64
LANDAREA         107154 non-null int64
YR_RMDL_ClASS    107154 non-null object
dtypes: int64(19), object(3)

We now have a cleaned up set of 22 different features for over 107k homes with all the missing values imputed. Before we begin analysis and clustering of our data we will explore these variables a little further to understand any transformations that may be required or any outliers that need to be addressed.