# Housing Days On Market Model Metric Analysis

## Information

Housing related data sources were combined in the project SQLite database. The output CSV file is analyzed here. 

### Environment Information:

Environment used for coding is as follow:

Oracle VM VirtualBox running Ubuntu (guest) on Windows 10 (host).

Current conda install:

               platform : linux-64
          conda version : 4.2.13
       conda is private : False
      conda-env version : 4.2.13
    conda-build version : 1.20.0
         python version : 2.7.11.final.0
       requests version : 2.9.1
       
Package requirements:

dill : 0.2.4, numpy : 1.11.2, pandas : 0.19.1, matplotlib : 1.5.1, scipy : 0.18.1, seaborn : 0.7.0, scikit-image : 0.12.3, scikit-learn : 0.18.1


## Python Package(s) Used

In [1]:
import numpy as np
import pandas as pd

In [2]:
import matplotlib.pyplot as plt
from scipy.stats import probplot, skew, skewtest, kurtosis, kurtosistest, ppcc_plot, ppcc_max
import seaborn as sns

In [4]:
%matplotlib inline

In [5]:
plt.style.use('seaborn-whitegrid')

## Data and Methods

### Data Fetching

In [None]:
# Import file from disk
df_combination = pd.read_csv('results_output_file.csv')
df_combination = df_combination.drop('Unnamed: 0', axis = 1)
df_combination

## Analysis of Models

In [None]:
# Descriptive statistics of dataframe
df_combination.describe()

In [None]:
# Print out maximum value for specific columns
for i in df_combination.columns.difference(['process_time','estimator']):
    print i
    print df_combination[df_combination[i] == df_combination[i].max()]

In [None]:
# Dataframe sorted by processing time
df_combination.sort_values(by='process_time')

In [None]:
#First pass for looking at frequency costs as a function of column, testing data for normality,
# testing data for randomness, analyzing uncertainity of mean, median, and midrange values
df_combination_2 = df_combination.drop('estimator', axis=1)

# Specific parameters not evaluated here due to small dataset size
for i in df_combination_2.iloc[:,:]:
    print i 
    print "Skew: ", round(skew(df_combination_2.iloc[:,:][i]),3)
    #print "Skew test: ", skewtest(df_combination_2.iloc[:,:][i])
    print "Kurtosis: ", round(kurtosis(df_combination_2.iloc[:,:][i]),3)
    #print "Kurtosis test: ", kurtosistest(df_combination_2.iloc[:,:][i])
    #print " "
    print "PPCC_max value: ", round(ppcc_max(df_combination_2.iloc[:,:][i], brack = (-10,10)),3)
    #www.itl.nist.gov/div898/handbook/eda/section3/ppccplot.htm
    #ppcc_max~0.14 indicates a normal distribution. Less than 0.14 indicates
    #long-tailed distributions (Cauchy).
    #Greater than 0.14 indicates short-tailed distributions (Beta or uniform).
    
    plt.figure(1, figsize = (10,10), dpi = 80)
    #histogram plot
    plt.subplot(321)
    plt.title("Histogram")
    plt.hist(df_combination_2.iloc[:,:][i])
    
    #box and whiskers plot 
    plt.subplot(322)
    plt.title("Box-Whiskers")
    plt.boxplot(df_combination_2.iloc[:,:][i])
    
    #normal probability plot - test for normality
    plt.subplot(323)
    plt.title("Normal Probability Test")
    probplot(df_combination_2.iloc[:,:][i], plot=plt)
    
    #run-sequence plot - test for outliers, and shifts in location and variation
    plt.subplot(324)
    plt.title("Run-Sequence")
    plt.scatter(df_combination_2.iloc[:,:][i].index, df_combination_2.iloc[:,:][i])
    
    #lag plot - test for randomness
    plt.subplot(325)
    #plt.title("Lag")
    lag_plot(df_combination_2.iloc[:,:][i])
    
    # Bootstrap plot not used here due to small dataset size
    #bootstrap plot - test for uncertainity of mean, median, and midrange
    #bootstrap_plot(df_combination_2.iloc[:,:][i], size = 100, samples = 100, color = 'grey')
    
    plt.tight_layout()
    plt.show()