# Notebook on potential cleaning

# Goals
Discuss:
- R^2 v. adjusted R^2
- F-statistic (18.04)
- p values
- Jarque-Bera (20.14)
- relevance of all coefficients
- error based metrics (MAE, RMSE) (19.05)

Find and use examples of:
- numeric variables (continuous) (19.06, 19.08)
- numeric variables (discrete) (19.06, 19.08)
- categorical variable (string) (19.06, 19.08)
- categorical variable (number) (19.06, 19.08)

Find and use examples of:
- linear transformations (shifting, scaling) (20.02, 20.03)
- log transformations (20.04, 20.05)
- interactions (20.06, 20.07)
- polynomial regression (20.08, 20.09)

Find and discuss violations (?) of:
- linearity assumption (20.11)
    - need model first, then can test and possibly make changes
- independence assumption (20.12, 20.13)
    - multicollinearity, explored before making model
- normality assumption (20.14)
    - need model first, then can test residuals for normality
- equal variance assumption (20.15)
    - need model first, then can test

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np

df = pd.read_csv('data/kc_house_data.csv')

# make subsets of the columns
num_cont = ['price', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_garage', 'sqft_patio']
num_disc = ['bedrooms', 'bathrooms', 'floors', 'yr_built', 'yr_renovated']
cat = ['waterfront', 'greenbelt', 'nuisance', 'view', 'condition', 'grade', 'heat_source']
ignore = ['id', 'date', 'lat', 'long', 'address']

# create sub-dfs and standardize numeric values
df_cont_std = df[num_cont].copy()
for col in df_cont_std:
    df_cont_std[col] = (df_cont_std[col] - df_cont_std[col].mean()) / df_cont_std[col].std()

df_disc_std = df[num_disc].copy()
for col in df_disc_std:
    df_disc_std[col] = (df_disc_std[col] - df_disc_std[col].mean()) / df_disc_std[col].std()

df_cat = df[cat].copy()

# what are the columns

## Ignore
- id
- date
- lat
- long
- address

## Numeric continuous:
- **price**
- sqft_living
- sqft_lot
- sqft_above
- sqft_basement
- sqft_garage
- sqft_patio

In [3]:
df_cont_std.skew()

price             6.602907
sqft_living       1.607881
sqft_lot         21.046621
sqft_above        1.553698
sqft_basement     1.111792
sqft_garage       0.666053
sqft_patio        2.345749
dtype: float64

In [4]:
# collect a list of outlier data past a certain threshold of standard deviations

threshold = 7.5
outliers = set()
for col in df_cont_std:
    outliers = outliers.union(set(df_cont_std[df_cont_std[col] > threshold].index))
    
len(outliers)

154

In [5]:
df_cont_std.drop(outliers).skew()

price            2.837029
sqft_living      1.205495
sqft_lot         7.182875
sqft_above       1.294544
sqft_basement    0.943686
sqft_garage      0.474270
sqft_patio       1.905713
dtype: float64

In [None]:
df_cont_std.hist(figsize=(15,10), bins="auto");

In [None]:
df_cont_std.drop(outliers).hist(figsize=(15,10), bins="auto");

In [None]:
y = df['price']
X = df_cont_std.drop('price', axis=1)

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15,8), sharey=True)

for i, column in enumerate(X.columns):
    # Locate applicable axes
    row = i // 3
    col = i % 3
    ax = axes[row][col]
    
    # Plot feature vs. y and label axes
    ax.scatter(X[column], y, alpha=0.2)
    ax.set_xlabel(column)
    if col == 0:
        ax.set_ylabel('price')

fig.tight_layout()

In [None]:
y = df.drop(outliers)['price']
X = df_cont_std.drop(outliers).drop('price', axis=1)

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15,8), sharey=True)

for i, column in enumerate(X.columns):
    # Locate applicable axes
    row = i // 3
    col = i % 3
    ax = axes[row][col]
    
    # Plot feature vs. y and label axes
    ax.scatter(X[column], y, alpha=0.2)
    ax.set_xlabel(column)
    if col == 0:
        ax.set_ylabel('price')

fig.tight_layout()

In [5]:
df['garage'] = df['sqft_garage'] > 0

## Numeric discrete:
- bedrooms
- bathrooms
- floors
- yr_built
- yr_renovated

In [None]:
df_disc_std.skew()

In [None]:
df_disc_std.hist(figsize=(15,10), bins="auto");

In [None]:
df_disc_std.drop(outliers).hist(figsize=(15,10), bins="auto");

In [None]:
y = df['price']
X = df_disc_std

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15,8), sharey=True)

for i, column in enumerate(X.columns):
    # Locate applicable axes
    row = i // 3
    col = i % 3
    ax = axes[row][col]
    
    # Plot feature vs. y and label axes
    ax.scatter(X[column], y, alpha=0.2)
    ax.set_xlabel(column)
    if col == 0:
        ax.set_ylabel('price')

fig.tight_layout()

In [None]:
y = df.drop(outliers)['price']
X = df_disc_std.drop(outliers)

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15,8), sharey=True)

for i, column in enumerate(X.columns):
    # Locate applicable axes
    row = i // 3
    col = i % 3
    ax = axes[row][col]
    
    # Plot feature vs. y and label axes
    ax.scatter(X[column], y, alpha=0.2)
    ax.set_xlabel(column)
    if col == 0:
        ax.set_ylabel('price')

fig.tight_layout()

In [None]:
df[df['yr_renovated'] > 0]['yr_renovated'].hist();

In [None]:
len(df[df['yr_renovated'] > 0])

In [7]:
df['renovated'] = df['yr_renovated'] > 0

## Categorical string:
- waterfront
- greenbelt
- nuisance
- view
- condition
- grade
- heat_source
- sewer_system

In [None]:
X = df[cat]
X = pd.get_dummies(X, columns=['waterfront', 'greenbelt', 'nuisance', 'view', 'condition', 'grade', 'heat_source'])
X = X.drop(['waterfront_NO', 'greenbelt_NO', 'nuisance_NO', 'view_NONE', 'condition_Average', 'grade_7 Average',
        'heat_source_Other'], axis=1)
# X = data[["weight", "model year", "origin"]]
# X = pd.get_dummies(X, columns=["origin"])
# X = X.drop("origin_2", axis=1)
# X

In [None]:
y = df['price']

fig, axes = plt.subplots(nrows=5, ncols=6, figsize=(15,15), sharey=True)

for i, column in enumerate(X.columns):
    # Locate applicable axes
    row = i // 6
    col = i % 6
    ax = axes[row][col]
    
    # Plot feature vs. y and label axes
    ax.scatter(X[column], y, alpha=0.2)
    ax.set_xlabel(column)
    if col == 0:
        ax.set_ylabel('price')

fig.tight_layout()

## Categorical number:
- *none*

# Possible data cleaning measures:

Setting an outlier threshold at 7.5 standard deviations eliminates about half a percent (154) of the NUMERIC CONTINUOUS records and generally reduces skewness except for in lot size ('sqft_lot'). There's likely a split between urban and suburban, where all the urban spaces have a lot size under a certain threshold but suburban spaces continue to vary logarithmically? It is clear that dropping the outliers gives richer looking scatterplots that are likely more linear.

Skewness is not a problem with the NUMERIC DISCRETE records.

The YEAR RENOVATED causes an issue because zero just means it wasn't renovated. Only 1,372 homes had been renovated, fewer than 5% of the records. Perhaps set the zeros instead to the year it was built??

Possibly RENOVATED and GARAGE should be new categorical variables.