# Model v1 - dropping features

We have been keeping track of features to drop so far:
- based on the p-value of the coefficient from the OLS baseline model and
- the test for multicollinearity that was performed

We need to add to that list the date and price to ensure that all necessary columns are dropped for our next model.

In [None]:
features_to_drop.extend(['date','price'])

In [None]:
X = df.drop(features_to_drop, axis=1)
y = df.price

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=0)

model_v1 = LinearRegression()
model_v1.fit(X_train,y_train)

splitter = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

baseline_scores = cross_validate(estimator=model_v1, X=X_train,
                                 y=y_train, return_train_score=True, 
                                 cv=splitter)

print('------------------------------------')
print('Cross Validation Scores on X_train')
print('Train score:', baseline_scores['train_score'].mean())
print('Test score:', baseline_scores['test_score'].mean())

In [None]:
X_train = sm.add_constant(X_train)
model_v1_results = sm.OLS(y_train, X_train).fit()
model_v1_results.summary()

In [None]:
model_v1_df = pd.DataFrame(model_v1_results.pvalues.sort_values(ascending=True))

Now we can pull out all features whose coefficients' p-value was greater than the threshold (0.05)

In [None]:
high_pvalues = model_v1_df[model_v1_df[0] > 0.05]
high_pvalues.reset_index(inplace=True)
high_pvalues.columns = ['feature', 'p_value']
high_pvalues

In [None]:
test_df = X.corr().abs().stack().reset_index().sort_values(0,ascending=False)
test_df['pairs'] = list(zip(test_df.level_0, test_df.level_1))
test_df.set_index(['pairs'], inplace=True)
test_df.drop(['level_0', 'level_1'], axis=1, inplace=True)
test_df.columns = ['mc']
test_df[(test_df.mc > 0.75) & (test_df.mc < 1)]

In [None]:
features_to_drop.extend(list(high_pvalues.feature))

## Model v2 - 2nd round dropping features

In [None]:
X = df.drop(features_to_drop, axis=1)
y = df.price

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=0)

model_v2 = LinearRegression()
model_v2.fit(X_train,y_train)

splitter = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

baseline_scores = cross_validate(estimator=model_v2, X=X_train,
                                 y=y_train, return_train_score=True, 
                                 cv=splitter)

print('------------------------------------')
print('Cross Validation Scores on X_train')
print('Train score:', baseline_scores['train_score'].mean())
print('Test score:', baseline_scores['test_score'].mean())

In [None]:
X_train = sm.add_constant(X_train)
model_v2_results = sm.OLS(y_train, X_train).fit()
model_v2_results.summary()

In [None]:
model_v2_df = pd.DataFrame(model_v2_results.pvalues.sort_values(ascending=True))

Now we can pull out all features whose coefficients' p-value was greater than the threshold (0.05)

In [None]:
high_pvalues = model_v2_df[model_v2_df[0] > 0.05]
high_pvalues.reset_index(inplace=True)
high_pvalues.columns = ['feature', 'p_value']
high_pvalues

In [None]:
test_df = X.corr().abs().stack().reset_index().sort_values(0,ascending=False)
test_df['pairs'] = list(zip(test_df.level_0, test_df.level_1))
test_df.set_index(['pairs'], inplace=True)
test_df.drop(['level_0', 'level_1'], axis=1, inplace=True)
test_df.columns = ['mc']
test_df[(test_df.mc > 0.75) & (test_df.mc < 1)]

In [None]:
features_to_drop.extend(['zip_98006'])

# Removing outliers

## by 'sqft_living'

During EDA we saw through boxplotting that there are lots of outliers in this feature. We will identify and remove those homes from the dataset and see if our model becomes more accurate at predicting price for a normal home

In [None]:
df_copy = df.copy()

In [None]:
df_copy.sqft_living.describe()

In [None]:
upper_limit = df_copy.sqft_living.mean() + 3*df_copy.sqft_living.std()
upper_limit

If we define outlier as any value more than 3x the standard deviation over the average, then we should drop all homes with a square footage over 4,834

In [None]:
df_copy = df_copy[df_copy.sqft_living <= upper_limit]

In [None]:
X = df_copy.drop(features_to_drop, axis=1)
y = df_copy.price

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=0)

model_v3 = LinearRegression()
model_v3.fit(X_train,y_train)

splitter = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

baseline_scores = cross_validate(estimator=model_v3, X=X_train,
                                 y=y_train, return_train_score=True, 
                                 cv=splitter)

print('------------------------------------')
print('Cross Validation Scores on X_train')
print('Train score:', baseline_scores['train_score'].mean())
print('Test score:', baseline_scores['test_score'].mean())

# Feature Engineering

Other than 'date' and 'price', these are the features dropped through the first two OLS linear regression models.

In [None]:
sorted(features_to_drop)

## Renovation

Set up new dataframe 'renovation_features' to hold engineered features before joining in with original features.

We are first working on the renovation status, so we will need some features: yr_built, yr_renovated, and sale_year

In [None]:
renovation_features = df[['yr_built', 'yr_renovated', 'sale_year']].copy()

During research on renovations, it seems like a general industry standard is that a home is considered renovated if the renovation took place within 15 years. We are engineering a new feature 'is_renovated' as a boolean, with 1 being the home is renovated within 15 years of sale, and 0 being either never renovated or renovations took place more than 15 years before sale.

In [None]:
def renovation_status(df):
    if df.yr_renovated == 0.0:
        return 0
    else:
        if (df.sale_year - df.yr_renovated) > 15:
            return 0
        else:
            return 1
        
renovation_features['is_renovated'] = renovation_features.apply(renovation_status, axis=1)
renovation_features.drop(renovation_features.iloc[:, 0:3], axis=1, inplace=True)

In [None]:
renovation_features.is_renovated.value_counts()

Our renovation feature is now a boolean representing homes that were renovated with 15 years of the sale.

> The feature resides in a dataframe 'renovation_features' so that we can concat it into our next iteration of features to be modeled

## Basement

The sqft_basement poses some interesting questions. I think first we want a column to see if there is a basement or not

In [None]:
basement_features = df[['sqft_living', 'sqft_basement']].copy()

In [None]:
basement_features

In [None]:
basement_features['has_basement'] = basement_features.sqft_basement.map(lambda x: 1 if x > 0 else 0)

In [None]:
basement_features

In [None]:
def percent_basement(df):
    if df.has_basement == False:
        return 0
    else:
        return round(((df.sqft_basement / df.sqft_living) * 100), 2)

In [None]:
basement_features['basement_percent'] = basement_features.apply(percent_basement, axis=1)

In [None]:
basement_features.has_basement.value_counts(normalize=True)

In [None]:
basement_features[basement_features.basement_percent > 0].basement_percent.describe()

In [None]:
sns.histplot(data=basement_features[basement_features.basement_percent > 0], x='basement_percent')

In our dataset, 60% of homes have no basement at all.

Of those that have basements, the above histogram shows the binned percent of sqft_living that is made up of sqft_basement.

It looks like we have a case for a feature that will show if the percent of the home that is basement is.

In [None]:
basement_features.drop(basement_features.iloc[:, 0:3], axis=1, inplace=True)

In [None]:
basement_features

## Zipcode

In [None]:
zipcode_features = raw_data[['zipcode']].copy()

There at 70 different zipcode values in our dataset. We want to try and create a new feature that takes zipcodes and matches them up with cities in our area.

In [None]:
# https://zipdatamaps.com/king-wa-county-zipcodes
zip_city = pd.read_csv('data\zip_city.csv')

zipcode_features = pd.merge(left=zipcode_features, right=zip_city, on='zipcode', how='left')

In [None]:
zipcode_features[zipcode_features.zipcode == 98005]

zipcode_features can be merged into the larger dataframe to associate a home with a city instead of just a zipcode.

Let's explore this a bit more.

In [None]:
homes_by_zip = zipcode_features.groupby('city').count()
homes_by_zip.reset_index(inplace=True)
homes_by_zip.columns = ['city', 'count']

In [None]:
homes_by_zip

In [None]:
homes_by_zip.sort_values(by='count', ascending=False)

So instead of 70 different zipcodes for a feature, we can have 24 different city names.