# Housing Data Analysis with Linear Regression

## Overview

For this project I'll be using the Kings County housing dataset to solve a housing-related business problem using linear regression techniques.
I'll be using the CRISP-DM Data Science method which the structure this notebook will follow.

## Business Understanding

A housing development company is working on new model homes. They want to design houses that will sell to middle class buyers who are currently facing a shortage of available inventory due to the increase in demand during 2020.

The median national home listing price grew by 13.4% over last year, to $340,000 in December. The developers would like to observe which features a home should include at that price level.

https://www.realtor.com/research/december-2020-data/

## Data Understanding

This data includes house sale prices and conditions from houses sold through 2014-2015 in the Seattle area.

https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r

- What are the properties of the variables you intend to use?

The median price in the dataset is $450K, so I'll have to define a range of the data I am using for my model.

I'm going to start by importing *all* my libraries and loading the data into a dataframe

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
data = pd.read_csv('data/kc_house_data.csv')
# data.info
# data.describe()

## Data Preparation

Previewing the data I can see there are some null values I'll have to deal with. These are coming from year renovated, view, and waterfront. It also looks like sqft_basement is an object not at integer. On a closer look, many of the rows contain a '?'. I'm going to fill nulls and examine the data more closely.

For Waterfront and View:

 - There are only 146 houses that are coded as waterfront. I don't want to drop these because there are over 2K, so I am going to fill the null values to 0 since it won't affect the distribution much.
 - The view column has 63 null values. This column describes how many times a house has been viewed (not the views from the house), which I don't see as being really important as a feature because there are a lot of reasons it could/could not have been viewed a given number of times. I am going to fill these nulls with a 0 assuming they were not available to view.

- Were there variables you dropped or created?
- How did you address missing values or outliers?
- Why are these choices appropriate given the data and the business problem?

In [None]:
data['waterfront'].fillna(0.0, inplace=True)
data['view'].fillna(0.0, inplace=True)

For year renovated, I am going to assume the nulls represent houses that were not renovated. I'm going to fill the nulls with 0, but also make a binary indicator column that just tells me whether the house was or was not renovated instead of what year. I'll keep both columns for now.

In [None]:
data['yr_renovated'].fillna(0.0, inplace=True)

#new column  
data['is_renovated'] = 0

# loop through the data and input a 1 where the house was renovated
for row in data.index:
    if data['yr_renovated'][row] != 0.0:
        data['is_renovated'][row] = 1

### First $&(@# Model

Before going too far down the data preparation rabbit hole, be sure to check your work against a first 'substandard' model!

At this point, you can also consider what a baseline, model-less prediction might look like, and begin evaluating this model compared to that baseline.

In [None]:
# code here for your first 'substandard' model

## Modeling

Describe and justify the process for analyzing or modeling the data.

Questions to consider:

- How did you analyze or model the data?
- How did you iterate on your initial approach to make it better?
- Why are these choices appropriate given the data and the business problem?

In [None]:
# code here to do your second, more refined model

In [None]:
# code here to iteratively improve your models

## Evaluation

Evaluate how well your work solves the stated business problem.

Questions to consider:

- How do you interpret the results?
- How well does your model fit your data? How much better is this than your baseline model?
- How well does your model/data fit any modeling assumptions?
- How confident are you that your results would generalize beyond the data you have?
- How confident are you that this model would benefit the business if put into use?

Please note - you should be evaluating each model as you move through, and be sure to evaluate your models consistently.

## Conclusions

Provide your conclusions about the work you've done, including any limitations or next steps.

Questions to consider:

- What would you recommend the business do as a result of this work?
- What are some reasons why your analysis might not fully solve the business problem?
- What else could you do in the future to improve this project (future work)?
