# Multiple Linear Regression Modeling for Sale Price Prediction at Elite Properties Group

# 1. Business Understanding


Elite Properties Group has the need to develop a reliable and accurate model that predicts the sale prices of residential properties based on their key attributes. By leveraging historical data, including features such as the number of bedrooms, bathrooms, square footage of living space, lot size, number of floors, property condition and year built, the agency aims to provide clients with precise pricing recommendations and maximize their returns on investment.

The primary objective of this project is to build a robust predictive model that accurately estimates the sale prices of residential properties listed by Elite Properties Group. By utilizing the available dataset, the agency aims to offer clients reliable and data-driven pricing advice, enhancing their confidence in the sales process and facilitating optimal pricing strategies.

# 2. Data Understanding
To gain a comprehensive understanding of the dataset provided by Elite Properties Group, I will examine the features and identify the key ones to use when creating models to predict the sale prices. The dataset `kc_house_data.csv` contains attributes such as: 
* Saleprice
* Number of Bedrooms
* Number of Bathrooms
* Year Built
* Date the house was sold
* Whether the house is on a waterfront
* Square footage of living space in the home
* Number of floors (levels) in house
* Quality of view from house

## Relevance of these features
The provided features in the dataset have significant relevance in predicting the sale prices of residential properties. The number of bedrooms and bathrooms provides insights into the property's size and functionality, which are crucial factors influencing its value. The square footage of the living space provides an indication of the overall spaciousness and potential use of the property, impacting its desirability and pricing. The number of floors reflects the property's layout and can affect its appeal to potential buyers.

The year built feature provides insights into the property's age, which can impact its condition, modernity, and potential maintenance costs. By considering these features collectively, we can develop a comprehensive model that accurately predicts sale prices and assists Elite Properties Group in providing valuable pricing advice to their clients.


# 3. Data Preparation
We first import the relevant packages for this analysis.

In [6]:
# importing the relevant packages
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')

#enabling plotting of visualizations in the notebook
%matplotlib inline

## a. Previewing the dataset
We now load the dataset `kc_house_data` stored in the data folder. The data contains features that will be necessary for our modelling.

In [9]:
# loading the data using pandas and storing to variable df
df = pd.read_csv('data/kc_house_data.csv')

# previewing the first 10 rows
df.head(10)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503
5,7237550310,5/12/2014,1230000.0,4,4.5,5420,101930,1.0,NO,NONE,...,11 Excellent,3890,1530.0,2001,0.0,98053,47.6561,-122.005,4760,101930
6,1321400060,6/27/2014,257500.0,3,2.25,1715,6819,2.0,NO,NONE,...,7 Average,1715,?,1995,0.0,98003,47.3097,-122.327,2238,6819
7,2008000270,1/15/2015,291850.0,3,1.5,1060,9711,1.0,NO,,...,7 Average,1060,0.0,1963,0.0,98198,47.4095,-122.315,1650,9711
8,2414600126,4/15/2015,229500.0,3,1.0,1780,7470,1.0,NO,NONE,...,7 Average,1050,730.0,1960,0.0,98146,47.5123,-122.337,1780,8113
9,3793500160,3/12/2015,323000.0,3,2.5,1890,6560,2.0,NO,NONE,...,7 Average,1890,0.0,2003,0.0,98038,47.3684,-122.031,2390,7570


## b. Removing features which are not necessary
Some of the columns in this dataset contain information which is not relevant for modelling. Before starting data cleaning, I'll remove some of these columns to remain with the most relevant.

Columns to be removed are: `date`, `view`, `sqft_above`, `sqft_basement`, `yr_renovated`, `zipcode`, `lat`, `long`, `sqft_living15` and `sqft_lot15`.

In [11]:
#dropping columns and assigning to new variable data
data = df.drop(columns = ['date','view','sqft_above','sqft_basement','yr_renovated','zipcode','lat','long','sqft_living15','sqft_lot15'], axis = 1)

#previewing the first 10 rows
data.head(10)

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,grade,yr_built
0,7129300520,221900.0,3,1.0,1180,5650,1.0,,Average,7 Average,1955
1,6414100192,538000.0,3,2.25,2570,7242,2.0,NO,Average,7 Average,1951
2,5631500400,180000.0,2,1.0,770,10000,1.0,NO,Average,6 Low Average,1933
3,2487200875,604000.0,4,3.0,1960,5000,1.0,NO,Very Good,7 Average,1965
4,1954400510,510000.0,3,2.0,1680,8080,1.0,NO,Average,8 Good,1987
5,7237550310,1230000.0,4,4.5,5420,101930,1.0,NO,Average,11 Excellent,2001
6,1321400060,257500.0,3,2.25,1715,6819,2.0,NO,Average,7 Average,1995
7,2008000270,291850.0,3,1.5,1060,9711,1.0,NO,Average,7 Average,1963
8,2414600126,229500.0,3,1.0,1780,7470,1.0,NO,Average,7 Average,1960
9,3793500160,323000.0,3,2.5,1890,6560,2.0,NO,Average,7 Average,2003


## c. Data Cleaning
Some of the columns may contain missing values,duplicates or incorrect formatting. In this stage, we are going to clean each column to remove these duplicates or missing values present. To start with, we'll first check for missing values in each table and deal with them.