##Business Problem
Stakeholder: Real Estate Agency

Business Problem:
The real estate agency wants to optimize their sales strategy by providing insights to homeowners on how they can increase the value of their properties. Specifically, they want to offer advice on home renovations that could potentially boost the estimated value of homes in the northwestern county.

Objective:
To develop a predictive model using multiple linear regression analysis that can estimate the impact of different home renovation features such as bedrooms, bathrooms, living space, etc. on the sale price of houses in the northwestern county. This model will enable the real estate agency to provide tailored advice to homeowners regarding which renovations are likely to yield the highest return on investment in terms of increasing property value.

This business problem aligns with the provided dataset as it includes various features related to the characteristics of houses and their sale prices, allowing us to perform regression analysis to identify the factors influencing property values in the region.



In [1]:
#importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns

In [2]:
df = pd.read_csv('kc_house_data.csv')
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503


In [3]:
df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,21597.0,17755.0,21597.0,21597.0,21597.0,21597.0,21597.0
mean,4580474000.0,540296.6,3.3732,2.115826,2080.32185,15099.41,1.494096,1788.596842,1970.999676,83.636778,98077.951845,47.560093,-122.213982,1986.620318,12758.283512
std,2876736000.0,367368.1,0.926299,0.768984,918.106125,41412.64,0.539683,827.759761,29.375234,399.946414,53.513072,0.138552,0.140724,685.230472,27274.44195
min,1000102.0,78000.0,1.0,0.5,370.0,520.0,1.0,370.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,322000.0,3.0,1.75,1430.0,5040.0,1.0,1190.0,1951.0,0.0,98033.0,47.4711,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,1560.0,1975.0,0.0,98065.0,47.5718,-122.231,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10685.0,2.0,2210.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,9410.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


In [4]:
correlation = df.corr()['price'].sort_values(ascending=False)
correlation

  correlation = df.corr()['price'].sort_values(ascending=False)


price            1.000000
sqft_living      0.701917
sqft_above       0.605368
sqft_living15    0.585241
bathrooms        0.525906
bedrooms         0.308787
lat              0.306692
floors           0.256804
yr_renovated     0.129599
sqft_lot         0.089876
sqft_lot15       0.082845
yr_built         0.053953
long             0.022036
id              -0.016772
zipcode         -0.053402
Name: price, dtype: float64

##Choosing the columns to use in the analysis
We can use the columns with the highest correlation<br>
These are: sqft_living, sqft_above, sqft_living15, bathrooms,bedrooms   



In [5]:
df_subset = df[['sqft_living','sqft_above','sqft_living15','bathrooms','bedrooms']]
df_subset.head()

Unnamed: 0,sqft_living,sqft_above,sqft_living15,bathrooms,bedrooms
0,1180,1180,1340,1.0,3
1,2570,2170,1690,2.25,3
2,770,770,2720,1.0,2
3,1960,1050,1360,3.0,4
4,1680,1680,1800,2.0,3


##Data Preparation

In [6]:
print("Dataset info:")
print(df_subset.info())
print(" ") # printing a whitespace line
print("Dataset describe:")
print(df_subset.describe())

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   sqft_living    21597 non-null  int64  
 1   sqft_above     21597 non-null  int64  
 2   sqft_living15  21597 non-null  int64  
 3   bathrooms      21597 non-null  float64
 4   bedrooms       21597 non-null  int64  
dtypes: float64(1), int64(4)
memory usage: 843.8 KB
None
 
Dataset describe:
        sqft_living    sqft_above  sqft_living15     bathrooms      bedrooms
count  21597.000000  21597.000000   21597.000000  21597.000000  21597.000000
mean    2080.321850   1788.596842    1986.620318      2.115826      3.373200
std      918.106125    827.759761     685.230472      0.768984      0.926299
min      370.000000    370.000000     399.000000      0.500000      1.000000
25%     1430.000000   1190.000000    1490.000000      1.750000      3.000000
50%     1910.000000   1560.0000

##Modelling