## Final Project Submission

Please fill out:

- Students name:
    - Billy Mwangi
    - Lynne Mutwiri
    - Sharon Kimani
    - Susan Kanyora
    - Kellen Kinya 
    - Derrick Wekesa
- Student pace: full time
- Scheduled project review date/time:
- Instructor name:
- Blog post URL:


# Data understanding


In [None]:
# Importing relevant libraries
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
import statsmodels.api as sm
from sklearn.metrics import r2_score, mean_squared_error
import folium
from datetime import datetime
pd.options.display.float_format = '{:.2f}'.format

## Loading the Dataset


In [None]:
kc_data= pd.read_csv('data/kc_house_data.csv')
kc_data.head()

In [None]:
#getting basic information about the data
kc_data.info()

The dataset has 21 columns:

- 6 categorical and 15 numerical columns.
- It has as a total of 21597 rows, the columns with a non null count of less than 21597 show existence of some missing values


In [None]:
#getting general summary statistics  on the data
kc_data.describe()

## Data Pre-processing


Involves manipulation, dropping or cleaning of data before it is used in order to ensure or enhance performance.


### Identifying and dealing with missing values


In [None]:
def missing_values(data):
    """A simple function to identify data has missing values"""
    # identify the total missing values per column
    # sort in order 
    miss = data.isnull().sum().sort_values(ascending = False)

    # calculate percentage of the missing values
    percentage_miss = (data.isnull().sum() / len(data)).sort_values(ascending = False)

    # store in a dataframe 
    missing = pd.DataFrame({"Missing Values": miss, "Percentage(%)": percentage_miss})

    # remove values that are missing 
    missing.drop(missing[missing["Percentage(%)"] == 0].index, inplace = True)

    return missing


missing_data = missing_values(kc_data)
missing_data

- The threshold on how to deal with missing values commonly used is 50% and also depends on the specific column. The percentages of missing values are very low for the specific columns so we can replace.
- The percentage of the missing values for waterfront column(11.00%), view column(0.29%) and year renovated column(17.70%) are less than 50% , so we can replace them.
- Checking the year renovated column we may assume the missing value is because the house was never renovated, maybe the house did not have a view or a waterfront also for the other two columns hence we can Fill them with zeros.
- Since the missing values in the 3 columns are categorical and are a small percentage of the columns, replacing them with mode won`t skew the data nor give false conclusions


In [None]:
def filling_missing_values(data, columns):
    missing = missing_values(data) # store the output of missing_values function
    for col in columns:
        if col in missing.index: # check if column has missing values
            data[col] = data[col].fillna(data[col].mode()[0]) # fill missing values with mode
    return data
filling_missing_values(kc_data, ['waterfront','yr_renovated','view'])

kc_data.isna().sum()

### Duplicates


In [None]:
def check_duplicates(data):
    """
    A simple function to check for duplicates in a given dataset.
    """
    duplicates = data.duplicated().sum()
    return duplicates
check_duplicates(kc_data)

There are no duplicates in the data.

#### Checking duplicated id 

In [None]:
duplicates_id = kc_data[kc_data.duplicated(subset=['id'], keep=False)]
duplicates_id

* The id column shows the unique identifier for a house. 
* While there are duplicated ids of a house the prices and dates (of sale) of the house were different- hence why there were no duplicated rows- meaning the duplicated ids represent a house that was sold multiple times 


### Data inconsistencies


In [None]:
def print_value_counts(df):
    for column in df.columns:
        # Print the column name
        print("Value counts for {} column:".format(column))
        # Print the value counts for the column
        print(df[column].value_counts())
        # Add a separator for clarity
        print("="*30)

print_value_counts(kc_data)

In [None]:
def find_inconsistent_data(df):
    # Identify potential data inconsistencies
    inconsistent_bedrooms = df[(df['bedrooms'] == 10) | (df['bedrooms'] == 11) | (df['bedrooms'] == 33)]
    inconsistent_bathrooms = df[(df['bathrooms'] == 7) | (df['bathrooms'] == 8)]

    # Concatenate the inconsistent data into a single DataFrame
    inconsistent_data = pd.concat([inconsistent_bedrooms, inconsistent_bathrooms])
inconsistent_data = find_inconsistent_data(kc_data)
inconsistent_data

The square foot basement column has a placeholder value,?.


In [None]:
def place_holders(data, column):
    inconsistent = data[data[column] == '?']
    data[column].replace('?', 0.0, regex=False, inplace=True)

place_holders(kc_data, 'sqft_basement')

- When the number of bedrooms is greater than 10, the value in the sqft_living and sqft_lot a too little to match to that record meaning there is most likely an error in data entry. Therefore it`s best drop that column

- It has 454 placeholder values, dropping the would mean loss of valuable data in the other columns

- The placeholder would have most likely have been used to show that the house has no basement, we can therefore replace these placeholder values with the mode ie 0

- The placeholder values constitute 2% of the column so imputing the data won't skew the data

- We noticed that sqft_basement feature was categorica (object type) instead of numerical so we have to change it


In [None]:
#changing the data type of the column because it contains numerical values
kc_data['sqft_basement']=kc_data['sqft_basement'].astype(float)

## Outliers


In [None]:
def check_outliers(data, columns):
    fig, axes = plt.subplots(nrows=1, ncols=len(columns), figsize=(15,5))
    for i, column in enumerate(columns):
        # Use interquartile range (IQR) to find outliers for the specified column
        q1 = data[column].quantile(0.25)
        q3 = data[column].quantile(0.75)
        iqr = q3 - q1
        print("IQR for {} column: {}".format(column, iqr))

        # Determine the outliers based on the IQR
        outliers = (data[column] < q1 - 1.5 * iqr) | (data[column] > q3 + 1.5 * iqr)
        print("Number of outliers in {} column: {}".format(column, outliers.sum()))

        # Create a box plot to visualize the distribution of the specified column
        sns.boxplot(data=data, x=column, ax=axes[i])
    plt.show()

In [None]:
check_outliers(kc_data, ['price', 'sqft_lot', 'sqft_above','sqft_lot','sqft_living15','sqft_lot15'])

In [None]:
sns.distplot(kc_data['price'])
mean = kc_data['price'].mean()
plt.axvline(x=mean, color='r', linestyle='--')
plt.show()

The data has outliers but we cannot eliminate the outliers because they actually provide valuable information


## Exploratory Data Analysis


### Univariate EDA


Checking for the distribution of individual columns


In [None]:
kc_data.hist(figsize=(20,20));

Checking for the location of our house sales


In [None]:
# Group the data by zipcode and calculate the mean latitude and longitude
zipcode_data = kc_data.groupby('zipcode').agg({'lat': 'mean', 'long': 'mean'}).reset_index()

# Create a map centered at the mean latitude and longitude of all the zipcodes
m = folium.Map(location=[kc_data['lat'].mean(), kc_data['long'].mean()], zoom_start=10)

# Add markers for each zipcode
for _, row in zipcode_data.iterrows():
    folium.Marker(location=[row['lat'], row['long']], popup=row['zipcode']).add_to(m)

# Display the map
m

### Bivariate EDA


- Checking for the relationship between variables.
- Our bivariate EDA involves checking for relationship between various features and the price


In [None]:
specific_col = 'price'
for col in kc_data.columns:
    if col != specific_col:
        plt.scatter(kc_data[col], kc_data[specific_col])
        plt.xlabel(col)
        plt.ylabel(specific_col)
        plt.show()

From the above visualizations we can see that the following features have the most linear relationship with price 
* sqft_living
* sqft_above
* sqft_living15
* sqft_basement

### What is the peak and low seasons for house sales?

In [None]:
kc_quarter =kc_data.copy()

In [None]:
def plot_quarter_counts(data):
    dates = pd.to_datetime(data['date'], format='%m/%d/%Y')
    dates_column = dates.dt.quarter
    # get the counts for each quarter
    quarter_counts = dates_column.value_counts()
    quarter_counts.plot.bar()
    # plot a bar chart of the quarter counts
    plt.title('Counts of Sales by Quarter')
    plt.xlabel('Quarter')
    plt.ylabel('Count')
    plt.show()

plot_quarter_counts(kc_quarter)

We can see that:

- The highest number of house sales are made in the second quarter of the year (Q2: April 1 - June 30) which fall in the Spring season
- The lowest number of house sales are made in the first quarter of the year (Q1: January 1 - March 31) which fall mostly in the Winter season


### Multivariate Visualizations

We took the features with the most linear relationship to the price and then visualize them together

In [None]:
plt.scatter(kc_data['sqft_living'], kc_data['price'], label='sqft_living')
plt.scatter(kc_data['sqft_above'], kc_data['price'], label='sqft_above')
plt.xlabel('Square footage')
plt.ylabel('Price')
plt.title('Price vs Square footage')
plt.legend()
plt.show()

The data points of sqft_living and sqft_above lie close together and they show a strong positive linear relationship with the price

In [None]:
plt.scatter(kc_data['sqft_living'], kc_data['price'], label='sqft_living')
plt.scatter(kc_data['sqft_above'], kc_data['price'], label='sqft_above')
plt.scatter(kc_data['sqft_living15'], kc_data['price'], label='sqft_living15')
plt.scatter(kc_data['sqft_basement'], kc_data['price'], label='sqft_basement')
plt.xlabel('Square footage')
plt.ylabel('Price')
plt.title('Price vs Square footage')
plt.legend()
plt.show()

The data points of sqft_living, sqft_above, sqft_living15 and sqft_basement lie close together and they show a strong positive linear relationship with the price

## Feature Engineering

#### Extracting the year from date sold 


In [None]:
#converting date column from categorical (object) to numerical (int64)
kc_data['date'] = pd.to_datetime(kc_data['date'], format='%m/%d/%Y')
#Extract the year and create a new column
kc_data['year'] = kc_data['date'].dt.year
kc_data.drop('date', axis=1, inplace=True)

#### Creating a new column named Age

The date (year) of the sale and the year built can be used to obtain the age of the house

In [None]:
#creating new column age
kc_data['age']= kc_data['year']-kc_data['yr_built']
kc_data['age']

* Since we have obtained the age of the house we can drop the year and yr_built columns.
* We drop the id of the house since it`s not in the modelling

In [None]:
#dropping the columns year, yr_built, id
kc_data.drop(['year','yr_built', 'id'],axis=1, inplace=True )

In [None]:
kc_data.info()

In [None]:
kc_data.describe()

The data doesn't have missing values, duplicates or placeholder values and all the columns are in their correct datatypes


### Correlations

In [None]:
corr_matrix = kc_data.corr()
fig, ax = plt.subplots(figsize=(12,12))   # Set the figure size to 12 inches by 12 inches
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', ax=ax)
plt.title('Correlation Matrix', fontsize=18)
plt.show()

#### Multicollinearity

How does each independent variable relate with the other

In [None]:
df=kc_data.corr().abs().stack().reset_index().sort_values(0, ascending=False)

# zip the variable name columns (Which were only named level_0 and level_1 by default) in a new column named "pairs"
df['pairs'] = list(zip(df.level_0, df.level_1))

# set index to pairs
df.set_index(['pairs'], inplace = True)

#d rop level columns
df.drop(columns=['level_1', 'level_0'], inplace = True)

# rename correlation column as cc rather than 0
df.columns = ['cc']

# drop duplicates. This could be dangerous i
df[(df.cc>.75) & (df.cc <1)]

* The above pairs are the most highly collerated to each other.
* Therefore adding all those variables will bring about multicollinearity in the model so we will drop some of them.

In [None]:
kc_data.drop(['sqft_above', 'sqft_living15', 'bathrooms'], axis=1, inplace=True)

In [None]:
kc_data.drop(['lat', 'long', 'zipcode'], axis=1, inplace=True)

#### One hot encoding

In [None]:
kc_data['yr_renovated']= kc_data['yr_renovated'].apply(lambda x: 1 if x>0 else 0 )
kc_data['yr_renovated'].value_counts()

In [None]:
#one hot encoding waterfront,view and condition
kc_transform = pd.get_dummies(kc_data, columns=["waterfront",'view','condition'])
kc_transform = kc_transform.drop(["condition_Poor",'view_NONE','waterfront_NO'], axis=1)
kc_transform

The reference categories for view will be None, for waterfront will be No and for condition will be poor condition

#### Label Encoding

In [None]:
#Convert grade column to numeric using label encoding
label_encoder = LabelEncoder()
kc_transform['grade'] = label_encoder.fit_transform(kc_transform['grade'])
kc_transform['grade'].value_counts()

In [None]:
kc_transform.info()

In [None]:
kc_transform.corr()['price'].sort_values(ascending=False)

* Sqft_living has the strongest positive correlation with price
* Sqft_basement, bedrooms and view_EXCELLENT has low positive correlation with price
* Grade, Age and condition have weak negative correlation




## Linear Regression


The first model will be that of price and the variable that is highly correlated to it
* We will use an ```alpha``` of ```0.05```
* We used forward filling to determine the best model

We choose to use RMSE as our error based metric because:

* RMSE gives more weight to larger errors than smaller errors.
* RMSE is more sensitive to outliers than other metrics such as MAE
* It is commonly used to compare different models and choose the best performing one.

#### Baseline model

In [None]:
#baseline model
X= kc_transform[['sqft_living']]
y=kc_transform['price']
model=sm.OLS(y, sm.add_constant(X))
results=model.fit()
results.summary()

In [None]:
# fit the model 
pred_model = LinearRegression()
pred_model.fit(X, y)

# predict the values of the dependent variable 
y_pred = pred_model.predict(X)

# calculate the RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
print('RMSE:', rmse)

#### Interpretation

The baseline model is that of square foot living and price since square foot living has the highest correlation to price. The model is statistically significant since the F-statistic p-value is less than 0.05 and it explains 49.3% of the total variation of price.

* An increase of 1 square foot in the living area leads to a price increase of approximately 281.

* The model is off by 261656 in price.

Let's add more predictors to improve the accuracy of this model.

### Multiple Linear Regression

In [None]:
X= kc_transform[['sqft_living','sqft_basement']]
y=kc_transform['price']
model=sm.OLS(y, sm.add_constant(X))
results=model.fit()
results.summary()

In [None]:
# fit the model 
pred_model1 = LinearRegression()
pred_model1.fit(X, y)

# predict the values of the dependent variable 
y_pred = pred_model1.predict(X)

# calculate the RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
print('RMSE:', rmse)

### Interpretation

* The model is that of square foot living , square foot basement and price. 
* The model is statistically significant since the F-statistic p-value is less than 0.05
* The model explains 49.3% of the total variation of price same as the other model showing that adding the square foot basement doesn't improve the model.
* The two coefficients are statistically significant since their t-statistic p values are less than 0.05.

* The model is off by $261525 in price which has reduced from the previous model. 
* An increase of 1 square foot in the living area leads to an increase of approximately $276.6 in price.
* An increase of 1 square foot in the basement area leads to an increase of approximately $20.7 in price. 

Let's add more predictors to improve the accuracy of this model.

In [None]:
X= kc_transform[['sqft_living','sqft_basement','bedrooms','view_EXCELLENT']]
y=kc_transform['price']
model=sm.OLS(y, sm.add_constant(X))
results=model.fit()
results.summary()

In [None]:
# fit the model 
pred_model2 = LinearRegression()
pred_model2.fit(X, y)

# predict the values of the dependent variable 
y_pred = pred_model2.predict(X)

# calculate the RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
print('RMSE:', rmse)

#### Interpretation

* The model is that of square foot living , square foot basement, bedrooms, view_EXCELLENT and price. 
* The model is statistically significant since the p-value is less than 0.05 and it explains 54% of the total variation of price which has improved from the previous models making our model more accurate.
* The coefficients are statistically significant since their pvalues are less than 0.05.

* The model is off by $249321 in price which has reduced from the previous models. 
* An increase of 1 square foot in the living area leads to an increase of approximately $296.11 in price. 
* An increase of 1 square foot in the basement area leads to an increase of approximately $12.85 in price. 
* An increase of 1 bedroom leads to an decrease of approximately $56090 in price. 
* A house with an excellent view compared to that with no view leads to an increase of $552500 in price.

Let's add more predictors to improve the accuracy of this model.

In [None]:
X= kc_transform[['sqft_living','sqft_basement','bedrooms','view_EXCELLENT', 'waterfront_YES','floors']]
y=kc_transform['price']
model=sm.OLS(y, sm.add_constant(X))
results=model.fit()
results.summary()

In [None]:
# fit the model 
pred_model3 = LinearRegression()
pred_model3.fit(X, y)

# predict the values of the dependent variable 
y_pred = pred_model3.predict(X)

# calculate the RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
print('RMSE:', rmse)

#### Interpretation

* The model is that of square foot living , square foot basement, bedrooms, view_EXCELLENT, waterfront_yes ,floors and price. 
* The model is statistically significant since the F-statistic p-value is less than 0.05 and it explains 55% of the total variation of price which has improved from the previous models making our model more accurate.
* The coefficients are statistically significant since their t-statistic p values are less than 0.05.

* The model is off by $246200 in price which has reduced from the previous models. 
* An increase of 1 square foot in the living area leads to an increase of approximately $288.76 in price. 
* An increase of 1 square foot in the basement area leads to an increase of approximately $23.33 in price. 
* An increase of 1 bedroom leads to an decrease of approximately $49460 in price. 
* A house with an excellent view compared to that with no view leads to an increase of an $345500 in price. 
* A house on a waterfront compared to that not on a waterfront leads to an increase in 544500 542500 shillings in price. 
* An increase of one more floor in a house leads to an increase of $16970 in price.

Let's add more predictors to improve the accuracy of this model.

In [None]:
X= kc_transform[['sqft_living','sqft_basement','bedrooms','view_EXCELLENT', 
                 'waterfront_YES','floors','grade','age','condition_Fair']]
y=kc_transform['price']
model=sm.OLS(y, sm.add_constant(X))
results=model.fit()
results.summary()

In [None]:
# fit the model 
pred_model4 = LinearRegression()
pred_model4.fit(X, y)

# predict the values of the dependent variable 
y_pred = pred_model4.predict(X)

# calculate the RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
print('RMSE:', rmse)

* On adding the variables that are lowly correlated to price , the model's accuracy improved to 59.9% and the RMSE has reduced to 232835.
* So let's try adding all the variables and see how the model performs.
* Before adding square foot lot and lot15 we need to log transform them

### Log Transformation

In [None]:
kc_copy= kc_transform.copy()

kc_copy["log(sqft_lot)"] = np.log(kc_copy["sqft_lot"])

# Visually inspect raw vs. transformed values
kc_copy[["sqft_lot", "log(sqft_lot)"]]

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(7,3))
ax1.hist(kc_copy["sqft_lot"])
ax1.set_xlabel("sqft_lot")
ax2.hist(kc_copy["log(sqft_lot)"], color="orange")
ax2.set_xlabel("log(sqft_lot)");

In [None]:
kc_copy["log(sqft_lot15)"] = np.log(kc_copy["sqft_lot15"])

# Visually inspect raw vs. transformed values
kc_copy[["sqft_lot15", "log(sqft_lot15)"]]

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(7,3))
ax1.hist(kc_copy["sqft_lot15"])
ax1.set_xlabel("sqft_lot15")
ax2.hist(kc_copy["log(sqft_lot15)"], color="orange")
ax2.set_xlabel("log(sqft_lot15)");

In [None]:
kc_copy.drop(['sqft_lot15',"sqft_lot"],axis=1, inplace= True)

#### Before Transformation

In [None]:
X = kc_transform.drop('price', axis=1)
y = kc_transform['price']
X = sm.add_constant(X)
model1 = sm.OLS(y, X)
results1 = model1.fit() 
print(results1.summary())

In [None]:
# fit the model
pred_model5 = LinearRegression()
pred_model5.fit(X, y)

# predict the values of the dependent variable
y_pred = pred_model5.predict(X)

# calculate the RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
print('RMSE:', rmse)

#### After Transformation

In [None]:
X = kc_copy.drop('price', axis=1)
y = kc_copy['price']
X = sm.add_constant(X)
model1 = sm.OLS(y, X)
results1 = model1.fit() 
print(results1.summary())

In [None]:
# fit the model 
pred_model6 = LinearRegression()
pred_model6.fit(X, y)

# predict the values of the dependent variable 
y_pred = pred_model6.predict(X)

# calculate the RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
print('RMSE:', rmse)

##### Why transform?

* The first model is without the log transformation and the second one is after the log transformation. The second one is better since it explains 61.7% of the total variation in price compared to the first model that explains 61% of the variation in price . 
* The RMSE of the second model is also lower thus we'll use the second model. 
* Interpreting the second model, some variables are not significant since their p-value is more than 0.05 but we can't drop them because that will mean we'll use them as a reference category and we already have a reference category.

This is our final model.

#### Interpretation




* The model is that of bedrooms, sqft_living, floors, grade,sqft_basement,yr_renovated, age, waterfront_YES, view_AVERAGE,view_EXCELLENT, view_FAIR, view_GOOD, condition_Average,condition_Fair, condition_Good, condition_Very Good,log(sqft_lot), log(sqft_lot15) and price. 
* The model is statistically significant since the F-statistic p-value is less than 0.05 and it explains 61.7%% of the total variation of price which has improved from the previous models making our model more accurate.
* Most of the predictor variables are statistically significant apart from condition fair and condition average. 

* The model is off by $227171 in price which has reduced from the previous models. 
* An increase of 1 square foot in the living area leads to an increase of approximately $308.73 in price. 
* An increase of 1 square foot in the basement area leads to a decrease of approximately $36.72 in price. 
* An increase of 1 bedroom leads to an decrease of approximately $409200 in price.
* A house graded higher by one unit leads to a decrease of approximately $19990 in price.
* An increase of 1 year in the age of the house leads to an increase of approximately $2154.77 in price.
* Renovating a house leads to an increase of your price by $63110 in price.
* A house on a waterfront compared to that not on a waterfront leads to an increase in $502500 in price.
* A house with an average view compared to that with no view leads to an increase of an $90700 in price.
* A house with an excellent view compared to that with no view leads to an increase of an $340600 in price.
* A house with a good view compared to that with no view leads to an increase of an $159900 in price. 
* A house with a fair view compared to that with no view leads to an increase of an $140200 in price. 
* An increase of one more floor in a house leads to an increase of $31440 in price.
* A house in an average condition compared to that in poor condition leads to an increase of an $80020 in price. 
* A house in fair condition compared to that in poor condition leads to an increase of an $41180 in price.
* A house in good condition compared to that in poor condition leads to an increase of an $103100 in price.
* A house in very good condition compared to that in poor condition leads to an increase of an $138300 in price.
* For each increase of 1% in square foot lot there is decrease of $386.8 in price. 
* For each increase of 1% in square foot lot15 there is decrease of $135.6 in price. 




In [None]:
fig= plt.figure( figsize=(40,40))
sm.graphics.plot_partregress_grid(results1, exog_idx=list(X.columns.values),fig=fig)
plt.show()

On visualizing the partial regression plots we can see that the predictor variables have a linear relationship with price thus concluding that they are beneficial to our model

#### Standardizing the model

In [None]:
for col in kc_copy:
    kc_copy[col]=(kc_copy[col]-kc_copy[col].mean())/kc_copy[col].std()
    
X = kc_copy.drop('price', axis=1)
y = kc_copy['price']
X = sm.add_constant(X)
model1 = sm.OLS(y, X)
results1 = model1.fit() 
print(results1.summary())

In [None]:
results1.params.sort_values(ascending=False)

* We can see that square foot living has the highest influence on the price of the house. 
* The variables that have a major influence on the price of the house are; square foot living, age of the house,good condition of the house,if the house is on a waterfront and has an excellent view.
* The variables that has the least influence on the price of the house are; grade,number of bedrooms,sqft lot,sqft basement and sqft lot 15.

## Conclusion

* The variables that have a major influence on the price of the house are; square foot living, age of the house,good condition of the house,if the house is on a waterfront and has an excellent view.
* The variables that has the least influence on the price of the house are; grade,number of bedrooms,sqft lot,sqft basement and sqft lot 15.

We can also see that:

- The highest number of house sales are made in the second quarter of the year (Q2: April 1 - June 30) which fall in the Spring season
- The lowest number of house sales are made in the first quarter of the year (Q1: January 1 - March 31) which fall mostly in the Winter season


## Recommendations

* Revonate their house since this increases the value of the house
* Ensure that the houses are in good condition before putting it into the market for sale
* Increase square footage of living space 
* Put up their houses for sale in peak season-Spring

## Future work

* Reducing noise in the data to improve the accuracy of our model. 
* Additionally investigate certain features, such as constructional/architectural values of the house, to see what trends we could discern from that. 