<p style="text-align:center"> 
<a href="https://skills.network" target="_blank"> 
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="Skills Network Logo"> 
</a>
</p>

# <h1 align="center"><font size="7"><strong>Final project</strong></font></h1>
## <h2 align= "center"><font size="6.8">Homes for sale in King County, USA</font></h2>

<h2>Table of contents</h2>
<div class="alert alert-block alert-info" style="margin-top: 20px"> 
<ul> 
<li><a href="#Instructions">Instructions</a></li> 
<li><a href="#About-the-Dataset">About the Dataset</a></li> 
<li><a href="#Module-1:-Importing-Data-Sets">Part 1: Importing the data </a></li> 
<li><a href="#Module-2:-Data-Wrangling">Part 2: Data Manipulation</a> </li> 
<li><a href="#Module-3:-Exploratory-Data-Analysis">Part 3: Exploratory Data Analysis (EDA)</a></li> 
<li><a href="#Module-4:-Model-Development">Part 4: Model Development</a></li>
<li><a href="#Module-5:-Model-Evaluation-and-Refinement">Part 5: Model Evaluation and Refinement</a></li>
</a></li>
</div>

<hr>

## Instrucciones

In this role, we are tasked with being a data analyst working for a real estate investment trust. The trust wants to begin investing in residential real estate, so we are tasked with determining the market price of a home, given a number of characteristics.

* <strong>Objective</strong>:

``Analyze and predict home prices using attributes or characteristics such as square footage (sqft), number of bedrooms, number of floors, etc.``


<hr>

## About the Dataset

This dataset contains home sales prices in King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. It was extracted from this [source](https://www.kaggle.com/harlfoxem/housesalesprediction?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-wwwcourseraorg-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01). It was also slightly modified for the purposes of this exercise.

### Dataset variables

| Variable | Description |
| ------------- | ----------------------------------------------------------------------------------------------------------- |
| id | A rating for a house |
| date | Date the house sold |
| price | Price of the house (this is our goal to predict) |
| bedrooms | Number of bedrooms |
| bathrooms | Number of bathrooms |
| sqft_living | Square footage of the house |
| sqft_lote | Square footage of the lot |
| floors | Total floors (levels) in the house |
| waterfront | House with an ocean view |
| view | Has been viewed |
| condition | How good the overall condition is |
| grade | Overall rating given to the housing unit, according to the King County rating system |
| sqft_above | Square footage of the house excluding the basement |
| sqft_basement | Square footage of the basement |
| yr_built | Year built |
| yr_renovated | Year the house was renovated |
| zipcode | Postal code |
| lat | Latitude coordinate |
| long | Longitude coordinate |
| sqft_living15 | Living room area in 2015 (involves some renovations). This may or may not have affected the lot area. |
| sqft_lot15 | Lot area in 2015 (involves some renovations) |

<hr>

## Setup

For this work, we will use the following Python libraries:

*   [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) to manage data.
*   [`numpy`](https://numpy.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for mathematical operations.
*   [`scikit-learn`](https://scikit-learn.org/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for machine learning-related functions and the machine learning pipeline.
*   [`seaborn`](https://seaborn.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) to visualize the data.
*   [`matplotlib`](https://matplotlib.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for additional plotting tools.


### **We install and import the mentioned libraries**

In [None]:
# We install the libraries using a magic command
%pip install pandas numpy matplotlib seaborn scikit-learn

In [2]:
# We import the libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression

import warnings
warnings.filterwarnings('ignore')

# To make visualizations run inside cells
%matplotlib inline

<hr>

## Part 1: Importing datasets

We directly load the dataset (in this case a file with the extension <code>.csv</code> that comes from an external link) into a data frame in our Jupyter notebook and then have it locally.

In [3]:
# CSV file address
file_path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/FinalModule_Coursera/data/kc_house_data_NaN.csv'

# Load the CSV file into a Pandas dataframe
df = pd.read_csv(file_path)

But before downloading the uncleaned dataset, we check if the dataframe is saved correctly with the <code>head()</code> method to display the first 5 columns of the dataframe.

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,0,7129300520,20141013T000000,221900.0,3.0,1.0,1180,5650,1.0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,1,6414100192,20141209T000000,538000.0,3.0,2.25,2570,7242,2.0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,2,5631500400,20150225T000000,180000.0,2.0,1.0,770,10000,1.0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,3,2487200875,20141209T000000,604000.0,4.0,3.0,1960,5000,1.0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,4,1954400510,20150218T000000,510000.0,3.0,2.0,1680,8080,1.0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [5]:
#We download the uncleaned CSV file so we can compare it with future versions.
df.to_csv('../data/raw/data.csv')

### We perform a brief inspection to see how the dataframe turned out.

``Dataset Dimensions``:

In [6]:
print(f'''
Dataset dimensions with its original data:
_ Number of rows: {df.shape[0]}
_ Number of columns: {df.shape[1]}
''')


Dataset dimensions with its original data:
_ Number of rows: 21613
_ Number of columns: 22



``We check the data types of each column in the DataFrame``:

In [7]:
print('Data types per column')
df.dtypes

Data types per column


Unnamed: 0         int64
id                 int64
date              object
price            float64
bedrooms         float64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

``Statistical summary of the DataFrame``:

This allows us to gain a quick and concise overview of the data frame, providing key information about the data distribution, including metrics such as standard deviation, mean, and other descriptive statistics.

In [8]:
df.describe()

Unnamed: 0.1,Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21613.0,21613.0,21613.0,21600.0,21603.0,21613.0,21613.0,21613.0,21613.0,21613.0,...,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,10806.0,4580302000.0,540088.1,3.37287,2.115736,2079.899736,15106.97,1.494309,0.007542,0.234303,...,7.656873,1788.390691,291.509045,1971.005136,84.402258,98077.939805,47.560053,-122.213896,1986.552492,12768.455652
std,6239.28002,2876566000.0,367127.2,0.926657,0.768996,918.440897,41420.51,0.539989,0.086517,0.766318,...,1.175459,828.090978,442.575043,29.373411,401.67924,53.505026,0.138564,0.140828,685.391304,27304.179631
min,0.0,1000102.0,75000.0,1.0,0.5,290.0,520.0,1.0,0.0,0.0,...,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,5403.0,2123049000.0,321950.0,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,...,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.471,-122.328,1490.0,5100.0
50%,10806.0,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,...,7.0,1560.0,0.0,1975.0,0.0,98065.0,47.5718,-122.23,1840.0,7620.0
75%,16209.0,7308900000.0,645000.0,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,...,8.0,2210.0,560.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,21612.0,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,...,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


<hr>