<a href="https://colab.research.google.com/github/icequeenwand/machine-learning/blob/main/supervised-learning/linear-regression/housing_price.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Simple Linear Regression for Predicting House Prices

## Introduction

In this project, we will explore the fundamentals of machine learning by creating a simple linear regression model to predict house prices based on the area of the houses. Linear regression is a fundamental and widely used technique in the field of machine learning and is a great place to start for those new to the subject.

### Problem Statement

The goal of this project is to build a model that can predict the price of a house given its area. This can be a valuable tool for both home buyers and sellers, as it provides an estimate of the price of a house based on a single, easily accessible feature - the area. By the end of this project, we aim to have a model that can make accurate predictions of house prices, thereby assisting in the decision-making process for potential buyers and sellers.

### Dataset

To accomplish our goal, we will utilize a dataset that contains information on house prices and their corresponding areas. The dataset will be used to train and evaluate our linear regression model. The dataset may include the following columns:

- `Area`: The area of the house (in square feet).
- `Price`: The price of the house (in a given currency).

We will explore, preprocess, and analyze the data to gain insights before training our machine learning model.

### Methodology

Our approach will involve the following key steps:

1. Data Exploration: We will start by loading and exploring the dataset to understand its structure and characteristics. This step will help us gain insights into the data and identify any potential challenges.

2. Data Preprocessing: Data preprocessing involves tasks such as handling missing values, scaling, and splitting the data into training and testing sets.

3. Linear Regression Model: We will build a simple linear regression model, where `Area` will be the independent variable (feature), and `Price` will be the dependent variable (target). The model will learn the relationship between the two variables and make predictions.

4. Model Evaluation: We will evaluate the performance of our model using appropriate metrics such as mean squared error, mean absolute error, and R-squared.

5. Visualization: Visualizing the data and model predictions will help us better understand the relationships and assess the model's accuracy.

6. Conclusion: We will conclude the project by summarizing the findings and discussing the model's potential applications and limitations.

By the end of this project, you will have a practical understanding of building a simple linear regression model in a Jupyter Notebook and using it to make predictions. Let's get started!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# EDA (Exploration Data Analytic)

In [8]:
df = pd.read_csv('housing.csv')
df.head(5)

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [13]:
df = df.drop(columns=['bathrooms', 'stories', 'mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'parking', 'prefarea', 'furnishingstatus'])
df.head(5)

Unnamed: 0,price,area
0,13300000,7420
1,12250000,8960
2,12250000,9960
3,12215000,7500
4,11410000,7420


In [14]:
df.shape

(545, 2)

In [16]:
duplicated_rows = df[df.duplicated()]
print(f"Duplicated Rows: {duplicated_rows.shape}")

Duplicated Rows: (13, 2)


In [17]:
df.count()

price    545
area     545
dtype: int64

In [19]:
df = df.drop_duplicates()
df.count()

price    532
area     532
dtype: int64

In [21]:
print(df.isnull().sum())

price    0
area     0
dtype: int64


In [22]:
df = df.dropna()
df.count()

price    532
area     532
dtype: int64