# Linear Regression Model using Python

## Objective

This project aims at predicting an outcome (e.g., house prices) based on a single feature (e.g., house area).

It can be achieved through:
- Exploring a real-world dataset
- Preparing and splitting data for training and testing
- Building a simple linear regression model using Scikit-learn’s LinearRegression
- Evaluating the model using key metrics i.e., MAE, MSE, RMSE, and R² Score
- Visualizing predictions and regression lines
- Publishing the project on GitHub/Portfolio

In [None]:
# Import relevant libraries
import numpy as np
import pandas as pd

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, root_mean_squared_error, mean_absolute_error

import warnings
warnings.filterwarnings("ignore")

## Loading Datasets

In [1]:
# Load the dataset
df1 = pd.read_csv("../PROJECTS/Regression Datasets/wk07-regression/areas.csv")
df1.head()

NameError: name 'pd' is not defined

In [None]:
df2 = pd.read_csv("../PROJECTS/Regression Datasets/wk07-regression/homeprices.csv")
df2.head()

In [None]:
df3 = pd.read_csv("../PROJECTS/Regression Datasets/wk07-regression/homeprices-m.csv")
df3.head()

# Exploratory Data Analysis

In [None]:
df3.shape

In [None]:
# Check for irregularities in the data
df3.info()

- Presence of missing value in `bedrooms` column.
- `bedrooms` should be of integer datatype.

In [None]:
df3.isnull().sum() / df3.shape[0] * 100

In [None]:
# Check summary statistics of the dataset
df3.describe()

## Handling missing value of `bedrooms`

- The missing value was filled by `median` i.e. the middle value in the number of `bedrooms`, since the value was missing at random (MAR).

In [None]:
# Impute missing value in `bedrooms` column
df3['bedrooms'] = df3['bedrooms'].fillna(df3['bedrooms'].median()).astype(int)
# df3['bedrooms'] = df3['bedrooms'].fillna(df3['bedrooms'].mode()[0]).astype(int)
df3.info()


In [None]:
df3

## Visualizing Data

In [None]:
# Scatter plot for `area` vs `price`
plt.scatter(df3['area'], df3['price'])
plt.xlabel('Area (sq ft)')
plt.ylabel('Price (USD)')
plt.title('Area vs Price')
plt.show();

In [None]:
# Scatter plot for `bedrooms` vs `price`
plt.scatter(df3['bedrooms'], df3['price'])
plt.xlabel('Bedrooms')
plt.ylabel('Price (USD)')
plt.title('Bedrooms vs Price')
plt.show();


In [None]:
# Scatter plot for `age` vs `price`
plt.scatter(df3['age'], df3['price'])
plt.xlabel('Age (years)')
plt.ylabel('Price (USD)')
plt.title('Age vs Price')
plt.show();


In [None]:
# Correlation matrix
corr_matrix = df3.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title('Housing Correlation Matrix Heatmap')
plt.show();

**Observations**
- There `age` of the house has a `negative relationship` with the `price`. As the `age` of the house increases, the `price` decreases.
- The `area` and the `number of bedrooms` show a positive relationship with the `price`. As the `area` and the `number of bedrooms` increases, the `price` of the house also increases.

## Feature Engineering
- Define the `target` and `feature` variables
- Split the dataset into `train` and `test`

In [None]:
# Drop the target variable
X = df3.drop(columns=['price'])
# Define the features and target variable
# X = df3.drop(columns=['area', 'bedrooms', 'age']) 
y = df3['price']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Build the Model using Linear Regression

In [None]:
# Instantiate the Linear Regression model
lr = LinearRegression()

# Fit the model on the training data
lr.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lr.predict(X_test)

In [None]:
# Print the coefficients
print("Coefficients:", pd.DataFrame(lr.coef_))

In [None]:

# Print the intercept
print("Intercept:", lr.intercept_)

In [None]:

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")

In [None]:
# Visualize the predictions vs actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted Prices')
plt.legend(['Predicted', 'Actual'])
plt.grid()
plt.show();

### Conclusion
- The model performed fairly due to the distance between the actual and predicted values. The model can be improved in future by performing hyperparameter tuning. 