# Exploratory Data Analysis on House Price Prediction Dataset

In this notebook, we will perform exploratory data analysis (EDA) on the house price prediction dataset. The goal is to understand the data, identify patterns, and inform feature selection for our linear regression model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualisation style
sns.set(style='whitegrid')

In [2]:
# Load the dataset
data = pd.read_csv('../data/house_prices.csv')
data.head()

In [3]:
# Check for missing values
missing_values = data.isnull().sum().sort_values(ascending=False)
missing_values[missing_values > 0]

In [4]:
# Visualize the distribution of house prices
plt.figure(figsize=(10, 6))
sns.histplot(data['SalePrice'], bins=30, kde=True)
plt.title('Distribution of House Prices')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.show()

In [5]:
# Correlation heatmap
plt.figure(figsize=(12, 8))
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title('Correlation Heatmap')
plt.show()

## Insights

- The distribution of house prices is right-skewed, indicating that most houses are priced lower, with fewer high-priced houses.
- There are several features with significant correlations to the sale price, which may be useful for our linear regression model.
- Missing values need to be addressed before training the model.