# California Housing Price Prediction
## Exploratory Data Analysis

Goal: Understand the dataset, identify target and features, and spot obvious patterns or issues before modeling.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_california_housing

sns.set_theme(style="whitegrid")

dataset = fetch_california_housing(as_frame=True)
df = dataset.frame
# Detailed dataset description
# print(dataset.DESCR)

df.head()


### Basic structure

In [None]:
# Basic structure: Dataset size and info
display(df.shape)
display(df.info())

### Summary statistics

In [None]:
# Summary statistics
df.describe()

### Missing values

In [None]:
# Check for missing values
df.isna().sum()
# df.isnull().sum()

There are no missing values in the dataset.

### Target and features# Identify features and target

In [None]:
# Identify features and target
features = df.drop("MedHouseVal", axis=1).columns
target = "MedHouseVal"

features, target
print(f"Target variable:\n- {target}\n")

print(f"Features:")
for feature in features:
    print("-",feature)
print("")

### Correlations

In [None]:
# Correltation matrix
corr = df.corr(numeric_only=True)

plt.figure(figsize=(8,6)) 
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Feature Correlation Matrix")
plt.savefig("../reports/corr_mat.png", dpi=150, bbox_inches="tight")
plt.show()

### Distributions

In [None]:
# Simple histograms for each feature and target 
df.hist(figsize=(12, 8), bins=30)
plt.tight_layout()
plt.savefig("../reports/histograms.png", dpi=150, bbox_inches="tight")
plt.show()

### Key relationships

In [None]:
# Scatter plot of median income vs median house value
df.plot(kind="scatter", x="MedInc", y="MedHouseVal", alpha=0.3)
plt.title("Median Income vs Median House Value")
plt.savefig("../reports/medinc_vs_price.png", dpi=150, bbox_inches="tight")
plt.show()

In [None]:
df.plot(kind="scatter", x="Longitude", y="Latitude", alpha=0.1, figsize=(6,6))
plt.title("Geographic Distribution of Data")
plt.savefig("../reports/geographic_distribution.png", dpi=150, bbox_inches="tight")
plt.show()

### EDA conclusions

Key observations: 
- Median inclome has the strongest correlation with house value.
- Target distribution is capped, which may affect regression performance.
- All features are numeric, simplifying preprocessing.
- Geographic location clearly matters.

Implications for modeling: 
- Scaling will be required.
- Linear models may underfit.
- Tree-based models are promising.