# Exploratory Data Analysis (EDA) - California Housing Dataset

**Objective:** This notebook performs an initial exploratory data analysis on the California Housing dataset. The goals are:
1.  Understand the structure, data types, and statistical properties of the dataset.
2.  Analyze the distribution of the target variable, `price`.
3.  Visualize the distributions of individual features.
4.  Investigate relationships and correlations between features, especially with the target variable.
5.  Identify potential data quality issues, outliers, or patterns that will inform feature engineering and model selection.

## Setup and Imports

Import necessary libraries and set plotting styles for consistent and professional visualizations.

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set a professional plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("talk")

## 1. Data Loading

**Reasoning:** As per our project's architecture, we will not place data loading logic directly into the notebook. Instead, we import and use the `CaliforniaHousingLoader` from our `src` directory. This promotes code reuse, testability, and separation of concerns—a key practice for production-level ML systems.

We need to add the `src` directory to our system path to make the modules importable.

In [None]:
# Add the project's root directory to the Python path to allow for `src` imports
# This path is relative to the notebook's location: notebooks/01_data_exploration/
project_root = os.path.abspath(os.path.join('..', '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

from src.data.loaders import CaliforniaHousingLoader

try:
    # Instantiate and load the data
    loader = CaliforniaHousingLoader()
    housing_df = loader.load()
    print("Data loaded successfully!")
except Exception as e:
    print(f"An error occurred during data loading: {e}")

## 2. Initial Data Inspection

**Reasoning:** Before any deep analysis, we perform a first-pass inspection of the DataFrame. This helps us quickly understand its basic characteristics: the shape, data types, and presence of missing values. It's the quickest way to get a feel for the data we're working with.

In [None]:
# Display the first few rows to get a visual sense of the data
housing_df.head()

In [None]:
# Get a concise summary of the dataframe
# This is crucial for checking data types and missing values.
housing_df.info()

**Initial Observations from `.info()`:**
- The dataset contains 20,640 entries.
- There are 8 features and 1 target variable (`price`).
- All columns are `float64`, which is expected for this dataset.
- **Crucially, there are no missing values.** This simplifies our preprocessing, but in a real-world scenario, we would need a strategy for handling them.

In [None]:
# Generate descriptive statistics
# This gives us a sense of the scale, central tendency, and spread of each feature.
housing_df.describe()

**Observations from `.describe()`:**
- **Varying Scales:** The scales of the features vary widely (e.g., `MedInc` is in single digits, while `Population` is in thousands). This strongly suggests that **feature scaling will be essential** for distance-based algorithms (like SVMs) and algorithms that use regularization (like Ridge/Lasso).
- **Capping:** `HouseAge` and the target `price` have a max value (52.0 and 5.00001 respectively) that appears to be a cap. This is an artifact of data collection and could impact model performance. We must investigate this.
- **Potential Outliers/Skew:** The `AveRooms`, `AveBedrms`, `Population`, and `AveOccup` features have a large difference between their 75th percentile and max values, indicating the presence of outliers or a highly skewed distribution.

## 3. Target Variable Analysis (`price`)

**Reasoning:** In any supervised learning task, understanding the distribution of the target variable is the most important step. Its characteristics (skewness, outliers, range) directly influence model choice, evaluation metrics, and potential transformations.

In [None]:
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
sns.histplot(housing_df['price'], kde=True, bins=50)
plt.title('Distribution of Median House Value (Price)')
plt.xlabel('Median House Value ($100,000s)')

plt.subplot(1, 2, 2)
sns.boxplot(x=housing_df['price'])
plt.title('Box Plot of Median House Value (Price)')
plt.xlabel('Median House Value ($100,000s)')

plt.tight_layout()
plt.show()

**Observations on Target Variable:**
- The distribution is **right-skewed**. Many models perform better on normally distributed targets.
- There is a clear **capping** at the maximum value of 5. This is a significant data artifact. The model may struggle to predict prices for houses in this top tier. This needs a dedicated strategy in the feature engineering phase.

## 4. Feature Distribution and Relationships

**Reasoning:** Now we analyze the features. We'll look at their individual distributions (univariate analysis) to spot skewness or other patterns. Then, we'll examine their relationship with the target variable (bivariate analysis) to identify which features are most predictive.

In [None]:
# Plot histograms for all numerical features to see their distributions
housing_df.hist(bins=50, figsize=(20, 15))
plt.suptitle('Histograms of All Numerical Features', y=0.92)
plt.show()

**Observations on Feature Distributions:**
- **Skewness:** `MedInc`, `AveRooms`, `AveBedrms`, `Population`, and `AveOccup` are all heavily right-skewed. A log transformation might help normalize these distributions, which can be beneficial for linear models.
- **Capping:** `HouseAge` is also clearly capped at 52 years.
- **Geographical Features:** `Latitude` and `Longitude` show multiple peaks, which makes sense as they represent geographical locations with population clusters (e.g., Los Angeles and Bay Area).

### Correlation Analysis

**Reasoning:** A correlation matrix is a fast way to quantify the linear relationships between features. We are most interested in the correlations with our target, `price`. This helps us identify the most promising features and also spot potential multicollinearity (features that are highly correlated with each other).

In [None]:
corr_matrix = housing_df.corr()

# Focus on correlations with the target variable
print("Correlation with Target (price):")
print(corr_matrix['price'].sort_values(ascending=False))

In [None]:
# Visualize the full correlation matrix with a heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Housing Features')
plt.show()

**Observations on Correlations:**
- **`MedInc` (Median Income)** has the strongest positive correlation with `price` (0.69). This is highly intuitive and makes it our most important feature.
- **`AveRooms`** has a weak positive correlation (0.15).
- **`Latitude`** has a slight negative correlation (-0.14), suggesting that houses in the north are slightly cheaper, though this is a very weak signal on its own.
- **Multicollinearity:** There is a high correlation between `AveRooms` and `AveBedrms` (0.85). This might be an issue for interpreting coefficients in linear models, but is less of a concern for tree-based models.

## 5. Geospatial Analysis

**Reasoning:** Since we have latitude and longitude data, we can create a geographical scatter plot. This is a powerful visualization that can reveal patterns that are impossible to see in tables or histograms. We can check if price is related to location, like proximity to the coast or major cities.

In [None]:
housing_df.plot(kind="scatter", x="Longitude", y="Latitude", alpha=0.4,
                s=housing_df["Population"]/100, label="Population", figsize=(12,9),
                c="price", cmap=plt.get_cmap("jet"), colorbar=True,
                sharex=False)
plt.title("California Housing Prices and Population Density")
plt.legend()
plt.show()

**Observations from Geospatial Plot:**
- The plot clearly resembles a map of California.
- **High-priced areas (red/yellow)** are concentrated along the coast, particularly in the Bay Area (around Longitude -122) and Southern California (around Los Angeles and San Diego).
- Inland areas generally have lower prices (blue/green).
- This visualization confirms that **location is a critical factor** in determining house prices. This suggests that creating location-based features (e.g., distance to coast, clustering of districts) could be highly beneficial.

## 6. Summary of Findings & Next Steps

**Reasoning:** The final step of EDA is to synthesize all our findings into a concise summary and, most importantly, to define a clear action plan for the next phase of the project: Feature Engineering.

### Key Findings:
1.  **Data Quality:** The data is clean with no missing values, but features have vastly different scales.
2.  **Target Variable:** `price` is right-skewed and **capped** at \$500,000. This capping is a major characteristic that must be addressed.
3.  **Feature Skewness:** Several important features are highly skewed (`MedInc`, `AveRooms`, etc.).
4.  **Key Predictor:** `MedInc` is by far the strongest linear predictor of `price`.
5.  **Location Importance:** Geospatial data confirms that prices are heavily dependent on location, especially proximity to the coast and major urban centers.
6.  **Multicollinearity:** `AveRooms` and `AveBedrms` are highly correlated.

### Proposed Next Steps (for Feature Engineering):
1.  **Transformation:** Apply log transformations to the skewed features (e.g., `MedInc`, `Population`, `AveRooms`, `AveOccup`) to make their distributions more normal, which can help linear models.
2.  **Feature Scaling:** Standardize or normalize all features to bring them to a common scale. This is **required** for many ML algorithms.
3.  **Feature Creation:**
    - Create new combination features that might capture more signal, such as `rooms_per_person` or `bedrooms_per_room`.
    - Engineer location-based features. We could use clustering on `Latitude` and `Longitude` to create a `location_category` feature, or calculate `distance_to_coast`.
4.  **Handling Capped Values:** Decide on a strategy for the capped `price` and `HouseAge`. For the target `price`, options include:
    a) Removing these instances before training.
    b) Leaving them as is, but being aware that the model will not be able to predict values above the cap.
    c) Treating it as a classification problem for that price bracket (more complex).

This structured analysis provides a clear, evidence-based path forward for the `02_feature_engineering` phase.