In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Land Estimator: Exploratory Data Analysis (EDA)\n",
    "This notebook explores the land parcel dataset to understand distributions, relationships, and geospatial patterns for building a land value estimator.\n",
    "\n",
    "## Objectives\n",
    "- Analyze distributions of key variables (e.g., price, area, zoning).\n",
    "- Examine relationships between variables (e.g., price vs. area).\n",
    "- Visualize geospatial patterns (e.g., price heatmaps).\n",
    "- Identify features and preprocessing needs for modeling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. Setup\n",
    "import pandas as pd\n",
    "import geopandas as gpd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from pathlib import Path\n",
    "from src.utils import CONFIG, load_csv, load_geojson, setup_directories\n",
    "\n",
    "# Set up directories\n",
    "setup_directories()\n",
    "\n",
    "# Set plot style\n",
    "sns.set_style(\"whitegrid\")\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load Data\n",
    "Load raw parcel data (CSV) and geospatial data (GeoJSON) using utility functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load parcel data (CSV)\n",
    "parcel_csv_path = CONFIG['raw_data_dir'] / 'parcels.csv'\n",
    "df = load_csv(parcel_csv_path)\n",
    "\n",
    "# Load geospatial parcel data (GeoJSON)\n",
    "parcel_geojson_path = CONFIG['raw_data_dir'] / 'parcels.geojson'\n",
    "gdf = load_geojson(parcel_geojson_path)\n",
    "\n",
    "# Display first few rows\n",
    "print(\"Parcel CSV Data:\")\n",
    "display(df.head())\n",
    "print(\"\\nGeospatial Data:\")\n",
    "display(gdf.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Data Overview\n",
    "Check data types, missing values, and summary statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data info\n",
    "print(\"CSV Data Info:\")\n",
    "df.info()\n",
    "print(\"\\nGeoJSON Data Info:\")\n",
    "gdf.info()\n",
    "\n",
    "# Summary statistics\n",
    "print(\"\\nSummary Statistics:\")\n",
    "display(df.describe())\n",
    "\n",
    "# Missing values\n",
    "print(\"\\nMissing Values in CSV:\")\n",
    "display(df.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Univariate Analysis\n",
    "Explore distributions of key variables like price, area, and zoning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate price per square meter\n",
    "df['price_per_sqm'] = df['price'] / df['area']\n",
    "\n",
    "# Price per square meter distribution\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.histplot(df['price_per_sqm'], bins=50, kde=True, color='#FF6B6B')\n",
    "plt.title('Distribution of Price per Square Meter')\n",
    "plt.xlabel('Price per Square Meter (USD)')\n",
    "plt.ylabel('Frequency')\n",
    "plt.show()\n",
    "\n",
    "# Zoning categories\n",
    "plt.figure(figsize=(8, 6))\n",
    "df['zoning'].value_counts().plot(kind='bar', color='#4ECDC4')\n",
    "plt.title('Distribution of Zoning Categories')\n",
    "plt.xlabel('Zoning')\n",
    "plt.ylabel('Count')\n",
    "plt.xticks(rotation=45)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Bivariate Analysis\n",
    "Examine relationships between variables (e.g., price vs. area, zoning vs. price)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Price vs. Area\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.scatterplot(x='area', y='price', hue='zoning', size='price_per_sqm', data=df, palette='viridis')\n",
    "plt.title('Price vs. Area by Zoning')\n",
    "plt.xlabel('Area (sqm)')\n",
    "plt.ylabel('Price (USD)')\n",
    "plt.show()\n",
    "\n",
    "# Boxplot of price by zoning\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.boxplot(x='zoning', y='price_per_sqm', data=df, palette='Set2')\n",
    "plt.title('Price per Square Meter by Zoning')\n",
    "plt.xlabel('Zoning')\n",
    "plt.ylabel('Price per Square Meter (USD)')\n",
    "plt.xticks(rotation=45)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Geospatial Analysis\n",
    "Visualize land prices on a map."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge price data with geospatial data\n",
    "gdf = gdf.merge(df[['parcel_id', 'price_per_sqm']], on='parcel_id', how='left')\n",
    "\n",
    "# Plot price heatmap\n",
    "fig, ax = plt.subplots(figsize=(12, 8))\n",
    "gdf.plot(column='price_per_sqm', cmap='YlOrRd', legend=True, ax=ax, missing_kwds={'color': 'lightgrey'})\n",
    "plt.title('Geospatial Heatmap of Price per Square Meter')\n",
    "plt.xlabel('Longitude')\n",
    "plt.ylabel('Latitude')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Correlation Analysis\n",
    "Check correlations between numerical features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Correlation matrix\n",
    "corr = df[['price', 'area', 'price_per_sqm']].corr()\n",
    "plt.figure(figsize=(8, 6))\n",
    "sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)\n",
    "plt.title('Correlation Matrix')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Key Insights\n",
    "- **Price Distribution**: The price per square meter is right-skewed, suggesting a log-transformation for modeling.\n",
    "- **Zoning Impact**: Residential zones tend to have higher median prices than commercial or industrial zones.\n",
    "- **Geospatial Patterns**: Higher prices cluster in specific areas, likely near amenities or city centers.\n",
    "- **Next Steps**:\n",
    "  - Preprocess data to handle missing values and outliers.\n",
    "  - Engineer features like distance to amenities or encoded zoning categories.\n",
    "  - Test regression models in `model.py` using `price_per_sqm` as the target."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}