In [None]:
# File: notebooks/exploration.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Laptop Price Prediction: Exploratory Data Analysis (EDA)\n",
    "\n",
    "This notebook is for prototyping, exploring the data, and testing feature engineering ideas.\n",
    "\n",
    "**Objective:**\n",
    "1.  Load the processed dataset.\n",
    "2.  Analyze feature distributions.\n",
    "3.  Analyze relationships between features and the target variable (`price`).\n",
    "4.  Identify correlations and potential issues."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from pathlib import Path\n",
    "import sys\n",
    "\n",
    "# Add src to path to import our utility functions\n",
    "sys.path.append('../')\n",
    "from src.utils import (\n",
    "    plot_brand_distribution, \n",
    "    plot_avg_price_by_brand, \n",
    "    plot_correlation_heatmap, \n",
    "    plot_ram_vs_price_scatter\n",
    ")\n",
    "\n",
    "%matplotlib inline\n",
    "sns.set_theme(style=\"whitegrid\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATA_PATH = Path('../data/processed/laptops_cleaned.csv')\n",
    "\n",
    "if not DATA_PATH.exists():\n",
    "    print(f\"Error: {DATA_PATH} not found.\")\n",
    "    print(\"Please run: python -m src.data.preprocess\")\n",
    "else:\n",
    "    df = pd.read_csv(DATA_PATH)\n",
    "    print(f\"Data loaded successfully: {df.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Target Variable Analysis (Price)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Plot original price distribution\n",
    "sns.histplot(df['price'], kde=True, ax=ax1)\n",
    "ax1.set_title('Distribution of Price (Right-Skewed)')\n",
    "\n",
    "# Plot log-transformed price distribution\n",
    "# Use np.log1p for numerical stability (handles 0s)\n",
    "df['log_price'] = np.log1p(df['price'])\n",
    "sns.histplot(df['log_price'], kde=True, ax=ax2)\n",
    "ax2.set_title('Distribution of Log-Transformed Price (More Normal)')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Observation:** The price is heavily right-skewed. Applying a log-transform makes it much more normally distributed, which is ideal for linear models and can help tree-based models converge better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Categorical Feature Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot 1: Brand Distribution\n",
    "fig1 = plot_brand_distribution(df)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot 2: Average Price by Brand\n",
    "fig2 = plot_avg_price_by_brand(df)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot 3: Average Price by OS\n",
    "plt.figure(figsize=(8, 5))\n",
    "sns.barplot(x='os_category', y='price', data=df)\n",
    "plt.title('Average Price by Operating System')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Numeric Feature Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot 4: RAM vs Price Scatter\n",
    "fig3 = plot_ram_vs_price_scatter(df)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Observation:** Clear positive correlation. More RAM = higher price. The relationship appears somewhat exponential, confirming that log-transforming price is a good idea."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot 5: Correlation Heatmap\n",
    "numeric_cols = df.select_dtypes(include=np.number).columns\n",
    "fig4 = plot_correlation_heatmap(df, numeric_cols)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Observations from Heatmap:**\n",
    "* `price` (and `log_price`) has strong positive correlations with `ram_gb`, `cpu_score`, `ppi`, and `user_rating`.\n",
    "* `weight_kg` and `display_size_in` are highly correlated (larger screens = heavier).\n",
    "* `is_gaming` is strongly correlated with `ram_gb` and `cpu_score`.\n",
    "* `is_ultrabook` is negatively correlated with `weight_kg` and `display_size_in`, which makes sense."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

: 