In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# College Insights Dashboard: Exploratory Data Analysis & Visualizations (03_eda_visuals.ipynb)\n",
    "\n",
    "This notebook is dedicated to performing Exploratory Data Analysis (EDA) and generating key visualizations to understand our dataset. The goal is to uncover trends, patterns, and insights related to student performance, attendance, and subject difficulty. We will use `matplotlib` and `seaborn` to create professional-quality charts that will inform our dashboard and final reports."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup and Data Loading\n",
    "\n",
    "We begin by importing the necessary libraries and loading the pre-cleaned and merged DataFrame from our `src/` directory. This ensures that our analysis starts with a reliable foundation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import os\n",
    "import logging\n",
    "\n",
    "# Set up logging\n",
    "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n",
    "\n",
    "# Add parent directory to path to import from 'src'\n",
    "import sys\n",
    "sys.path.append('..')\n",
    "from src.load_data import load_all_data\n",
    "\n",
    "# Set the seaborn style for better aesthetics\n",
    "sns.set_theme(style=\"whitegrid\")\n",
    "\n",
    "# Load the cleaned and merged DataFrame\n",
    "df = load_all_data()\n",
    "\n",
    "if df is not None:\n",
    "    logging.info(\"DataFrame loaded successfully. Starting EDA...\")\n",
    "    print(\"\\nDataFrame Head:\\n\")\n",
    "    display(df.head())\n",
    "else:\n",
    "    logging.error(\"Could not load data. Please check the `data` directory and `src/load_data.py`.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Key Visualizations for Insights\n",
    "\n",
    "We'll now create several charts that address key questions about academic performance. Each visualization will be saved to the `outputs/charts/` directory for use in our final dashboard and reports."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualization 1: Pass Percentage by Subject\n",
    "\n",
    "A bar chart is perfect for comparing the performance of students across different subjects. This helps identify which subjects might be more challenging."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if df is not None:\n",
    "    pass_rates = df.groupby('subject_name')['pass_status'].value_counts(normalize=True).unstack().fillna(0)\n",
    "    pass_rates['Pass_Percentage'] = pass_rates['Pass'] * 100\n",
    "    pass_rates = pass_rates.sort_values('Pass_Percentage', ascending=False)\n",
    "\n",
    "    plt.figure(figsize=(12, 7))\n",
    "    sns.barplot(x=pass_rates.index, y=pass_rates['Pass_Percentage'], palette='viridis')\n",
    "    plt.title('Pass Percentage by Subject', fontsize=16, fontweight='bold')\n",
    "    plt.xlabel('Subject', fontsize=12)\n",
    "    plt.ylabel('Pass Percentage (%)', fontsize=12)\n",
    "    plt.xticks(rotation=45, ha='right')\n",
    "    plt.tight_layout()\n",
    "    plt.savefig('../outputs/charts/pass_rate_by_subject.png')\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualization 2: Grade Distribution\n",
    "\n",
    "A pie chart provides a quick, high-level view of the overall pass/fail ratio, which is a key performance indicator for the entire college."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if df is not None:\n",
    "    grade_counts = df['pass_status'].value_counts()\n",
    "\n",
    "    plt.figure(figsize=(8, 8))\n",
    "    plt.pie(grade_counts, labels=grade_counts.index, autopct='%1.1f%%', startangle=90, colors=['#4CAF50', '#F44336'])\n",
    "    plt.title('Overall Grade Distribution', fontsize=16, fontweight='bold')\n",
    "    plt.tight_layout()\n",
    "    plt.savefig('../outputs/charts/grade_distribution.png')\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualization 3: Attendance vs. Marks (Regression Plot)\n",
    "\n",
    "This scatter plot with a regression line helps us visually confirm the relationship between student attendance and marks. This is a powerful insight for faculty and students alike."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if df is not None:\n",
    "    plt.figure(figsize=(10, 6))\n",
    "    sns.regplot(x='attendance', y='marks', data=df, scatter_kws={'alpha':0.6}, line_kws={'color':'red', 'linestyle':'--'})\n",
    "    plt.title('Attendance vs Marks with Regression Line', fontsize=16, fontweight='bold')\n",
    "    plt.xlabel('Attendance Percentage (%)', fontsize=12)\n",
    "    plt.ylabel('Marks', fontsize=12)\n",
    "    plt.tight_layout()\n",
    "    plt.savefig('../outputs/charts/attendance_vs_marks.png')\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualization 4: Correlation Matrix\n",
    "\n",
    "A heatmap of the correlation matrix quantifies the relationship between numerical features. This is a crucial step in preparing for machine learning, as it helps identify which features are most impactful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if df is not None:\n",
    "    numerical_df = df[['marks', 'attendance']]\n",
    "    corr_matrix = numerical_df.corr()\n",
    "\n",
    "    plt.figure(figsize=(8, 6))\n",
    "    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)\n",
    "    plt.title('Correlation Matrix of Marks & Attendance', fontsize=16, fontweight='bold')\n",
    "    plt.tight_layout()\n",
    "    plt.savefig('../outputs/charts/correlation_matrix.png')\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Conclusion\n",
    "\n",
    "The visualizations generated in this notebook provide a comprehensive overview of the student data. We've identified key trends such as subject difficulty and the strong positive relationship between attendance and marks. These insights are essential for the final dashboard and form a solid basis for our predictive modeling in the next phase."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}