In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# College Insights Dashboard: Predictive Modeling (04_ml_prediction.ipynb)\n",
    "\n",
    "This notebook focuses on building a simple but effective machine learning model to predict student outcomes. We will use a **Logistic Regression** model to predict whether a student will pass or fail a subject based on their attendance and marks. This is a powerful application of data science that can help faculty identify and assist at-risk students proactively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup and Data Loading\n",
    "\n",
    "First, we import all necessary libraries for data manipulation, modeling, and evaluation. We'll load our prepared DataFrame from the `src/` directory, which is essential for a clean and efficient workflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import logging\n",
    "import os\n",
    "\n",
    "# Scikit-learn for modeling and evaluation\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n",
    "\n",
    "# joblib for saving the model\n",
    "import joblib\n",
    "\n",
    "# Visualization\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n",
    "\n",
    "# Add parent directory to path to import from 'src'\n",
    "import sys\n",
    "sys.path.append('..')\n",
    "from src.load_data import load_all_data\n",
    "\n",
    "# Load the cleaned and merged DataFrame\n",
    "df = load_all_data()\n",
    "\n",
    "if df is not None:\n",
    "    logging.info(\"DataFrame loaded successfully. Starting model training...\")\n",
    "    print(\"\\nDataFrame Head:\\n\")\n",
    "    display(df.head())\n",
    "else:\n",
    "    logging.error(\"Could not load data. Please check the `data` directory and `src/load_data.py`.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Data Preparation for Modeling\n",
    "\n",
    "To build our model, we must define our features (independent variables) and our target (dependent variable). Our target, `pass_status`, needs to be encoded into numerical values (e.g., 1 for 'Pass', 0 for 'Fail'). We then split the data into training and testing sets to evaluate the model's performance on unseen data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if df is not None:\n",
    "    # Define features (X) and target (y)\n",
    "    features = ['attendance', 'marks']\n",
    "    target = 'pass_status'\n",
    "\n",
    "    X = df[features]\n",
    "    y = df[target].map({'Pass': 1, 'Fail': 0})  # Encode target variable\n",
    "\n",
    "    # Split the data into training and testing sets (80% train, 20% test)\n",
    "    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)\n",
    "\n",
    "    logging.info(f\"Data split into training ({len(X_train)} samples) and testing ({len(X_test)} samples) sets.\")\n",
    "    logging.info(f\"Pass ratio in training set: {y_train.mean():.2f}\")\n",
    "    logging.info(f\"Pass ratio in testing set: {y_test.mean():.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Model Training\n",
    "\n",
    "We will use `LogisticRegression` from `scikit-learn` for its simplicity and interpretability. We train the model on the `X_train` and `y_train` data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if df is not None:\n",
    "    # Initialize the Logistic Regression model\n",
    "    model = LogisticRegression(random_state=42)\n",
    "\n",
    "    # Train the model\n",
    "    model.fit(X_train, y_train)\n",
    "\n",
    "    logging.info(\"✅ Logistic Regression model trained successfully.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Model Evaluation\n",
    "\n",
    "After training, we evaluate the model's performance on the unseen test data. We'll use several metrics to get a comprehensive view of its predictive power."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if df is not None:\n",
    "    # Make predictions on the test set\n",
    "    y_pred = model.predict(X_test)\n",
    "\n",
    "    # Calculate and print evaluation metrics\n",
    "    accuracy = accuracy_score(y_test, y_pred)\n",
    "    logging.info(f\"Model Accuracy: {accuracy:.2f}\")\n",
    "\n",
    "    print(\"\\nClassification Report:\\n\")\n",
    "    print(classification_report(y_test, y_pred, target_names=['Fail', 'Pass']))\n",
    "\n",
    "    # Plot the confusion matrix for a visual representation\n",
    "    cm = confusion_matrix(y_test, y_pred)\n",
    "    plt.figure(figsize=(8, 6))\n",
    "    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Fail', 'Pass'], yticklabels=['Fail', 'Pass'])\n",
    "    plt.title('Confusion Matrix', fontsize=16, fontweight='bold')\n",
    "    plt.xlabel('Predicted Label', fontsize=12)\n",
    "    plt.ylabel('True Label', fontsize=12)\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Model Persistence\n",
    "\n",
    "To use our trained model in the Streamlit dashboard without re-training it every time, we need to save it to a file. The `joblib` library is ideal for this purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if df is not None:\n",
    "    output_dir = '../outputs/'\n",
    "    if not os.path.exists(output_dir):\n",
    "        os.makedirs(output_dir)\n",
    "\n",
    "    model_path = os.path.join(output_dir, 'model.pkl')\n",
    "    joblib.dump(model, model_path)\n",
    "    logging.info(f\"💾 Trained model saved to: {model_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Conclusion\n",
    "\n",
    "We have successfully trained and evaluated a predictive model for student pass/fail status. The model demonstrates a good level of accuracy and provides a clear classification report, proving its viability for our dashboard. The saved model is now ready to be integrated into our `streamlit_app/app.py` for real-time predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}