In [None]:
# Create a file with Jupyter Notebook content.
jupyter_notebook_content = """
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# College Insights Dashboard: Data Cleaning & Preparation (01_data_cleaning.ipynb)\n",
    "\n",
    "This notebook serves as the initial step of the data science project. The primary goal here is to load the raw data from the CSV files, perform a thorough cleaning and inspection, and prepare a single, clean DataFrame for subsequent analysis and modeling. This is a critical step to ensure the integrity of our insights."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup and Library Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import os\n",
    "import logging\n",
    "\n",
    "# Set up logging for informative output\n",
    "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n",
    "\n",
    "# Set the path to the data directory\n",
    "data_path = '../data/'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load Raw Datasets\n",
    "\n",
    "We'll load each of the four CSV files into separate pandas DataFrames to inspect their individual structures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    students_df = pd.read_csv(os.path.join(data_path, 'students.csv'))\n",
    "    subjects_df = pd.read_csv(os.path.join(data_path, 'subjects.csv'))\n",
    "    marks_df = pd.read_csv(os.path.join(data_path, 'marks.csv'))\n",
    "    attendance_df = pd.read_csv(os.path.join(data_path, 'attendance.csv'))\n",
    "    logging.info(\"All raw data files loaded successfully.\")\n",
    "except FileNotFoundError as e:\n",
    "    logging.error(f\"Error loading data: {e}. Please ensure the CSV files are in the '{data_path}' directory.\")\n",
    "    # Exit if files are not found\n",
    "    raise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Initial Data Inspection\n",
    "\n",
    "A quick look at the head and info of each DataFrame helps us understand its structure, data types, and potential issues like missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logging.info(\"\\n--- students_df info ---\")\n",
    "students_df.info()\n",
    "logging.info(\"\\n--- subjects_df info ---\")\n",
    "subjects_df.info()\n",
    "logging.info(\"\\n--- marks_df info ---\")\n",
    "marks_df.info()\n",
    "logging.info(\"\\n--- attendance_df info ---\")\n",
    "attendance_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Merging Datasets\n",
    "\n",
    "To perform meaningful analysis, we need to combine all the information into a single DataFrame. We will use a series of `pd.merge()` calls, starting with the `marks` DataFrame as our base."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge marks with student data\n",
    "combined_df = marks_df.merge(students_df, on='student_id', how='left')\n",
    "\n",
    "# Merge the result with subject data\n",
    "combined_df = combined_df.merge(subjects_df, on='subject_id', how='left')\n",
    "\n",
    "# Finally, merge with attendance data\n",
    "combined_df = combined_df.merge(attendance_df, on=['student_id', 'subject_id'], how='left')\n",
    "\n",
    "logging.info(\"Datasets merged successfully.\")\n",
    "combined_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Data Cleaning & Feature Engineering\n",
    "\n",
    "Now that the data is merged, we'll perform final cleaning steps and create new features that will be useful for our analysis and modeling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. Handle Missing Values: Drop any rows that may have missing data after the merge (e.g., if a student or subject ID was not found)\n",
    "logging.info(f\"Shape before dropping NaNs: {combined_df.shape}\")\n",
    "combined_df.dropna(inplace=True)\n",
    "logging.info(f\"Shape after dropping NaNs: {combined_df.shape}\")\n",
    "\n",
    "# 2. Rename columns for clarity\n",
    "combined_df.rename(columns={\n",
    "    'name_x': 'student_name',\n",
    "    'name_y': 'subject_name',\n",
    "    'department': 'department',\n",
    "    'marks': 'marks',\n",
    "    'attendance_percentage': 'attendance'\n",
    "}, inplace=True)\n",
    "\n",
    "# 3. Feature Engineering: Create a 'pass_status' column (Pass if marks >= 40, Fail otherwise)\n",
    "combined_df['pass_status'] = combined_df['marks'].apply(lambda x: 'Pass' if x >= 40 else 'Fail')\n",
    "\n",
    "logging.info(\"Columns renamed and 'pass_status' feature engineered.\")\n",
    "combined_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Final Data Inspection and Summary\n",
    "\n",
    "A final check to ensure all our data is in the correct format before we proceed to the next steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "combined_df.info()\n",
    "logging.info(\"\\n--- Summary of Final DataFrame ---\")\n",
    "logging.info(f\"Number of records: {len(combined_df)}\")\n",
    "logging.info(f\"Number of unique students: {combined_df['student_id'].nunique()}\")\n",
    "logging.info(f\"Number of unique subjects: {combined_df['subject_id'].nunique()}\")\n",
    "logging.info(f\"Final DataFrame ready for analysis.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
"""

with open('notebooks/01_data_cleaning.ipynb', 'w') as f:
    f.write(jupyter_notebook_content)
print('The file "notebooks/01_data_cleaning.ipynb" has been created.')

: 