# DU Admission Analyzer - Demo Notebook

This notebook demonstrates how to use the DU Admission Analyzer to extract, clean, analyze, and export Delhi University admission data from PDFs.

## Features
- 📄 **PDF Extraction**: Extract tables from DU admission PDFs
- 🧹 **Data Cleaning**: Handle split rows, merge data, fix column alignment
- 📊 **Analytics**: Generate comprehensive analytics and insights
- 📤 **Excel Export**: Export to formatted Excel with multiple sheets
- 🔄 **Batch Processing**: Process multiple PDFs at once

## Setup and Installation

First, let's install the required packages:

In [None]:
# Install required packages
!pip install pandas tabula-py openpyxl xlsxwriter requests matplotlib seaborn pdfplumber PyPDF2

## Import and Initialize

In [None]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt

# Add src to path
sys.path.append('src')

# Import our modules
from src.pipeline import process_admission_pdf, DUAdmissionPipeline
from src.pdf_extractor import extract_pdf
from src.data_cleaner import clean_data
from src.analytics import generate_analytics_summary
from src.excel_exporter import export_to_excel

print("✅ All modules imported successfully!")

## Quick Start - Process a Single PDF

Let's process the example PDF from the DU website:

In [None]:
# Example PDF URL from DU website
pdf_url = "https://admission.uod.ac.in/userfiles/downloads/25082025_VacantSeats_UG_Spot_Round.pdf"

# Process the PDF through the complete pipeline
print("🚀 Starting PDF processing...")
results = process_admission_pdf(pdf_url, output_dir="outputs")

if results['success']:
    print(f"\n✅ Processing successful!")
    print(f"📊 Data Shape: {results['data_shape']}")
    print(f"📁 Excel Export: {results['files']['excel']}")
    print(f"📄 CSV Backup: {results['files']['csv']}")
else:
    print(f"❌ Processing failed: {results['error']}")

## View Key Analytics

In [None]:
if results['success']:
    analytics = results['analytics']
    
    print("📊 OVERVIEW:")
    overview = analytics['overview']
    for key, value in overview.items():
        print(f"   {key.replace('_', ' ').title()}: {value}")
    
    print("\n🔍 KEY INSIGHTS:")
    for i, insight in enumerate(analytics['insights'][:5], 1):
        print(f"   {i}. {insight}")
    
    print("\n📈 CATEGORY TOTALS:")
    totals = analytics['totals']
    for category, total in totals.items():
        if category != 'grand_total':
            print(f"   {category}: {total:,} seats")
    print(f"   GRAND TOTAL: {totals['grand_total']:,} seats")

## Step-by-Step Processing (Advanced)

For more control, you can process each step individually:

In [None]:
# Initialize pipeline
pipeline = DUAdmissionPipeline("outputs")

# Step 1: Extract raw data
print("📄 Step 1: Extracting data from PDF...")
raw_data = extract_pdf(pdf_url)
print(f"Extracted {len(raw_data)} rows, {len(raw_data.columns)} columns")
print("\nFirst few rows of raw data:")
print(raw_data.head())

In [None]:
# Step 2: Clean the data
print("🧹 Step 2: Cleaning data...")
clean_df = clean_data(raw_data)
print(f"Cleaned data: {len(clean_df)} rows, {len(clean_df.columns)} columns")
print("\nFirst few rows of clean data:")
print(clean_df.head())

In [None]:
# Step 3: Generate analytics
print("📊 Step 3: Generating analytics...")
analytics = generate_analytics_summary(clean_df)

# Display college-wise analysis (top 10)
print("\nTop 10 Colleges by Total Seats:")
college_analysis = analytics['college_wise']
print(college_analysis.head(10)[['Total_Seats', 'Program_Count']])

In [None]:
# Display program-wise analysis (top 10)
print("Top 10 Programs by Total Seats:")
program_analysis = analytics['program_wise']
print(program_analysis.head(10)[['Total_Seats', 'College_Count']])

In [None]:
# Display category-wise analysis
print("Category-wise Analysis:")
category_analysis = analytics['category_wise']
print(category_analysis)

## Visualization

In [None]:
# Create visualizations
from src.analytics import AdmissionAnalytics

analytics_obj = AdmissionAnalytics(clean_df)
visualizations = analytics_obj.create_visualizations()

# Display category distribution
plt.figure(figsize=(10, 6))
category_data = analytics['category_wise']
plt.pie(category_data['Total_Seats'], labels=category_data['Category'], autopct='%1.1f%%')
plt.title('Seat Distribution by Category')
plt.show()

In [None]:
# Top colleges bar chart
plt.figure(figsize=(12, 8))
top_colleges = college_analysis.head(10)
plt.barh(range(len(top_colleges)), top_colleges['Total_Seats'])
plt.yticks(range(len(top_colleges)), [name[:40] + '...' if len(name) > 40 else name for name in top_colleges.index])
plt.xlabel('Total Seats')
plt.title('Top 10 Colleges by Total Seats')
plt.tight_layout()
plt.show()

## Export Results

In [None]:
# Step 4: Export to Excel
print("📤 Step 4: Exporting to Excel...")
excel_path = export_to_excel(clean_df, "demo_analysis.xlsx", "outputs")
print(f"✅ Exported to: {excel_path}")

# Also save as CSV
csv_path = "outputs/demo_clean_data.csv"
clean_df.to_csv(csv_path, index=False)
print(f"📄 CSV saved to: {csv_path}")

## Data Quality Check

In [None]:
# Check data quality
print("🔍 DATA QUALITY REPORT:")
print(f"Total rows: {len(clean_df)}")
print(f"Total columns: {len(clean_df.columns)}")
print(f"\nColumns: {list(clean_df.columns)}")

print("\nMissing values:")
missing = clean_df.isnull().sum()
for col, count in missing.items():
    if count > 0:
        print(f"   {col}: {count}")

print("\nData types:")
for col, dtype in clean_df.dtypes.items():
    print(f"   {col}: {dtype}")

# Check numeric columns
numeric_cols = ['UR', 'OBC', 'SC', 'ST', 'EWS', 'SIKH', 'PwBD']
print("\nNumeric column statistics:")
for col in numeric_cols:
    if col in clean_df.columns:
        print(f"   {col}: min={clean_df[col].min()}, max={clean_df[col].max()}, sum={clean_df[col].sum()}")

## Sample Data Display

In [None]:
# Display sample of the clean data
print("📋 SAMPLE OF CLEAN DATA:")
print(clean_df.head(10))

print("\n📊 SUMMARY STATISTICS:")
print(clean_df.describe())

## Next Steps

This notebook demonstrated the complete DU Admission Analyzer pipeline. You can now:

1. **Process different PDFs**: Change the `pdf_url` to process other spot round PDFs
2. **Customize analytics**: Modify the analytics functions in `src/analytics.py`
3. **Batch processing**: Use the batch processing feature for multiple PDFs
4. **FastAPI integration**: The modular design makes it easy to integrate into a web API
5. **Custom visualizations**: Add more charts and graphs in the analytics module

### For FastAPI Integration:
```python
from fastapi import FastAPI, UploadFile
from src.pipeline import process_admission_pdf

app = FastAPI()

@app.post("/upload")
async def upload_pdf(file: UploadFile):
    # Save uploaded file temporarily
    # Process with pipeline
    results = process_admission_pdf(temp_file_path)
    return results
```