#1. Problem Definition

Problem: Predicting student dropout rates in higher education institutions.

    Objectives:

        Identify at-risk students early.

        Improve retention strategies through targeted interventions.

        Reduce institutional dropout rates.

    Stakeholders:

        University administrators.

        Students and academic advisors.

    Key Performance Indicator (KPI):
    Dropout prediction accuracy within the first semester (e.g., % of actual dropouts correctly identified).

#2. Data Collection & Preprocessing

    Data Sources:

        Student academic records (grades, attendance).

        Socioeconomic and demographic data (financial aid status, parental education).

    Potential Bias:
    Underrepresentation of low-income or rural students may bias the model to favor patterns seen in more affluent or urban populations.

    Preprocessing Steps:

        Handle missing data (e.g., imputation or removal).

        Normalize numerical features (e.g., GPA scaling).

        Encode categorical variables (e.g., major, gender).

#3. Model Development

    Model Choice:
    Random Forest – Good for tabular data, interpretable, handles non-linearities and missing values well.

    Data Splitting:

        70% training

        15% validation

        15% testing

    Hyperparameters to Tune:

        Number of trees (n_estimators) – affects performance and speed.

        Maximum tree depth (max_depth) – controls overfitting vs. underfitting.



#4. Evaluation & Deployment

    Evaluation Metrics:

        Precision – Measures correctness of positive predictions (important to avoid false alarms).

        Recall – Ensures most actual dropouts are caught.

    Concept Drift:
    A change in the data distribution over time that affects model performance.
    Monitoring: Regularly compare prediction accuracy over time and retrain on updated data.

    Technical Challenge:
    Scalability – Handling large student populations across multiple campuses with limited compute resources.

## Part 2: Case Study Application
  # Problem Scope (5 points)

    Problem: Predict risk of patient readmission within 30 days post-discharge.

    Objectives:

        Reduce hospital readmission rates.

        Identify high-risk patients for early intervention.

    Stakeholders:

        Hospital administrators

        Healthcare providers

        Patients

Data Strategy:

    Data Sources:

        Electronic Health Records (EHRs)

        Patient demographics and prior admission history

        Discharge summaries

        Lab test results and medication logs

    Ethical Concerns:

        Patient privacy – handling identifiable medical data.

        Algorithmic bias – model might underpredict for certain ethnic or age groups.

    Preprocessing Pipeline:

        Data cleaning – remove or impute missing lab values.

        Feature engineering – generate features like “number of prior visits” or “length of stay.”

        Encoding – categorical variables like diagnosis or discharge type encoded via one-hot or label encoding.

        Normalization – e.g., z-score for lab test values.

Model Development:

    Model Choice:
    Gradient Boosted Trees (e.g., XGBoost) – Performs well on structured data, allows feature importance inspection.

    Confusion Matrix Example (Hypothetical):

	Predicted: No Readmit	Predicted: Readmit
Actual: No Readmit: 	850 	150
Actual: Readmit:    	100 	400

    Precision: 400 / (400 + 150) = 0.727 (72.7%)

    Recall: 400 / (400 + 100) = 0.80 (80%)


Deployment

    Integration Steps:

        Expose the model as an API service within the hospital’s digital infrastructure.

        Link the API to discharge forms so it runs prediction before discharge.

        Alert care teams if a patient is high-risk.

    Regulatory Compliance:

        Ensure data encryption at rest and in transit.

        Enforce access control policies and audit logs.

        Comply with HIPAA (Health Insurance Portability and Accountability Act) for all PHI.

Optimization

    Method to Address Overfitting:
    - Cross-validation + Early stopping on validation loss to prevent training  too long on noisy patterns.





## Part 3: Critical Thinking

Ethics & Bias

    Impact of Biased Training Data:
    A model trained predominantly on urban patients might miss risk patterns for rural or underserved populations, resulting in poorer care for those groups.

    Mitigation Strategy:
    Ensure diverse and representative data sampling; use fairness-aware algorithms that account for demographic parity.

Trade-offs

    Interpretability vs. Accuracy:

        A complex model (like a deep neural network) may offer better accuracy but be a "black box."

        In healthcare, interpretability is often preferred, so doctors can understand and trust predictions.

    Limited Compute Resources:

        May favor lighter models (e.g., Logistic Regression or Decision Trees) over complex ones like ensembles or deep learning.

        Prioritize efficiency and real-time predictions over marginal accuracy gains.

## Part 4 Reflection & Workflow Diagram (10 points)

Reflection

    Most Challenging Part:
    Defining a clean preprocessing pipeline that balances medical nuance and technical simplicity.

    Improvement with More Resources:

        Collaborate with clinicians for better feature engineering.

        Use federated learning or synthetic data generation to overcome privacy hurdles.

Diagram

AI Development Workflow:

[Problem Definition]
        ↓
[Data Collection]
        ↓
[Data Preprocessing]
        ↓
[Feature Engineering]
        ↓
[Model Selection & Training]
        ↓
[Validation & Tuning]
        ↓
[Evaluation]
        ↓
[Deployment]
        ↓
[Monitoring & Updating]



