# Experiment 1 — Academic + Behavioral Features

**Goal:** Predict `final_grade` using only academic and behavioral features.  
**Features used:** `study_hours`, `attendance_percentage`, `study_method`, `math_score`, `science_score`, `english_score`  
**Models:** Logistic Regression, Decision Tree  
**Why these models?**
- Logistic Regression → Simple baseline linear classifier
- Decision Tree → Captures non-linear patterns and is easy to interpret

In [None]:
# --- Imports ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay

import warnings
warnings.filterwarnings('ignore')

print('All imports loaded successfully!')

## Step 1 — Load Cleaned Data
We load the already preprocessed train and test CSVs from `data_analysis.ipynb`.

In [None]:
# Load the cleaned datasets
train_df = pd.read_csv('../datasets/train_cleaned.csv')
test_df  = pd.read_csv('../datasets/test_cleaned.csv')

print('Training data shape :', train_df.shape)
print('Test data shape     :', test_df.shape)
print('\nColumns available   :', list(train_df.columns))

## Step 2 — Select Academic + Behavioral Features

For Experiment 1 we use **only** these features:
- `study_hours`, `attendance_percentage` (behavioral)
- `math_score`, `science_score`, `english_score` (academic scores)
- `study_method_*` columns (behavioral — one-hot encoded)

In [None]:
# Define the features for Experiment 1
exp1_features = [
    'study_hours',
    'attendance_percentage',
    'math_score',
    'science_score',
    'english_score',
    'study_method_coaching',
    'study_method_group study',
    'study_method_mixed',
    'study_method_notes',
    'study_method_online videos',
    'study_method_textbook'
]

# Separate features (X) and target (y)
X_train = train_df[exp1_features]
X_test  = test_df[exp1_features]

y_train = train_df['final_grade']
y_test  = test_df['final_grade']

print('Experiment 1 feature count :', len(exp1_features))
print('X_train shape              :', X_train.shape)
print('X_test  shape              :', X_test.shape)
print('\nTarget distribution (train):')
print(y_train.value_counts().sort_index())