# 🎓 From Scores to Seats: The Grad School ML Challenge

Welcome to the starter notebook for this beginner-friendly data science hackathon! In this challenge, your goal is to build a machine learning model that can predict whether a student will be admitted into a graduate program based on their academic profile.

This notebook will walk you through a simple end-to-end pipeline:
- Loading and exploring the data
- Preprocessing
- Training a baseline model
- Making predictions
- Preparing a submission

---

## 📦 Files
- `train.csv`: Training data (features + target)
- `test.csv`: Test data (features only)
- `SampleSubmission.csv`: Format for submitting predictions

---

## 🧠 Target Variable
- `Admitted`: 1 if the student was admitted, 0 otherwise

Let's get started! 🚀


In [None]:
# 📚 Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

In [None]:
# 📥 Load the Data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
sample_submission = pd.read_csv("SampleSubmission.csv")

# Peek at the data
train.head()

## 🔍 Exploratory Data Analysis (EDA)
Let’s explore the training data to understand the features and target.

In [None]:
train.info()

### 🏷️ Encoding 'Location'

We used **one-hot encoding** to convert the categorical `Location` column into numeric format. This creates binary columns for each unique location (excluding one to avoid redundancy):


In [None]:
train['Location'].value_counts()

In [None]:
train.isnull().sum()

In [None]:
# Fill missing numeric values with the mean of each column
numeric_cols = ['GRE Score', 'TOEFL Score', 'SOP', 'CGPA']
train[numeric_cols] = train[numeric_cols].fillna(train[numeric_cols].mean())

# Double check to confirm missing values are handled
print("Remaining missing values:", train.isnull().sum().sum())

In [None]:
# Fill missing numeric values with the mean of each column
numeric_cols = ['GRE Score', 'TOEFL Score', 'SOP', 'CGPA']
test[numeric_cols] = test[numeric_cols].fillna(test[numeric_cols].mean())

# Double check to confirm missing values are handled
print("Remaining missing values:", test.isnull().sum().sum())

In [None]:
train.head()

In [None]:
test.head()

In [None]:
#Dummy variables for one-hot encoding
train = pd.get_dummies(train, columns=['Location'], dtype=int, drop_first=True)
test = pd.get_dummies(test, columns=['Location'], dtype=int, drop_first=True)

In [None]:
train.head()

In [None]:
print(train.columns)

In [None]:
# Distribution of target variable
sns.countplot(x='Admitted', data=train)
plt.title("Admission Distribution")
plt.show()

In [None]:
# Plot distributions of key features
num_features = ['GRE Score', 'TOEFL Score', 'CGPA']
for col in num_features:
    sns.histplot(train[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

In [None]:
# Features and target
X = train.drop(columns=['Admitted', "ID"])
y = train['Admitted']

# Split into train/validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_val.shape

In [None]:
#installing lazypredict to run multiple models on the data
!pip install lazypredict
from lazypredict.Supervised import LazyClassifier
model = LazyClassifier()
alg, mod = model.fit(X_train, X_val, y_train, y_val)
print(alg)

In [None]:
#Using LightGBM with the Dropouts meet Multiple Additive Regression Trees boosting type
!pip install lightgbm
from lightgbm import LGBMClassifier
model = LGBMClassifier(boosting_type='dart')

In [None]:
# Predict on validation set
model.fit(X_train, y_train)
y_pred = model.predict(X_val)

# Evaluate
print("Accuracy:", accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

In [None]:
# 📈 Evaluate Model on Validation Set

from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, ConfusionMatrixDisplay

# Predict on validation data
y_pred = model.predict(X_val)

# Accuracy score
print("Accuracy:", accuracy_score(y_val, y_pred))

# Classification report
print("\nClassification Report:")
print(classification_report(y_val, y_pred))

# Confusion matrix
cm = confusion_matrix(y_val, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot(cmap="Blues")
plt.title("Confusion Matrix")
plt.show()

In [None]:
importances = pd.Series(model.feature_importances_, index=X.columns)
importances.sort_values().plot(kind='barh')
plt.title("Feature Importances")
plt.show()

## 🚀 Predictions on Test Set
Let’s predict on the test set and generate a submission file.

In [None]:
test.isnull().sum()

In [None]:
# Fill missing numeric values with the mean of each column
numeric_cols = ['GRE Score', 'TOEFL Score', 'SOP', 'CGPA']
test[numeric_cols] = test[numeric_cols].fillna(test[numeric_cols].mean())

# Double check to confirm missing values are handled
print("Remaining missing values:", test.isnull().sum().sum())

In [None]:
test_predictions = model.predict(test.drop("ID", axis=1))

# Prepare submission
submission = sample_submission.copy()
submission['Admitted'] = test_predictions

# Save to CSV
submission.to_csv('lgbm2_submission.csv', index=False)
print("Submission file saved!")