# Graduate Admission Predictor using [microsoft/FLAML](https://github.com/microsoft/FLAML) library

__Kaggle Data Card__: [mohansacharya/graduate-admissions](https://www.kaggle.com/datasets/mohansacharya/graduate-admissions)

# Packages Used

* `pandas` for importing Data
* `flaml` for Model building
* `sklearn.metrics` for Model Evaluation
* `seaborn` for Visualizing the Results

In [None]:
# ! pip install flaml[automl] pandas seaborn scikit-learn

In [None]:
import pandas as pd
import sklearn as skl
import seaborn as sns
import flaml
import os
from matplotlib import pyplot as plt
from pathlib import Path
from pprint import pprint
from scipy.stats import zscore

# check versions
print(f"pandas = {pd.__version__}")
print(f"sklearn = {skl.__version__}")
print(f"flaml = {flaml.__version__}")
print(f"seaborn = {sns.__version__}")

# suppress UserWarnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# I/O paths

In [None]:
DATASET = Path("/kaggle/input/graduate-admissions/Admission_Predict.csv")
SAVED_MODEL = Path("/kaggle/working/", "model.pickle")
TRAINING_LOG = Path("/kaggle/working/", "train.log")

# 1. Import Data

In [None]:
df = pd.read_csv(DATASET)
df.info()

In [None]:
df.sample(n = 8)

### Conclusions

1. From `df.info()` we don't have any missing data.
2. `Serial No.` is like a Database Primary key which means it can be dropped from the dataset.
3. `University Rating` and `Research` are can be treated as categorical variables but since we are using FLAML, it is not needed as the model will handle.


In [None]:
# drop Serial Number column
df = df.drop(columns=df.columns[0])

# 2. Separate out data for Model Validation

In [None]:
def train_test_split(dataframe: pd.DataFrame, target_column: str, test_ratio: float = 0.2):
    # shuffle the dataset
    dataframe = dataframe.sample(frac=1, random_state=225)
    
    # Create training and testing partitions
    split_index = int(len(dataframe) * test_ratio)
    
    # Separate Features and Target
    X_test = dataframe.drop(columns=target_column).iloc[:split_index].values
    y_test = dataframe[target_column].iloc[:split_index].values
    
    # return Train and test 
    return dataframe.iloc[split_index:], X_test, y_test

train_df, X_test, y_test = train_test_split(df, df.columns[-1], test_ratio=0.2)

In [None]:
print("Train Dataframe: ", train_df.shape)
print("X-Test: ", X_test.shape)
print("Y-Test: ", y_test.shape)

# 3. Train the Model

## Training Params

| Parameter | Description |
| --------- | ----------- |
| `task`    | type of underlying Models to use. either `Regressor` or `Classifer` |
| `metric`    | metric to optimize, use `mse` for regresion and `accuracy` or `log_loss` for classification |
| `estimator` | list of estimators that will be tuned |
| `eval_method` | `cv` for k-Fold Cross-Validation and `holdout` for train-test split |
| `n_splits` | Value of `k` in Cross Validation. Cannot be used for `holdout` |
| `time_budget` | Maximum time (seconds) to take to tune the models |
| `log_file_name` | Path to the log file. Set to `None` to disable logging |
| `n_jobs` | Number of CPU cores to use. `-1` indicates all available cores |
| `verbose` | Level of Verbosity. `0` to disable and `3` for maximum verbosity |

In [None]:
# init model
model = flaml.AutoML()

# model settings
settings = {
    "task": "regression",
    "metric": "mse",
    "estimator_list": ["rf", "xgboost", "lgbm", "xgb_limitdepth", "extra_tree"],
    "eval_method": "cv",
    "n_splits": 3,
    "time_budget": 300,
    "early_stop": True,
    "log_file_name": str(TRAINING_LOG),
    "n_jobs": -1,
    "verbose": 0
}

# start training
model.fit(
    dataframe = train_df,
    label = train_df.columns[-1],
    **settings
)

# 3. Training Results

In [None]:
def results(model: flaml.AutoML):
    """
    Returns the details of the best model.
    
    Parameter:
    - `model`: trained model
    
    Returns:
    - `model` (str): Name of the best model
    - `hyperparameters` (dict): Hyperparameters of best model
    - `train_time` (float): Time taken to train the best model
    """
    return {
        "model": model.best_estimator,
        "hyperparameters": model.best_config,
        "train_time": model.best_config_train_time
    }

In [None]:
pprint(results(model))

# 4. Model Evaluation

In [None]:
def evaluation_metrics(model: flaml.AutoML, test_features, test_target):
    mse = skl.metrics.mean_squared_error(test_target, model.predict(test_features))
    r2 = skl.metrics.r2_score(test_target, model.predict(test_features))
    return mse, r2

mse, r2 = evaluation_metrics(model, X_test, y_test)
print("Mean Squared Error: ", round(mse, 3))
print("R2-score: ", round(r2, 4))

In [None]:
# Feature Importance
feature_score = {feature: score for feature, score in zip(model.feature_names_in_, model.feature_importances_)}
feature_score = pd.DataFrame(list(feature_score.items()), columns=["Feature", "Score"])
feature_score.sort_values(by="Score", inplace=True, ascending=False)
feature_score

# 5. Make predictions

In [None]:
y_pred = model.predict(X_test)

# 6. Visual Evaluation

In [None]:
sns.set_style("whitegrid")
sns.scatterplot(x=y_test, y=y_pred)
sns.lineplot(x=[0,1], y=[0,1], color="red")
plt.xlabel("True Values")
plt.ylabel("Predicted Values")
plt.title("Quality of prediction")
plt.xlim(0.3,1)
plt.ylim(0.3,1)
plt.show()

* The __Red Line__ represents the ideal scenario where all the predictions are correct.
* Our predicted model can correctly predict admission chances for candidates that truly have a higher chance of getting selected and becomes error prone for candidates with lower chances of admission.

In [None]:
sns.barplot(x='Feature', y='Score', data=feature_score)
plt.xticks(rotation=45)
plt.xlabel('Feature')  # Label for x-axis
plt.ylabel('Value')  # Label for y-axis
plt.title('Feature Importances')  # Title of the plot
plt.show()

* The above chart shows that candidates should prioritize `CGPA`, `GRE` and `TOEFL` scores with some `Research` experience.

# Residual Analysis

In [None]:
residuals = y_test - y_pred

In [None]:
sns.kdeplot(residuals, fill=True, alpha=0.5)
plt.xlim(-0.3,0.3)
plt.xlabel("Residuals")
plt.ylabel("Density")
plt.title("Residual Analysis: Density of Residuals")
plt.grid(True)
plt.show()

In [None]:
sns.scatterplot(x=range(len(residuals)), y=zscore(residuals))
plt.xlabel("Z-Transformed Residuals")
plt.ylabel("Density")
plt.title("Residual Analysis: Run plot for Residuals")
plt.grid(True)
plt.show()

In [None]:
sns.set_style("whitegrid")
sns.scatterplot(x=y_pred, y=residuals)
plt.xlabel("Predited Y")
plt.ylabel("Residuals")
plt.title("Residual Analysis: Residuals vs Predicted Y")
plt.show()

## Export Model

In [None]:
model.pickle(SAVED_MODEL)