<a href="https://colab.research.google.com/github/chogerlate/cpe342-machine-learning-course-kmutt/blob/main/cpe342_a04_1052.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### A04: Siwarat Laoprom 65070501052


### Import libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### Load raw dataset

In [22]:
raw_df = pd.read_csv('./MBA.csv')
raw_df.head()

Unnamed: 0,application_id,gender,international,gpa,major,race,gmat,work_exp,work_industry,admission
0,1,Female,False,3.3,Business,Asian,620.0,3.0,Financial Services,Admit
1,2,Male,False,3.28,Humanities,Black,680.0,5.0,Investment Management,
2,3,Female,True,3.3,Business,,710.0,5.0,Technology,Admit
3,4,Male,False,3.47,STEM,Black,690.0,6.0,Technology,
4,5,Male,False,3.35,STEM,Hispanic,590.0,5.0,Consulting,


In [23]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6194 entries, 0 to 6193
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   application_id  6194 non-null   int64  
 1   gender          6194 non-null   object 
 2   international   6194 non-null   bool   
 3   gpa             6194 non-null   float64
 4   major           6194 non-null   object 
 5   race            4352 non-null   object 
 6   gmat            6194 non-null   float64
 7   work_exp        6194 non-null   float64
 8   work_industry   6194 non-null   object 
 9   admission       1000 non-null   object 
dtypes: bool(1), float64(3), int64(1), object(5)
memory usage: 441.7+ KB


### Dataset Analysis
From the research I found that the original dataset is from
- https://www.kaggle.com/datasets/taweilo/mba-admission-dataset

About Dataset
1. Data Source:
Synthetic data generated from the Wharton Class of 2025's statistics.

2. Meta Data:
  - application_id: Unique identifier for each application
  - gender: Applicant's gender (Male, Female)
  - international: International student (TRUE/FALSE)
  - gpa: Grade Point Average of the applicant (on 4.0 scale)
  - major: Undergraduate major (Business, STEM, Humanities)
  - race: Racial background of the applicant (e.g., White, Black, Asian, Hispanic, Other / null: international student)
  - gmat: GMAT score of the applicant (800 points)
  - work_exp: Number of years of work experience (Year)
  - work_industry: Industry of the applicant's previous work experience (e.g., - Consulting, Finance, Technology, etc.)
  - admission: Admission status (Admit, Waitlist, Null: Deny)
3. Usage:
Exploratory Data Analysis (EDA): Understand the distributions, relationships, and patterns within the data.
Classification: Predict the admission status based on other features.

Selected data from raw dataset

In [24]:
df = raw_df.copy()
df['admission'] = df['admission'].fillna('Reject')
df

Unnamed: 0,application_id,gender,international,gpa,major,race,gmat,work_exp,work_industry,admission
0,1,Female,False,3.30,Business,Asian,620.0,3.0,Financial Services,Admit
1,2,Male,False,3.28,Humanities,Black,680.0,5.0,Investment Management,Reject
2,3,Female,True,3.30,Business,,710.0,5.0,Technology,Admit
3,4,Male,False,3.47,STEM,Black,690.0,6.0,Technology,Reject
4,5,Male,False,3.35,STEM,Hispanic,590.0,5.0,Consulting,Reject
...,...,...,...,...,...,...,...,...,...,...
6189,6190,Male,False,3.49,Business,White,640.0,5.0,Other,Reject
6190,6191,Male,False,3.18,STEM,Black,670.0,4.0,Consulting,Reject
6191,6192,Female,True,3.22,Business,,680.0,5.0,Health Care,Admit
6192,6193,Male,True,3.36,Business,,590.0,5.0,Other,Reject


### Data pre-processing

Drop columns:

    - drop race and gender due to Ethical and Bias Considerations.
    - drop application_id due to Irrelevant.

Encoding:

    One-hot encode categorical variables (major, work_industry).

Scaling:

    Normalize continuous features like gpa, gmat, and work_exp if needed to improve model performance.

In [25]:
df = df.drop(columns=['race', 'application_id', 'gender'])

# One-hot encode categorical variables
df = pd.get_dummies(df, columns=['major', 'work_industry'], dummy_na=False)

# Normalize continuous features (if needed)
scaler = MinMaxScaler()
continuous_features = ['gpa', 'gmat', 'work_exp']
df[continuous_features] = scaler.fit_transform(df[continuous_features])


Target encode:
    Label encode `admission` column using label encoding strategy

In [26]:
le = LabelEncoder()
df['admission'] = le.fit_transform(df['admission'])  # 0 for 'Admit', 1 for 'Waitlist'

In [27]:
# prompt: print le encoding value
print(le.classes_)

['Admit' 'Reject' 'Waitlist']


In [28]:
admission_counts = df['admission'].value_counts()
print(admission_counts)

admission
1    5194
0     900
2     100
Name: count, dtype: int64


Splits the dataset into train and test sets

In [35]:
X = df.drop('admission', axis=1)
y = df['admission']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check the distribution of the classes to confirm stratification
print(y_train.value_counts(normalize=True))  # Class distribution in training set
print(y_test.value_counts(normalize=True))   # Class distribution in test set

admission
1    0.838547
0    0.145308
2    0.016145
Name: proportion, dtype: float64
admission
1    0.838579
0    0.145278
2    0.016142
Name: proportion, dtype: float64


### Train models

    Decision tree
    Random forest
    Gradient boosting


In [38]:
# Initialize and train Decision Tree
dt_model = DecisionTreeClassifier(max_depth=10)
dt_model.fit(X_train, y_train)

# Initialize and train Random Forest
rf_model = RandomForestClassifier(n_estimators=50)
rf_model.fit(X_train, y_train)

# Initialize and train Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=50)
gb_model.fit(X_train, y_train)

### Evaluate the trained models using classifiation_report function from sklearn:

The classification report shows significant differences in the performance metrics between the two classes ('Admit' and 'Waitlist'), which can largely be attributed to data imbalance.

In [39]:
# Make predictions on the test set for each model
y_pred_dt = dt_model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)
y_pred_gb = gb_model.predict(X_test)

# Evaluate Decision Tree
print("Decision Tree Classification Report:")
print(classification_report(y_test, y_pred_dt, zero_division=1))

# Evaluate Random Forest
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, y_pred_rf, zero_division=1))

# Evaluate Gradient Boosting
print("\nGradient Boosting Classification Report:")
print(classification_report(y_test, y_pred_gb, zero_division=1))


Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.31      0.12      0.17       180
           1       0.85      0.95      0.90      1039
           2       0.00      0.00      0.00        20

    accuracy                           0.82      1239
   macro avg       0.39      0.36      0.36      1239
weighted avg       0.76      0.82      0.78      1239


Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.37      0.22      0.27       180
           1       0.86      0.94      0.90      1039
           2       0.00      0.00      0.00        20

    accuracy                           0.82      1239
   macro avg       0.41      0.38      0.39      1239
weighted avg       0.77      0.82      0.79      1239


Gradient Boosting Classification Report:
              precision    recall  f1-score   support

           0       0.46      0.03      0.06       180
           1