<a href="https://colab.research.google.com/github/chogerlate/cpe342-machine-learning-course-kmutt/blob/main/cpe342_a04_1052.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### A04: Siwarat Laoprom 65070501052


### Import libraries

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### Load raw dataset

In [2]:
raw_df = pd.read_csv('./MBA.csv')
raw_df.head()

Unnamed: 0,application_id,gender,international,gpa,major,race,gmat,work_exp,work_industry,admission
0,1,Female,False,3.3,Business,Asian,620.0,3.0,Financial Services,Admit
1,2,Male,False,3.28,Humanities,Black,680.0,5.0,Investment Management,
2,3,Female,True,3.3,Business,,710.0,5.0,Technology,Admit
3,4,Male,False,3.47,STEM,Black,690.0,6.0,Technology,
4,5,Male,False,3.35,STEM,Hispanic,590.0,5.0,Consulting,


In [3]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6194 entries, 0 to 6193
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   application_id  6194 non-null   int64  
 1   gender          6194 non-null   object 
 2   international   6194 non-null   bool   
 3   gpa             6194 non-null   float64
 4   major           6194 non-null   object 
 5   race            4352 non-null   object 
 6   gmat            6194 non-null   float64
 7   work_exp        6194 non-null   float64
 8   work_industry   6194 non-null   object 
 9   admission       1000 non-null   object 
dtypes: bool(1), float64(3), int64(1), object(5)
memory usage: 441.7+ KB


Selected data from raw dataset

In [44]:
df = raw_df[raw_df['admission'].notna()]
missing_label_df = raw_df[raw_df['admission'].isna()]

### Data pre-processing

Drop columns:

    - drop race and gender due to Ethical and Bias Considerations.
    - drop application_id due to Irrelevant.

Encoding:

    One-hot encode categorical variables (major, work_industry).

Scaling:

    Normalize continuous features like gpa, gmat, and work_exp if needed to improve model performance.

In [45]:
df = df.drop(columns=['race', 'application_id', 'gender'])

# One-hot encode categorical variables
df = pd.get_dummies(df, columns=['major', 'work_industry'], dummy_na=False)

# Normalize continuous features (if needed)
scaler = MinMaxScaler()
continuous_features = ['gpa', 'gmat', 'work_exp']
df[continuous_features] = scaler.fit_transform(df[continuous_features])


Target encode:
    Label encode `admission` column using label encoding strategy

In [46]:
le = LabelEncoder()
df['admission'] = le.fit_transform(df['admission'])  # 0 for 'Admit', 1 for 'Waitlist'

In [47]:
admission_counts = df['admission'].value_counts()
print(admission_counts)

admission
0    900
1    100
Name: count, dtype: int64


Splits the dataset into train and test sets

In [48]:
# Split dataset into features (X) and target (y)
X = df.drop(columns=['admission'])
y = df['admission']

# Train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

### Train models

    Decision tree
    Random forest
    Gradient boosting


In [49]:
# Initialize and train Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

# Initialize and train Random Forest
rf_model = RandomForestClassifier(n_estimators=50)
rf_model.fit(X_train, y_train)

# Initialize and train Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=50)
gb_model.fit(X_train, y_train)

### Evaluate the trained models using classifiation_report function from sklearn:

The classification report shows significant differences in the performance metrics between the two classes ('Admit' and 'Waitlist'), which can largely be attributed to data imbalance.

In [50]:
# Make predictions on the test set for each model
y_pred_dt = dt_model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)
y_pred_gb = gb_model.predict(X_test)

# Evaluate Decision Tree
print("Decision Tree Classification Report:")
print(classification_report(y_test, y_pred_dt))

# Evaluate Random Forest
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

# Evaluate Gradient Boosting
print("\nGradient Boosting Classification Report:")
print(classification_report(y_test, y_pred_gb))


Decision Tree Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.90      0.91       268
           1       0.24      0.25      0.24        32

    accuracy                           0.83       300
   macro avg       0.57      0.58      0.57       300
weighted avg       0.84      0.83      0.84       300


Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.99      0.94       268
           1       0.20      0.03      0.05        32

    accuracy                           0.88       300
   macro avg       0.55      0.51      0.50       300
weighted avg       0.82      0.88      0.84       300


Gradient Boosting Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.99      0.94       268
           1       0.25      0.03      0.06        32

    accuracy                           0.89       300
   macro av