<a href="https://colab.research.google.com/github/denistoo749/Academic-Success-Classification/blob/main/academic_success_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification with an Academic Success Dataset
**1. Problem**
- Predict academic risk of students in higher education.

**2. Data**
- Files
```
train.csv - the training dataset; Target is the categorical target
test.csv - the test dataset; your objective is to predict the class of Target for each row
sample_submission.csv - a sample submission file in the correct format
```

>https://www.kaggle.com/competitions/playground-series-s4e6/data

**3. Evaluation**
- Submissions are evaluated using the accuracy score.


In [2]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# # Unzip the file
# !unzip '/content/drive/MyDrive/Academic Success Classification/playground-series-s4e6.zip' -d '/content/drive/MyDrive/Academic Success Classification/data/'

Import necessary tools

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [5]:
# Read the train dataset
df = pd.read_csv('/content/drive/MyDrive/Academic Success Classification/data/train.csv')
df.head()

Unnamed: 0,id,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,0,1,1,1,9238,1,1,126.0,1,1,...,0,6,7,6,12.428571,0,11.1,0.6,2.02,Graduate
1,1,1,17,1,9238,1,1,125.0,1,19,...,0,6,9,0,0.0,0,11.1,0.6,2.02,Dropout
2,2,1,17,2,9254,1,1,137.0,1,3,...,0,6,0,0,0.0,0,16.2,0.3,-0.92,Dropout
3,3,1,1,3,9500,1,1,131.0,1,19,...,0,8,11,7,12.82,0,11.1,0.6,2.02,Enrolled
4,4,1,1,2,9500,1,1,132.0,1,19,...,0,7,12,6,12.933333,0,7.6,2.6,0.32,Graduate


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76518 entries, 0 to 76517
Data columns (total 38 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   id                                              76518 non-null  int64  
 1   Marital status                                  76518 non-null  int64  
 2   Application mode                                76518 non-null  int64  
 3   Application order                               76518 non-null  int64  
 4   Course                                          76518 non-null  int64  
 5   Daytime/evening attendance                      76518 non-null  int64  
 6   Previous qualification                          76518 non-null  int64  
 7   Previous qualification (grade)                  76518 non-null  float64
 8   Nacionality                                     76518 non-null  int64  
 9   Mother's qualification                 

In [7]:
df.isna().sum()

id                                                0
Marital status                                    0
Application mode                                  0
Application order                                 0
Course                                            0
Daytime/evening attendance                        0
Previous qualification                            0
Previous qualification (grade)                    0
Nacionality                                       0
Mother's qualification                            0
Father's qualification                            0
Mother's occupation                               0
Father's occupation                               0
Admission grade                                   0
Displaced                                         0
Educational special needs                         0
Debtor                                            0
Tuition fees up to date                           0
Gender                                            0
Scholarship 

In [8]:
# Create X and y
X = df.drop('Target', axis=1)
y = df['Target']

# Preprocess data

In [9]:
from sklearn.preprocessing import LabelEncoder

# Example data split - adjust according to your actual data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 1. Encode the Target Variable
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# 2. Encode Categorical Features in X
categorical_features = [
    'Marital status', 'Application mode', 'Daytime/evening attendance',
    'Previous qualification', 'Nacionality', 'Mother\'s qualification',
    'Father\'s qualification', 'Mother\'s occupation', 'Father\'s occupation',
    'Displaced', 'Educational special needs', 'Debtor', 'Tuition fees up to date',
    'Gender', 'Scholarship holder', 'International'
]

# Convert categorical features to dummy variables
X_train_encoded = pd.get_dummies(X_train, columns=categorical_features, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, columns=categorical_features, drop_first=True)

# Ensure that the encoded training and test sets have the same columns
X_train_encoded, X_test_encoded = X_train_encoded.align(X_test_encoded, join='left', axis=1, fill_value=0)

In [10]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Create a pipeline for Logistic Regression that scales the data and applies logistic regression
log_reg_pipeline = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))

# Train and Evaluate the Logistic Regression Model
log_reg_pipeline.fit(X_train_encoded, y_train_encoded)
log_reg_score = log_reg_pipeline.score(X_test_encoded, y_test_encoded)
print(f"LogisticRegression score after encoding and scaling: {log_reg_score}")

# 4. Train and Evaluate the Random Forest Classifier
rf_classifier = RandomForestClassifier()

# Train the model
rf_classifier.fit(X_train_encoded, y_train_encoded)
rf_score = rf_classifier.score(X_test_encoded, y_test_encoded)
print(f"RandomForestClassifier score: {rf_score}")

LogisticRegression score after encoding and scaling: 0.8218112911657083
RandomForestClassifier score: 0.8265159435441715


## Prediction

In [11]:
# Read the test data
test_df = pd.read_csv('/content/drive/MyDrive/Academic Success Classification/data/test.csv')
test_df.head()

Unnamed: 0,id,Marital status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,76518,1,1,1,9500,1,1,141.0,1,3,...,0,0,8,0,0,0.0,0,13.9,-0.3,0.79
1,76519,1,1,1,9238,1,1,128.0,1,1,...,0,0,6,6,6,13.5,0,11.1,0.6,2.02
2,76520,1,1,1,9238,1,1,118.0,1,1,...,0,0,6,11,5,11.0,0,15.5,2.8,-4.06
3,76521,1,44,1,9147,1,39,130.0,1,1,...,0,3,8,14,5,11.0,0,8.9,1.4,3.51
4,76522,1,39,1,9670,1,1,110.0,1,1,...,0,0,6,9,4,10.666667,2,7.6,2.6,0.32


In [12]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51012 entries, 0 to 51011
Data columns (total 37 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   id                                              51012 non-null  int64  
 1   Marital status                                  51012 non-null  int64  
 2   Application mode                                51012 non-null  int64  
 3   Application order                               51012 non-null  int64  
 4   Course                                          51012 non-null  int64  
 5   Daytime/evening attendance                      51012 non-null  int64  
 6   Previous qualification                          51012 non-null  int64  
 7   Previous qualification (grade)                  51012 non-null  float64
 8   Nacionality                                     51012 non-null  int64  
 9   Mother's qualification                 

In [13]:
test_df.isna().sum()

id                                                0
Marital status                                    0
Application mode                                  0
Application order                                 0
Course                                            0
Daytime/evening attendance                        0
Previous qualification                            0
Previous qualification (grade)                    0
Nacionality                                       0
Mother's qualification                            0
Father's qualification                            0
Mother's occupation                               0
Father's occupation                               0
Admission grade                                   0
Displaced                                         0
Educational special needs                         0
Debtor                                            0
Tuition fees up to date                           0
Gender                                            0
Scholarship 

In [14]:
# Encode the Target Variable
label_encoder = LabelEncoder()
# y_test_encoded = label_encoder.fit_transform(y_test)

# Encode Categorical Features in x_test
categorical_features = [
    'Marital status', 'Application mode', 'Daytime/evening attendance',
    'Previous qualification', 'Nacionality', 'Mother\'s qualification',
    'Father\'s qualification', 'Mother\'s occupation', 'Father\'s occupation',
    'Displaced', 'Educational special needs', 'Debtor', 'Tuition fees up to date',
    'Gender', 'Scholarship holder', 'International'
]

# Convert categorical features to dummy variables
x_test_encoded = pd.get_dummies(test_df, columns=categorical_features, drop_first=True)

In [15]:
# Ensure that the encoded training and test sets have the same columns
X_train_encoded, x_test_encoded = X_train_encoded.align(x_test_encoded, join='left', axis=1, fill_value=0)

In [16]:
y_preds = rf_classifier.predict(x_test_encoded)

In [17]:
y_preds

array([0, 2, 2, ..., 0, 0, 0])

In [18]:
# Reverse the label encoding
def reverse_label_encoding(y_train, y_preds):
    """
    Reverse label encoding for predicted values using a LabelEncoder instance.

    Parameters:
    - y_train: The original y labels used for fitting the LabelEncoder.
    - y_preds: Encoded predictions to be reverse transformed.

    Returns:
    - y_preds_original: Predicted values in their original categorical form.
    """
    # Initialize a LabelEncoder instance
    label_encoder = LabelEncoder()

    # Fit the LabelEncoder with y_train to ensure consistency in reverse transformation
    label_encoder.fit(y_train)

    # Reverse the encoding for y_preds
    y_preds_original = label_encoder.inverse_transform(y_preds)

    return y_preds_original

In [19]:
y_preds_original = reverse_label_encoding(y_train, y_preds)

In [20]:
y_preds_original.shape

(51012,)

In [21]:
submission = pd.DataFrame({'id': test_df['id'], 'Target': y_preds_original})
submission.to_csv('/content/drive/MyDrive/Academic Success Classification/data/submission.csv', index=False)

In [22]:
data = pd.read_csv('/content/drive/MyDrive/Academic Success Classification/data/submission.csv')
data.head()

Unnamed: 0,id,Target
0,76518,Dropout
1,76519,Graduate
2,76520,Graduate
3,76521,Graduate
4,76522,Enrolled


# Hyperparameter Tuning using RandomizedSearchCV

In [25]:
np.random.seed(42)

grid = {'n_estimators': [100, 200, 500],
       'max_depth': [None, 10, 20],
       'min_samples_split': [2, 4],
       'min_samples_leaf': [1, 2]}

# Setup RandomizedSearchCV
rs_rf = RandomizedSearchCV(estimator=rf_classifier,
                           param_distributions=grid,
                           n_iter=10, # number of models to try
                           cv=2,
                           verbose=True,
                           n_jobs=-1)

# Fit the RandomizedSearchCV version of clf
rs_rf.fit(X_train_encoded, y_train_encoded)

Fitting 2 folds for each of 10 candidates, totalling 20 fits


In [26]:
rs_rf.best_params_

{'n_estimators': 500,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_depth': 20}

In [27]:
rs_y_preds = rs_rf.predict(x_test_encoded)

In [28]:
rs_y_preds

array([0, 2, 2, ..., 0, 0, 0])

In [29]:
rs_y_preds_original = reverse_label_encoding(y_train, rs_y_preds)

In [30]:
submission = pd.DataFrame({'id': test_df['id'], 'Target': rs_y_preds_original})
submission.to_csv('/content/drive/MyDrive/Academic Success Classification/data/submission.csv', index=False)

#  Hyperparameters tuning with GridSearchCV

In [33]:
np.random.seed(42)

grid = {'n_estimators': [100, 200, 500],
          'max_depth': [None],
          'max_features': ['sqrt'],
          'min_samples_split': [6],
          'min_samples_leaf': [1, 2]}

# Setup RandomizedSearchCV
gs_rf = GridSearchCV(estimator=rf_classifier,
                           param_grid=grid,
                           n_jobs=-1,
                           cv=3,
                           verbose=True)

# Fit the RandomizedSearchCV version of clf
gs_rf.fit(X_train_encoded, y_train_encoded)

Fitting 3 folds for each of 6 candidates, totalling 18 fits


In [34]:
gs_rf.best_params_

{'max_depth': None,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 6,
 'n_estimators': 500}

In [35]:
gs_y_preds = gs_rf.predict(x_test_encoded)

In [36]:
gs_y_preds

array([0, 2, 2, ..., 0, 0, 0])

In [37]:
gs_y_preds_original = reverse_label_encoding(y_train, gs_y_preds)

In [38]:
submission = pd.DataFrame({'id': test_df['id'], 'Target': gs_y_preds_original})
submission.to_csv('/content/drive/MyDrive/Academic Success Classification/data/submission.csv', index=False)