# University Admission Predictor - Multiple Linear Regression with XGboost in SKLearn

The objective of this project is to build, train, test and deploy a machine learning model to predict chances of university admission into a particular university given student’s profile.

This project can be effectively used by university admission departments to determine top qualifying students.

### Inputs (Features)

    - GRE Scores (out of 340)
    - TOEFL Scores (out of 120)
    - University Rating (out of 5)
    - Statement of Purpose (SOP) 
    - Letter of Recommendation (LOR) Strength (out of 5)
    - Undergraduate GPA (out of 10)
    - Research Experience (either 0 or 1)

### Output

    - Chance of admission (ranging from 0 to 1)

## Setup

Let's start by specifying:

* AWS region.
* The IAM role arn used to give learning and hosting access to your data. See the documentation for how to specify these.
* The S3 bucket that you want to use for training and model data.

In [2]:
%%time

import os
import boto3
import re
import json
import sagemaker
from sagemaker import get_execution_role

region = boto3.Session().region_name

role = get_execution_role()

bucket = sagemaker.Session().default_bucket()

CPU times: user 1.1 s, sys: 160 ms, total: 1.26 s
Wall time: 1.37 s


In [3]:
prefix = "sagemaker/multiregressor-xgboost-byo"
bucket_path = "https://s3-{}.amazonaws.com/{}".format(region, bucket)

print(bucket_path)

https://s3-us-west-2.amazonaws.com/sagemaker-us-west-2-442342299380


## Import the CSV Data

In [4]:
# Import Numpy and check the version
import numpy as np
print(np.__version__)

# Import Numpy and check the version
import pandas as pd
print(pd.__version__)

# Read the CSV file 
university_df = pd.read_csv("university_admission.csv")

1.21.6
1.3.5


## Describe the Data

In [5]:
# Check the shape of the dataframe
university_df.shape

(1000, 8)

In [6]:
# Check if any missing values are present in the dataframe
university_df.isnull().sum()

# Drop the rows with missing values
university_df = university_df.dropna()

university_df.dtypes

GRE_Score                int64
TOEFL_Score              int64
University_Rating        int64
SOP                    float64
LOR                    float64
CGPA                   float64
Research                 int64
Chance_of_Admission    float64
dtype: object

In [7]:
# Describe the data
university_df.describe()

Unnamed: 0,GRE_Score,TOEFL_Score,University_Rating,SOP,LOR,CGPA,Research,Chance_of_Admission
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,316.472,107.193,3.114,3.374,3.484,8.57644,0.56,0.72174
std,11.289494,6.079368,1.142939,0.990507,0.924986,0.60451,0.496635,0.14107
min,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.34
25%,308.0,103.0,2.0,2.5,3.0,8.1275,0.0,0.63
50%,317.0,107.0,3.0,3.5,3.5,8.56,1.0,0.72
75%,325.0,112.0,4.0,4.0,4.0,9.04,1.0,0.82
max,340.0,120.0,5.0,5.0,5.0,9.92,1.0,0.97


## Prepare the Data before Model Training

In [8]:
# Print the columns
university_df.columns

Index(['GRE_Score', 'TOEFL_Score', 'University_Rating', 'SOP', 'LOR', 'CGPA',
       'Research', 'Chance_of_Admission'],
      dtype='object')

In [9]:
# Features Dataframe
X = university_df.drop(columns = ['Chance_of_Admission'])

# Inference Dataframe
y = university_df['Chance_of_Admission']

In [10]:
# Print the shapes

print(X.shape)
print(y.shape)

(1000, 7)
(1000,)


In [11]:
# Convert to numpy arrays
X = np.array(X)
y = np.array(y)

# reshaping the array from (1000,) to (1000, 1)
y = y.reshape(-1,1)
y.shape

(1000, 1)

## Split the Training and Test data sets

In [12]:
# Split the data into 25% Testing and 75% Training

# !pip uninstall -y numpy
# !pip uninstall -y setuptools
# %pip install setuptools
# %pip install numpy


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

In [13]:
X_train.shape

(750, 7)

In [14]:
X_test.shape

(250, 7)

## Train the Model

In [15]:
!pip install -Uq xgboost==0.90

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [16]:
import xgboost as xgb
import sklearn as sk

xgb_version = xgb.__version__

print(f"XGB Version - {xgb_version}")

# from sklearn.preprocessing import LabelEncoder
# le = LabelEncoder()
# y_train_le = le.fit_transform(y_train.ravel())

# # print(y_train)
# print(y_train_le)

# bt = xgb.XGBClassifier(
#     objective ='reg:squarederror', learning_rate = 0.1, max_depth = 30, n_estimators = 100
# )  # Setup xgboost model
# bt.fit(X_train, y_train_le, verbose=False)  # Train it to our data

model = xgb.XGBRegressor(objective ='reg:squarederror', learning_rate = 0.1, max_depth = 30, n_estimators = 100)

model.fit(X_train, y_train)

XGB Version - 0.90


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=30, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:squarederror',
             random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
             seed=None, silent=None, subsample=1, verbosity=1)

In [17]:
# Save the model
model_file_name = "multi-regressor-xgboost-model"
model.save_model(model_file_name)

In [18]:
# Predict the score of the trained model using the testing dataset

result = model.score(X_test, y_test)
print("Accuracy : {}".format(result))

# Make predictions on the test data
y_predict = model.predict(X_test)

Accuracy : 0.9047137599591243


In [19]:
y_predict

array([0.5792501 , 0.41041452, 0.72943157, 0.85038257, 0.95985985,
       0.87060857, 0.9404441 , 0.71825516, 0.4617769 , 0.7682433 ,
       0.7286445 , 0.66943663, 0.95563185, 0.8604122 , 0.7473627 ,
       0.69834834, 0.72043073, 0.7597363 , 0.92766625, 0.7911525 ,
       0.7101232 , 0.49164867, 0.849548  , 0.7226256 , 0.62029064,
       0.5314263 , 0.78056777, 0.57956976, 0.44022632, 0.9468069 ,
       0.5202545 , 0.57041895, 0.52008164, 0.6791025 , 0.7098753 ,
       0.7689101 , 0.76027274, 0.69834834, 0.5596452 , 0.8916558 ,
       0.7226256 , 0.6061566 , 0.6720531 , 0.6792645 , 0.8202441 ,
       0.961064  , 0.89003384, 0.5896477 , 0.58156335, 0.7106353 ,
       0.7234301 , 0.6397014 , 0.53056556, 0.939922  , 0.8111183 ,
       0.83371365, 0.6490842 , 0.7889292 , 0.4441638 , 0.86004555,
       0.7240722 , 0.68049276, 0.7648382 , 0.64973056, 0.7201986 ,
       0.62063104, 0.9393027 , 0.6306576 , 0.5601826 , 0.6701422 ,
       0.8981861 , 0.73911905, 0.6814442 , 0.8563551 , 0.67035

In [20]:
# Check Accuracy

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

k = X_test.shape[1]
n = len(X_test)

RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 

RMSE = 0.043 
MSE = 0.0018255081177536007 
MAE = 0.015137223381996155 
R2 = 0.9047137599591242 
Adjusted R2 = 0.9019575464042228


In [21]:
!tar czvf model.tar.gz $model_file_name

multi-regressor-xgboost-model


## Upload the model to S3

In [22]:
fObj = open("model.tar.gz", "rb")
key = os.path.join(prefix, model_file_name, "model.tar.gz")
boto3.Session().resource("s3").Bucket(bucket).Object(key).upload_fileobj(fObj)

print(key)

sagemaker/multiregressor-xgboost-byo/multi-regressor-xgboost-model/model.tar.gz
