#Project 5 - Predicting the Likelihood of Graduate Admission for Indian Students Using ML Models
Aman Patel

CSCI-B 455

April 25, 2021

# **Introduction**

## Problem Statement
The goal for this project was to create three distinct ML models to predict the likelihood that an Indian student would be admitted to graduate school.

## Data
The dataset used for this project, "Graduate Admission 2", was collected from Mohan S Acharya on Kaggle. It contained seven academic attributes for 500 students including GRE, TOEFL, CGPA, and more. The data can be found at https://www.kaggle.com/mohansacharya/graduate-admissions.

## Model Parameters
The first model used was a default LinearRegression. This is a simple model that aims to minimize the SSE between the data and the prediction.

The second model used was a DecisionTreeRegressor. This was also mostly using default settings, but the max depth of the tree was changed to minimize overfitting while maintaining a level of complexity.

The third model used was a default SGDRegressor. This model is trained using Stochastic Gradient Descent, which gradually oscillates about local maxima until convergence is reached. 

# **Code**

In [93]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import confusion_matrix
import sklearn

# data file processed and split into features and targets
f = open('Admission_Predict_Ver1.1.csv')
f.readline()
data = np.loadtxt(f,delimiter=',')
X = data[:, 1:-1]
y = data[:, -1]

# All features normalized to keep training even between features
X = preprocessing.StandardScaler().fit_transform(X)

# Encodes the targets for use in multi-class classification (affects SGDRegressor)
encoder = preprocessing.LabelEncoder()
y = encoder.fit_transform(y)

# Split training and testing data randomly using 40% test and 60% train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 1)

# Initialize the models
model1 = LinearRegression()
model2 = DecisionTreeRegressor(max_depth = 3)
model3 = SGDRegressor()

# Train the models
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)

# Perform 5-fold cross validation to test the models
model1_scores = cross_val_score(model1, X_test, y_test, scoring='r2', cv=5)
model2_scores = cross_val_score(model2, X_test, y_test, scoring='r2', cv=5)
model3_scores = cross_val_score(model3, X_test, y_test, scoring='r2', cv=5)

# Prints the cross-validation scores and the average score
print(model1_scores, np.mean(model1_scores))
print(model2_scores, np.mean(model2_scores))
print(model3_scores, np.mean(model3_scores))


[0.66967059 0.87540069 0.83332021 0.84154644 0.81193494] 0.8063745732971566
[0.46377999 0.79293067 0.71338152 0.77872167 0.70073842] 0.6899104523122412
[0.66643891 0.88152298 0.8346214  0.83888757 0.80987084] 0.8062683401492106


# **Results**

As shown in the printed results, the LinearRegression and SGDRegressor models performed well, with similar scores in all five validations. The DecisionTreeRegressor model had noticeably lower cross-validation scores than the other models (80.6% vs. 69%)

Interestingly, the first cross-validation score for each model was significantly lower than the following four. The reason why this occurs is unknown, but further research can uncover it.

The models can be improved by tweaking the parameters, which can reduce overfitting and decrease training time. 