 ###  Using Ensemble Methods to Predict Grad School Admission 

I focused on creating and comparing a logistic regression model, a support vector machine model, and a decision tree model on my dataset. These three models then got combined to form one ensemble method, and the results of these four models were compared. The dataset I chose to work with was on predicting the likelihood of someone getting admitted to grad school, based on factors such as GRE Score, Univerisity Rating,LOR, etc. Since my target variable, AdmitPercentage, was not a binary classifier, I had to create a new variable, LikelihoodAdmit, that was a binary classifier. I said someone had a high chance of grad school if their LikelihoodAdmit was greater than 0.7. This allowed the classes to be split roughly 50%-50% for each of the groups. I then used this column as my target variable to make predictions. For these set of models, I let the models be trained on all the feature variables that were in the dataset. 

I also performed the grid search algorithm on both my logistic regression and support vector mahcine pipelines in order to find the best hyperparameters for the model. After creating and fitting the models on the training set using 10 fold cross validation, I end up comparing accuracy scores on the testing sets of the four models to see which one performs best for the dataset. This should allow me to pick the model that will best predict whether or not someone has a good chance of getting admitted to a graduate school.

In [7]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder


In [8]:
admission=pd.read_csv("Admission_Predict.csv").astype("float")
admission

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,AdmitPercentage
0,1.0,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
1,2.0,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
2,3.0,316.0,104.0,3.0,3.0,3.5,8.00,1.0,0.72
3,4.0,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.80
4,5.0,314.0,103.0,2.0,2.0,3.0,8.21,0.0,0.65
...,...,...,...,...,...,...,...,...,...
495,496.0,332.0,108.0,5.0,4.5,4.0,9.02,1.0,0.87
496,497.0,337.0,117.0,5.0,5.0,5.0,9.87,1.0,0.96
497,498.0,330.0,120.0,5.0,4.5,5.0,9.56,1.0,0.93
498,499.0,312.0,103.0,4.0,4.0,5.0,8.43,0.0,0.73


In [19]:
admission.loc[admission["AdmitPercentage"]<=0.7, "LikelihoodAdmit"] = 0
admission.loc[admission["AdmitPercentage"]>0.7, "LikelihoodAdmit"] = 1
count0=admission.loc[:,"LikelihoodAdmit"].eq(1.0).sum()
#count0=admission[['LikelihoodAdmit']].eq(1.0).sum()
count1=admission.loc[:,"LikelihoodAdmit"].eq(0).sum()
yes=len(admission[admission["LikelihoodAdmit"]==0])
count0, count1, yes

#admission=admission.to_numpy()

(287, 213, 213)

We can see from here that the two target outcomes have close to the same number of observations for each outcome

In [4]:
admission=admission.to_numpy()

In [5]:
X=admission[:, 0:9]
y=admission[:, 9]

le=LabelEncoder()
y=le.fit_transform(y)
y

X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.3, random_state=2, stratify=y)

We will now use the grid search algorithm on the SVM and Logistic Regression Models

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC

pipe_svc=make_pipeline(StandardScaler(), SVC(random_state=1))
param_range=[0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
param_grid=[{'svc__C': param_range,
           'svc__kernel':['linear']},
           {'svc__C':param_range,
            'svc__gamma':param_range,
            'svc__kernel':['rbf'] }]
gs=GridSearchCV(estimator=pipe_svc, param_grid=param_grid, scoring='accuracy', cv=10, refit=True)
gs=gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

1.0
{'svc__C': 1000.0, 'svc__kernel': 'linear'}


In [7]:
from sklearn.linear_model import LogisticRegression

pipe_log= make_pipeline(StandardScaler(), LogisticRegression(random_state=1))

param_grid=[{'logisticregression__C':param_range,
              'logisticregression__solver':['lbfgs'],
            'logisticregression__multi_class':['auto']}]

          
gs=GridSearchCV(estimator=pipe_log, param_grid=param_grid, scoring='accuracy', cv=10, refit=True)
gs=gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)


0.9942857142857143
{'logisticregression__C': 1000.0, 'logisticregression__multi_class': 'auto', 'logisticregression__solver': 'lbfgs'}


For both the logistic regression and on the support vector machine models, the grid search algorithim was used in order to find the optimal hyperparameters that would be best for the two separate models. This grid search algorithm is a brute force algorithm and tests every combination of hyperparameters until it finds the best ones for the respective models.

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score


pipe1 = make_pipeline(StandardScaler(), 
                        SVC(C=1000.0, kernel='linear',random_state=1))

pipe2 =  make_pipeline(DecisionTreeClassifier(max_depth=2,
                                             criterion='entropy',
                                             random_state=0))

pipe3 = make_pipeline(StandardScaler(), LogisticRegression(C=1000.0, random_state=1, solver='lbfgs', multi_class='auto'))

In [9]:
clf_labels = ['SVM', 'Decision tree', 'Logistic Regression']

print('10-fold cross validation:\n')
for clf, label in zip([pipe1, pipe2, pipe3], clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='accuracy')
    print("Accuracy: " + str(round(scores.mean(), 3)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " [" + label + "]")

10-fold cross validation:

Accuracy: 1.0 Stdev: 0.0 [SVM]
Accuracy: 1.0 Stdev: 0.0 [Decision tree]
Accuracy: 0.994 Stdev: 0.011 [Logistic Regression]


In [10]:
from sklearn.ensemble import VotingClassifier

mv_clf = VotingClassifier(estimators=[('svm', pipe1), ('dt', pipe2), ('log', pipe3)])


The ensemble method I choose to use is one that utilizes majority class voting based on the three prior models. Through a combination of the support vector machine model, the decision tree model, and the logistic regression model, the ensemble model will now use majority voting with these three models to create this new ensemble method to make predictions.

In [11]:
clf_labels += ['Majority voting']
all_clf = [pipe1, pipe2, pipe3, mv_clf]

for clf, label in zip(all_clf, clf_labels):
    scores = cross_val_score(estimator=clf,
                             X=X_train,
                             y=y_train,
                             cv=10,
                             scoring='accuracy')
    print("Accuracy: " + str(round(scores.mean(), 2)) + 
          " Stdev: " + str(round(scores.std(), 3)) +
          " [" + label + "]")

Accuracy: 1.0 Stdev: 0.0 [SVM]
Accuracy: 1.0 Stdev: 0.0 [Decision tree]
Accuracy: 0.99 Stdev: 0.011 [Logistic Regression]
Accuracy: 1.0 Stdev: 0.0 [Majority voting]


From running a 10 fold cross validation on each of the different models, we can see that the three models do an extremely good job for all four of the different models. For the SVM, Decision Tree, and the Ensemble Majority Voting methods, we can see that the accuracy is 100% with 0 standard deviation. For the Logistic Regression model we can that this one only performs marginally worse with an accuracy of 0.99 and a std deviation of 0.011. This means that each of the four models does an extremely good job on fitting the training dataset. One of the potential worries with this is that the model may have high variance and could have an over-fitting issue but we can now test our models and see how they do on the testing set to see if this ends up being true.

In [12]:
pipe1.fit(X_train, y_train)

y_pred = pipe1.predict(X_test)
print("SVM Model")
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', pipe1.score(X_test, y_test))

SVM Model
Misclassified test set examples: 0
Out of a total of: 150
Accuracy: 1.0


In [13]:
pipe2.fit(X_train, y_train)

y_pred = pipe2.predict(X_test)
print("Decision Tree Model")
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', pipe2.score(X_test, y_test))

Decision Tree Model
Misclassified test set examples: 0
Out of a total of: 150
Accuracy: 1.0


In [14]:
pipe3.fit(X_train, y_train)

y_pred = pipe3.predict(X_test)
print("Logistic Regression Model")
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', pipe3.score(X_test, y_test))

Logistic Regression Model
Misclassified test set examples: 0
Out of a total of: 150
Accuracy: 1.0


In [15]:
mv_clf.fit(X_train, y_train)

y_pred = mv_clf.predict(X_test)
print("Ensemble Method Model")
print('Misclassified test set examples:', (y_test != y_pred).sum())
print('Out of a total of:', y_test.shape[0])
print('Accuracy:', mv_clf.score(X_test, y_test))

Ensemble Method Model
Misclassified test set examples: 0
Out of a total of: 150
Accuracy: 1.0


For all four of the models, we can see that each one of them 100% perfectly classified the 150 examples in the testing data set. This means that for this dataset, either of the four models would be a great model to use. While I was wary that the models overfit the training data, we can see that the high accuracy scores still held up when we fitted them to the testing data. While this may seem suprising to have this high of an accuracy score, I think this occurred based on how I split up the dataset to begin with. If I were to change the highchance of admit to above say 80% or even 90% I believe the dataset might not have done as good of a job predicting the testing set. What these models do tell us though is that when we use all the different factor variables that are available in the dataset, that we can create a very good model that accruately predicts if someone has a high likelhood of getting into graduate school or not. Overall each of these models do a great job of predicting admit likelihood though.