# US College Admission Success

## Business Objective:

    1. Build a smart Predictive tool that can assess the chances of a Student's entry into multiple Tier US Tech Schools.
    2. Develop a prototype model that uses Student data to produce admission probability for each student.
    3. Explore historical student data, using statistical analysis and Machine Learning, to understand and identify factors that may affect a Student's success at getting into their dream University.


## Motivation:

After observing many of my friends trying hard and not getting into the University of their choice in the US, I was intrigued to know whether the admission process that seems to be a black-box can be broken down using Analytical tools. Anxiety can be caused by low GRE and TOEFL scores. So, I shall look whether they play an important role or are there other factors than can quantitatively outweigh Standardized Test Scores. Furthermore, this tool can help students in building a stronger Profile.l
 

## Workflow :

   1. **Importing Libraries**
      
   2. **Data Pre-Processing**
        - 2.1 Reading Data from the csv file.
        - 2.2 Dummification of categorical Variables

   3. **Data Modeling**
        - 3.1 Statistical Learning
            - Building a statistical Regression Model to find statistically significant factors.
            - 3.1.1 Interpret the p-values (Set 5% as level of significance) 
                 - Analyzing important Factors with significance lower than 0.05.
        - 3.2 Machine Learning
            - Making a Machine Learning model that will finally yield the prediction.
  
## Results:

|    **Model**     |    **Percentage RMSE**    | **R square**   | **Adjusted R-square**|
|:------------:|:-----------------------------:|:----------------------:|:----------------------:|
| `Linear Regression`| 9.84 |74.3|72|

## Conclusion:

Using a very simple Linear Regression model, the prediction of getting into a University is quite good. It is too be seen that TOEFL score is quite insignificant factor in deciding the admissibility of a student to any University. Whereas, GRE scores, University Tier, Letter of Recommendation and Previous Research can play an important role in predicting the admissibility.
    


# 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
import statsmodels.api as sm


# 2. Data Pre-processing

## 2.1 Reading Data from the csv file

In [2]:
admission = pd.read_csv("Admissions.csv")

In [3]:
admission = admission.drop('Serial No.', axis =1)


In [4]:
admission.head()


Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,337,118,4,4.5,4.5,9.65,1,0.92
1,324,107,4,4.0,4.5,8.87,1,0.76
2,316,104,3,3.0,3.5,8.0,1,0.72
3,322,110,3,3.5,2.5,8.67,1,0.8
4,314,103,2,2.0,3.0,8.21,0,0.65


In [5]:
admission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 8 columns):
GRE Score            400 non-null int64
TOEFL Score          400 non-null int64
University Rating    400 non-null int64
SOP                  400 non-null float64
LOR                  400 non-null float64
CGPA                 400 non-null float64
Research             400 non-null int64
Chance of Admit      400 non-null float64
dtypes: float64(4), int64(4)
memory usage: 25.1 KB


In [6]:
admission.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
GRE Score,400.0,316.8075,11.473646,290.0,308.0,317.0,325.0,340.0
TOEFL Score,400.0,107.41,6.069514,92.0,103.0,107.0,112.0,120.0
University Rating,400.0,3.0875,1.143728,1.0,2.0,3.0,4.0,5.0
SOP,400.0,3.4,1.006869,1.0,2.5,3.5,4.0,5.0
LOR,400.0,3.4525,0.898478,1.0,3.0,3.5,4.0,5.0
CGPA,400.0,8.598925,0.596317,6.8,8.17,8.61,9.0625,9.92
Research,400.0,0.5475,0.498362,0.0,0.0,1.0,1.0,1.0
Chance of Admit,400.0,0.72435,0.142609,0.34,0.64,0.73,0.83,0.97


## 2.2 Dummification of categorical columns

In [11]:
# Converting necessary columns into categorical ones

admission['University Rating'] = pd.Categorical(admission['University Rating'])
admission['Research'] = pd.Categorical(admission['Research'])


In [12]:
# Dummification of categorical columns
variables = list(admission.columns)

variables

['GRE Score',
 'TOEFL Score',
 'University Rating',
 'SOP',
 'LOR ',
 'CGPA',
 'Research',
 'Chance of Admit ']

In [19]:
admits = pd.get_dummies(data = admission[['GRE Score',
 'TOEFL Score',
 'University Rating',
 'SOP',
 'LOR ',
 'CGPA',
 'Research']], drop_first = True)

In [20]:
admits.columns

Index(['GRE Score', 'TOEFL Score', 'SOP', 'LOR ', 'CGPA',
       'University Rating_2', 'University Rating_3', 'University Rating_4',
       'University Rating_5', 'Research_1'],
      dtype='object')

# 3. Data Modeling

In [24]:
x = admits[['GRE Score', 'TOEFL Score', 'SOP', 'LOR ', 'CGPA',
       'University Rating_2', 'University Rating_3', 'University Rating_4',
       'University Rating_5', 'Research_1']]
y = admission[['Chance of Admit ']]

In [26]:
## Splitting the data into Test and Train in 70:30 ratio

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, random_state = 10)

In [27]:
# print proportions
print('train: {}% | Test: {}%'.format( round(len(y_train)/len(y),2), 
                                                        round(len(y_test)/len(y),2) ) )

train: 0.7% | Test: 0.3%


## 3.1 Statistical Learning

In [29]:
# Build the model using the train data
# Define the model


lm = sm.OLS(y_train,x_train).fit()

# Model Summary
lm.summary()

0,1,2,3
Dep. Variable:,Chance of Admit,R-squared:,0.991
Model:,OLS,Adj. R-squared:,0.991
Method:,Least Squares,F-statistic:,3145.0
Date:,"Sun, 04 Nov 2018",Prob (F-statistic):,5.1100000000000005e-273
Time:,15:56:15,Log-Likelihood:,354.08
No. Observations:,280,AIC:,-688.2
Df Residuals:,270,BIC:,-651.8
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
GRE Score,-0.0026,0.001,-5.066,0.000,-0.004,-0.002
TOEFL Score,0.0019,0.001,1.356,0.176,-0.001,0.005
SOP,0.0007,0.007,0.094,0.926,-0.014,0.015
LOR,0.0219,0.007,3.093,0.002,0.008,0.036
CGPA,0.1419,0.016,8.904,0.000,0.111,0.173
University Rating_2,-0.0043,0.019,-0.225,0.822,-0.041,0.033
University Rating_3,0.0204,0.021,0.996,0.320,-0.020,0.061
University Rating_4,0.0469,0.025,1.898,0.059,-0.002,0.096
University Rating_5,0.0793,0.028,2.842,0.005,0.024,0.134

0,1,2,3
Omnibus:,41.064,Durbin-Watson:,1.792
Prob(Omnibus):,0.0,Jarque-Bera (JB):,62.542
Skew:,-0.884,Prob(JB):,2.63e-14
Kurtosis:,4.496,Cond. No.,3500.0


## 3.1.1 Interpret the p-values (Set 5% as level of significance) 

In [39]:
# Function to get p-value<0.05
def get_significant_vars( lm ):
    var_p_vals_df = pd.DataFrame( lm.pvalues )
    var_p_vals_df['vars'] = var_p_vals_df.index
    var_p_vals_df.columns = ['pvals', 'vars']
    return list( var_p_vals_df[var_p_vals_df.pvals <= 0.05]['vars'] )
global significant_vars
significant_vars = get_significant_vars( lm )
significant_vars

['GRE Score', 'LOR ', 'CGPA', 'University Rating_5', 'Research_1']

## 3.2 Machine Learning

In [30]:
from sklearn.linear_model import LinearRegression


In [31]:
model1 = LinearRegression()


In [32]:
m2 = model1.fit(x_train,y_train)


In [33]:
# R-square
m2.score(x_test,y_test)
# Residual Standard Error

avgYtest = np.mean(y_test)
yTestPred = m2.predict(x_test)
mse = metrics.mean_squared_error(y_test, yTestPred)
test_rmse = np.sqrt(mse)
test_rmse_percent = (test_rmse/avgYtest*100)

# Adjusted R^2
Rsquare = m2.score(x_test,y_test)
noData = len(y_test)
p = x_test.shape[1]
tempRsquare = 1 - ((1 - Rsquare)*(noData - 1 ))/(noData - p - 1)

In [34]:
print("RMSE Percentage Error for Test Data :", list(test_rmse_percent))
print("R-square for Test Data:", m2.score(x_test,y_test))
print("Adjusted R-square for Test Data:", tempRsquare)

RMSE Percentage Error for Test Data : [9.848498944659614]
R-square for Test Data: 0.7437398170329537
Adjusted R-square for Test Data: 0.7202297085038669


In [35]:
admits.columns


Index(['GRE Score', 'TOEFL Score', 'SOP', 'LOR ', 'CGPA',
       'University Rating_2', 'University Rating_3', 'University Rating_4',
       'University Rating_5', 'Research_1'],
      dtype='object')

#### Predicting the chance of admit for the given attributes :

| **GRE Score** | **TOEFL Score** | **University Ranking** | **SOP** | **LOR** | **CGPA** | **Research** |
|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|
| 295 | 98 | 4 | 4.5 | 4.5| 7.56 | 1 |
| 333 | 112 | 4 | 4.5 | 4.5 | 8.87 | 1 |

In [36]:
newAdmission = {'GRE_Score': [295,333], 'TOEFL_Score': [98,112],'SOP': [4.5,4.5], 'LOR ' :[4.5,4.5],
                'CGPA' : [7.56,8.87],'University_Rating_1': [0,0],'University_Rating_2': [0,0],'University_Rating_3': [0,0],'University_Rating_4': [1,1],  'Research_1': [1,1]}
newAdmission = pd.DataFrame(data=newAdmission)

In [37]:
# Prediction of 'Chance of Admit'
lm.predict(newAdmission)

0    0.716673
1    0.829938
dtype: float64