# About Graduate Admission Dataset

The dataset encompasses a collection of crucial parameters utilized in the evaluation process for admission into Masters Programs. These pivotal factors encompass:

1. **GRE Scores:** Represented on a scale of 0 to 340, the GRE scores serve as a standardized assessment to gauge the applicant's aptitude and proficiency in various academic fields.

2. **TOEFL Scores:** Measured on a scale of 0 to 120, the TOEFL scores assess the English language proficiency of international applicants, ensuring their ability to cope with the linguistic demands of the program.

3. **University Rating:** Rated on a scale of 0 to 5, this parameter provides insights into the reputation and quality of the candidate's previous educational institutions.

4. **Statement of Purpose and Letter of Recommendation Strength:** Evaluated on a scale of 0 to 5, this criterion gauges the persuasive power and efficacy of the applicant's statement of purpose and letters of recommendation, which reflect their motivation, aspirations, and support from mentors.

5. **Undergraduate GPA:** Rated on a scale of 0 to 10, the undergraduate GPA offers a quantitative measure of the applicant's academic performance during their previous studies.

6. **Research Experience:** A binary factor represented as either 0 or 1, indicating the presence or absence of research experience in the applicant's academic journey.

7. **Chance of Admit:** Ranging from 0 to 1, this parameter acts as the predicted probability of the applicant's successful admission into the Masters Program based on the aforementioned factors.

By incorporating these diverse parameters, the dataset provides a comprehensive and holistic view of the candidates, enabling admission committees to make informed decisions while selecting individuals who demonstrate the potential to excel in their chosen Masters Program.


In [None]:
# imort all the necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.svm import SVR
import statsmodels.api as sm
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Lasso,Ridge,BayesianRidge,ElasticNet
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import mean_squared_error
import warnings 
warnings.filterwarnings("ignore")

In [None]:
# Read the dataset
dataframe= pd.read_csv("/kaggle/input/graduate-admissions/Admission_Predict.csv")
dataframe.head(10)

# Exploratory Data Analysis

In [None]:
# check the info of data
dataframe.info()

In [None]:
# Describe the statistics of data
dataframe.describe()

In [None]:
# shape of data
dataframe.shape

In [None]:
# check the duplicate values in the dataset
dataframe.duplicated().sum()

In [None]:
# check is there any null value in the dataset
dataframe.isnull().sum()

In [None]:
# check the correlation matrix 
correlation_matrix= dataframe.corr()
correlation_matrix

In [None]:
# Plot the correlation matrix
plt.figure(figsize=(8,5))
sns.heatmap(correlation_matrix, annot=True, cmap="Blues", fmt='.2f')
plt.show()

In [None]:
# Let's drop the serial No. from the dataset
dataframe.drop("Serial No.", axis=1, inplace=True)

In [None]:
# check the shape of the data
dataframe.shape

In [None]:
columns= dataframe.columns.to_list()
columns[4]="LOR"
columns[7]="Chance of Admit"
dataframe.columns= columns
columns= dataframe.columns.to_list()
columns

# Univariate Analysis

In [None]:
# Density plot for GRE
sns.histplot(dataframe["GRE Score"], kde=True)
plt.title("GRE Score", fontsize=14)
plt.xlabel("GRE Score")
plt.ylabel("Count")
plt.show()

In [None]:
#  plot for TOFEL
sns.histplot(dataframe["TOEFL Score"], kde=True)
plt.title("TOEFL Score", fontsize=14)
plt.xlabel("TOEFL Score")
plt.ylabel("Count")
plt.show()

In [None]:
# plot for TOFEL
sns.histplot(dataframe["CGPA"], kde=True)
plt.title("CGPA", fontsize=14)
plt.xlabel("CGPA")
plt.ylabel("Count")
plt.show()

In [None]:
# Frequency plot for University Rating
dataframe["University Rating"].value_counts().plot(kind="bar", figsize=(6,4), rot=0)
plt.title("University Rating", fontsize=14)
plt.xlabel("University Rating")
plt.ylabel("Count")
plt.show()

In [None]:
# Frequency plot for SOP
dataframe["SOP"].value_counts().plot(kind="bar", figsize=(6,4), rot=0)
plt.title("SOP", fontsize=14)
plt.xlabel("SOP")
plt.ylabel("Count")
plt.show()

In [None]:
# Frequency plot for research
dataframe["Research"].value_counts().plot(kind="bar", figsize=(6,4), rot=0)
plt.title("Research", fontsize=14)
plt.xlabel("Research")
plt.ylabel("Count")
plt.show()

In [None]:
# Frequency plot for LOR
dataframe["LOR"].value_counts().plot(kind="bar", figsize=(6,4), rot=0)
plt.title("LOR", fontsize=14)
plt.xlabel("LOR")
plt.ylabel("Count")
plt.show()

# Bivariate Analaysis
Bivariate analysis using Violin plot to see the impact of SOP, LOR, Research and Universiy Rating on Chance of Admit.

In [None]:
fig=plt.figure(figsize=(12,12))

ax=fig.add_subplot(221)
sns.violinplot(data=dataframe, x=dataframe["SOP"], y=dataframe["Chance of Admit"], hue=None ,color='c',ax=ax)
ax.set_title('chance of admit vs SOP', fontsize=16)

ax=fig.add_subplot(222)
sns.violinplot(data=dataframe, x=dataframe["LOR"], y=dataframe["Chance of Admit"], hue=None ,color='r',ax=ax)
ax.set_title('chance of admit vs LOR', fontsize=16)

ax=fig.add_subplot(223)
sns.violinplot(data=dataframe, x=dataframe["Research"], y=dataframe["Chance of Admit"], hue=None ,color='y',ax=ax)
ax.set_title('chance of admit vs Research', fontsize=16)

ax=fig.add_subplot(224)
sns.violinplot(data=dataframe, x=dataframe["University Rating"], y=dataframe["Chance of Admit"], hue=None ,color='b',ax=ax)
ax.set_title('chance of admit vs University Rating', fontsize=16)

plt.show()

# Scatter Plot of CGPA, GRE Score and TOEFL Score with Chance of Admit

In [None]:
# hexabining plot 
plt.figure(figsize=(5,5))
sns.jointplot(x=dataframe["CGPA"], y=dataframe["Chance of Admit"], kind="hex", color="r")

In [None]:
plt.figure(figsize=(5,5))
sns.jointplot(x=dataframe["GRE Score"], y=dataframe["Chance of Admit"], kind="hex", color="b")

In [None]:
fig=plt.figure(figsize=(5,5))
sns.jointplot(x=dataframe["TOEFL Score"], y=dataframe["Chance of Admit"], kind="hex", color="g")

# Let's detect the outliers in the dataset
For detecting the outliers in the dataset we find the 95th percentile of each column and see the values that are covered under 95th percentile.

In [None]:
# Find the 95th percentile of each column
GRE_Score=np.quantile(dataframe["GRE Score"], 0.95)
print(f"GRE_Score at 95th Percentile: {GRE_Score}")

TOEFL_Score=np.quantile(dataframe["TOEFL Score"], 0.95)
print(f"TOEFL Score at 95th Percentile: {TOEFL_Score}")

CGPA=np.quantile(dataframe["CGPA"], 0.95)
print(f"CGPA at 95th Percentile:{CGPA}")

University_Rate=np.quantile(dataframe["University Rating"], 0.95)
print(f"University Rating at 95th Percentile: {University_Rate}")

SOP=np.quantile(dataframe["SOP"], 0.95)
print(f"SOP at 95th Percentile: {SOP}")


LOR=np.quantile(dataframe["LOR"], 0.95)
print(f"LOR at 95th Percentile: {LOR}")

# From the above values it is cleared that there is no outliers in the dataset,because the values that we are find at 95th percentile, closer to maximum value of each column that we see where we describe our dataset.

# Q-Q Plot

In [None]:
plt.figure(figsize=(10,6))
stats.probplot(dataframe["Chance of Admit"], plot= plt, dist="norm")
plt.title('Q-Q plot for Chance of Admit')
plt.show()

# Check the Multicollineraity Using VIF

In [None]:
# Standardise the data
std = StandardScaler()
x = dataframe.drop('Chance of Admit', axis=1)
y = dataframe['Chance of Admit']
cols = x.columns
x[cols] = std.fit_transform(x[cols])
x.shape

# Calculate the Variance Inflation Factor for all columns

In [None]:
VIF = pd.DataFrame()
Features=x
print(Features.columns)
VIF['features'] = Features.columns
print(Features.shape)
print(Features.shape[0])
print(Features.shape[1])
VIF['VIF'] = [variance_inflation_factor(Features.values, i) for i in range(Features.shape[1])]
print(VIF)

Here we can see that the VIF of CGPA column is 5.2, but when we see the correlation matrix then we came to know that CGPA column has individually impact on the Chance of Admit, regardless this column is highly correleated with others.So, we decided not to drop this column.

# OLS (Ordinary Least Squared) Regression is the most simple linear regression model also known as the base model for Linear Regression. 
 OLS is an estimator in which the values of slope and intercept are chosen in such a way as to minimize the sum of the squares of the differences between the observed dependent variable and predicted dependent variable. That’s why it’s named ordinary least squares.

In [None]:
x_sm = sm.add_constant(x)
sm_model = sm.OLS(y,x_sm).fit()
print(sm_model.summary())

# Split the data into train and test split

In [None]:
x_train, x_test, y_train, y_test= train_test_split(dataframe.drop("Chance of Admit", axis=1),
                                                   dataframe["Chance of Admit"],
                                                   test_size=0.2)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

# Data Preprocessing using StandardScaler

In [None]:
x_train=std.fit_transform(x_train)
x_test=std.transform(x_test)

# Modelling

In [None]:
result={}

models = {'DecisionTree': DecisionTreeRegressor(),
          'Linear Regression': LinearRegression(),
          'RandomForest': RandomForestRegressor(),
          'KNeighbours': KNeighborsRegressor(n_neighbors = 2),
           'SVM': SVR(),
           'AdaBoostClassifier': AdaBoostRegressor(),
           'GradientBoostingClassifier': GradientBoostingRegressor(),
           'Xgboost': XGBRegressor(),
           'Lasso':  Lasso(),
           'Ridge':  Ridge(),
           'BayesianRidge':  BayesianRidge(),
           'ElasticNet': ElasticNet(),
           }


for key,value in models.items():
    temp=[]
    model =value
    model.fit(x_train, y_train)
    predictions = model.predict(x_test)
    rmse=np.sqrt(mean_squared_error(y_test, predictions))
    temp.append(rmse)
    result[key]=temp
    
    
print(result)

# Results

In [None]:
result=pd.DataFrame(result)
result=result.T
result

In [None]:
col=result.columns.to_list()
col[0]="Root Mean Squared Error"
result.columns=col
result

# Visualising the Result using Bar Plot

In [None]:
result.plot(kind="bar", figsize=(10, 7)).legend(bbox_to_anchor=(1.0, 1.0));

# Conclusion
1. In this dataset we try to predict the Chance of Admission based on various input features. We perform the Univariate and Bivariate Analysis to see the impact of different input varibales on chance of Admit.
2. We detect the outliers in the dataset with help of Percentile Capping, and there are no outliers in the dataset.
3. We also try to find the Multicollinearity using Variance Inflation Method.
4. We also create the summary of model with the help of OLS Regression Model and see the value of different parameters.We find that the slope of University Rating and SOP is below Zero, which means these column has less important for predicting the chance of Admit.
5. At the end we find that the Ridge Regression perform very well among all the regression models.