### INTRODUCTION

#### Predicting the chances of admition mainly through logistic regression
#### Admit class was classified into two categories  0 and 1
#### Steps taken in preprocessing includes Data cleaning, Standardizationetc
#### All our variables in this dataset are numerical
#### Other models where used to compare accuracy

### SIDE NOTE
#### You can leave your question about any unclear part in the comment section
#### Any correction will be highly welcomed

### LOADING THE DATAFRAME

In [11]:
#Importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
import os

In [15]:
df = pd.read_csv(r'C:\dataset_git\Admission_Predict_Ver1.1.csv')

df.head(3)

FileNotFoundError: [Errno 2] File b'C:\\dataset_git\\Admission_Predict_Ver1.1.csv' does not exist: b'C:\\dataset_git\\Admission_Predict_Ver1.1.csv'

### DEALING WITH MISSING VALUES

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
Serial No.           500 non-null int64
GRE Score            500 non-null int64
TOEFL Score          500 non-null int64
University Rating    500 non-null int64
SOP                  500 non-null float64
LOR                  500 non-null float64
CGPA                 500 non-null float64
Research             500 non-null int64
Chance of Admit      500 non-null float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB


#### This dataset is clean it does not have any missing value

### DUMMY INDICATOR
#### Converting our target variable into a dummy indicator where a value greater than 0.5 chance of admit represents 1 else 0

In [4]:
df['admit'] =  np.where(df['Chance of Admit '] > 0.5,1,0)

In [5]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit,admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92,1
1,2,324,107,4,4.0,4.5,8.87,1,0.76,1
2,3,316,104,3,3.0,3.5,8.0,1,0.72,1
3,4,322,110,3,3.5,2.5,8.67,1,0.8,1
4,5,314,103,2,2.0,3.0,8.21,0,0.65,1


In [6]:
#Dropping useless variables
df.drop(['Chance of Admit ', 'Serial No.'], axis = 1, inplace = True)

In [7]:
df.head(3)

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,admit
0,337,118,4,4.5,4.5,9.65,1,1
1,324,107,4,4.0,4.5,8.87,1,1
2,316,104,3,3.0,3.5,8.0,1,1


In [8]:
df.describe()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,admit
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,316.472,107.192,3.114,3.374,3.484,8.57644,0.56,0.922
std,11.295148,6.081868,1.143512,0.991004,0.92545,0.604813,0.496884,0.26844
min,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.0
25%,308.0,103.0,2.0,2.5,3.0,8.1275,0.0,1.0
50%,317.0,107.0,3.0,3.5,3.5,8.56,1.0,1.0
75%,325.0,112.0,4.0,4.0,4.0,9.04,1.0,1.0
max,340.0,120.0,5.0,5.0,5.0,9.92,1.0,1.0


### CHECKING OLS ASSUMPTIONS

#### Let's check that our dataset are not violating any of this assumptions which includes:
#### 1. No Endogeneity
#### 2. Normality and Homoscedasticity
#### 3.No Autocorrelation
#### 4.NO multicollinearity: making sure our independents variables are not strongly related(correlated) with each other

####  We are not violating  assumptions 1 through 3 but for NO multicollinearity we need to check

In [9]:
df.columns.values

array(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ',
       'CGPA', 'Research', 'admit'], dtype=object)

In [10]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant


# the target column (in this case 'admit') should not be included in variables
#Categorical variables already turned into dummy indicator may or maynot be added if any
variables = df[['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ',
       'CGPA',]]
X = add_constant(variables)
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range (X.shape[1]) ]
vif['features'] = X.columns
vif

#Using 10 as the minimum vif values i.e any independent variable 10 and above will have to be dropped
#From the results all independent variable are below 10

  return ptp(axis=axis, out=out, **kwargs)


Unnamed: 0,VIF,features
0,1277.356032,const
1,4.099486,GRE Score
2,3.895301,TOEFL Score
3,2.613004,University Rating
4,2.834057,SOP
5,2.027346,LOR
6,4.775198,CGPA


### Standardization

#### Standardizing helps to give our independent varibles a more standard and relatable numeric scale, it also helps in improving model accuracy

In [11]:
#Declaring our target variable as y
#Declaring our independent variables as x
y = df['admit']
x = df.drop(['admit'], axis = 1)

In [12]:
scaler = StandardScaler() #Selecting the standardscaler

scaler.fit(x)#fitting our independent variables

StandardScaler(copy=True, with_mean=True, with_std=True)

In [13]:
scaled_x = scaler.transform(x)#scaling

### LOGISTIC REGRESSION

In [14]:
#Splitting our data into train and test dataframe
x_train, x_test, y_train, y_test = train_test_split(scaled_x,y , test_size = 0.2, random_state = 49)

In [15]:
reg = LogisticRegression()#Selecting our model
reg.fit(x_train,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [16]:
y_new = reg.predict(x_test) #Predicting with our already trained model using x_test

In [17]:
#Getting the accuracy of our model
acc = metrics.accuracy_score(y_new,y_test)
acc

0.94

In [18]:
#The intercept for our regression
reg.intercept_

array([3.77379732])

In [19]:
#Coefficient for all our variables
reg.coef_

array([[ 0.1932854 ,  0.58482231, -0.0991337 , -0.17392944,  0.60550104,
         1.1850579 ,  0.0598847 ]])

### CONFUSION MATRIX

In [20]:
cm = confusion_matrix(y_new, y_test)
cm

array([[ 1,  0],
       [ 6, 93]], dtype=int64)

In [21]:
# Format for easier understanding
cm_df = pd.DataFrame(cm)
cm_df.columns = ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index={0: 'Actual 0',1:'Actual 1'})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,1,0
Actual 1,6,93


#### Our model predicted '0' correctly once while NEVER predicting '0' incorrectly
#### Also it predicted '1' correctly 93 times while predicting '1' incorrectly 6 times


### OTHER MODELS

In [22]:
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)

dnew = dt.predict(x_test)

acc2 = metrics.accuracy_score(dnew,y_test)
acc2

0.91

In [23]:
sv = svm.SVC() #select the algorithm
sv.fit(x_train,y_train) # we train the algorithm with the training data and the training output
y_pred = sv.predict(x_test) #now we pass the testing data to the trained algorithm
acc_svm = metrics.accuracy_score(y_pred,y_test)
print('The accuracy of the SVM is:', acc_svm)

The accuracy of the SVM is: 0.92




In [24]:
knc = KNeighborsClassifier(n_neighbors=3) #this examines 3 neighbours for putting the new data into a class
knc.fit(x_train,y_train)
y_pred = knc.predict(x_test)
acc_knn = metrics.accuracy_score(y_pred,y_test)
print('The accuracy of the KNN is', acc_knn)

The accuracy of the KNN is 0.93


#### After comparison with some other model we see that Logistic regression gave us the highest accuracy ~94%

###  CONCLUSION
#### Let's try to make a table and interpret what weight(BIAS) and odds means

In [25]:
df1 = pd.DataFrame(data = x.columns.values, columns = ['Features'])

df1['weight'] = np.transpose(reg.coef_)
df1['odds'] = np.exp(np.transpose(reg.coef_))
df1

Unnamed: 0,Features,weight,odds
0,GRE Score,0.193285,1.213229
1,TOEFL Score,0.584822,1.794672
2,University Rating,-0.099134,0.905622
3,SOP,-0.173929,0.840356
4,LOR,0.605501,1.83217
5,CGPA,1.185058,3.270876
6,Research,0.059885,1.061714


#### Remember we standardized all independents variables so the odds values have no direct interpretation
#### Nevertheless using LOR as an example we can say for one standard deviation increase in LOR it is amost twice likely to cause a change in our target variable





#### If you find this notebook useful don't forget to upvote. #Happycoding
