# EDA for Admission Prediction Dataset

## Content

- Introduction (About Dataset - Source, What it contains, How it will be useful.)
- Importing Libraries / Datasets


## 1. Introduction

### Source: https://www.kaggle.com/datasets/mohansacharya/graduate-admissions

### Context
- This dataset is created for prediction of Graduate Admissions from an Indian perspective.

### Attribute Information

The dataset contains several parameters which are considered important during the application for Masters Programs.
The parameters included are :

1. GRE Scores ( out of 340 )
2. TOEFL Scores ( out of 120 )
3. University Rating ( out of 5 )
4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
5. Undergraduate GPA ( out of 10 )
6. Research Experience ( either 0 or 1 )
7. Chance of Admit ( ranging from 0 to 1 )(Target Variable)

## 2. Importing Libraries / Datasets

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

In [2]:
## import datasets

df = pd.read_csv('Dataset/Admission_Predict_Ver1.1.csv')

In [3]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Serial No.         500 non-null    int64  
 1   GRE Score          500 non-null    int64  
 2   TOEFL Score        500 non-null    int64  
 3   University Rating  500 non-null    int64  
 4   SOP                500 non-null    float64
 5   LOR                500 non-null    float64
 6   CGPA               500 non-null    float64
 7   Research           500 non-null    int64  
 8   Chance of Admit    500 non-null    float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB


**Observations:**
- As we can look at the dataset info, Here is a total of **8 features/columns** in the dataset.
- All features are of numerical type.

In [5]:
# Check the shape of dataset

df.shape

(500, 9)

In [6]:
# check the size of dataset

df.size

4500

**Observations:**
- As we can see, dataset contains 9 columns and 500 rows.

In [7]:
# Check the statistical information about features
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Serial No.,500.0,250.5,144.481833,1.0,125.75,250.5,375.25,500.0
GRE Score,500.0,316.472,11.295148,290.0,308.0,317.0,325.0,340.0
TOEFL Score,500.0,107.192,6.081868,92.0,103.0,107.0,112.0,120.0
University Rating,500.0,3.114,1.143512,1.0,2.0,3.0,4.0,5.0
SOP,500.0,3.374,0.991004,1.0,2.5,3.5,4.0,5.0
LOR,500.0,3.484,0.92545,1.0,3.0,3.5,4.0,5.0
CGPA,500.0,8.57644,0.604813,6.8,8.1275,8.56,9.04,9.92
Research,500.0,0.56,0.496884,0.0,0.0,1.0,1.0,1.0
Chance of Admit,500.0,0.72174,0.14114,0.34,0.63,0.72,0.82,0.97


**Observations:**

- All the features in the dataset are in normal range.
- As we can see that there is no skewness in the dataset but for further analysis lets explore through different graphs.

In [12]:
## Remove space in target variable name
df.rename(columns={'Chance of Admit ': 'Chance of Admit'},inplace=True)

In [13]:
df.columns

Index(['Serial No.', 'GRE Score', 'TOEFL Score', 'University Rating', 'SOP',
       'LOR ', 'CGPA', 'Research', 'Chance of Admit'],
      dtype='object')

In [14]:
df['Chance of Admit'].describe()

count    500.00000
mean       0.72174
std        0.14114
min        0.34000
25%        0.63000
50%        0.72000
75%        0.82000
max        0.97000
Name: Chance of Admit, dtype: float64

In [22]:
# Violin plot - Distribution of Chance of Admit
fig = px.violin(df, y="Chance of Admit", box=True, # draw box plot inside the violin
                points='all',# can be 'outliers', or False
                title='Distribution of Chance of Admit'
               )
fig.show()

**Observations:**
- As we can see that in violin plot, there is 60% to 80% chances of admission means more people in data have higher chances of admit.
- So,let see further analysis to find which features increase the the chances of admission.

In [23]:
# Histogram - Distribution GRE Score 
fig = px.histogram(df, x='GRE Score',
                  title='Distribution of GRE Score', marginal='box')
fig.show()

**Observations:**
- **About GRE Score:** The Graduate Record Examination, or GRE, is an important step in the graduate school or business school application process. The GRE is a multiple-choice, computer-based, standardized exam that is often required for admission to graduate programs and graduate business programs (MBA) globally.

- As we can see that data is Normaly distributed.
- Many entries in data have high GRE Score.Lets check how GRE score affects the chance of admission.

In [40]:
fig = px.scatter(df, x="GRE Score", y="Chance of Admit", 
                 trendline="ols",trendline_color_override = '#000000',
                 height=450,title='GRE Score Vs Chance of Admit')
fig.show()