# Admission Prediction with PySpark ML

This project demonstrates how to use PySpark for predicting graduate admissions. The dataset used contains several parameters which are considered important during the application for Masters Programs.

The main objective is to build a machine learning model using PySpark to predict the chances of admission based on various features such as GRE Score, TOEFL Score, University Rating, SOP, LOR, CGPA, Research, etc.


### Introduction
In this notebook, we will walk through the process of building a machine learning model with PySpark to predict the chances of admission. PySpark is the Python API for Apache Spark, which is a distributed computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.


### Install libraries & Run a SparkSession


In [1]:
#install pyspark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.3.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.3-py2.py3-none-any.whl size=317840625 sha256=69a418f99818aab7c71f3f4c3f863dc9f06b7036cc3e715814587382d9dd571a
  Stored in directory: /root/.cache/pip/wheels/1b/3a/92/28b93e2fbfdbb07509ca4d6f50c5e407f48dce4ddbda69a4ab
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.3


In [2]:
#create a sparksession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('admission_pred').getOrCreate()


### Data Preparation
We will start by loading the dataset and performing basic data cleaning and preparation. This includes handling missing values, encoding categorical features, and splitting the data into training and testing sets.


In [3]:
#clone the dataset
! git clone https://github.com/education454/admission_dataset

Cloning into 'admission_dataset'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (3/3), 5.60 KiB | 5.60 MiB/s, done.


In [4]:
#check the presence of dataset
! ls admission_dataset

Admission_Predict_Ver1.1.csv


In [5]:
#create a spark dataframe
df = spark.read.csv('/content/admission_dataset/Admission_Predict_Ver1.1.csv', header=True, inferSchema=True)

In [6]:
#display dataframe
df.show()

+---------+---------+-----------+-----------------+---+---+----+--------+---------------+
|Serial No|GRE Score|TOEFL Score|University Rating|SOP|LOR|CGPA|Research|Chance of Admit|
+---------+---------+-----------+-----------------+---+---+----+--------+---------------+
|        1|      337|        118|                4|4.5|4.5|9.65|       1|           0.92|
|        2|      324|        107|                4|4.0|4.5|8.87|       1|           0.76|
|        3|      316|        104|                3|3.0|3.5| 8.0|       1|           0.72|
|        4|      322|        110|                3|3.5|2.5|8.67|       1|            0.8|
|        5|      314|        103|                2|2.0|3.0|8.21|       0|           0.65|
|        6|      330|        115|                5|4.5|3.0|9.34|       1|            0.9|
|        7|      321|        109|                3|3.0|4.0| 8.2|       1|           0.75|
|        8|      308|        101|                2|3.0|4.0| 7.9|       0|           0.68|
|        9

In [7]:
#get the no.of rows & columns
df.count(), len(df.columns)

(500, 9)

In [8]:
#print schema
df.printSchema()

root
 |-- Serial No: integer (nullable = true)
 |-- GRE Score: integer (nullable = true)
 |-- TOEFL Score: integer (nullable = true)
 |-- University Rating: integer (nullable = true)
 |-- SOP: double (nullable = true)
 |-- LOR: double (nullable = true)
 |-- CGPA: double (nullable = true)
 |-- Research: integer (nullable = true)
 |-- Chance of Admit: double (nullable = true)



In [None]:
#get the summary statistics
df.describe().show()

+-------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+------------------+-------------------+
|summary|        Serial No|         GRE Score|      TOEFL Score|University Rating|               SOP|               LOR|              CGPA|          Research|    Chance of Admit|
+-------+-----------------+------------------+-----------------+-----------------+------------------+------------------+------------------+------------------+-------------------+
|  count|              500|               500|              500|              500|               500|               500|               500|               500|                500|
|   mean|            250.5|           316.472|          107.192|            3.114|             3.374|             3.484| 8.576440000000003|              0.56| 0.7217399999999996|
| stddev|144.4818327679989|11.295148372354712|6.081867659564538|1.143511800759815|0.9910036207566072|0.92

### Data Cleaning

In [9]:
#drop the unnecessary column
df = df.drop('Serial No')

In [10]:
#display the dataframe
df.show()

+---------+-----------+-----------------+---+---+----+--------+---------------+
|GRE Score|TOEFL Score|University Rating|SOP|LOR|CGPA|Research|Chance of Admit|
+---------+-----------+-----------------+---+---+----+--------+---------------+
|      337|        118|                4|4.5|4.5|9.65|       1|           0.92|
|      324|        107|                4|4.0|4.5|8.87|       1|           0.76|
|      316|        104|                3|3.0|3.5| 8.0|       1|           0.72|
|      322|        110|                3|3.5|2.5|8.67|       1|            0.8|
|      314|        103|                2|2.0|3.0|8.21|       0|           0.65|
|      330|        115|                5|4.5|3.0|9.34|       1|            0.9|
|      321|        109|                3|3.0|4.0| 8.2|       1|           0.75|
|      308|        101|                2|3.0|4.0| 7.9|       0|           0.68|
|      302|        102|                1|2.0|1.5| 8.0|       0|            0.5|
|      323|        108|                3

In [11]:
#check for null values
for col in df.columns:
  print(col, df.filter(df[col].isNull()).count())

GRE Score 0
TOEFL Score 0
University Rating 0
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0



### Correlation Analysis & Feature Selection
This is a crucial step to understand the dataset and the relationships between the features and the target variable.


In [12]:
# correlation analysis
for col in df.columns:
  print('Correlation between Chance of admit and', col, 'is', df.stat.corr('Chance of Admit', col))

Correlation between Chance of admit and GRE Score is 0.8103506354632598
Correlation between Chance of admit and TOEFL Score is 0.7922276143050823
Correlation between Chance of admit and University Rating is 0.6901323687886892
Correlation between Chance of admit and SOP is 0.6841365241316723
Correlation between Chance of admit and LOR is 0.6453645135280112
Correlation between Chance of admit and CGPA is 0.882412574904574
Correlation between Chance of admit and Research is 0.5458710294711379
Correlation between Chance of admit and Chance of Admit is 1.0


Taking only the highly correlated columns as features for our regression task

In [13]:
# feature selection
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['GRE Score', 'TOEFL Score', 'CGPA'], outputCol='features')

In [14]:
#display dataframe
output_data = assembler.transform(df)
output_data.show()

+---------+-----------+-----------------+---+---+----+--------+---------------+------------------+
|GRE Score|TOEFL Score|University Rating|SOP|LOR|CGPA|Research|Chance of Admit|          features|
+---------+-----------+-----------------+---+---+----+--------+---------------+------------------+
|      337|        118|                4|4.5|4.5|9.65|       1|           0.92|[337.0,118.0,9.65]|
|      324|        107|                4|4.0|4.5|8.87|       1|           0.76|[324.0,107.0,8.87]|
|      316|        104|                3|3.0|3.5| 8.0|       1|           0.72| [316.0,104.0,8.0]|
|      322|        110|                3|3.5|2.5|8.67|       1|            0.8|[322.0,110.0,8.67]|
|      314|        103|                2|2.0|3.0|8.21|       0|           0.65|[314.0,103.0,8.21]|
|      330|        115|                5|4.5|3.0|9.34|       1|            0.9|[330.0,115.0,9.34]|
|      321|        109|                3|3.0|4.0| 8.2|       1|           0.75| [321.0,109.0,8.2]|
|      308

### Model Building
We will use PySpark MLlib to build a machine learning model. MLlib is Spark's scalable machine learning library which provides many machine learning algorithms.

Build the Linear Regression Model

In [15]:
#import Linearregression and create final data
from pyspark.ml.regression import LinearRegression
final_data = output_data.select('features', 'Chance of Admit')

In [16]:
#print schema of final data
final_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- Chance of Admit: double (nullable = true)



In [17]:
#split the dataset into training and testing set
train_data, test_data = final_data.randomSplit([0.7, 0.3])

In [18]:
#build & train the model
lr = LinearRegression(featuresCol='features', labelCol='Chance of Admit')
model = lr.fit(train_data)

In [19]:
#get coefficients & intercept
print('Coefficients: ', model.coefficients)
print('Intercept: ', model.intercept)

Coefficients:  [0.002100229229409879,0.003223113159368671,0.1473400727910022]
Intercept:  -1.553503271386935


In [20]:
#get summary of the model
summary = model.summary

In [21]:
#print the rmse & r2 score
print('RMSE: ', summary.rootMeanSquaredError)
print('R2: ', summary.r2)

RMSE:  0.06333440360082253
R2:  0.7944998843718222


### Model Evaluation
After building the model, we will evaluate its performance using appropriate metrics. This helps us understand how well our model is performing and if there are any improvements needed.


In [22]:
#transform on the test data
predictions = model.transform(test_data)

In [23]:
#display the predictions
predictions.show()

+------------------+---------------+-------------------+
|          features|Chance of Admit|         prediction|
+------------------+---------------+-------------------+
| [295.0,99.0,7.57]|           0.37| 0.5005169050943643|
|[295.0,101.0,7.86]|           0.69| 0.5496917525224922|
| [296.0,99.0,7.28]|           0.47| 0.4598885132143835|
| [297.0,96.0,7.43]|           0.34|0.47442041388433753|
| [297.0,99.0,7.81]|           0.54| 0.5400789810230247|
| [298.0,99.0,7.46]|           0.53| 0.4906101847755837|
|  [298.0,99.0,7.6]|           0.46|  0.511237794966324|
| [299.0,96.0,7.86]|           0.54| 0.5419771036432885|
|[299.0,100.0,7.42]|           0.42|  0.490039924252722|
|[299.0,100.0,7.88]|           0.51| 0.5578163577365831|
|[299.0,100.0,7.88]|           0.68| 0.5578163577365831|
| [299.0,106.0,8.4]|           0.64| 0.6537718745441163|
| [300.0,95.0,8.22]|           0.62| 0.5938966459180908|
| [300.0,99.0,8.01]|           0.58| 0.5758476832694546|
|[300.0,100.0,8.26]|           

In [24]:
#evaluate the model
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(labelCol='Chance of Admit', predictionCol='prediction', metricName='r2')
print('R2 Score on the test data: ', evaluator.evaluate(predictions))

R2 Score on the test data:  0.8275684330553728
