<a href="https://colab.research.google.com/github/francofreche/DSWorkshop/blob/main/Graduate_Admissions_Workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

© 2022 Zaka AI, Inc. All Rights Reserved

# Graduate Admissions
In this workshop, we will try to predict the chance of admissions (ranging from 0 to 1) for students during their application for a Masters program. Prediction is based on several parameters, including:
  - GRE Scores ( out of 340 )
  - TOEFL Scores ( out of 120 )
  - University Rating ( out of 5 )
  - Statement of Purpose ( out of 5 )
  - Letter of Recommendation Strength ( out of 5 )
  - Undergraduate GPA ( out of 10 )
  - Research Experience ( either 'yes' or 'no' )

## 1. Import necessary python modules (libraries)
We will need the following libraries:
 - Numpy — for scientific computing (e.g., linear algebra (vectors & matrices)).
 - Pandas — providing high-performance, easy-to-use data reading, manipulation, and analysis.
 - Matplotlib & seaborn — plotting & visualization.
 - scikit-learn — a tool for data mining and machine learning models.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Download the data
Dataset can be found on Kaggle. For simplicity, it has been uploaded to Github. Link is provided for you in the cell below. Let's download our data from Github.

In [None]:
!git clone https://github.com/zaka-ai/graduates-admission-workshop.git

## 3. Read & visualize data
Data now is stored on disk in a csv (Comma Separated Values) file. To load the data to our code, we use **pandas** module, more specifically, the **read_csv** function.

In [None]:
dataset_path = ______

# read dataset from disk
data = pd.read_csv(___)  

# print first 5 rows
data.____  

## 4. Exploratory Data Analysis
Let's dig deeper & understand out data

**Task x:** how many rows & columns in our dataset

In [None]:
n_rows = data.____
n_columns = data._____

print('There are {} rows and {} columns.'.format(n_rows,n_columns))

using the function **info()**, we can check:
 - data types (int, float, or object (e.g., string))
 - missing values
 - memory usage
 - number of rows and columns

In [None]:
data.info()

### Some statistics
using the function **describe()**, we can check the mean, standard deviation, maximum, and minimum of each feature (column)

In [None]:
data.describe()

### Data Visualization

#### GRE Score:
**Task x:** plot a histogram shows the frequency for GRE scores.

In [None]:
data["GRE Score"]._______
plt.title("GRE Scores")
plt.xlabel("GRE Score")
plt.ylabel("Frequency")
plt.show()

There is a density between 310 and 330. Being above this range would be a good feature for a candidate to stand out.

#### CGPA Scores VS University Ratings:
**Task x:** what is the relation between the rating of the university and CGPA? Let's plot a **scatter** plot to find out.

In [None]:
plt.______
plt.title("CGPA Scores for University Ratings")
plt.xlabel("University Rating")
plt.ylabel("CGPA")
plt.show()

**Conclusion >>** As the quality of the university increases, the CGPA score increases.

#### Correlation between GRE VS CGPA Scores:
**Task x:** Let's plot a **scatter** plot to find out how GRE and CGPA Scores are correlated.

In [None]:
data._______
plt.title("CGPA for GRE Scores")
plt.xlabel("GRE Score")
plt.ylabel("CGPA")
plt.show()

**Conclusion >>** Candidates with high GRE scores usually have a high CGPA score.

#### Does University Rating increase chance of admit?

In [None]:
condition = data["Chance of Admit"] >= 0.75
s = data[condition]["University Rating"].value_counts()
plt.title("University Ratings of Candidates with an 75% acceptance chance")
s.plot(kind='bar',figsize=(10, 6))
plt.xlabel("University Rating")
plt.ylabel("Candidates")
plt.show()

**Conclusion >>** Candidates who graduate from good universities are more fortunate to be accepted.

#### Correlation between CGPA, GRE, and POS

In [None]:
plt.figure(figsize=(12,6))
plt.scatter(x = data['GRE Score'],
            y = data['CGPA'],
            s = data['SOP']*55,
            c = data['SOP'],
            alpha=0.4,
            edgecolors='w')

plt.xlabel('GRE Score')
plt.ylabel('CGPA')
plt.title('Correlation between CGPA, GRE, and POS')
plt.colorbar()
plt.show()

**Conclusion >>** Candidates with high GRE and CGPA scores write better Statement of Purpose!

#### Correlation between all columns
**Task x:** Let's find out which parameters affect addmissions the most by finding correlations.

In [None]:
fig,ax = plt.subplots(figsize=(10, 10))
sns.heatmap(data.corr(), annot=True, fmt='.2f', ax=ax) # annot=True prints correlation values inside the heatmap, fmt='.2f' prints only two decimal numbers
plt.show()

The 3 most important features for admission to the Master: CGPA, GRE SCORE, and TOEFL SCORE

The 3 least important features for admission to the Master: Research, LOR, and SOP

## 5. Preprocessing
"Garbage in, garbage out". Data should be preprocessed and cleaned to get rid of noisy data. Preprocessing includes:
 - dealing with missing data
   - remove whole rows (if they are not a lot)
   - infer (e.g., date of birth & age)
   - fill with mean, median, or even 0
 - convert categorical (non numerical) data into numerical
 - normalization: standarize data ranges for all features (e.g., between 0 and 1)

**Task x:** drop rows with missing values

In [None]:
# print how many missing value in each column (feature)
data.isnull().sum()

In [None]:
# drop rows with missing values
data = data.dropna()
data.isnull().sum()

In [None]:
data.info()

**Task x:** convert Research column to numerical values: 1 if yes, 0 if no

In [None]:
research = {'no':0, 'yes':1}
data['Research'] = _________
data.head()

**Task x:** normalize by dividing by maximum

In [None]:
# get the max of each column
df_max = data.___
df_max

In [None]:
# divide each column by its maximum value
data = data.____
data.describe()

**Task x:** split data into training (80%) & testing (20%)

In [None]:
from sklearn.model_selection import train_test_split

# store all columns (excpet first & last one) as inputs in X
X = data.iloc[:,1:-1].values

# store the last column as the output (label) in y
y = data.iloc[:,-1].values  

x_train, x_test, y_train, y_test = _________

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

## 5. Training & Evaluation
For evaluation, we will use R^2:
 - the higher R^2, the better.
 - the best R^2 is 1
 - R^2 can be negative.

#### Train on training data only
**Task x:** train the model using the training data

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# define our regression model
model = ______

# train our model
model.______  
print('Model trained!')

r2_train = r2_score(y_train, model.predict(x_train))
print('R^2 on training data is', r2_train)

#### Evaluate model performace on unseen (test) data

In [None]:
r2_test = r2_score(y_test, model.predict(x_test))
print('R^2 on test data is', r2_test)

R^2 for traing is close from R^2 for test (unseen) data >> model is generalizing & performing well :)

#### Features importance (weights)

In [None]:
columns_names = data.columns[1:-1].values
features_importance = model.coef_
plt.barh(columns_names, features_importance)
plt.title('Features Importance')
plt.xlabel('importance')
plt.ylabel('feature')
plt.show()

This is consistent with our previous conclusion which states that the 3 most important features for admission to the Master are **CGPA**, **GRE SCORE**, and **TOEFL SCORE**.