<a href="https://colab.research.google.com/github/aka-gera/Regression/blob/main/jamboree_linear_regression_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PREDICTING IVY LEAGUE ADMISSION**

---

We will conduct regression analysis on this dataset to predict the likelihood of Ivy League university admission based on the provided features.

---

The dataset is sourced from: [Jamboree Linear Regression Dataset](https://www.kaggle.com/datasets/ranitsarkar01/jamboree-linear-regression-dataset)

---

The algorithms demonstrate an average R2 score level of up to 57%, which indicates that approximately 57% of the variability in admission likelihood can be explained by the features considered in the analysis.

---

The most influential features contributing to the prediction, according to the well-performed machine learning model, are:

1. Whether the applicant has research experience
2. Letter of Recommendation (LOR) score provided by the applicant's recommenders

---


##Dataset Description


Here is a brief description of the dataset.



| Column             | Description                                                                                     |
|--------------------|-------------------------------------------------------------------------------------------------|
| Serial No.         | Unique identifier for each entry in the dataset.                                                |
| GRE Score          | The GRE (Graduate Record Examinations) score of the applicant.                                   |
| TOEFL Score        | The TOEFL (Test of English as a Foreign Language) score of the applicant.                         |
| University Rating  | Rating of the university where the applicant completed their undergraduate education (on a scale from 1 to 5). |
| SOP                | Statement of Purpose (SOP) score provided by the applicant (on a scale from 1 to 5).             |
| LOR                | Letter of Recommendation (LOR) score provided by the applicant's recommenders (on a scale from 1 to 5). |
| CGPA               | Cumulative Grade Point Average (CGPA) of the applicant during their undergraduate studies.       |
| Research           | Indicates whether the applicant has research experience (1 for yes, 0 for no).                   |
| Chance of Admit    | Probability of admission for the applicant, as predicted by the model or determined by other means.|



# Preset Parameters

In [47]:
data_dir =  f'ranitsarkar01/jamboree-linear-regression-dataset'  # Dataset location

view_hist_feat = [5,7,0, 1,-1]  # Features selected for histogram visualization

target_switcher = -1  # Switch target to a feature which is in the last column

feat = [0]  # Features to drop

data_nan_drop = 'mode'  # Fill NaN values:
                   #   Choose 'mode' to fill NaN values with the mode of the feature
                   #   Choose 'mean' to fill NaN values with the mean of the feature
                   #   Choose 'drop' to drop rows containing NaN values

balanced_dataset = False  # Whether to balance the dataset or not

confidence_interval_limit = [-3, 3]  # Define the limits of the confidence interval [-m, m] and eliminate the outliers

correlation_percentage_threshold = 0.7  # Set the correlation threshold between features for removal

pre_proc = 'X'  # Data preprocessing:
                #   Choose 'XY' to standardize both 'X' and 'Y',
                #   Choose 'X' to standardize only 'X',
                #   Choose 'Y' to standardize only 'Y',

target_values_label = True  # True if target values are float or integers

####### Neural Network Parameters #######
activation = 'relu'
epoch = 10
num_nodes = [2, 4]
dropout_prob = [0.05, 0.1]
lr = [0.01, 0.1]
batch_size = [2, 4]


# Import Dataset

In [2]:
! pip install kaggle



In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!pwd
%cd /content

/content
/content


In [5]:
! mkdir ~/.kaggle

In [6]:
! cp /content/drive/MyDrive/Kaggle_API/kaggle.json ~/.kaggle

In [7]:
! chmod 600 ~/.kaggle/kaggle.json

In [8]:
! kaggle datasets download {data_dir}

Dataset URL: https://www.kaggle.com/datasets/ranitsarkar01/jamboree-linear-regression-dataset
License(s): Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Downloading jamboree-linear-regression-dataset.zip to /content
  0% 0.00/5.34k [00:00<?, ?B/s]
100% 5.34k/5.34k [00:00<00:00, 13.7MB/s]


In [9]:
import os
file_names = os.listdir()
zip_file =   [file for file in file_names if file.endswith('.zip')]
zip_file

['jamboree-linear-regression-dataset.zip']

In [10]:
import zipfile

# Open the zip file
with zipfile.ZipFile(zip_file[-1], 'r') as zip_ref:
    zip_ref.extractall()
    unzipped_file_names = zip_ref.namelist()
unzipped_file_names

['jamboree_dataset.csv']

# Import the helper classes

In [11]:
!pwd
%cd /content/drive/MyDrive/ML2023/data-analysis

/content
/content/drive/MyDrive/ML2023/data-analysis


In [12]:
!pip install aka-mlearning==0.0.1
from aka_MLearning import aka_ML_analysis,aka_regression

Collecting aka-mlearning==0.0.1
  Downloading aka_mlearning-0.0.1.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting catboost (from aka-mlearning==0.0.1)
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: aka-mlearning
  Building wheel for aka-mlearning (setup.py) ... [?25l[?25hdone
  Created wheel for aka-mlearning: filename=aka_mlearning-0.0.1-py3-none-any.whl size=5222 sha256=5c5d2753dc75fe076b61e10a7901d4ae3a15d97a3102e50339f9f3157597f246
  Stored in directory: /root/.cache/pip/wheels/a9/32/37/dc5b42ab80d79613dd21357f887d4b9b1d5c93a64ccb4372ab
Successfully built aka-mlearning
Installing collected packages: catboost, aka-mlearning
Successfully installed aka-mlearning-0.0.1 catboost-1.2.5


In [13]:
!pip install aka-data-prep==0.1.2
from aka_data_prep import aka_encoding,aka_df_prepare,aka_plot_prep,aka_cleaned_data,aka_plot_shap,aka_plot_ML
aka_plot_ = aka_plot_prep()

Collecting aka-data-prep==0.1.2
  Downloading aka-data-prep-0.1.2.tar.gz (7.7 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting shap (from aka-data-prep==0.1.2)
  Downloading shap-0.45.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (540 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.5/540.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting slicer==0.0.8 (from shap->aka-data-prep==0.1.2)
  Downloading slicer-0.0.8-py3-none-any.whl (15 kB)
Building wheels for collected packages: aka-data-prep
  Building wheel for aka-data-prep (setup.py) ... [?25l[?25hdone
  Created wheel for aka-data-prep: filename=aka_data_prep-0.1.2-py3-none-any.whl size=7988 sha256=c9538d0dbf026187a85d73311298ea5eca3fe262a017633f4ea3b113ffc78301
  Stored in directory: /root/.cache/pip/wheels/f2/de/d6/05cbd71695a5fc82a0e740c266bf8e77f99f7219d78f1a954d
Successfully built aka-data-prep
Installing collected packag

In [14]:
from aka_data_analysis.aka_nn import aka_nn

In [15]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


# Dataset Information

In [45]:
df = aka_df_prepare().df_get(f'/content/{unzipped_file_names[0]}')
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Serial No.         500 non-null    int64  
 1   GRE Score          500 non-null    int64  
 2   TOEFL Score        500 non-null    int64  
 3   University Rating  500 non-null    int64  
 4   SOP                500 non-null    float64
 5   LOR                500 non-null    float64
 6   CGPA               500 non-null    float64
 7   Research           500 non-null    int64  
 8   Chance of Admit    500 non-null    float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB


In [18]:
df.describe()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,250.5,316.472,107.192,3.114,3.374,3.484,8.57644,0.56,0.72174
std,144.481833,11.295148,6.081868,1.143512,0.991004,0.92545,0.604813,0.496884,0.14114
min,1.0,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.34
25%,125.75,308.0,103.0,2.0,2.5,3.0,8.1275,0.0,0.63
50%,250.5,317.0,107.0,3.0,3.5,3.5,8.56,1.0,0.72
75%,375.25,325.0,112.0,4.0,4.0,4.0,9.04,1.0,0.82
max,500.0,340.0,120.0,5.0,5.0,5.0,9.92,1.0,0.97


In [48]:
# view_hist_feat = [0,1,2,-2,-1]
fig = aka_plot_.Plot_histogram_Features(df,view_hist_feat )
if fig is not None:
    fig.show()

In [49]:
aka_plot_.plot_box(df,view_hist_feat)

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Serial No.         500 non-null    int64  
 1   GRE Score          500 non-null    int64  
 2   TOEFL Score        500 non-null    int64  
 3   University Rating  500 non-null    int64  
 4   SOP                500 non-null    float64
 5   LOR                500 non-null    float64
 6   CGPA               500 non-null    float64
 7   Research           500 non-null    int64  
 8   Chance of Admit    500 non-null    float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB


# Null Values

In [22]:
df_null = df[df.columns[df.isnull().sum()>0]].isnull().astype(float)
aka_plot_.plot_heatmap(df_null)

Empty list is provided


<Figure size 300x200 with 0 Axes>

<Figure size 300x200 with 0 Axes>

In [25]:
aka_df_prepare().missing_data_processing(df, data_nan_drop=data_nan_drop)

# Clean Dataset

## Drop Duplicate data

In [26]:
df.drop_duplicates()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.00,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.80
4,5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...,...
495,496,332,108,5,4.5,4.0,9.02,1,0.87
496,497,337,117,5,5.0,5.0,9.87,1,0.96
497,498,330,120,5,4.5,5.0,9.56,1,0.93
498,499,312,103,4,4.0,5.0,8.43,0,0.73


## Swap the target and the last feature

In [27]:
# target_switcher = -1
df = aka_df_prepare().swap_features(df,target_switcher)
df.head()

Invalid feature indices or feat_a is equal to feat_b.


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


### Drop feature(s)


In [28]:
# feat =  []
df = aka_df_prepare().drop_feature(df,feat)

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,337,118,4,4.5,4.5,9.65,1,0.92
1,324,107,4,4.0,4.5,8.87,1,0.76
2,316,104,3,3.0,3.5,8.0,1,0.72
3,322,110,3,3.5,2.5,8.67,1,0.8
4,314,103,2,2.0,3.0,8.21,0,0.65


## Transforming Categorical Variables into Numerical Representations Using Encoding

In [29]:
# data_nan = 'drop'                    # Choose 'drop' to drop rows containing NaN values
df_encod = aka_encoding(df)
df = df_encod.label_encoding()

## Correlation Matrix

In [30]:
aka_plot_.Plot_Correlation_Matrix(df)

In [31]:

# confidence_interval_limit =   [-3,3]             # Define the limits m of the confidence interval [-m, m] and eliminate the outliers'''

# correlation_percentage_threshold = .7      # Set the limit of the correlation between the feature to be removed

df_filtered,corr_tmp = aka_cleaned_data().filter_drop_corr_df(df,confidence_interval_limit,correlation_percentage_threshold)

print(f'We dropped {df.shape[0]-df_filtered.shape[0]} outliers and remove {df.shape[1]-df_filtered.shape[1]} feature(s)')
print(f'The filtered dataset\'s shape is {df_filtered.shape} ')

We dropped 0 outliers and remove 5 feature(s)
The filtered dataset's shape is (500, 3) 


## Graph the features that are highly correlated


In [32]:
aka_plot_.Plot_scatter(df,list(corr_tmp)).show()

In [33]:
aka_plot_.Plot_box_2_Features(df,df_filtered,corr_tmp=range(df_filtered.shape[1]))

### Visualize the distribution of the filtered dataset

# Search for the most effective ML algorithm to learn the dataset

In [34]:
# pre_proc = 'X'                                # Choose between 'XY' to standardize both 'X' and 'Y',
#                                               #                'X' to standardize only 'X',
#                                               #                'Y' to standardize only 'Y',

X_train, X_test, y_train, y_test = aka_cleaned_data().train_test_cleaned_data(df_filtered,pre_proc)

In [35]:
clf, df_metric_algorithms, clf_algorithms = aka_regression().train_and_find_best_regressor(X_train, y_train, X_test, y_test)

In [36]:
fig = aka_plot_.plot_heatmap(df_metric_algorithms)
fig.update_layout(
    xaxis_title='ML algorithm',
    yaxis_title='Metric',
    title='Metric Report',
    font=dict(size=20)
)

In [38]:
clf

In [37]:
y_pred = clf.predict(X_test)

In [39]:
params = np.append(clf.best_estimator_.intercept_, clf.best_estimator_.coef_)
y_pred = clf.best_estimator_.predict(X_train)
feat_name = df_filtered.columns[:-1]

aka_plot_ML().plot_regression_summary(X_train, y_train, y_pred, params, feat_name)


## Confusion Matrix

## Plot Important Features by Weight



In [None]:
# aka_plot.plot_important_features(clf.best_estimator_,df_filtered)

In [41]:
aka_plot_shap(clf.best_estimator_, X_train, feat_name).plot_summary_shap().show()

# Neural Net

In [42]:

myNN = aka_nn(X_train, X_test, y_train, y_test,activation)
model,scre =myNN.DNN(epoch,num_nodes,dropout_prob,lr,batch_size)
y_pred = myNN.predict(model)

2 nodes, dropout 0.05, lr 0.01, batch size 2
2 nodes, dropout 0.05, lr 0.01, batch size 4
2 nodes, dropout 0.05, lr 0.1, batch size 2
2 nodes, dropout 0.05, lr 0.1, batch size 4
2 nodes, dropout 0.1, lr 0.01, batch size 2
2 nodes, dropout 0.1, lr 0.01, batch size 4
2 nodes, dropout 0.1, lr 0.1, batch size 2
2 nodes, dropout 0.1, lr 0.1, batch size 4
4 nodes, dropout 0.05, lr 0.01, batch size 2
4 nodes, dropout 0.05, lr 0.01, batch size 4
4 nodes, dropout 0.05, lr 0.1, batch size 2
4 nodes, dropout 0.05, lr 0.1, batch size 4
4 nodes, dropout 0.1, lr 0.01, batch size 2
4 nodes, dropout 0.1, lr 0.01, batch size 4
4 nodes, dropout 0.1, lr 0.1, batch size 2
4 nodes, dropout 0.1, lr 0.1, batch size 4


In [43]:
from sklearn.metrics import r2_score
y_pred = model.predict(X_test)
print("R-squared score (DNN):", r2_score(y_test, y_pred))

R-squared score (DNN): 0.5139939540601548


In [44]:
aka_plot_shap(clf.best_estimator_, X_train, feat_name).plot_summary_shap().show()