<a href="https://colab.research.google.com/github/dandoush/ML-for-Academic-Counselling/blob/main/Guidance_Counselling_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Simple Tutorial: AI and ML-Based Academic Guidance and Counselling System.**

The system analyzes student profiles, grades, and interests to recommend specializations or universities based on expected admission success rates.

By the end of this tutorial, you will:

Understand how to preprocess student data.

Train a machine learning model to predict admission success.
Build a recommendation system for universities or specializations.
Visualize recommendations.


*   dataset used: Admission Prediction Dataset (available on Kaggle or UCI).: https://www.kaggle.com/datasets/mukeshmanral/graduates-admission-prediction

Inspiration
Results will help students in shortlisting universities with their profiles. Predicted output can provide students a fair idea about their chances for a particular university.




1.** Dataset and Problem Overview**

For simplicity, we will use a dataset that includes the following:

Student Profile: GRE Score, TOEFL Score, GPA.
Grades: Undergraduate GPA, relevant coursework grades.
Interests: Specialization fields like Data Science, AI, Robotics.
Target: Admission success probability.

GRE Score (Graduate Record Examination) is like a test placement organized by the university during the admission application

SOP (Statement of Purpose): is a motivation letter

LOR (Letter of Recommendation):  Letters written by professors, employers
CGPA (Cumulative Grade Point Average)

The University Rating feature typically refers to a numerical or categorical value that rates or ranks universities based on their overall reputation, quality of education, or competitiveness

These metrics with TOFEL (120) and Research work (if any) collectively help universities assess your academic readiness, English proficiency, research interests, and potential for success in their program.

**2. Workflow**

**Step 1:** Import Libraries and Dataset

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns



**Step 2**: Load and Explore the Dataset


In [None]:
# Load dataset
#df = pd.read_csv("admission_data.csv")
# Load the dataset
from google.colab import files
uploaded = files.upload()

data_path = 'admission_data.csv.csv'
import pandas as pd
df = pd.read_csv(data_path)


Saving admission_data.csv.csv to admission_data.csv (2).csv


GRE Score (Graduate Record Examination) is like a test placement organized by the university during the admission application

SOP (Statement of Purpose): is a motivation letter

LOR (Letter of Recommendation):  Letters written by professors, employers
CGPA (Cumulative Grade Point Average)

The University Rating feature typically refers to a numerical or categorical value that rates or ranks universities based on their overall reputation, quality of education, or competitiveness

These metrics with TOFEL (120) and Research work (if any) collectively help universities assess your academic readiness, English proficiency, research interests, and potential for success in their program.

In [None]:

# Inspect the data
print("Data file head")
print(df.head())
print("Data file info")
print(df.info())
print("")
print("*****Print column names*****")
print("")
print(df.columns.tolist())

#remove spaces at the end of at the begining of the col name
df.columns = df.columns.str.strip()
print("Let us print the target col")
print(df['Chance of Admit'])

print("")
print("Checking missing values in the data")

# Check for missing values
print(df.isnull().sum())
print("Done")

Data file head
   GRE Score  TOEFL Score  University Rating  SOP  LOR  CGPA  Research  \
0        337          118                  4  4.5  4.5  9.65         1   
1        324          107                  4  4.0  4.5  8.87         1   
2        316          104                  3  3.0  3.5  8.00         1   
3        322          110                  3  3.5  2.5  8.67         1   
4        314          103                  2  2.0  3.0  8.21         0   

   Chance of Admit  
0             0.92  
1             0.76  
2             0.72  
3             0.80  
4             0.65  
Data file info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   GRE Score          500 non-null    int64  
 1   TOEFL Score        500 non-null    int64  
 2   University Rating  500 non-null    int64  
 3   SOP                500 non-null    float64
 4   LO

**Step 3: Data Preprocessing**

Feature Engineering: Normalize scores, encode categorical features.
Target Variable: Create binary labels for admission success rates.

In [None]:
# Normalize GRE, TOEFL, and GPA
df['GRE_Score'] = df['GRE Score'] / 340
df['TOEFL_Score'] = df['TOEFL Score'] / 120
df['GPA'] = df['CGPA'] / 10

# Define target: Admission success
df['Admitted'] = (df['Chance of Admit'] > 0.7).astype(int)
print("Print column names")
print(df.columns.tolist())
print("Drop redundant columns")
df = df.drop(['GRE Score', 'TOEFL Score', 'CGPA', 'Chance of Admit'], axis=1)
print(df.head())

Print column names
['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR', 'CGPA', 'Research', 'Chance of Admit', 'GRE_Score', 'TOEFL_Score', 'GPA', 'Admitted']
Drop redundant columns
   University Rating  SOP  LOR  Research  GRE_Score  TOEFL_Score    GPA  \
0                  4  4.5  4.5         1   0.991176     0.983333  0.965   
1                  4  4.0  4.5         1   0.952941     0.891667  0.887   
2                  3  3.0  3.5         1   0.929412     0.866667  0.800   
3                  3  3.5  2.5         1   0.947059     0.916667  0.867   
4                  2  2.0  3.0         0   0.923529     0.858333  0.821   

   Admitted  
0         1  
1         1  
2         1  
3         1  
4         0  


**Step 4:** Train a Classification Model

Split the data into training and testing sets and train a classifier.

In scikit-learn (Python), train_test_split splits your dataset into training and test sets. Specifically:

X: The features (input variables) in your dataset.
y: The target (output variable/labels) corresponding to each row in X.

random_state= an int value (like 1 or 5 or 42 ... ): Ensures reproducibility. Using the same seed (42 here) guarantees you get the same random split each time the code runs.

In [None]:
# Split data
X = df.drop('Admitted', axis=1)
y = df['Admitted']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Test model
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

#let us try with some randomn data /
new_data_point = [3, 4.5, 4.5, 0, 0.95, 0.98  ,0.96]
y_p1=


Accuracy: 0.84
              precision    recall  f1-score   support

           0       0.84      0.84      0.84        49
           1       0.84      0.84      0.84        51

    accuracy                           0.84       100
   macro avg       0.84      0.84      0.84       100
weighted avg       0.84      0.84      0.84       100



In [None]:
# Split data
X = df.drop('Admitted', axis=1)
y = df['Admitted']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Test model
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.84
              precision    recall  f1-score   support

           0       0.84      0.84      0.84        49
           1       0.84      0.84      0.84        51

    accuracy                           0.84       100
   macro avg       0.84      0.84      0.84       100
weighted avg       0.84      0.84      0.84       100



We can provide any input based on the student profile and expect 1 admitted or 0 not admitted.
Our last data fram df after the normalization and removing the redundancy is as follows:
[University Rating , SOP , LOR , Research , GRE_Score,  TOEFL_Score  ,  GPA]

In [None]:
#let us try with some randomn data /
new_data_point = [4, 4.5, 4.5, 0, 0.75, 0.78  ,0.96]
y_p1=model.predict([new_data_point])
print(y_p1)

[0]




For model selection, we can use any classifier that may work in our problem like:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=42)

model.fit(X_train, y_train)

 or

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(random_state=42)

model.fit(X_train, y_train)


**Step 5: University Recommendation System**

Build a system that ranks universities based on a student’s predicted admission success and their interests.
Simulate university admission criteria and match probabilities.

In [None]:
# Mock university data
universities = pd.DataFrame({
    "University": ["MIT", "Stanford", "Sorbonne", "UDST", "Berkeley", "QatarUniv"],
    "Required GRE": [320, 335, 325, 325, 310, 320],
    "Required GPA": [9.0, 9.5, 9.8, 9.2, 8.5, 8.95]
})

# Recommend based on profile
def recommend_universities(student_profile):
    recommendations = universities[
        (universities["Required GRE"] <= student_profile["GRE_Score"] * 340) &
        (universities["Required GPA"] <= student_profile["GPA"] * 10)
    ]
    return recommendations

# Example student profile
student = {"GRE_Score": 0.95, "GPA": 0.95}
print(recommend_universities(student))


  University  Required GRE  Required GPA
0        MIT           320          9.00
4   Berkeley           310          8.50
5  QatarUniv           320          8.95


****Improvement to the dataset *****

In [None]:
import pandas as pd

# Example dataset structure
data = {
    'Student_ID': [1, 2, 3, 4, 5],
    'GRE Score': [330, 320, 300, 310, 315],
    'TOEFL Score': [115, 110, 100, 105, 107],
    'CGPA': [9.0, 8.5, 7.5, 8.0, 8.2],
    'Behavior': ['Quiet', 'Energetic', 'Social', 'Quiet', 'Isolated'],
    'Preferred Subjects': ['Music', 'Math', 'Science', 'Music', 'Languages'],
    'Target University': ['Paris Sorbonne', 'California', 'Cairo', 'Milan', 'Mexico']
}

df = pd.DataFrame(data)

# Enriching data for example:
df['Recommendation'] = df.apply(
    lambda row: "Medicine or Music" if row['Behavior'] == 'Quiet' else "Languages or Politics",
    axis=1
)

# Save to file for GPT upload
df.to_csv('Enhanced_Admission_Data.csv', index=False)

print(df)


   Student_ID  GRE Score  TOEFL Score  CGPA   Behavior Preferred Subjects  \
0           1        330          115   9.0      Quiet              Music   
1           2        320          110   8.5  Energetic               Math   
2           3        300          100   7.5     Social            Science   
3           4        310          105   8.0      Quiet              Music   
4           5        315          107   8.2   Isolated          Languages   

  Target University         Recommendation  
0    Paris Sorbonne      Medicine or Music  
1        California  Languages or Politics  
2             Cairo  Languages or Politics  
3             Milan      Medicine or Music  
4            Mexico  Languages or Politics  
