<a href="https://colab.research.google.com/github/al-9k/Kaggle-Test/blob/main/InternPlacementModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **This was adapted from local .py files into python. If user wishes to use this in local files, then they should write "from functions import *" in main.py and model.py**

### **functions.py -- Data PreProcessing Functions**

In [8]:
from sklearn.preprocessing import LabelEncoder

def label_encoder(df, columns):
    """
    Here we encode categorical (non-numerical) columns in a DataFrame using LabelEncoder.

    Parameters:
    - df (DataFrame): The DataFrame containing the columns to be encoded.
    - columns (list): A list of column names to be encoded.

    Returns:
    - DataFrame: The DataFrame with encoded categorical columns.

    This encoding will be useful in training the model in model.py when we import the Kaggle data
    """
    for column in columns:
        # Initialize LabelEncoder for the current column
        le = LabelEncoder()
        # Fit and transform the column
        df[column] = le.fit_transform(df[column])
    return df

### **model.py -- Model Training & Evaluation**

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import joblib

"""
Now here, we import the kaggle dataset. We first clean it from duplicates, then we use it to train a model! We then print the performance & precision of this model.
"""

# URL to the CSV file containing the kaggle dataset (uploaded to Alhasan's github, with likely irrelevant data columns -- Hostel and HistoryOfBacklogs -- being removed)
csv_url = "https://raw.githubusercontent.com/al-9k/Kaggle-Test/main/collegePlace.csv"

# Read the dataset from the CSV file into a DataFrame
df = pd.read_csv(csv_url)

# Remove duplicate rows from the DataFrame, keeping the first occurrence -- cleaning data
df.drop_duplicates(keep='first', inplace=True)

# Reset the index of the DataFrame after dropping duplicates
df = df.reset_index(drop=True)

# Encode categorical columns using the label_encoder function from functions.py
df = label_encoder(df, df.columns)

# Separate features (X) and target variable (y) -- the target y: PlacedOrNot -- features X: age, gender, stream, CGPA, nb of internships
X = df.drop('PlacedOrNot', axis=1)
y = df['PlacedOrNot']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train a Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

# Save the trained model to a file named 'model.pkl'
joblib.dump(rf_classifier, 'model.pkl')

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Print classification report containing precision, recall, f1-score, and support
print(classification_report(y_test, y_pred))


"""
    - Precision measures the model's ability to make accurate positive predictions.
    - Recall measures the model's ability to identify all relevant positive instances.
    - F1-score balances precision and recall into a single metric.
    - Support provides the number of true instances for each class.
"""

              precision    recall  f1-score   support

           0       0.57      0.64      0.60        91
           1       0.75      0.70      0.73       144

    accuracy                           0.68       235
   macro avg       0.66      0.67      0.67       235
weighted avg       0.68      0.68      0.68       235



### **main.py -- Model's Prediction and Result Display**

In [7]:
import pandas as pd
import joblib
import numpy as np

"""
Now, we can load the model, and input the features of students (i.e. gender, age ,stream, CGPA and nb of internships) for the model to then to predict the likelihood of each student
being placed in a job based on their features.
"""

# Load the trained model from the file 'model.pkl'
model = joblib.load('model.pkl')

# Here you may input the features for which the model will predict likelihood of placement. In this example, we give here the features of 10 students.
data = {
    'Age': [22, 25, 24, 28, 25, 21, 29, 25, 27, 23],
    'Gender': ['Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male'],
    'Stream': ['Electronics And Communication', 'Mechanical', 'Mechanical', 'Computer Science', 'Computer Science',
               'Electronics And Communication', 'Mechanical', 'Electronics And Communication', 'Electronics And Communication', 'Mechanical'],
    'Internships': [2, 4, 2, 3, 5, 1, 5, 0, 5, 3],
    'CGPA': [8, 5, 6, 7, 9, 9, 7, 5, 9, 8]
}

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Encode categorical columns using the label_encoder function from functions.py
df = label_encoder(df, df.columns)

# Make predictions using the loaded model
prediction = model.predict(df)
# Calculate uncertainty (probability of the predicted class) for each prediction
uncertainty = np.max(model.predict_proba(df), axis=1) * 100

# Iterate over each prediction and print the result along with certainty
for pred, uncert in zip(prediction, uncertainty):
    # Translate prediction label to human-readable form
    if pred == 1:
        pred_text = 'Placed'
    else:
        pred_text = 'Not Placed'
    # Print prediction result and certainty percentage
    print(f"Prediction: {pred_text} \t | \t Certainty: {uncert:.1f}%")
    print('----------------------------------------------------')


Prediction: Placed 	 | 	 Certainty: 99.0%
----------------------------------------------------
Prediction: Placed 	 | 	 Certainty: 56.0%
----------------------------------------------------
Prediction: Placed 	 | 	 Certainty: 74.3%
----------------------------------------------------
Prediction: Placed 	 | 	 Certainty: 88.7%
----------------------------------------------------
Prediction: Placed 	 | 	 Certainty: 100.0%
----------------------------------------------------
Prediction: Placed 	 | 	 Certainty: 100.0%
----------------------------------------------------
Prediction: Placed 	 | 	 Certainty: 74.7%
----------------------------------------------------
Prediction: Not Placed 	 | 	 Certainty: 98.0%
----------------------------------------------------
Prediction: Placed 	 | 	 Certainty: 100.0%
----------------------------------------------------
Prediction: Placed 	 | 	 Certainty: 96.0%
----------------------------------------------------
