## Placement Prediction

### 1. Data Preparation: Cleaning, Preprocessing & Exploratory Analysis

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Dataset is sourced from https://www.kaggle.com/datasets/ruchikakumbhar/placement-prediction-dataset/data?select=placementdata.csv

In [2]:
data = "https://raw.githubusercontent.com/bankymondial/Placement-Prediction/refs/heads/main/placementdata.csv"

In [3]:
df = pd.read_csv(data)

In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   StudentID                  10000 non-null  int64  
 1   CGPA                       10000 non-null  float64
 2   Internships                10000 non-null  int64  
 3   Projects                   10000 non-null  int64  
 4   Workshops/Certifications   10000 non-null  int64  
 5   AptitudeTestScore          10000 non-null  int64  
 6   SoftSkillsRating           10000 non-null  float64
 7   ExtracurricularActivities  10000 non-null  object 
 8   PlacementTraining          10000 non-null  object 
 9   SSC_Marks                  10000 non-null  int64  
 10  HSC_Marks                  10000 non-null  int64  
 11  PlacementStatus            10000 non-null  object 
dtypes: float64(2), int64(7), object(3)
memory usage: 937.6+ KB
None


In [5]:
df.head()

Unnamed: 0,StudentID,CGPA,Internships,Projects,Workshops/Certifications,AptitudeTestScore,SoftSkillsRating,ExtracurricularActivities,PlacementTraining,SSC_Marks,HSC_Marks,PlacementStatus
0,1,7.5,1,1,1,65,4.4,No,No,61,79,NotPlaced
1,2,8.9,0,3,2,90,4.0,Yes,Yes,78,82,Placed
2,3,7.3,1,2,2,82,4.8,Yes,No,79,80,NotPlaced
3,4,7.5,1,1,2,85,4.4,Yes,Yes,81,80,Placed
4,5,8.3,1,2,2,86,4.5,Yes,Yes,74,88,Placed


In [6]:
print(df.describe())

         StudentID          CGPA   Internships      Projects  \
count  10000.00000  10000.000000  10000.000000  10000.000000   
mean    5000.50000      7.698010      1.049200      2.026600   
std     2886.89568      0.640131      0.665901      0.867968   
min        1.00000      6.500000      0.000000      0.000000   
25%     2500.75000      7.400000      1.000000      1.000000   
50%     5000.50000      7.700000      1.000000      2.000000   
75%     7500.25000      8.200000      1.000000      3.000000   
max    10000.00000      9.100000      2.000000      3.000000   

       Workshops/Certifications  AptitudeTestScore  SoftSkillsRating  \
count              10000.000000       10000.000000      10000.000000   
mean                   1.013200          79.449900          4.323960   
std                    0.904272           8.159997          0.411622   
min                    0.000000          60.000000          3.000000   
25%                    0.000000          73.000000          4.0

#### This code changes modifies the column names, removing `/` and  changing all column names to lowercase.

In [7]:
df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('/', '')

In [8]:
df.tail()

Unnamed: 0,studentid,cgpa,internships,projects,workshopscertifications,aptitudetestscore,softskillsrating,extracurricularactivities,placementtraining,ssc_marks,hsc_marks,placementstatus
9995,9996,7.5,1,1,2,72,3.9,Yes,No,85,66,NotPlaced
9996,9997,7.4,0,1,0,90,4.8,No,No,84,67,Placed
9997,9998,8.4,1,3,0,70,4.8,Yes,Yes,79,81,Placed
9998,9999,8.9,0,3,2,87,4.8,Yes,Yes,71,85,Placed
9999,10000,8.4,0,1,1,66,3.8,No,No,62,66,NotPlaced


#### Data Preprocessing
- Check for missing data
- Encoding categorical features
- Defining features (X) and target (y)

In [9]:
df.isna().sum()

studentid                    0
cgpa                         0
internships                  0
projects                     0
workshopscertifications      0
aptitudetestscore            0
softskillsrating             0
extracurricularactivities    0
placementtraining            0
ssc_marks                    0
hsc_marks                    0
placementstatus              0
dtype: int64

In [10]:
df.dtypes

studentid                      int64
cgpa                         float64
internships                    int64
projects                       int64
workshopscertifications        int64
aptitudetestscore              int64
softskillsrating             float64
extracurricularactivities     object
placementtraining             object
ssc_marks                      int64
hsc_marks                      int64
placementstatus               object
dtype: object

In [11]:
print(df['extracurricularactivities'].unique())
print(df['placementtraining'].unique())
print(df['placementstatus'].unique())

['No' 'Yes']
['No' 'Yes']
['NotPlaced' 'Placed']


In [12]:
df['extracurricularactivities'] = df['extracurricularactivities'].str.lower()
df['placementtraining'] = df['placementtraining'].str.lower()
df['placementstatus'] = df['placementstatus'].str.lower()

In [13]:
df['extracurricularactivities'] = df['extracurricularactivities'].map({'no': 0, 'yes': 1})
df['placementtraining'] = df['placementtraining'].map({'no': 0, 'yes': 1})
df['placementstatus'] = df['placementstatus'].map({'notplaced': 0, 'placed': 1})

In [14]:
print(df.dtypes)

studentid                      int64
cgpa                         float64
internships                    int64
projects                       int64
workshopscertifications        int64
aptitudetestscore              int64
softskillsrating             float64
extracurricularactivities      int64
placementtraining              int64
ssc_marks                      int64
hsc_marks                      int64
placementstatus                int64
dtype: object


### 2.  Train-Validation-Test Split (60-20-20)
- Define X (features) and y (target variable)
- Split the data into train (60%) and temp (40%) for validation and test
- Split the temp data into validation (50% of 40%) and test (50% of 40%)

In [15]:
X = df.drop(columns=['studentid', 'placementstatus'])
y = df['placementstatus']

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("Training set shape:", X_train.shape)
print("Validation set shape:", X_val.shape)
print("Test set shape:", X_test.shape)

Training set shape: (6000, 10)
Validation set shape: (2000, 10)
Test set shape: (2000, 10)


In [17]:
X.head()

Unnamed: 0,cgpa,internships,projects,workshopscertifications,aptitudetestscore,softskillsrating,extracurricularactivities,placementtraining,ssc_marks,hsc_marks
0,7.5,1,1,1,65,4.4,0,0,61,79
1,8.9,0,3,2,90,4.0,1,1,78,82
2,7.3,1,2,2,82,4.8,1,0,79,80
3,7.5,1,1,2,85,4.4,1,1,81,80
4,8.3,1,2,2,86,4.5,1,1,74,88


### 3. Normalizing Numerical Features
- Normalization is needed for continuous features (like CGPA, Aptitude Test Score, Soft Skills Rating, SSC Marks, and HSC Marks) to ensure they don't dominate the model. This involves:
  - Select numerical columns excluding categorical ones
  - Initialize the scaler
  - Fit on training data and transform train, validation, and test sets

In [18]:
from sklearn.preprocessing import StandardScaler

numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns
numerical_cols = numerical_cols.difference(['extracurricularactivities', 'placementtraining', 'placementstatus'])

scaler = StandardScaler()


X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_val[numerical_cols] = scaler.transform(X_val[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

### 4. Training Logistic Regression Model (Baseline Model)
- Initialize and train the Logistic Regression Model
- Predict on Validation set

In [19]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

y_val_pred = log_reg.predict(X_val)

#### Evaluating Model Performance
- Calculate metrics on validation set
- Print evaluation metrics

In [20]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_val, y_val_pred)
precision = precision_score(y_val, y_val_pred)
recall = recall_score(y_val, y_val_pred)
f1 = f1_score(y_val, y_val_pred)

print(f"Validation Accuracy: {accuracy:.4f}")
print(f"Validation Precision: {precision:.4f}")
print(f"Validation Recall: {recall:.4f}")
print(f"Validation F1 Score: {f1:.4f}")

Validation Accuracy: 0.7975
Validation Precision: 0.7529
Validation Recall: 0.7682
Validation F1 Score: 0.7605


### 5. Feature Importance Analysis
- Get feature importance from model coefficients
- Sort by absolute value of coefficients
- Display most important features

In [21]:
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': log_reg.coef_[0]
})

feature_importance['Abs_Coefficient'] = feature_importance['Coefficient'].abs()
feature_importance = feature_importance.sort_values(by='Abs_Coefficient', ascending=False)

print(feature_importance[['Feature', 'Coefficient']])

                     Feature  Coefficient
7          placementtraining     0.955773
6  extracurricularactivities     0.675994
4          aptitudetestscore     0.554143
9                  hsc_marks     0.343753
5           softskillsrating     0.287111
8                  ssc_marks     0.285684
0                       cgpa     0.254303
2                   projects     0.191061
3    workshopscertifications     0.112340
1                internships     0.017232


### 6. Evaluation on Test set
- Predict on test set
- Compute test metrics
- Print final test metrics

In [22]:
y_test_pred = log_reg.predict(X_test)

test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)

print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test Precision: {test_precision:.4f}")
print(f"Test Recall: {test_recall:.4f}")
print(f"Test F1 Score: {test_f1:.4f}")

Test Accuracy: 0.7855
Test Precision: 0.7426
Test Recall: 0.7452
Test F1 Score: 0.7439


#### Model Performance Analysis

- Validation vs. Test Performance: The test accuracy (0.7855) is slightly lower than validation accuracy (0.7975), meaning the model generalizes well but might be slightly overfitting.  
- Precision & Recall Tradeoff: Precision (0.7529) is slightly lower than recall (0.7682), meaning the model misclassifies some placed students as not placed.

###  7. Finding the optimal threshold to find a balance between precision and recall (i.e., for both classes — "Placed" and "Not Placed") or select the threshold that maximizes F1 score.
- Get the predicted probabilities for the positive class
- Compute precision, recall, and threshold
- Calculate F1 score for each threshold
- Find the threshold with the highest F1 score
- Apply the optimal threshold to make predictions
- Evaluate performance with the new threshold

In [23]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import precision_score, recall_score, f1_score

y_prob = log_reg.predict_proba(X_test)[:, 1]

precision, recall, thresholds = precision_recall_curve(y_test, y_prob)

f1_scores = 2 * (precision * recall) / (precision + recall)

optimal_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal Threshold: {optimal_threshold}")

y_pred_optimal = (y_prob >= optimal_threshold).astype(int)

precision_optimal = precision_score(y_test, y_pred_optimal)
recall_optimal = recall_score(y_test, y_pred_optimal)
f1_optimal = f1_score(y_test, y_pred_optimal)

print(f"Test Precision: {precision_optimal}")
print(f"Test Recall: {recall_optimal}")
print(f"Test F1 Score: {f1_optimal}")

Optimal Threshold: 0.4152416296778436
Test Precision: 0.7213822894168467
Test Recall: 0.7990430622009569
Test F1 Score: 0.7582292849035187


#### Baseline Model: The threshold of 0.4152 seems to provide a nice balance between precision and recall, with a solid F1 score of approximately 0.7582. This threshold suggests that the model is effectively identifying students who are likely to be placed while maintaining a good level of precision.

### 8. Building a model using the most important features based on the `feature importance analysis`, a simplified dataset
- Select the most important features based on the feature importance analysis
- Update the features to include only the selected ones
- Train the logistic regression model with selected features
- Evaluate the model on validation data


In [24]:
important_features = [
    'placementtraining', 'extracurricularactivities', 'aptitudetestscore', 
    'hsc_marks', 'softskillsrating'
]

X_train_important = X_train[important_features]
X_val_important = X_val[important_features]
X_test_important = X_test[important_features]

log_reg_important = LogisticRegression(random_state=42)
log_reg_important.fit(X_train_important, y_train)

y_val_pred_important = log_reg_important.predict(X_val_important)

### 
- Evaluation metrics

In [25]:
precision_important = precision_score(y_val, y_val_pred_important)
recall_important = recall_score(y_val, y_val_pred_important)
f1_important = f1_score(y_val, y_val_pred_important)

print("Validation Precision (Important Features):", precision_important)
print("Validation Recall (Important Features):", recall_important)
print("Validation F1 Score (Important Features):", f1_important)

Validation Precision (Important Features): 0.7426210153482881
Validation Recall (Important Features): 0.7514934289127837
Validation F1 Score (Important Features): 0.7470308788598575


###
- Test the model on test data

In [26]:
y_test_pred_important = log_reg_important.predict(X_test_important)

precision_test_important = precision_score(y_test, y_test_pred_important)
recall_test_important = recall_score(y_test, y_test_pred_important)
f1_test_important = f1_score(y_test, y_test_pred_important)

print("Test Precision (Important Features):", precision_test_important)
print("Test Recall (Important Features):", recall_test_important)
print("Test F1 Score (Important Features):", f1_test_important)

Test Precision (Important Features): 0.7394451145958987
Test Recall (Important Features): 0.7332535885167464
Test F1 Score (Important Features): 0.7363363363363363


### 8b. Adjusting the threshold and Recomputing Metrics
- This method should help optimize the model's classification performance for the placement prediction task, since the goal is to have a better balance between both classes (placed and not placed).
  - Get predicted probabilities for the positive class (placement)
  - Calculate precision, recall, and thresholds
  - Calculate F1 scores for each threshold
  - Find the threshold that maximizes the F1 score
  - Reclassify the test set using the optimal threshold
  - Evaluate the model with the optimal threshold

In [27]:
from sklearn.metrics import precision_recall_curve


y_prob = log_reg.predict_proba(X_test)[:, 1]

precision, recall, thresholds = precision_recall_curve(y_test, y_prob)

f1_scores = 2 * (precision * recall) / (precision + recall)

optimal_threshold = thresholds[f1_scores.argmax()]

y_test_pred_optimal = (y_prob >= optimal_threshold).astype(int)

test_precision_optimal = precision_score(y_test, y_test_pred_optimal)
test_recall_optimal = recall_score(y_test, y_test_pred_optimal)
test_f1_optimal = f1_score(y_test, y_test_pred_optimal)

print(f"Optimal Threshold: {optimal_threshold}")
print(f"Test Precision (Optimal Threshold): {test_precision_optimal}")
print(f"Test Recall (Optimal Threshold): {test_recall_optimal}")
print(f"Test F1 Score (Optimal Threshold): {test_f1_optimal}")

Optimal Threshold: 0.4152416296778436
Test Precision (Optimal Threshold): 0.7213822894168467
Test Recall (Optimal Threshold): 0.7990430622009569
Test F1 Score (Optimal Threshold): 0.7582292849035187


#### Despite the difference in feature selection, the optimal threshold for classification hasn't changed, and the resulting performance metrics (precision, recall, and F1 score) are similar.  While the number of features has been reduced, it seems that the core relationships in the data that drive placement predictions may not be drastically different, as the model's performance metrics remain consistent.

### 9. Performing further tests analyze the impact of feature selection and evaluate the effectiveness of using important features
- Comparing Training Time between baseline model and important features model
  - Measure training time for baseline model
  - Measure training time for important features model
- Comparing Cross Validation Scores between baseline model and important features model
  - Cross-validation for baseline model
  - Cross-validation for important features model

In [28]:
import time

start_time = time.time()
log_reg.fit(X_train, y_train)
baseline_training_time = time.time() - start_time

start_time = time.time()
log_reg.fit(X_train_important, y_train)
important_features_training_time = time.time() - start_time

print(f"Baseline Model Training Time: {baseline_training_time:.4f} seconds")
print(f"Important Features Model Training Time: {important_features_training_time:.4f} seconds")

Baseline Model Training Time: 0.0079 seconds
Important Features Model Training Time: 0.0050 seconds


In [29]:
from sklearn.model_selection import cross_val_score

baseline_cv_score = cross_val_score(log_reg, X_train, y_train, cv=5, scoring='f1')

important_features_cv_score = cross_val_score(log_reg, X_train_important, y_train, cv=5, scoring='f1')

print(f"Baseline Model Cross-validation F1 Scores: {baseline_cv_score}")
print(f"Important Features Model Cross-validation F1 Scores: {important_features_cv_score}")

Baseline Model Cross-validation F1 Scores: [0.77475728 0.74441205 0.77091633 0.79919679 0.76      ]
Important Features Model Cross-validation F1 Scores: [0.76246334 0.73724735 0.772      0.77822581 0.74801587]


### Conclusion 1
- The model with important features is faster to train, which can lead to better scalability, especially as the dataset grows. This could be crucial in real-world scenarios where time efficiency is critical (e.g., in deployment or real-time systems).
- The Baseline model has slightly higher F1 scores on average than F1 scores of the important features model, meaning it’s capturing both precision and recall better overall.
- The drop in F1 score for the important features model is not drastic, indicating that the feature selection process still yields a model that performs fairly well.

### 10. Training other model for comparison to the Important Features Model before making a final choice

#### 10a. Train Random Forest Model
- Initialize and train the Random Forest model
- Predict probabilities and apply the optimal threshold
- Evaluate with default 0.5 threshold
- Compute evaluation metrics

In [30]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

y_val_prob_rf = rf_model.predict_proba(X_val)[:, 1]

y_val_pred_rf = (y_val_prob_rf >= 0.5).astype(int)

rf_precision = precision_score(y_val, y_val_pred_rf)
rf_recall = recall_score(y_val, y_val_pred_rf)
rf_f1 = f1_score(y_val, y_val_pred_rf)

print(f"Random Forest - Validation Precision: {rf_precision:.4f}")
print(f"Random Forest - Validation Recall: {rf_recall:.4f}")
print(f"Random Forest - Validation F1 Score: {rf_f1:.4f}")

Random Forest - Validation Precision: 0.7399
Random Forest - Validation Recall: 0.7037
Random Forest - Validation F1 Score: 0.7214


### 10b. Train XGBoost Model
- Initialize and train the XGBoost model
- Predict probabilities and apply the optimal threshold
- Evaluate with default 0.5 threshold
- Compute evaluation metrics

In [31]:
!pip install xgboost
from xgboost import XGBClassifier

xgb_model = XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42)
xgb_model.fit(X_train, y_train)

y_val_prob_xgb = xgb_model.predict_proba(X_val)[:, 1]

y_val_pred_xgb = (y_val_prob_xgb >= 0.5).astype(int)

xgb_precision = precision_score(y_val, y_val_pred_xgb)
xgb_recall = recall_score(y_val, y_val_pred_xgb)
xgb_f1 = f1_score(y_val, y_val_pred_xgb)

print(f"XGBoost - Validation Precision: {xgb_precision:.4f}")
print(f"XGBoost - Validation Recall: {xgb_recall:.4f}")
print(f"XGBoost - Validation F1 Score: {xgb_f1:.4f}")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
XGBoost - Validation Precision: 0.7332
XGBoost - Validation Recall: 0.7025
XGBoost - Validation F1 Score: 0.7175


Parameters: { "use_label_encoder" } are not used.



### Conclusion 2: Both Random Forest and XGBoost are performing slightly worse than the logistic regression model with important features in terms of validation F1 score:

Logistic Regression (Important Features): F1 = 0.7470  
Random Forest: F1 = 0.7214  
XGBoost: F1 = 0.7175  
This suggests that the logistic regression model with important features is not only simpler but also more effective for this dataset.

#### 10c. Evaluating both models on the test set

In [32]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Predictions for Random Forest
y_test_pred_rf = rf_model.predict(X_test)
rf_precision = precision_score(y_test, y_test_pred_rf)
rf_recall = recall_score(y_test, y_test_pred_rf)
rf_f1 = f1_score(y_test, y_test_pred_rf)

# Predictions for XGBoost
y_test_pred_xgb = xgb_model.predict(X_test)
xgb_precision = precision_score(y_test, y_test_pred_xgb)
xgb_recall = recall_score(y_test, y_test_pred_xgb)
xgb_f1 = f1_score(y_test, y_test_pred_xgb)

# Display results
print(f"Random Forest - Test Precision: {rf_precision:.4f}")
print(f"Random Forest - Test Recall: {rf_recall:.4f}")
print(f"Random Forest - Test F1 Score: {rf_f1:.4f}")
print("-" * 40)
print(f"XGBoost - Test Precision: {xgb_precision:.4f}")
print(f"XGBoost - Test Recall: {xgb_recall:.4f}")
print(f"XGBoost - Test F1 Score: {xgb_f1:.4f}")

Random Forest - Test Precision: 0.7519
Random Forest - Test Recall: 0.7069
Random Forest - Test F1 Score: 0.7287
----------------------------------------
XGBoost - Test Precision: 0.7357
XGBoost - Test Recall: 0.7057
XGBoost - Test F1 Score: 0.7204


### Conclusion 3: As expected, both Random Forest and XGBoost have lower test F1 scores compared to the Logistic Regression (Important Features) model.
- Important Features Model (Logistic Regression, Optimal Threshold): 0.7582
- Random Forest: 0.7287
- XGBoost: 0.7204

### Takeaways:
- Random Forest and XGBoost did not outperform the Logistic Regression model.
- Logistic Regression (Important Features) remains the best choice because it’s simpler, trains faster, and achieves the best balance of precision and recall.
- If interpretability matters, logistic regression is the clear winner since it's easy to explain, while tree-based models like Random Forest and XGBoost are more complex.