State your analytical question for your MLS: Our goal is to predict PlacementStatus (whether a student is "Placed" or "NotPlaced") based on:

CGPA
has_done_extracurricular (0 = No, 1 = Yes)
has_done_placementtraining (0 = No, 1 = Yes)

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns

import joblib
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

In [5]:
# Load the uploaded dataset
file_path = "/Users/adamkhay/Desktop/data engineering/cleaned_data.csv"
df = pd.read_csv(file_path)

# Display basic info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   StudentID                   10000 non-null  int64  
 1   CGPA                        10000 non-null  float64
 2   Internships                 10000 non-null  int64  
 3   Projects                    10000 non-null  int64  
 4   Workshops_Certifications    10000 non-null  int64  
 5   AptitudeTestScore           10000 non-null  int64  
 6   SoftSkillsRating            10000 non-null  float64
 7   SSC_Marks                   10000 non-null  int64  
 8   HSC_Marks                   10000 non-null  int64  
 9   PlacementStatus             10000 non-null  object 
 10  has_done_extracurricular    10000 non-null  int64  
 11  has_done_placementtraining  10000 non-null  int64  
dtypes: float64(2), int64(9), object(1)
memory usage: 937.6+ KB


In [6]:
# Display first few rows
df.head()

Unnamed: 0,StudentID,CGPA,Internships,Projects,Workshops_Certifications,AptitudeTestScore,SoftSkillsRating,SSC_Marks,HSC_Marks,PlacementStatus,has_done_extracurricular,has_done_placementtraining
0,1,7.5,1,1,1,65,4.4,61,79,NotPlaced,0,0
1,2,8.9,0,3,2,90,4.0,78,82,Placed,1,1
2,3,7.3,1,2,2,82,4.8,79,80,NotPlaced,1,0
3,4,7.5,1,1,2,85,4.4,81,80,Placed,1,1
4,5,8.3,1,2,2,86,4.5,74,88,Placed,1,1


3-Way splitting of the data

the features I am going to use in my predictive model are: "has_done_extracurricular", "has_done_placementtraining", "SoftSkillsRating" because they are the ones that gave me the highest accuracy score after Iterating through steps 2-4. 

In [7]:
feature_cols = ["has_done_extracurricular", "has_done_placementtraining", "SoftSkillsRating"]
prediction_col = "PlacementStatus"


In [8]:
#set the prediction column and the feature columns 

X = df[feature_cols]
y = df[prediction_col]

In [9]:
# Train-Validation-Test Split - First split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, random_state=42, test_size=0.3, stratify=y)

In [10]:
X_temp.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3000 entries, 4258 to 7089
Data columns (total 3 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   has_done_extracurricular    3000 non-null   int64  
 1   has_done_placementtraining  3000 non-null   int64  
 2   SoftSkillsRating            3000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 93.8 KB


In [11]:
X_temp.head()

Unnamed: 0,has_done_extracurricular,has_done_placementtraining,SoftSkillsRating
4258,0,0,3.9
1687,1,0,4.8
6130,1,1,4.8
5761,0,1,4.1
9587,1,0,4.6


In [12]:
y_temp.info()

<class 'pandas.core.series.Series'>
Index: 3000 entries, 4258 to 7089
Series name: PlacementStatus
Non-Null Count  Dtype 
--------------  ----- 
3000 non-null   object
dtypes: object(1)
memory usage: 46.9+ KB


In [13]:
y_temp.head()

4258    NotPlaced
1687    NotPlaced
6130    NotPlaced
5761    NotPlaced
9587       Placed
Name: PlacementStatus, dtype: object

In [14]:
# Train-Validation-Test Split - Second split
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, random_state=42, test_size=0.5, stratify=y_temp)

In [15]:
print(X_test.shape)
X_test.head()

(1500, 3)


Unnamed: 0,has_done_extracurricular,has_done_placementtraining,SoftSkillsRating
3891,0,1,3.8
7136,1,1,4.4
1823,0,0,4.0
1560,1,1,3.9
104,0,0,4.1


In [16]:
print(X_val.shape)
X_val.head()

(1500, 3)


Unnamed: 0,has_done_extracurricular,has_done_placementtraining,SoftSkillsRating
9232,0,1,3.8
4178,1,1,4.8
6582,1,1,4.6
4405,1,1,4.8
388,0,0,4.4


In [17]:
print(y_test.shape)
y_test.head()

(1500,)


3891    NotPlaced
7136    NotPlaced
1823    NotPlaced
1560    NotPlaced
104     NotPlaced
Name: PlacementStatus, dtype: object

In [18]:
print(y_val.shape)
y_val.head()

(1500,)


9232    NotPlaced
4178       Placed
6582       Placed
4405    NotPlaced
388        Placed
Name: PlacementStatus, dtype: object

training: building the model, random forest classifier 

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [21]:
# Train a Random Forest classifier using training set
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [22]:
pred_X_test = model.predict(X_test)

In [23]:
accuracy_score(y_test,pred_X_test)

0.78

saving the optimal model

In [24]:
# Save the model using joblib

joblib.dump(model, "digits_model.joblib")

['digits_model.joblib']