[Link to Kaggle Dataset](https://www.kaggle.com/datasets/benroshan/factors-affecting-campus-placement)

In [29]:
import pandas as pd

In [30]:
data = pd.read_csv('Datasets/Placement_Data_Full_Class.csv')
data.head()

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


In [31]:
data.describe()

Unnamed: 0,sl_no,ssc_p,hsc_p,degree_p,etest_p,mba_p,salary
count,215.0,215.0,215.0,215.0,215.0,215.0,148.0
mean,108.0,67.303395,66.333163,66.370186,72.100558,62.278186,288655.405405
std,62.209324,10.827205,10.897509,7.358743,13.275956,5.833385,93457.45242
min,1.0,40.89,37.0,50.0,50.0,51.21,200000.0
25%,54.5,60.6,60.9,61.0,60.0,57.945,240000.0
50%,108.0,67.0,65.0,66.0,71.0,62.0,265000.0
75%,161.5,75.7,73.0,72.0,83.5,66.255,300000.0
max,215.0,89.4,97.7,91.0,98.0,77.89,940000.0


In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   gender          215 non-null    object 
 2   ssc_p           215 non-null    float64
 3   ssc_b           215 non-null    object 
 4   hsc_p           215 non-null    float64
 5   hsc_b           215 non-null    object 
 6   hsc_s           215 non-null    object 
 7   degree_p        215 non-null    float64
 8   degree_t        215 non-null    object 
 9   workex          215 non-null    object 
 10  etest_p         215 non-null    float64
 11  specialisation  215 non-null    object 
 12  mba_p           215 non-null    float64
 13  status          215 non-null    object 
 14  salary          148 non-null    float64
dtypes: float64(6), int64(1), object(8)
memory usage: 25.3+ KB


## Problem in Hand

The aim here is to predict the salary based on the qualifications of the candidates as available or the least case to predict whether the candidate will be placed or not. Ultimately the attempt is to figure out what are the key factors which decide the selection of a candidate.

## Data Analysis

Using the data descriptions available as follows: 

`sl_no` : Serial Number 

`gender` : Gender- Male='M', Female='F' 

`ssc_p` : Secondary Education percentage- 10th Grade

`ssc_b` : Board of Education- Central/Others

`hsc_p` : Higher Secondary Education percentage- 12th Grade

`hsc_b` : Board of Education- Central/Others

`hsc_s` : Specialization in Higher Secondary Education

`degree_p` : Degree Percentage

`degree_t` : Under Graduation(Degree type)- Field of degree education

`workex` : Work Experience

`etest_p` : Employability test percentage (conducted by college)

`specialisation` : Post Graduation(MBA)- Specialization

`mba_p` : MBA percentage

`status` : Status of placement- Placed/Not placed

`salary` : Salary offered by corporate to candidates


Right now as an intial look-through, few difficulties present themselves like the presence of different boards which makes judging the percentages at the same level unfair and most probably inaccurate. So we will look at options where those are handled appropriately. But due to lack of data among the other boards(boards only having 2 choices `Central` and `Others`, basically having no division inside the non-central boards), we will have no choice other than to consider all `Others` boards entry equivalent.

Also here an oppurtunity presents itself, that is replacing all `NaN` values in the salary column with 0, though that might make sense in its literal terms but it would highly skew the results towards a lower salary as the difference in the qualities of someone placed and someone not placed might not be as major as the difference in 0 and some XX...XX salary would suggest.

Hence as an initial outlook, we now have two options that is using two distinct models; one to classify placed and not placed and one to predict the salary, and only present the salary if the candidate is classified as placed (We will can this Plan A); alternatively we can just use a regressive model to predict a salary and only classify the candidate as placed if it is above a certain threshold (Plan B).

Now Plan B might seem simpler but it has an issue that it will never present a salary lower than the threshold which might be possible, so Plan A as of now seems the more sensible choice as it is more flexible in terms of the ability of Status(Placed/Not Plaaced) and Salary being able to depend on different factors.

In [33]:
#Visualisation to be inserted

## Encoding the Data

In [34]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
import numpy as np

In [35]:
copydata=data

In [36]:
data['status']=data['status'].replace(['Placed','Not Placed'],[1,0])
data['workex']=data['workex'].replace(['Yes','No'],[1,0])

In [37]:
transformer = make_column_transformer((OneHotEncoder(),['gender','ssc_b','hsc_b','hsc_s','degree_t','specialisation']), remainder='passthrough')
data = pd.DataFrame(transformer.fit_transform(data),columns=transformer.get_feature_names_out())

In [38]:
data.head()

Unnamed: 0,onehotencoder__gender_F,onehotencoder__gender_M,onehotencoder__ssc_b_Central,onehotencoder__ssc_b_Others,onehotencoder__hsc_b_Central,onehotencoder__hsc_b_Others,onehotencoder__hsc_s_Arts,onehotencoder__hsc_s_Commerce,onehotencoder__hsc_s_Science,onehotencoder__degree_t_Comm&Mgmt,...,onehotencoder__specialisation_Mkt&HR,remainder__sl_no,remainder__ssc_p,remainder__hsc_p,remainder__degree_p,remainder__workex,remainder__etest_p,remainder__mba_p,remainder__status,remainder__salary
0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,1.0,1.0,67.0,91.0,58.0,0.0,55.0,58.8,1.0,270000.0
1,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,2.0,79.33,78.33,77.48,1.0,86.5,66.28,1.0,200000.0
2,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,0.0,3.0,65.0,68.0,64.0,0.0,75.0,57.8,1.0,250000.0
3,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,4.0,56.0,52.0,52.0,0.0,66.0,59.43,0.0,
4,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,...,0.0,5.0,85.8,73.6,73.3,0.0,96.8,55.5,1.0,425000.0


The classification of Placed or Not Placed can be done using SVM.

###### References used: [1](https://datagy.io/sklearn-one-hot-encode/)