<font size="+2" color="white"> Campus Recruitment analysis

<img src="https://www.processmaker.com/wp-content/uploads/2020/03/20689.jpg" title="Campus"/>

Importing Python libraries

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns

Getting the data

In [3]:
data = pd.read_csv("data/Placement_Data_Full_Class.csv")
data.head(5)

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


## Description of dataset

### Factors influencing Employability

The database that we are going to use in this projet consists of placement data of students in a campus.

It includes secondary and higher secondary school percentage and specialisation. It also includes degree specialisation, type and Work experience and salary offers to the employed students.

The database is provided by Jain University Bangalore and it can be found <a href="https://www.kaggle.com/datasets/benroshan/factors-affecting-campus-placement" title="Campus Recruitment">here</a> 

The data is made up of 215 students with 18 variables.

Description of the variables:

- **sl.no**: Serial Number
- **gender**: Gender- Male=‘M’,Female=‘F’
- **ssc_p**: Secondary Education percentage- 10th Grade
- **ssc_b**: Board of Education- Central/ Others
- **hsc_p**: Higher Secondary Education percentage- 12th Grade
- **hsc_b**: Board of Education- Central/ Others
- **hsc_s**: Specialisation in Higher Secondary Education
- **degree_p**: Degree Percentage
- **degree_t**: Under Graduation(Degree type)- Field of degree education
- **workex**: Work Experience
- **etest_p**: Employability test percentage (conducted by college)
- **specialisation**: Post Graduation(MBA)- Specialisation
- **mba_p**: MBA percentage
- **status**: Status of placement- Placed /Not Placed 
- **salary**: Annuary Salary offered by corporate to candidates (Value = Indian Rupia)

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   gender          215 non-null    object 
 2   ssc_p           215 non-null    float64
 3   ssc_b           215 non-null    object 
 4   hsc_p           215 non-null    float64
 5   hsc_b           215 non-null    object 
 6   hsc_s           215 non-null    object 
 7   degree_p        215 non-null    float64
 8   degree_t        215 non-null    object 
 9   workex          215 non-null    object 
 10  etest_p         215 non-null    float64
 11  specialisation  215 non-null    object 
 12  mba_p           215 non-null    float64
 13  status          215 non-null    object 
 14  salary          148 non-null    float64
dtypes: float64(6), int64(1), object(8)
memory usage: 25.3+ KB


## Purpose of the Analysis

The purpose of the analysis is to find out how a person's "status" ("Placed"/"Not Placed") and "salary" are related to the type of study path performed, the grades obtained during it and previous work experience.

## Data Cleaning
We can notice that "salary" has value too much high, this is due to a unit of indian rupe has not too much value.
Let's trasform the "salary" variables in "salary_k" dividing the actual value by 1000

In [4]:
data["salary"] = round(data["salary"]/1000, 2) 
data.rename(columns={"salary":"salary_k"}, inplace=True)

Let's check the missing values for each column

In [7]:
data.isna().sum() # checking all na values in the data dataframe and summing up for each column
# 67 na values in the "salary_k" column

sl_no              0
gender             0
ssc_p              0
ssc_b              0
hsc_p              0
hsc_b              0
hsc_s              0
degree_p           0
degree_t           0
workex             0
etest_p            0
specialisation     0
mba_p              0
status             0
salary_k          67
dtype: int64

In [9]:
# it seems that all NaN values are related to "Not Placed" values in the "status" column...
data.loc[data["salary_k"].isna(), ["status", "salary_k"]]

Unnamed: 0,status,salary_k
3,Not Placed,
5,Not Placed,
6,Not Placed,
9,Not Placed,
12,Not Placed,
...,...,...
198,Not Placed,
201,Not Placed,
206,Not Placed,
208,Not Placed,


In [13]:
# let's check if the NaN are all due to the "Not Placed" status by doing a simple test: 
# if the "Not Placed" values are 67 just like the NaN values in the "salary_k" column
# then we could affirm that our hypothesis is consistent
data.loc[data["status"]=="Not Placed", "status"].count()

67