# Campus Placement Data â€“ Initial Exploration

## Objective
The goal of this notebook is to **understand the structure, quality, and limitations** of the campus placement dataset *before* performing any cleaning or analysis.

This step is critical in real-world data science projects because:
- Incorrect assumptions at this stage can invalidate later insights
- Many columns may look useful but offer no analytical value
- Data inconsistencies must be identified early

No transformations are performed in this notebook.


In [14]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)

df = pd.read_csv('C:\\Users\\Urja Srivastava\\OneDrive\\UrjaProjects\\Data2-CampusPlacementEDA\\data\\placements.csv')

df.head()

Unnamed: 0,Email,Name,Gender,10th board,10th marks,12th board,12th marks,Stream,Cgpa,Internships(Y/N),Training(Y/N),Backlog in 5th sem,Innovative Project(Y/N),Communication level,Technical Course(Y/N),Placement(Y/N)?
0,payal_roy79@gmail.com,Payal Roy,Female,State Board,96.7,CBSE,70.2,Mechanical Engineering,7.37,No,Yes,No,No,3,Yes,Not Placed
1,shreyoshi_dey13@gmail.com,Shreyoshi Dey,Female,WBBSE,96.2,WBCHSE,90.6,Electronics and Communication Engineering,9.35,No,No,No,Yes,4,No,Not Placed
2,rohan_nandi12@gmail.com,Rohan Nandi,Male,State Board,97.5,CBSE,69.6,Information Technology,7.84,No,Yes,No,Yes,3,Yes,Placed
3,smita_agarwal90@gmail.com,Smita Agarwal,Female,CBSE,96.9,Other state Board,77.6,Computer Science in AIML,7.87,Yes,No,Yes,Yes,2,Yes,Not Placed
4,samaira_singhania95@gmail.com,Samaira Singhania,Female,ICSE,99.1,CBSE,62.8,Computer Science and Engineering,9.26,Yes,Yes,No,Yes,1,Yes,Not Placed


## Dataset Size and Shape

Understanding dataset size helps assess:
- Whether results may suffer from small-sample bias
- Whether advanced modeling is justified (not required here)

In [15]:
df.shape


(401, 16)

## Column Types and Schema

We inspect:
- Data types inferred by pandas
- Presence of unexpected object types
- Columns requiring encoding or normalization


In [16]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401 entries, 0 to 400
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Email                    401 non-null    object 
 1   Name                     401 non-null    object 
 2   Gender                   401 non-null    object 
 3   10th board               401 non-null    object 
 4   10th marks               401 non-null    float64
 5   12th board               401 non-null    object 
 6   12th marks               401 non-null    float64
 7   Stream                   401 non-null    object 
 8   Cgpa                     401 non-null    float64
 9   Internships(Y/N)         401 non-null    object 
 10  Training(Y/N)            401 non-null    object 
 11  Backlog in 5th sem       401 non-null    object 
 12  Innovative Project(Y/N)  401 non-null    object 
 13  Communication level      401 non-null    int64  
 14  Technical Course(Y/N)    4

## Column-Level Audit

| Column | Description | Expected Type | Analytical Value | Action |
|------|------------|---------------|------------------|--------|
| Email | Student identifier | Identifier | None | Drop |
| Name | Student identifier | Identifier | None | Drop |
| Gender | Student gender | Categorical | Medium | Encode |
| 10th board | Board name | Categorical | Low | Optional |
| 10th marks | Percentage | Numeric | Medium | Keep |
| 12th board | Board name | Categorical | Low | Optional |
| 12th marks | Percentage | Numeric | Medium | Keep |
| Stream | Branch | Categorical | High | Keep |
| Cgpa | Academic performance | Numeric | High | Normalize |
| Internships (Y/N) | Practical exposure | Binary | High | Encode |
| Training (Y/N) | Additional training | Binary | Medium | Encode |
| Backlog in 5th sem | Academic risk | Numeric/Binary | High | Convert |
| Innovative Project (Y/N) | Project exposure | Binary | High | Encode |
| Communication level | Soft skill score | Ordinal | High | Keep |
| Technical Course (Y/N) | Extra coursework | Binary | Medium | Encode |
| Placement (Y/N)? | Target variable | Binary | Critical | Encode |


## Unique Values & Missing Data

This step helps identify:
- Inconsistent category naming (Y/Yes/YES)
- Unexpected numeric ranges
- Missing values that require imputation or removal


In [17]:
df.nunique()


Email                      401
Name                       389
Gender                       2
10th board                   4
10th marks                 232
12th board                  11
12th marks                 241
Stream                      16
Cgpa                       204
Internships(Y/N)             2
Training(Y/N)                2
Backlog in 5th sem           2
Innovative Project(Y/N)      3
Communication level          5
Technical Course(Y/N)        3
Placement(Y/N)?              2
dtype: int64

In [18]:
df.isna().sum()


Email                      0
Name                       0
Gender                     0
10th board                 0
10th marks                 0
12th board                 0
12th marks                 0
Stream                     0
Cgpa                       0
Internships(Y/N)           0
Training(Y/N)              0
Backlog in 5th sem         0
Innovative Project(Y/N)    0
Communication level        0
Technical Course(Y/N)      0
Placement(Y/N)?            0
dtype: int64

## Initial Observations & Risks

1. Binary columns may contain inconsistent labels (Y/N, Yes/No).
2. CGPA scale is ambiguous (10-point vs 100-point).
3. Communication level scale needs validation.
4. Board names may have high cardinality and low analytical value.
5. Email and Name must be removed to avoid leakage or bias.

These issues will be handled explicitly in the data cleaning notebook.
