# Data Cleaning & Feature Preparation

## Objective
This notebook cleans and standardizes the dataset so it is suitable for analysis.

Key principles followed:
- No unnecessary feature engineering
- Decisions are explainable and documented
- Transformations reflect real-world assumptions


In [1]:
import pandas as pd
import numpy as np
import re 
df = pd.read_csv("../data/placements.csv")


In [2]:
df.head()

Unnamed: 0,Email,Name,Gender,10th board,10th marks,12th board,12th marks,Stream,Cgpa,Internships(Y/N),Training(Y/N),Backlog in 5th sem,Innovative Project(Y/N),Communication level,Technical Course(Y/N),Placement(Y/N)?
0,payal_roy79@gmail.com,Payal Roy,Female,State Board,96.7,CBSE,70.2,Mechanical Engineering,7.37,No,Yes,No,No,3,Yes,Not Placed
1,shreyoshi_dey13@gmail.com,Shreyoshi Dey,Female,WBBSE,96.2,WBCHSE,90.6,Electronics and Communication Engineering,9.35,No,No,No,Yes,4,No,Not Placed
2,rohan_nandi12@gmail.com,Rohan Nandi,Male,State Board,97.5,CBSE,69.6,Information Technology,7.84,No,Yes,No,Yes,3,Yes,Placed
3,smita_agarwal90@gmail.com,Smita Agarwal,Female,CBSE,96.9,Other state Board,77.6,Computer Science in AIML,7.87,Yes,No,Yes,Yes,2,Yes,Not Placed
4,samaira_singhania95@gmail.com,Samaira Singhania,Female,ICSE,99.1,CBSE,62.8,Computer Science and Engineering,9.26,Yes,Yes,No,Yes,1,Yes,Not Placed


## Column Name Normalization

Column names in real-world datasets often contain:
- Spaces
- Special characters
- Inconsistent capitalization

Normalizing column names improves:
- Code readability
- Reusability across notebooks
- Compatibility with plotting and analysis libraries

This step ensures all column names follow a consistent, Python-friendly format.


In [3]:
# Function to normalize column names
def normalize_column_names(columns):
    normalized = []
    for col in columns:
        col = col.strip().lower()
        col = re.sub(r"[?()]", "", col)      # remove special characters
        col = col.replace("/", "_")
        col = col.replace(" ", "_")
        col = re.sub(r"_+", "_", col)        # remove multiple underscores
        normalized.append(col)
    return normalized

# Apply normalization
df.columns = normalize_column_names(df.columns)

# Preview normalized columns
df.columns


Index(['email', 'name', 'gender', '10th_board', '10th_marks', '12th_board',
       '12th_marks', 'stream', 'cgpa', 'internshipsy_n', 'trainingy_n',
       'backlog_in_5th_sem', 'innovative_projecty_n', 'communication_level',
       'technical_coursey_n', 'placementy_n'],
      dtype='object')

## Dropping Non-Analytical Identifiers

Email and Name uniquely identify students but provide **no predictive or analytical value**.
Keeping them may introduce bias or privacy concerns.


In [4]:
df.drop(columns=["email", "name"], inplace=True)


In [5]:
df.head()

Unnamed: 0,gender,10th_board,10th_marks,12th_board,12th_marks,stream,cgpa,internshipsy_n,trainingy_n,backlog_in_5th_sem,innovative_projecty_n,communication_level,technical_coursey_n,placementy_n
0,Female,State Board,96.7,CBSE,70.2,Mechanical Engineering,7.37,No,Yes,No,No,3,Yes,Not Placed
1,Female,WBBSE,96.2,WBCHSE,90.6,Electronics and Communication Engineering,9.35,No,No,No,Yes,4,No,Not Placed
2,Male,State Board,97.5,CBSE,69.6,Information Technology,7.84,No,Yes,No,Yes,3,Yes,Placed
3,Female,CBSE,96.9,Other state Board,77.6,Computer Science in AIML,7.87,Yes,No,Yes,Yes,2,Yes,Not Placed
4,Female,ICSE,99.1,CBSE,62.8,Computer Science and Engineering,9.26,Yes,Yes,No,Yes,1,Yes,Not Placed


## Standardizing Binary Columns

Binary columns are encoded as 1 (Yes) and 0 (No) to:
- Enable numerical analysis
- Avoid categorical inconsistencies
- Support aggregation and correlation analysis


In [6]:
binary_columns = [
    "internshipsy_n",
    "trainingy_n",
    "innovative_projecty_n",
    "technical_coursey_n",
    "placementy_n",
    "backlog_in_5th_sem"
]

yes_no_map = {
    "y": 1,
    "n": 0,
    "yes": 1,
    "no": 0,
    "placed": 1,
    "not placed": 0,
}

for col in binary_columns:
    df[col] = (
        df[col]
        .astype(str)        # handle NaN safely
        .str.strip()        # remove spaces
        .str.lower()        # normalize case
        .map(yes_no_map)
    )


In [7]:
df.head()

Unnamed: 0,gender,10th_board,10th_marks,12th_board,12th_marks,stream,cgpa,internshipsy_n,trainingy_n,backlog_in_5th_sem,innovative_projecty_n,communication_level,technical_coursey_n,placementy_n
0,Female,State Board,96.7,CBSE,70.2,Mechanical Engineering,7.37,0,1,0,0,3,1.0,0
1,Female,WBBSE,96.2,WBCHSE,90.6,Electronics and Communication Engineering,9.35,0,0,0,1,4,0.0,0
2,Male,State Board,97.5,CBSE,69.6,Information Technology,7.84,0,1,0,1,3,1.0,1
3,Female,CBSE,96.9,Other state Board,77.6,Computer Science in AIML,7.87,1,0,1,1,2,1.0,0
4,Female,ICSE,99.1,CBSE,62.8,Computer Science and Engineering,9.26,1,1,0,1,1,1.0,0


## Handling Backlog Information

The backlog column may contain numeric values (number of backlogs).
For placement risk analysis, **presence of backlog** is more important than count.

Hence:
- 0 → No backlog
- greater than 0 → Has backlog


In [8]:
df["backlog_in_5th_sem"] = df["backlog_in_5th_sem"].apply(lambda x: 1 if (x > 0) else 0)


## CGPA Normalization

Some institutions report CGPA on a 10-point scale, others on a 100-point scale.

To ensure comparability:
- CGPA ≤ 10 is assumed to be on a 10-point scale
- Converted to percentage by multiplying by 10


In [9]:
df["cgpa_normalized"] = df["cgpa"].apply(
    lambda x: x * 10 if x <= 10 else x
)


## Final Dataset Validation

We verify:
- No unintended missing values
- Correct encoding
- Reasonable numeric ranges


In [10]:
df.describe(include="all")


Unnamed: 0,gender,10th_board,10th_marks,12th_board,12th_marks,stream,cgpa,internshipsy_n,trainingy_n,backlog_in_5th_sem,innovative_projecty_n,communication_level,technical_coursey_n,placementy_n,cgpa_normalized
count,401,401,401.0,401,401.0,401,401.0,401.0,401.0,401.0,401.0,401.0,400.0,401.0,401.0
unique,2,4,,11,,16,,,,,,,,,
top,Female,CBSE,,CBSE,,Computer Science and Engineering,,,,,,,,,
freq,207,112,,116,,81,,,,,,,,,
mean,,,84.512718,,78.798828,,8.467855,0.571072,0.538653,0.25187,0.738155,2.922693,0.77,0.496259,82.658603
std,,,9.779359,,10.418821,,4.154455,0.495541,0.499126,0.434629,0.440188,1.378952,0.42136,0.500611,7.74967
min,,,32.0,,45.0,,5.5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,55.0
25%,,,77.6,,70.0,,7.62,0.0,0.0,0.0,0.0,2.0,1.0,0.0,76.2
50%,,,85.2,,80.5,,8.27,1.0,1.0,0.0,1.0,3.0,1.0,0.0,82.7
75%,,,92.0,,87.3,,8.94,1.0,1.0,1.0,1.0,4.0,1.0,1.0,89.4


In [11]:
df.isna().sum()


gender                   0
10th_board               0
10th_marks               0
12th_board               0
12th_marks               0
stream                   0
cgpa                     0
internshipsy_n           0
trainingy_n              0
backlog_in_5th_sem       0
innovative_projecty_n    0
communication_level      0
technical_coursey_n      1
placementy_n             0
cgpa_normalized          0
dtype: int64

## Saving the Cleaned Dataset

After completing all cleaning and standardization steps, the dataset is saved as a cleaned CSV file.

This allows:
- Reuse across analysis notebooks
- Separation of cleaning and analysis logic
- Reproducibility of results

All subsequent notebooks will load this cleaned dataset.


In [12]:
# Save cleaned dataset for downstream analysis
output_path = "../data/placement_cleaned.csv"

df.to_csv(output_path, index=False)

print(f"Cleaned dataset saved to {output_path}")


Cleaned dataset saved to ../data/placement_cleaned.csv
