<a href="https://colab.research.google.com/github/chantiasNK26768/data-science-visualization/blob/main/EXP6__Handling_Missing__Values_and_Data_Imputation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Experiment – Handling Missing Values and Data Imputation

VTU26768 – Data Science & Visualization Lab

Objective

To identify, analyze, and handle missing values in a real-world dataset using appropriate imputation techniques for numerical and categorical variables.

Dataset Used

Toyota.csv

Contains information about used Toyota cars including price, age, mileage, fuel type, horsepower, and other attributes.

**Libraries Used**

In [13]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

**Step 1: Loading the Dataset**

In [14]:
df1 = pd.read_csv('/content/Toyota.csv', index_col=0, na_values=['??', '????'])

Syntax Explanation

read_csv(): Loads the CSV file into a DataFrame.

index_col=0: Sets the first column as index.

na_values: Treats ?? and ???? as missing values (NaN).**

**Step 2: Creating Working Copies**

In [15]:
df = df1.copy()
cars_data = df1.copy()

Explanation

copy() creates independent DataFrame copies.

df is used for manual imputation.

cars_data is used for generalized imputation.

**Step 3: Identifying Missing Values**

In [16]:
df.isnull().sum()

Unnamed: 0,0
Price,0
Age,100
KM,15
FuelType,100
HP,6
MetColor,150
Automatic,0
CC,0
Doors,0
Weight,0


Explanation

isnull() detects missing values.

sum() counts missing values per column.

Observation

Missing values were found in:

Age

KM

FuelType

HP

MetColor

**Step 4: Displaying Rows with Missing Values**

In [17]:
df[df.isnull().any(axis=1)]

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
2,13950,24.0,41711.0,Diesel,90.0,,0,2000,3,1165
6,16900,27.0,,Diesel,,,0,2000,3,1245
7,18600,30.0,75889.0,,90.0,1.0,0,2000,3,1245
9,12950,23.0,71138.0,Diesel,,,0,1900,3,1105
15,22000,28.0,18739.0,Petrol,,0.0,0,1800,3,1185
...,...,...,...,...,...,...,...,...,...,...
1428,8450,72.0,,Petrol,86.0,,0,1300,3,1015
1431,7500,,20544.0,Petrol,86.0,1.0,0,1300,3,1025
1432,10845,72.0,,Petrol,86.0,0.0,0,1300,3,1015
1433,8500,,17016.0,Petrol,86.0,0.0,0,1300,3,1015


Explanation

any(axis=1) checks if any column in a row has a missing value.

Filters and displays affected rows.

**Step 5: Descriptive Statistics Before Imputation**

In [18]:
df.describe(include='all')

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
count,1436.0,1336.0,1421.0,1336,1430.0,1286.0,1436.0,1436.0,1436.0,1436.0
unique,,,,3,,,,,7.0,
top,,,,Petrol,,,,,5.0,
freq,,,,1177,,,,,673.0,
mean,10730.824513,55.672156,68647.239972,,101.478322,0.674961,0.05571,1566.827994,,1072.45961
std,3626.964585,18.589804,37333.023589,,14.768255,0.468572,0.229441,187.182436,,52.64112
min,4350.0,1.0,1.0,,69.0,0.0,0.0,1300.0,,1000.0
25%,8450.0,43.0,43210.0,,90.0,0.0,0.0,1400.0,,1040.0
50%,9900.0,60.0,63634.0,,110.0,1.0,0.0,1600.0,,1070.0
75%,11950.0,70.0,87000.0,,110.0,1.0,0.0,1600.0,,1085.0


Explanation

Provides statistical summary of both numerical and categorical columns.

Helps understand data distribution before cleaning.

**Step 6: Handling Missing Values in Numerical Columns**

In [19]:
num_cols = ['Age', 'KM', 'HP']
for col in num_cols:
    df[col].fillna(df[col].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mean(), inplace=True)


Syntax Explanation

fillna() replaces missing values.

mean() is used as an imputation strategy for numerical data.

inplace=True updates the DataFrame directly.

**Step 7: Handling Missing Values in Categorical Columns**

*FuelType Imputation*

In [20]:
df['FuelType'].fillna(df['FuelType'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['FuelType'].fillna(df['FuelType'].mode()[0], inplace=True)


*MetColor Imputation*

In [21]:
df['MetColor'].fillna(df['MetColor'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['MetColor'].fillna(df['MetColor'].mode()[0], inplace=True)


Explanation

mode() returns the most frequent category.

Suitable for categorical data imputation.

**Step 8: Verifying Missing Values After Manual Imputation**

In [22]:
df.isnull().sum()

Unnamed: 0,0
Price,0
Age,0
KM,0
FuelType,0
HP,0
MetColor,0
Automatic,0
CC,0
Doors,0
Weight,0


Result

All missing values were successfully handled.

**Step 9: Generalized Imputation (Single-Step Approach)**

In [23]:
cars_data = cars_data.apply(
    lambda x: x.fillna(x.mean()) if x.dtype == 'float'
    else x.fillna(x.value_counts().index[0])
)

Syntax Explanation

apply() applies a function column-wise.

Numerical columns → filled with mean.

Categorical columns → filled with mode.

**Step 10: Final Verification**

In [24]:
cars_data.isnull().sum()

Unnamed: 0,0
Price,0
Age,0
KM,0
FuelType,0
HP,0
MetColor,0
Automatic,0
CC,0
Doors,0
Weight,0


Result

No missing values remain in the dataset.

Conclusion

Missing values were successfully identified and handled.

Numerical attributes were imputed using mean.

Categorical attributes were imputed using mode.

Both manual and generalized imputation techniques were implemented.