In [1]:
# Import SimpleImputer to handle missing values
# Import pandas to create and display the dataset
from sklearn.impute import SimpleImputer
import pandas as pd

### 📊 Sample Data with Missing Values
We'll create a DataFrame with NaN values to demonstrate how imputation works.

In [2]:
# Create a dataset with some missing values (NaN)
data = {
    'Age': [25, 30, None, 45, 50],
    'Salary': [50000, None, 60000, 80000, None]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Age,Salary
0,25.0,50000.0
1,30.0,
2,,60000.0
3,45.0,80000.0
4,50.0,


### ⚙️ Apply SimpleImputer
We'll use the "mean" strategy to fill missing values.
Other options include:
- "median"
- "most_frequent"
- "constant"

In [3]:
# Initialize the SimpleImputer with mean strategy
imputer = SimpleImputer(strategy='mean')

# Fit and transform the data
imputed_array = imputer.fit_transform(df)

In [4]:
# Convert the imputed NumPy array back to a DataFrame
# Use the original column names
imputed_df = pd.DataFrame(imputed_array, columns=df.columns)
imputed_df

Unnamed: 0,Age,Salary
0,25.0,50000.0
1,30.0,63333.333333
2,37.5,60000.0
3,45.0,80000.0
4,50.0,63333.333333


In [5]:
print("Original Data (with NaNs):")
print(df)

print("\nImputed Data (filled using mean):")
print(imputed_df)

Original Data (with NaNs):
    Age   Salary
0  25.0  50000.0
1  30.0      NaN
2   NaN  60000.0
3  45.0  80000.0
4  50.0      NaN

Imputed Data (filled using mean):
    Age        Salary
0  25.0  50000.000000
1  30.0  63333.333333
2  37.5  60000.000000
3  45.0  80000.000000
4  50.0  63333.333333


💡 Always handle missing values **before training your models**.
Unfilled NaNs can cause most ML algorithms to crash or behave unpredictably.
Choose the right imputation strategy based on the data distribution and domain knowledge.