### **üìò Lesson 1 ‚Äî Advanced Missing Data Handling**

**üéØ Objective**

- Handle missing data with advanced techniques
- Use interpolation (linear, polynomial, time-based)
- Use forward/backward fill with limits
- Use conditional multi-column imputation
- Clean mixed-type columns (numbers + text)
- Use professional method chaining cleaning pipelines
- Detect & fix ‚Äúsilent dirty data‚Äù that breaks dashboards
- Replace outliers with NaN + impute

**üß± 1Ô∏è‚É£ Load Dataset & Initial Check**

In [72]:
# Import Library

import pandas as pd
import numpy as np

In [73]:
# Define the path to the CSV file
# r"" tells Python it's a raw string (so backslashes \ don't need to be escaped)
df_url = r"C:\Users\dhira\Desktop\python-mastery\pandas\02_Transformation\datasets\raw\advanced_missing_people.csv"

# Read the CSV file into a pandas DataFrame
# A DataFrame is like a table in Python where we can store and manipulate data
df_mp = pd.read_csv(df_url)

# Print information about the DataFrame
# df.info() shows:
# - Number of rows and columns
# - Column names
# - Data types of each column
# - Number of non-null (non-missing) values per column
print("Data information:", df_mp.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   EmpID       10 non-null     int64  
 1   Name        10 non-null     object 
 2   Age         8 non-null      float64
 3   City        8 non-null      object 
 4   Department  8 non-null      object 
 5   JoinDate    9 non-null      object 
 6   Salary      7 non-null      float64
 7   Rating      8 non-null      object 
dtypes: float64(2), int64(1), object(5)
memory usage: 772.0+ bytes
Data information: None


In [74]:
# Check for missing values in the DataFrame
# df_mp.isna() returns a DataFrame of the same shape with True for missing values and False for non-missing
# .sum() adds up True values for each column (True is treated as 1, False as 0)
# The result is a count of missing values per column
df_mp.isna().sum()

EmpID         0
Name          0
Age           2
City          2
Department    2
JoinDate      1
Salary        3
Rating        2
dtype: int64

**üß© 2Ô∏è‚É£ Fix Bad Special Characters (Critical Step)**
- Convert ?, " " and empty strings ‚Üí NaN:

In [75]:
# Print the original DataFrame
# This shows the full data as it was read from the CSV
print("Original data:\n", df_mp)

# Replace certain placeholder values with actual NaN (missing) values
# Often in messy datasets, missing values may appear as:
# '?' or empty strings '' or strings with only spaces ' '
# np.nan is the proper missing value recognized by pandas
# Note: By default, replace() does NOT change df_mp inplace, so we need inplace=True or assign it back
df_mp.replace(['?', '', ' '], np.nan, inplace=True)  # inplace=True updates df_mp directly

# Optional: print the DataFrame after replacement to verify changes
print("\nData after replacing placeholders with NaN:\n", df_mp)

Original data:
    EmpID    Name   Age    City Department    JoinDate    Salary Rating
0    101  Dhiraj  36.0  Mumbai       Data  2019-05-01  150000.0    4.5
1    102   Pooja   NaN   Delhi    Finance  2018-03-12  120000.0    NaN
2    103   Aarav  12.0     NaN       Tech  2020-07-22       NaN    3.8
3    104  Ananya  10.0    Pune        NaN  2021-08-01   95000.0    4.2
4    105   Vijay  28.0  Mumbai       Tech  2017-11-15  110000.0      ?
5    106   Laxmi   NaN   Delhi    Finance         NaN  105000.0    3.5
6    107   Rohan  29.0     NaN       Tech  2022-01-12   85000.0    NaN
7    108   Meera  27.0   Delhi         HR  2021-04-18       NaN    4.1
8    109     Sam  31.0  Mumbai        NaN  2019-10-05  125000.0    4.9
9    110   Kiran  33.0    Pune    Finance  2020-12-30       NaN      ?

Data after replacing placeholders with NaN:
    EmpID    Name   Age    City Department    JoinDate    Salary Rating
0    101  Dhiraj  36.0  Mumbai       Data  2019-05-01  150000.0    4.5
1    102   Pooj

**üß© 3Ô∏è‚É£ Convert Data Types Correctly**
- Salary & Rating should be numeric:

In [76]:
# Print original data information
# df_mp.info() shows:
# - Number of rows and columns
# - Column names
# - Data types of each column
# - Number of non-null (non-missing) values
print('Original data types:')
print(df_mp.info())

# Convert 'Salary' column to numeric values
# pd.to_numeric() attempts to convert the column to numbers
# errors='coerce' will replace any value that cannot be converted (like text or '?') with NaN
df_mp['Salary'] = pd.to_numeric(df_mp['Salary'], errors='coerce')

# Convert 'Rating' column to numeric values with the same method
df_mp['Rating'] = pd.to_numeric(df_mp['Rating'], errors='coerce')

# Print data info again to see changes
# Now 'Salary' and 'Rating' columns should have numeric types (float64)
# Any non-convertible values are replaced with NaN
print('\nData types after conversion:\n')
df_mp.info()


Original data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   EmpID       10 non-null     int64  
 1   Name        10 non-null     object 
 2   Age         8 non-null      float64
 3   City        8 non-null      object 
 4   Department  8 non-null      object 
 5   JoinDate    9 non-null      object 
 6   Salary      7 non-null      float64
 7   Rating      6 non-null      object 
dtypes: float64(2), int64(1), object(5)
memory usage: 772.0+ bytes
None

Data types after conversion:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   EmpID       10 non-null     int64  
 1   Name        10 non-null     object 
 2   Age         8 non-null      float64
 3   City        8 non-null      object 
 4   Department  8

**Date conversion**

In [77]:
# Convert the 'JoinDate' column to datetime format
# pd.to_datetime() tries to convert each value in the column to a datetime object
# errors='coerce' will replace any value that cannot be converted (like text, empty strings, or wrong format) with NaT (Not a Time)
df_mp['JoinDate'] = pd.to_datetime(df_mp['JoinDate'], errors='coerce')

# Print DataFrame info again to see the updated data types
# Now 'JoinDate' column should have type datetime64[ns]
# Any invalid or missing dates are marked as NaT (missing datetime)
df_mp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   EmpID       10 non-null     int64         
 1   Name        10 non-null     object        
 2   Age         8 non-null      float64       
 3   City        8 non-null      object        
 4   Department  8 non-null      object        
 5   JoinDate    9 non-null      datetime64[ns]
 6   Salary      7 non-null      float64       
 7   Rating      6 non-null      float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(3)
memory usage: 772.0+ bytes


**üß© 4Ô∏è‚É£ Advanced: Forward/Backward Fill with Limits**
- This is useful when data is sequential (e.g., reports, logs).