In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Identifying Missing Data

Pandas uses the special placeholder NaN (Not a Number) to represent missing values. Detecting where these missing values exist is the first step before deciding how to handle them.

Pandas provides two primary functions to check for missing data:

isnull(): Returns True for every missing value (NaN) and False otherwise.
notnull(): Returns True for non-missing values and False for missing ones.


In [2]:
# Sample DataFrame with missing values
data = {
    "ProductID": [101, 102, 103, 104],
    "SalesAmount": [200, None, 400, 300],
    "Region": ["North", "South", None, "East"]
}

In [3]:
# Check for missing values
df = pd.DataFrame(data)

In [4]:
print(df.isnull())

   ProductID  SalesAmount  Region
0      False        False   False
1      False         True   False
2      False        False    True
3      False        False   False


In [5]:
# Summarize missing values by column
print(df.isnull().sum())

ProductID      0
SalesAmount    1
Region         1
dtype: int64


## Removing missing Data 
    In some cases, it might be necessary to remove rows or columns with missing     values, especially when the proportion of missing data is small. Pandas provides the dropna() function for this purpose.

In [6]:
# This removes any row that contains at least one missing value.
df_cleaned_rows = df.dropna()
print(df_cleaned_rows)

   ProductID  SalesAmount Region
0        101        200.0  North
3        104        300.0   East


## Removing Columns with Missing Values
If a column has too many missing values, it may be better to drop the entire column.

In [7]:
# Remove columns with any missing values
df_cleaned_columns = df.dropna(axis=1)
print(df_cleaned_columns)

   ProductID
0        101
1        102
2        103
3        104


In [8]:
# Keep rows with at least 2 non-NaN values
df_thresh = df.dropna(thresh=2)
print(df_thresh)

   ProductID  SalesAmount Region
0        101        200.0  North
1        102          NaN  South
2        103        400.0   None
3        104        300.0   East


## Imputing Missing Data
Imputation refers to replacing missing values with meaningful substitutes, such as the mean, median, or a fixed value. This technique is useful when dropping rows or columns would result in the loss of too much data.

### Using fillna() to Impute Data

The `fillna()` function allows you to replace missing values with a specific value or method.

In [9]:
# fill the missing value with mean f columns
mean_sales = df['SalesAmount'].mean()
df_filled = df['SalesAmount'].fillna(mean_sales)

In [10]:
print(df_filled)

0    200.0
1    300.0
2    400.0
3    300.0
Name: SalesAmount, dtype: float64


## Filling Missing Data with a Fixed Value
You can replace missing values with a fixed value (e.g., 0 or 'Unknown').


In [11]:
df['Region'] = df['Region'].fillna('Unkown')
print(df)

   ProductID  SalesAmount  Region
0        101        200.0   North
1        102          NaN   South
2        103        400.0  Unkown
3        104        300.0    East


## Using Different Fill Values for Different Columns
You can pass a dictionary to fillna() to use different values for different columns.

In [12]:
df_new_feild = df.fillna({'Salesamount': df['SalesAmount'].mean(),
                           'Region': 'Uknown'
                         })

In [13]:
print(df_new_feild)

   ProductID  SalesAmount  Region
0        101        200.0   North
1        102          NaN   South
2        103        400.0  Unkown
3        104        300.0    East


##  Forward fill missing values
Forward and backward filling is especially useful for time series data where missing values can be replaced with the last known value (forward fill) or the next known value (backward fill).

Forward Fill `(ffill)`

Forward fill propagates the last valid observation forward to the next missing value.

`Backward Fill (bfill)`

Backward fill propagates the next valid observation backward to the previous missing value.

In [14]:
df_forward_fill = df.fillna(method='ffill')
print(df_forward_fill)

   ProductID  SalesAmount  Region
0        101        200.0   North
1        102        200.0   South
2        103        400.0  Unkown
3        104        300.0    East


In [15]:
df_backward_fill = df.fillna(method='bfill')
print(df_backward_fill)

   ProductID  SalesAmount  Region
0        101        200.0   North
1        102        400.0   South
2        103        400.0  Unkown
3        104        300.0    East


## Advanced Imputation Techniques: Interpolation

Interpolation is a method of estimating missing values by using the known values in the dataset. Pandas provides the `interpolate()` method, which can be useful for numerical data, especially in time series.

In [16]:
df[['SalesAmount']].interpolate()

Unnamed: 0,SalesAmount
0,200.0
1,300.0
2,400.0
3,300.0


In this example, the missing value in SalesAmount is interpolated based on the existing values in the column. This method works well when the data follows a specific trend or pattern.

## Different Methods for Interpolation
Pandas supports various interpolation methods, such as linear (default), polynomial, and spline. You can specify the method that best fits your data.


In [17]:
# Polynomial interpolation (degree 2)
df_interpolated_poly = df['SalesAmount'].interpolate(method='polynomial', order=2)
print(df_interpolated_poly)

0    200.000000
1    366.666667
2    400.000000
3    300.000000
Name: SalesAmount, dtype: float64
