<a id="table-of-contents"></a>

![Status: Complete](https://img.shields.io/badge/status-complete-brightgreen)
![Python](https://img.shields.io/badge/python-3.10-blue)
![Coverage](https://img.shields.io/badge/coverage-95%25-yellowgreen)
![License](https://img.shields.io/badge/license-MIT-green)

# 📖 Table of Contents

___



`Imputation`
- mean or median or mode
- timeseries: smoothing or carry-forward/backward
- linear regression imputation: based on other fields

- drop them
- use methods immune to missing values

<a id="data-setup"></a>
# 🗂️ Data Setup

#### 📚 Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### 📄 Generate Dataset

In [2]:
import pandas as pd
import numpy as np

# Set a seed for reproducibility
np.random.seed(42)

# Number of rows in the dummy dataset
n_rows = 100

# Define possible values for categorical columns
stock_codes = ['85123A', '71053', '84406B', '84406G', '84406E']
descriptions = [
    'WHITE HANGING HEART T-LIGHT HOLDER',
    'WHITE METAL LANTERN',
    'CREAM CUPID HEARTS COAT HANGER',
    'PARTY BUNTING',
    'RED WOOLLY HOTTIE'
]
countries = ['United Kingdom', 'France', 'Germany', 'Spain', 'Netherlands']
customer_ids = [np.nan, 12346, 12347, 12348, 12349]

# Create dummy retail dataset and assign it to df
df = pd.DataFrame({
    'InvoiceNo': np.random.randint(10000, 20000, n_rows),
    'StockCode': np.random.choice(stock_codes, n_rows),
    'Description': np.random.choice(descriptions, n_rows),
    'Quantity': np.random.randint(1, 10, n_rows),
    'InvoiceDate': pd.date_range('2020-01-01', periods=n_rows, freq='D'),
    'UnitPrice': np.round(np.random.uniform(0.5, 20.0, n_rows), 2),
    'CustomerID': np.random.choice(customer_ids, n_rows),
    'Country': np.random.choice(countries, n_rows)
})

# Introduce additional random missingness into numeric columns (around 10% missing)
missing_mask = np.random.rand(n_rows) < 0.1  # roughly 10% missing
df.loc[missing_mask, 'Quantity'] = np.nan
missing_mask = np.random.rand(n_rows) < 0.1
df.loc[missing_mask, 'UnitPrice'] = np.nan

# Display a preview of the dummy dataset
print("Dummy Retail Dataset Preview:")
display(df.head())

# Optionally, show dataset info for a quick summary
print("\nDataset Info:")
df.info()


Dummy Retail Dataset Preview:


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,17270,84406G,WHITE METAL LANTERN,5.0,2020-01-01,,12349.0,France
1,10860,85123A,CREAM CUPID HEARTS COAT HANGER,8.0,2020-01-02,9.56,,United Kingdom
2,15390,84406G,WHITE HANGING HEART T-LIGHT HOLDER,1.0,2020-01-03,13.17,12348.0,Germany
3,15191,71053,WHITE HANGING HEART T-LIGHT HOLDER,5.0,2020-01-04,1.44,,Netherlands
4,15734,85123A,PARTY BUNTING,3.0,2020-01-05,19.01,,France



Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   InvoiceNo    100 non-null    int64         
 1   StockCode    100 non-null    object        
 2   Description  100 non-null    object        
 3   Quantity     87 non-null     float64       
 4   InvoiceDate  100 non-null    datetime64[ns]
 5   UnitPrice    94 non-null     float64       
 6   CustomerID   79 non-null     float64       
 7   Country      100 non-null    object        
dtypes: datetime64[ns](1), float64(3), int64(1), object(3)
memory usage: 6.4+ KB


# Missing Value Detection

#### Column-wise


In [None]:
def detect_missing_by_column(df, visualize=False, plot_type='heatmap'):
    """
    Generate a table of missing values by column.
    
    Parameters:
        df (pd.DataFrame): Input data.
        visualize (bool): If True, displays a plot; otherwise, just returns the table.
        plot_type (str): Type of visualization ('heatmap' or 'bar').
    
    Returns:
        pd.DataFrame: A DataFrame with missing counts and percentages.
    """
    missing_counts = df.isnull().sum()
    missing_percent = 100 * missing_counts / len(df)
    summary = pd.DataFrame({
        'Missing Count': missing_counts,
        'Missing Percentage': missing_percent
    })
    
    if visualize:
        plt.figure(figsize=(10, 4))
        if plot_type == 'heatmap':
            sns.heatmap(df.isnull(), cbar=False, yticklabels=False)
            plt.title("Missing Values Heatmap")
        elif plot_type == 'bar':
            summary['Missing Count'].plot(kind='bar', color='skyblue')
            plt.title("Missing Values by Column")
            plt.ylabel("Count")
        plt.show()
        
    return summary

# Example usage:
col_summary = detect_missing_by_column(data, visualize=True, plot_type='heatmap')
print(col_summary)


#### Row-wise

In [3]:
def visualize_missing_by_row(df, visualize=True, plot='bar'):
    """
    Visualize missing values on a per-row basis.
    
    Parameters:
        df (pd.DataFrame): Input data.
        visualize (bool): If True, displays a bar chart of missing values per row.
        plot (str): Currently supports 'bar' for visualization.
    
    Returns:
        pd.DataFrame: DataFrame with an added 'Missing Count' column.
    """
    df_copy = df.copy()
    df_copy['Missing Count'] = df_copy.isnull().sum(axis=1)
    
    if visualize and plot == 'bar':
        plt.figure(figsize=(10, 4))
        df_copy['Missing Count'].plot(kind='bar', color='salmon')
        plt.title("Missing Values per Row")
        plt.xlabel("Row Index")
        plt.ylabel("Missing Count")
        plt.show()
        
    return df_copy[['Missing Count']]

# Example usage:
row_missing = visualize_missing_by_row(data, visualize=True, plot='bar')
print(row_missing.head())


NameError: name 'data' is not defined

#### drop missing rows or columns

In [None]:
# drop rows with missing values
df = df.dropna() 

# drop columns with missing values
df = df.dropna(axis = 1, thresh = 1)

#### knn imputer

In [1]:
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=3)
X_imputed = knn_imputer.fit_transform(X)

ModuleNotFoundError: No module named 'sklearn'

#### iterative imputer

In [2]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)
X_imputed = imputer.fit_transform(X)

ModuleNotFoundError: No module named 'sklearn'