<a href="https://colab.research.google.com/github/amrahmani/Marketing/blob/main/AIMarketing_Ch1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Load the Dataset**

**Identify Missing Values**

**Impute Missing Values**

**Remove Rows or Columns with High Missing Percentaget**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('./data/OnlineRetail.csv', encoding='ISO-8859-1')

print("Original DataFrame Shape:", df.shape)

# 1- Identify Missing Values
print("\n--- 1. Initial Missing Values ---\n")
print(df.isnull().sum())
initial_missing_count = df.isnull().sum().sum()
print(f"\nTotal initial missing values: {initial_missing_count}")

# 2- Impute Missing Values with Mean
print("\n--- 2. Imputing Missing Values (UnitPrice) ---\n")
print("Rows with missing UnitPrice before imputation:")
print(df[df['UnitPrice'].isnull()])

mean_unit_price = df['UnitPrice'].mean()
df['UnitPrice'].fillna(mean_unit_price, inplace=True)

print("\nRows with missing UnitPrice after imputation:")
print(df[df['UnitPrice'].isnull()])
print("\nMissing values after UnitPrice imputation:")
print(df.isnull().sum())

# 3- Remove rows or columns with a high percentage of missing values (> 50%)
print("\n--- 3. Removing Rows/Columns with High Missing Percentage (> 50%) ---\n")

# Check column-wise missing percentage
column_missing_percentage = (df.isnull().sum() / len(df)) * 100
columns_to_drop = column_missing_percentage[column_missing_percentage > 50].index
print("Columns with > 50% missing values:", columns_to_drop)

if not columns_to_drop.empty:
    print("\nDataFrame info before dropping columns:")
    df.info()
    df.drop(columns=columns_to_drop, inplace=True)
    print("\nDataFrame info after dropping columns:")
    df.info()
else:
    print("\nNo columns found with more than 50% missing values.")

# Check row-wise missing percentage
row_missing_percentage = (df.isnull().sum(axis=1) / df.shape[1]) * 100
rows_to_drop = row_missing_percentage[row_missing_percentage > 50].index
print("\nNumber of rows with > 50% missing values:", len(rows_to_drop))

if not rows_to_drop.empty:
    print("\nFirst 5 rows with > 50% missing values before dropping:")
    print(df.loc[rows_to_drop.head()])
    df.drop(index=rows_to_drop, inplace=True)
    print("\nDataFrame shape after dropping rows:", df.shape)
    print("\nFirst 5 rows with > 50% missing values after dropping:")
    # Check if any still exist (should be none)
    remaining_high_missing_rows = df[(df.isnull().sum(axis=1) / df.shape[1]) * 100 > 50]
    if not remaining_high_missing_rows.empty:
        print(remaining_high_missing_rows.head())
    else:
        print("No rows with > 50% missing values remain.")
else:
    print("\nNo rows found with more than 50% missing values.")

print("\nMissing values after handling high percentage missing rows/columns:")
print(df.isnull().sum())

Original DataFrame Shape: (541909, 8)

--- 1. Initial Missing Values ---

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

Total initial missing values: 136534

--- 2. Imputing Missing Values (UnitPrice) ---

Rows with missing UnitPrice before imputation:
Empty DataFrame
Columns: [InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country]
Index: []

Rows with missing UnitPrice after imputation:
Empty DataFrame
Columns: [InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country]
Index: []

Missing values after UnitPrice imputation:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

--- 3. Removing Rows/Columns with High Missing Percentage (> 50%) ---


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['UnitPrice'].fillna(mean_unit_price, inplace=True)


Columns with > 50% missing values: Index([], dtype='object')

No columns found with more than 50% missing values.

Number of rows with > 50% missing values: 0

No rows found with more than 50% missing values.

Missing values after handling high percentage missing rows/columns:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64
