# Polars Missing Data and Duplicate Handling - Comprehensive Guide

This notebook covers handling missing values (nulls) and duplicate data in Polars.

## What You'll Learn:
- Comprehensive null/missing value handling strategies
- Forward fill, backward fill, and interpolation
- Null-safe operations and propagation
- Identifying and analyzing duplicates
- Removing duplicates with different strategies
- Missing data patterns and visualization
- Best practices for data cleaning

In [None]:
import polars as pl
import numpy as np
from datetime import datetime, date, timedelta

print(f"Polars version: {pl.__version__}")

---
# Part 1: Understanding Null Values

In [None]:
# Create DataFrame with null values
df_nulls = pl.DataFrame({
    'id': [1, 2, 3, 4, 5, 6, 7, 8],
    'name': ['Alice', 'Bob', None, 'Diana', 'Eve', None, 'Frank', 'Grace'],
    'age': [25, None, 35, None, 28, 42, None, 31],
    'score': [85.5, 92.3, None, 88.9, None, 95.2, 78.1, None],
    'date': [
        date(2024, 1, 1), date(2024, 1, 2), None, 
        date(2024, 1, 4), date(2024, 1, 5), None,
        date(2024, 1, 7), date(2024, 1, 8)
    ]
})

print("DataFrame with null values:")
print(df_nulls)

In [None]:
# Check for null values
print("Null count per column:")
print(df_nulls.null_count())

print("\nNull percentage per column:")
null_pct = (df_nulls.null_count() / len(df_nulls) * 100)
print(null_pct)

In [None]:
# Identify rows with any null
has_any_null = df_nulls.select(
    pl.any_horizontal(pl.all().is_null()).alias('has_null')
)

df_with_null_flag = df_nulls.with_columns(has_any_null)
print("Rows with any null value:")
print(df_with_null_flag.filter(pl.col('has_null')))
print(f"\nTotal rows with nulls: {df_with_null_flag['has_null'].sum()}")

---
# Part 2: Filling Null Values

## 2.1 Fill with Literal Values

In [None]:
# Fill nulls with specific values
df_filled_literal = df_nulls.with_columns([
    pl.col('name').fill_null('Unknown').alias('name_filled'),
    pl.col('age').fill_null(0).alias('age_filled'),
    pl.col('score').fill_null(0.0).alias('score_filled')
])

print("Fill nulls with literals:")
print(df_filled_literal.select(['name', 'name_filled', 'age', 'age_filled', 'score', 'score_filled']))

## 2.2 Fill with Statistical Values

In [None]:
# Fill with mean, median, min, max
df_filled_stats = df_nulls.with_columns([
    pl.col('age').fill_null(pl.col('age').mean()).alias('age_mean'),
    pl.col('age').fill_null(pl.col('age').median()).alias('age_median'),
    pl.col('score').fill_null(pl.col('score').mean()).alias('score_mean')
])

print("Fill nulls with statistics:")
print(df_filled_stats.select([
    'id', 'age', 'age_mean', 'age_median', 'score', 'score_mean'
]))

## 2.3 Forward Fill and Backward Fill

In [None]:
# Forward fill (propagate last valid value forward)
df_forward_fill = df_nulls.with_columns([
    pl.col('name').fill_null(strategy='forward').alias('name_ffill'),
    pl.col('age').fill_null(strategy='forward').alias('age_ffill'),
    pl.col('score').fill_null(strategy='forward').alias('score_ffill')
])

print("Forward fill:")
print(df_forward_fill.select(['id', 'name', 'name_ffill', 'age', 'age_ffill']))

In [None]:
# Backward fill (propagate next valid value backward)
df_backward_fill = df_nulls.with_columns([
    pl.col('name').fill_null(strategy='backward').alias('name_bfill'),
    pl.col('age').fill_null(strategy='backward').alias('age_bfill'),
    pl.col('score').fill_null(strategy='backward').alias('score_bfill')
])

print("Backward fill:")
print(df_backward_fill.select(['id', 'name', 'name_bfill', 'age', 'age_bfill']))

## 2.4 Interpolation (for numeric/temporal data)

In [None]:
# Linear interpolation for numeric columns
df_interpolated = df_nulls.with_columns([
    pl.col('age').interpolate().alias('age_interpolated'),
    pl.col('score').interpolate().alias('score_interpolated')
])

print("Linear interpolation:")
print(df_interpolated.select(['id', 'age', 'age_interpolated', 'score', 'score_interpolated']))
print("\n💡 Interpolation estimates values between known points")

## 2.5 Conditional Fill (Coalesce)

In [None]:
# Coalesce: Use first non-null value from multiple columns
df_backup = pl.DataFrame({
    'id': [1, 2, 3, 4],
    'primary_email': ['alice@example.com', None, 'charlie@example.com', None],
    'secondary_email': [None, 'bob_backup@example.com', None, 'diana_backup@example.com'],
    'default_email': ['default@example.com'] * 4
})

df_coalesced = df_backup.with_columns(
    pl.coalesce(['primary_email', 'secondary_email', 'default_email']).alias('email')
)

print("Coalesce (first non-null):")
print(df_coalesced)

---
# Part 3: Dropping Null Values

In [None]:
# Drop rows with ANY null value
df_dropped_any = df_nulls.drop_nulls()

print("Drop rows with any null:")
print(df_dropped_any)
print(f"\nRows: {len(df_nulls)} -> {len(df_dropped_any)}")

In [None]:
# Drop rows with null in specific columns
df_dropped_subset = df_nulls.drop_nulls(subset=['name', 'age'])

print("Drop rows with null in 'name' or 'age':")
print(df_dropped_subset)
print(f"\nRows: {len(df_nulls)} -> {len(df_dropped_subset)}")

---
# Part 4: Null-Safe Operations

## 4.1 Null Propagation in Arithmetic

In [None]:
# Nulls propagate through arithmetic operations
df_arithmetic = pl.DataFrame({
    'a': [1, 2, None, 4, 5],
    'b': [10, None, 30, 40, None]
})

df_arithmetic_result = df_arithmetic.with_columns([
    (pl.col('a') + pl.col('b')).alias('sum'),
    (pl.col('a') * pl.col('b')).alias('product'),
    (pl.col('a') / pl.col('b')).alias('division')
])

print("Null propagation in arithmetic:")
print(df_arithmetic_result)
print("\n💡 Any operation with null results in null")

## 4.2 Nulls in Aggregations

In [None]:
# Aggregations typically ignore nulls
df_agg = pl.DataFrame({
    'group': ['A', 'A', 'A', 'B', 'B', 'B'],
    'value': [10, None, 30, 40, 50, None]
})

result = df_agg.group_by('group').agg([
    pl.col('value').sum().alias('sum'),
    pl.col('value').mean().alias('mean'),
    pl.col('value').count().alias('count'),
    pl.col('value').null_count().alias('null_count')
])

print("Aggregations ignore nulls:")
print(result)
print("\n💡 sum, mean, count ignore nulls by default")

## 4.3 Nulls in Comparisons

In [None]:
# Null comparisons always return null (not True or False)
df_compare = pl.DataFrame({
    'value': [1, 2, None, 4, None]
})

df_compare_result = df_compare.with_columns([
    (pl.col('value') > 2).alias('greater_than_2'),
    (pl.col('value') == None).alias('equals_none'),  # Always null!
    pl.col('value').is_null().alias('is_null'),  # Correct way
    pl.col('value').is_not_null().alias('is_not_null')
])

print("Null comparisons:")
print(df_compare_result)
print("\n⚠️ Use .is_null() instead of == None")

## 4.4 Nulls in Joins

In [None]:
# Nulls don't match in joins (by default)
df_left = pl.DataFrame({
    'key': [1, 2, None, 4],
    'value_left': ['A', 'B', 'C', 'D']
})

df_right = pl.DataFrame({
    'key': [1, None, 3, 4],
    'value_right': ['W', 'X', 'Y', 'Z']
})

result = df_left.join(df_right, on='key', how='inner')

print("Join with null keys:")
print(result)
print("\n💡 Null keys don't match (row with key=None excluded)")

---
# Part 5: Missing Data Patterns

In [None]:
# Create realistic dataset with missing patterns
np.random.seed(42)
n = 100

df_pattern = pl.DataFrame({
    'id': range(n),
    'age': [x if np.random.random() > 0.1 else None for x in np.random.randint(18, 80, n)],
    'income': [x if np.random.random() > 0.15 else None for x in np.random.randint(20000, 150000, n)],
    'education': [x if np.random.random() > 0.05 else None 
                  for x in np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n)]
})

print("Missing data summary:")
print(df_pattern.null_count())
print(f"\nMissing percentages:")
print((df_pattern.null_count() / len(df_pattern) * 100).round(2))

In [None]:
# Analyze missing patterns
missing_pattern = df_pattern.select([
    pl.col('age').is_null().alias('age_missing'),
    pl.col('income').is_null().alias('income_missing'),
    pl.col('education').is_null().alias('education_missing')
])

# Count combinations of missing values
pattern_counts = missing_pattern.group_by(['age_missing', 'income_missing', 'education_missing']).agg(
    pl.count().alias('count')
).sort('count', descending=True)

print("Missing value patterns:")
print(pattern_counts)

---
# Part 6: Duplicate Data Handling

## 6.1 Identifying Duplicates

In [None]:
# Create DataFrame with duplicates
df_dupes = pl.DataFrame({
    'id': [1, 2, 3, 2, 4, 3, 5, 1],
    'name': ['Alice', 'Bob', 'Charlie', 'Bob', 'Diana', 'Charlie', 'Eve', 'Alice'],
    'value': [100, 200, 300, 200, 400, 300, 500, 150]
})

print("DataFrame with duplicates:")
print(df_dupes)

In [None]:
# is_duplicated(): Mark duplicate rows (True if duplicate exists)
df_with_dupes = df_dupes.with_columns([
    pl.col('id').is_duplicated().alias('id_is_dup'),
    pl.col('name').is_duplicated().alias('name_is_dup')
])

print("Mark duplicates:")
print(df_with_dupes)
print("\n💡 is_duplicated() marks ALL occurrences (including first)")

In [None]:
# is_unique(): Mark unique values (opposite of is_duplicated)
df_with_unique = df_dupes.with_columns([
    pl.col('id').is_unique().alias('id_is_unique'),
    pl.col('name').is_unique().alias('name_is_unique')
])

print("Mark unique values:")
print(df_with_unique)

In [None]:
# Find duplicate rows based on subset of columns
df_subset_dupes = df_dupes.with_columns(
    pl.struct(['id', 'name']).is_duplicated().alias('row_is_duplicate')
)

print("Duplicates based on 'id' AND 'name':")
print(df_subset_dupes)
print("\nDuplicate rows:")
print(df_subset_dupes.filter(pl.col('row_is_duplicate')))

## 6.2 Counting Duplicates

In [None]:
# Count duplicates per group
duplicate_counts = df_dupes.group_by(['id', 'name']).agg(
    pl.count().alias('count')
).filter(
    pl.col('count') > 1
).sort('count', descending=True)

print("Duplicate groups:")
print(duplicate_counts)

In [None]:
# Count unique values
print("Unique value counts:")
print(f"Unique ids: {df_dupes['id'].n_unique()}")
print(f"Unique names: {df_dupes['name'].n_unique()}")
print(f"Total rows: {len(df_dupes)}")

## 6.3 Removing Duplicates

In [None]:
# unique(): Keep only unique rows (all columns)
df_unique_all = df_dupes.unique()

print("Unique (all columns):")
print(df_unique_all)
print(f"\nRows: {len(df_dupes)} -> {len(df_unique_all)}")

In [None]:
# unique() on specific columns
df_unique_subset = df_dupes.unique(subset=['id', 'name'])

print("Unique based on 'id' and 'name':")
print(df_unique_subset)
print(f"\nRows: {len(df_dupes)} -> {len(df_unique_subset)}")

In [None]:
# unique() with keep parameter
# keep='first' (default), 'last', 'none', 'any'

print("Keep first occurrence:")
df_keep_first = df_dupes.unique(subset=['id'], keep='first')
print(df_keep_first)

print("\nKeep last occurrence:")
df_keep_last = df_dupes.unique(subset=['id'], keep='last')
print(df_keep_last)

print("\nKeep none (remove all duplicates):")
df_keep_none = df_dupes.unique(subset=['id'], keep='none')
print(df_keep_none)

## 6.4 Advanced Duplicate Handling

In [None]:
# Keep duplicate with highest value
# Example: Keep the record with highest 'value' for each 'id'

df_best = (
    df_dupes
    .sort('value', descending=True)  # Sort by value (highest first)
    .unique(subset=['id'], keep='first')  # Keep first (= highest value)
    .sort('id')  # Resort by id for readability
)

print("Keep duplicate with highest value:")
print(df_best)

In [None]:
# Mark first/last occurrence
df_marked = df_dupes.with_columns([
    (pl.col('id').cum_count().over('id') == 1).alias('is_first'),
    (pl.col('id').cum_count().over('id') == pl.col('id').count().over('id')).alias('is_last')
])

print("Mark first and last occurrences:")
print(df_marked)

---
# Part 7: Real-World Data Cleaning Pipeline

In [None]:
# Realistic messy dataset
np.random.seed(42)
n = 1000

df_messy = pl.DataFrame({
    'customer_id': list(range(1, 901)) + list(np.random.choice(range(1, 901), 100)),  # Duplicates
    'name': [f'Customer_{i}' if np.random.random() > 0.05 else None for i in range(n)],
    'age': [np.random.randint(18, 80) if np.random.random() > 0.1 else None for _ in range(n)],
    'email': [f'user{i}@example.com' if np.random.random() > 0.08 else None for i in range(n)],
    'purchase_amount': [round(np.random.uniform(10, 1000), 2) if np.random.random() > 0.12 else None for _ in range(n)],
    'signup_date': [
        date(2024, 1, 1) + timedelta(days=int(np.random.randint(0, 365))) 
        if np.random.random() > 0.05 else None 
        for _ in range(n)
    ]
})

print("Messy dataset statistics:")
print(f"Total rows: {len(df_messy)}")
print(f"\nNull counts:")
print(df_messy.null_count())
print(f"\nDuplicate customer_ids: {df_messy['customer_id'].is_duplicated().sum()}")

In [None]:
# Data cleaning pipeline
df_cleaned = (
    df_messy
    # 1. Remove complete duplicates (all columns identical)
    .unique()
    
    # 2. For duplicate customer_ids, keep the one with most recent signup
    .sort('signup_date', descending=True, nulls_last=True)
    .unique(subset=['customer_id'], keep='first')
    
    # 3. Fill missing values
    .with_columns([
        pl.col('name').fill_null('Unknown'),
        pl.col('age').fill_null(pl.col('age').median()),
        pl.col('email').fill_null(pl.concat_str([pl.lit('unknown_'), pl.col('customer_id'), pl.lit('@example.com')])),
        pl.col('purchase_amount').fill_null(0.0),
        pl.col('signup_date').fill_null(strategy='forward')
    ])
    
    # 4. Remove rows still missing critical data
    .drop_nulls(subset=['customer_id'])
    
    # 5. Sort for readability
    .sort('customer_id')
)

print("Cleaned dataset:")
print(f"Rows: {len(df_messy)} -> {len(df_cleaned)}")
print(f"\nNull counts after cleaning:")
print(df_cleaned.null_count())
print(f"\nDuplicate customer_ids: {df_cleaned['customer_id'].is_duplicated().sum()}")
print("\n✅ Data cleaning complete!")

---
# Summary

## Missing Data Strategies:

| Method | Use Case | Pros | Cons |
|--------|----------|------|------|
| **drop_nulls()** | Critical columns only | Simple, no assumptions | Loses data |
| **fill_null(value)** | Default values known | Fast, deterministic | May bias results |
| **fill_null(mean)** | Numeric, symmetric data | Preserves mean | Reduces variance |
| **fill_null('forward')** | Time series | Preserves trends | Propagates errors |
| **fill_null('backward')** | Time series | Good for forecasts | Looks into future |
| **interpolate()** | Numeric time series | Smooth estimates | Assumes linearity |
| **coalesce()** | Multiple fallback sources | Flexible | Complex logic |

## Duplicate Handling:

| Method | Effect | Keep Parameter |
|--------|--------|----------------|
| **unique()** | Remove duplicates | 'first', 'last', 'none', 'any' |
| **is_duplicated()** | Mark duplicates | Marks ALL occurrences |
| **is_unique()** | Mark unique | Opposite of is_duplicated |

## Best Practices:

1. ✅ **Investigate** missing patterns before filling
2. ✅ **Document** your cleaning decisions
3. ✅ **Validate** data quality after cleaning
4. ✅ **Use** forward fill for time series
5. ✅ **Prefer** dropping over imputing for critical columns
6. ✅ **Keep** duplicates with best data quality
7. ❌ **Don't** use `== None` (use `.is_null()` instead)
8. ❌ **Don't** blindly fill all nulls with mean

## Common Patterns:
```python
# Fill nulls
df.with_columns(pl.col('col').fill_null(strategy='forward'))

# Drop nulls in critical columns
df.drop_nulls(subset=['id', 'date'])

# Remove duplicates, keep first
df.unique(subset=['id'], keep='first')

# Find duplicates
df.filter(pl.col('id').is_duplicated())
```

---
# Practice Exercises

In [None]:
# Exercise 1: Fill nulls with different strategies
df_ex1 = pl.DataFrame({
    'date': [date(2024, 1, i) for i in range(1, 11)],
    'temperature': [20, None, 22, None, None, 25, 26, None, 28, 29]
})

# TODO: Fill temperature nulls using:
# a) Forward fill
# b) Interpolation
# c) Mean
# Compare results


In [None]:
# Exercise 2: Find and remove duplicates
df_ex2 = pl.DataFrame({
    'user_id': [1, 2, 3, 2, 4, 3, 5],
    'score': [100, 200, 300, 250, 400, 300, 500],
    'timestamp': ['2024-01-01', '2024-01-02', '2024-01-03', 
                  '2024-01-04', '2024-01-05', '2024-01-06', '2024-01-07']
})

# TODO: Keep the duplicate with the highest score for each user_id


In [None]:
# Exercise 3: Build a data cleaning pipeline
# TODO: Create a messy dataset and clean it following best practices
