# 1. Data Cleaning & Normalization

This notebook implements the first step of our Frequent Pattern Mining project with Concept Lattice foundations.

## Objectives
- Load transaction data from JSON files
- Clean and normalize transactions
- Apply synonym normalization using a mapping file
- Remove rare/meaningless items
- Save cleaned transaction data for further processing

## Theoretical Background

**Data Cleaning** is a critical first step in any data mining process. For frequent pattern mining, we need to ensure:
1. Consistent representation of items (case-sensitivity, whitespace, etc.)
2. Removal of duplicates within transactions
3. Standardization of synonyms and variants
4. Identification and handling of rare items

## Import Required Libraries

First, let's import all necessary libraries:

In [45]:
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Set, Tuple, Any

# Set plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

# Ensure directories exist
os.makedirs('output', exist_ok=True)
os.makedirs('figures', exist_ok=True)

## Define Helper Functions

Let's define functions to handle the data cleaning tasks:

In [46]:
def load_json_file(file_path: str) -> Any:
    """Load data from a JSON file.

    Args:
        file_path: Path to the JSON file

    Returns:
        Loaded data from the JSON file
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        return json.load(f)


def save_json_file(data: Any, file_path: str) -> None:
    """Save data to a JSON file.

    Args:
        data: Data to save
        file_path: Path to the JSON file
    """
    with open(file_path, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2)


def clean_and_normalize_transactions(
    transactions: List[Dict[str, Any]],
    normalization_map: Dict[str, str]
) -> List[Dict[str, Any]]:
    """Clean and normalize transaction data.

    Args:
        transactions: List of transaction dictionaries with 'transaction_id' and 'items' keys
        normalization_map: Dictionary mapping raw item names to normalized names

    Returns:
        List of cleaned and normalized transactions
    """
    cleaned_transactions = []
    seen_ids = set()

    for transaction in transactions:
        transaction_id = transaction.get('transaction_id')

        # Validate transaction_id
        if transaction_id is None:
            continue
        if transaction_id in seen_ids:
            print(f"Warning: Duplicate transaction ID {transaction_id} found. Skipping.")
            continue
        seen_ids.add(transaction_id)

        # Clean and normalize items
        items = transaction.get('items', [])
        normalized_items = set()  # Use set to automatically deduplicate

        for item in items:
            if not item:  # Skip empty items
                continue

            # Lowercase and strip
            clean_item = item.lower().strip()

            # Apply normalization map
            normalized_item = normalization_map.get(item, normalization_map.get(clean_item, clean_item))

            if normalized_item:  # Add non-empty items
                normalized_items.add(normalized_item)

        if normalized_items:  # Only add transactions with at least one item
            cleaned_transactions.append({
                'transaction_id': transaction_id,
                'items': sorted(list(normalized_items))  # Convert back to sorted list
            })

    return cleaned_transactions


def filter_rare_items(
    transactions: List[Dict[str, Any]],
    threshold: float = 0.05
) -> Tuple[List[Dict[str, Any]], Set[str]]:
    """Remove items that appear less frequently than the threshold.

    Args:
        transactions: List of transaction dictionaries
        threshold: Minimum frequency threshold (default: 0.05 or 5%)

    Returns:
        Tuple of (filtered transactions, set of removed items)
    """
    # Count item frequencies
    item_counts = {}
    for transaction in transactions:
        for item in transaction['items']:
            item_counts[item] = item_counts.get(item, 0) + 1

    # Identify rare items
    total_transactions = len(transactions)
    rare_items = {
        item for item, count in item_counts.items()
        if count / total_transactions < threshold
    }

    # Remove rare items from transactions
    filtered_transactions = []
    for transaction in transactions:
        filtered_items = [item for item in transaction['items'] if item not in rare_items]
        if filtered_items:  # Only include transactions with at least one item left
            filtered_transactions.append({
                'transaction_id': transaction['transaction_id'],
                'items': filtered_items
            })

    return filtered_transactions, rare_items


def transactions_to_lists(transactions: List[Dict[str, Any]]) -> List[List[str]]:
    """Convert transaction dictionaries to lists of items.

    Args:
        transactions: List of transaction dictionaries

    Returns:
        List of item lists
    """
    return [transaction['items'] for transaction in transactions]


def get_unique_items(transactions: List[Dict[str, Any]]) -> Set[str]:
    """Get set of all unique items across transactions.

    Args:
        transactions: List of transaction dictionaries

    Returns:
        Set of unique items
    """
    unique_items = set()
    for transaction in transactions:
        unique_items.update(transaction['items'])
    return unique_items

## Load Raw Data

Now, let's load the raw transaction data and the normalization mapping:

In [47]:
# Load transactions from a JSON file
import os

# Get absolute paths - fix path for notebook location
current_dir = os.getcwd()
# Check if we're in notebooks folder, if so go up one level
if current_dir.endswith('notebooks'):
    project_dir = os.path.dirname(current_dir)
else:
    project_dir = current_dir

transaction_file = os.path.join(project_dir, 'data', 'transactions.json')
print(f"Looking for file at: {transaction_file}")

try:
    with open(transaction_file, 'r', encoding='utf-8') as f:
        transactions = json.load(f)
    print(f"Loaded {len(transactions)} transactions from your real data!")
except FileNotFoundError:
    print(f"Transaction file not found at {transaction_file}.")
    print("Creating sample transactions...")

    # Create sample transactions
    sample_transactions = [
        {"transaction_id": 1, "items": ["Apple", "Milk", "Bread"]},
        {"transaction_id": 2, "items": ["Rice", "Oil"]},
        {"transaction_id": 3, "items": ["milk", "Eggs", "Cheese"]},
        {"transaction_id": 4, "items": ["Bread", "Butter", "Milk"]},
        {"transaction_id": 5, "items": ["apple", "Banana", "Orange"]}
    ]

    # Save sample transactions
    os.makedirs(os.path.dirname(transaction_file), exist_ok=True)
    with open(transaction_file, 'w', encoding='utf-8') as f:
        json.dump(sample_transactions, f, indent=2)

    transactions = sample_transactions
    print(f"Created and saved {len(transactions)} sample transactions")

# Display a few transactions
print(f"\nYour Real Transaction Data (showing first 5 of {len(transactions)}):")
for i, transaction in enumerate(transactions[:5]):
    print(f"{transaction['transaction_id']}: {transaction['items']}")

Looking for file at: e:\project-x\data\transactions.json
Loaded 20 transactions from your real data!

Your Real Transaction Data (showing first 5 of 20):
1: ['Apple', 'milk', 'bread']
2: ['rice', 'Oil', 'beans']
3: ['milk', 'eggs', 'cheese', 'yogurt']
4: ['bread', 'butter', 'milk']
5: ['apple', 'banana', 'orange', 'grapes']


## Clean and Normalize Transactions

Apply the cleaning and normalization process to the raw transactions:

In [48]:
# Load normalization map
import json
import os

# Get the correct path for normalization file
current_dir = os.getcwd()
# Check if we're in notebooks folder, if so go up one level
if current_dir.endswith('notebooks'):
    project_dir = os.path.dirname(current_dir)
else:
    project_dir = current_dir

normalization_path = os.path.join(project_dir, 'data', 'normalization.json')

try:
    print(f"Looking for normalization file at: {normalization_path}")
    with open(normalization_path, 'r') as f:
        normalization_map = json.load(f)
    print(f"Loaded normalization map with {len(normalization_map)} entries")
except FileNotFoundError:
    print(f"Normalization file not found at {normalization_path}")
    # Create a basic normalization map if file not found
    normalization_map = {
        "apple": "apple",
        "Apple": "apple",
        "milk": "milk",
        "Milk": "milk",
        "bread": "bread",
        "Bread": "bread",
        "rice": "rice",
        "Rice": "rice",
        "oil": "oil",
        "Oil": "oil"
    }
    print(f"Created default normalization map with {len(normalization_map)} entries")

Looking for normalization file at: e:\project-x\data\normalization.json
Loaded normalization map with 33 entries


In [49]:
# Define function to clean and normalize transactions
def clean_and_normalize_transactions(transactions, normalization_map):
    """
    Clean and normalize transaction data.

    Args:
        transactions: List of transaction dictionaries with 'items' field
        normalization_map: Dictionary mapping item variants to normalized names

    Returns:
        List of cleaned transactions
    """
    cleaned_transactions = []

    for transaction in transactions:
        cleaned_items = []

        for item in transaction["items"]:
            # Apply normalization if available
            normalized_item = normalization_map.get(item, item.lower())
            cleaned_items.append(normalized_item)

        # Remove duplicates (if an item appears multiple times in a transaction)
        cleaned_items = list(set(cleaned_items))

        # Add to cleaned transactions
        cleaned_transactions.append({
            "transaction_id": transaction.get("transaction_id", len(cleaned_transactions) + 1),
            "items": cleaned_items
        })

    return cleaned_transactions

In [50]:
# Clean and normalize transactions
cleaned_transactions = clean_and_normalize_transactions(transactions, normalization_map)

print(f"After cleaning: {len(cleaned_transactions)} transactions")

# Show a sample of the cleaned data
print("\nCleaned Transaction Sample:")
for t in cleaned_transactions[:3]:
    print(t)

# Count unique items after cleaning
unique_items = get_unique_items(cleaned_transactions)
print(f"\nNumber of unique items after cleaning: {len(unique_items)}")
print("Unique items:", sorted(list(unique_items)))

After cleaning: 20 transactions

Cleaned Transaction Sample:
{'transaction_id': 1, 'items': ['milk', 'bread', 'apple']}
{'transaction_id': 2, 'items': ['beans', 'oil', 'rice']}
{'transaction_id': 3, 'items': ['yogurt', 'milk', 'cheese', 'eggs']}

Number of unique items after cleaning: 25
Unique items: ['apple', 'bacon', 'banana', 'beans', 'berries', 'bread', 'butter', 'cereal', 'cheese', 'chicken', 'coffee', 'crackers', 'eggs', 'grapes', 'jam', 'milk', 'oil', 'orange', 'rice', 'sugar', 'tea', 'tomato', 'vegetables', 'wine', 'yogurt']


## Analyze Item Frequencies

Let's analyze the frequency of items in the transactions to identify rare items:

In [51]:
# Calculate item frequencies
item_counts = {}
for transaction in cleaned_transactions:
    for item in transaction["items"]:
        if item not in item_counts:
            item_counts[item] = 0
        item_counts[item] += 1

# Create DataFrame for easier analysis
import pandas as pd
items = list(item_counts.keys())
counts = list(item_counts.values())
frequencies = [count / len(cleaned_transactions) for count in counts]

item_freq_df = pd.DataFrame({
    'item': items,
    'count': counts,
    'frequency': frequencies
})

# Sort by frequency
item_freq_df = item_freq_df.sort_values('count', ascending=False).reset_index(drop=True)

print("Item Frequencies:")
print(item_freq_df)

# Skip visualization due to matplotlib compatibility issue
print("\nItem frequencies calculated. Skipping visualization due to matplotlib compatibility issue.")

# Filter rare items (let's say items that appear in less than 20% of transactions)
rare_item_threshold = 0.2
rare_items = item_freq_df[item_freq_df['frequency'] < rare_item_threshold]['item'].tolist()
common_items = item_freq_df[item_freq_df['frequency'] >= rare_item_threshold]['item'].tolist()

print(f"\nRare items (frequency < {rare_item_threshold}):")
print(rare_items)
print(f"\nCommon items (frequency >= {rare_item_threshold}):")
print(common_items)

Item Frequencies:
          item  count  frequency
0         milk     10       0.50
1        bread      6       0.30
2       banana      5       0.25
3       cheese      4       0.20
4         rice      4       0.20
5       yogurt      3       0.15
6         eggs      3       0.15
7       butter      3       0.15
8        apple      3       0.15
9        beans      2       0.10
10         oil      2       0.10
11      orange      2       0.10
12  vegetables      2       0.10
13      coffee      2       0.10
14     chicken      2       0.10
15      cereal      2       0.10
16       bacon      2       0.10
17         tea      1       0.05
18      grapes      1       0.05
19      tomato      1       0.05
20     berries      1       0.05
21       sugar      1       0.05
22        wine      1       0.05
23    crackers      1       0.05
24         jam      1       0.05

Item frequencies calculated. Skipping visualization due to matplotlib compatibility issue.

Rare items (frequency < 0.2):
[

## Filter Rare Items

Now, let's filter out items that appear in less than 5% of the transactions:

In [52]:
# Load configuration (if available)
config_path = 'config.json'

try:
    config = load_json_file(config_path)
    rare_item_threshold = config.get('rare_item_threshold', 0.05)
except (FileNotFoundError, json.JSONDecodeError):
    print("Config file not found or invalid. Using default rare_item_threshold of 0.05.")
    rare_item_threshold = 0.05

print(f"Using rare item threshold: {rare_item_threshold} ({rare_item_threshold*100}%)")

# Filter rare items
filtered_transactions, rare_items = filter_rare_items(
    cleaned_transactions,
    threshold=rare_item_threshold
)

print(f"After filtering rare items: {len(filtered_transactions)} transactions")
print(f"Removed {len(rare_items)} rare items: {sorted(list(rare_items))}")

# Count unique items after filtering
unique_items_after_filtering = get_unique_items(filtered_transactions)
print(f"Number of unique items after filtering: {len(unique_items_after_filtering)}")
print("Remaining items:", sorted(list(unique_items_after_filtering)))

Config file not found or invalid. Using default rare_item_threshold of 0.05.
Using rare item threshold: 0.05 (5.0%)
After filtering rare items: 20 transactions
Removed 0 rare items: []
Number of unique items after filtering: 25
Remaining items: ['apple', 'bacon', 'banana', 'beans', 'berries', 'bread', 'butter', 'cereal', 'cheese', 'chicken', 'coffee', 'crackers', 'eggs', 'grapes', 'jam', 'milk', 'oil', 'orange', 'rice', 'sugar', 'tea', 'tomato', 'vegetables', 'wine', 'yogurt']


## Save Cleaned Data

Save the cleaned data for use in subsequent steps of the pipeline:

In [53]:
# Save cleaned transactions as JSON
# Get the correct output path
current_dir = os.getcwd()
if current_dir.endswith('notebooks'):
    project_dir = os.path.dirname(current_dir)
else:
    project_dir = current_dir

output_path = os.path.join(project_dir, 'output', 'cleaned_transactions.json')
save_json_file(filtered_transactions, output_path)
print(f"Saved cleaned transactions to {output_path}")

# Save rare items list
rare_items_path = os.path.join(project_dir, 'output', 'rare_items.json')
save_json_file(list(rare_items), rare_items_path)
print(f"Saved rare items list to {rare_items_path}")

# Save as list format for next steps
transaction_lists = transactions_to_lists(filtered_transactions)
lists_path = os.path.join(project_dir, 'output', 'transaction_lists.json')
save_json_file(transaction_lists, lists_path)
print(f"Saved transaction lists to {lists_path}")

# Show a sample of the final output
print(f"\nFinal Transaction Lists Sample (from your {len(transaction_lists)} real transactions):")
for t in transaction_lists[:5]:
    print(t)

Saved cleaned transactions to e:\project-x\output\cleaned_transactions.json
Saved rare items list to e:\project-x\output\rare_items.json
Saved transaction lists to e:\project-x\output\transaction_lists.json

Final Transaction Lists Sample (from your 20 real transactions):
['milk', 'bread', 'apple']
['beans', 'oil', 'rice']
['yogurt', 'milk', 'cheese', 'eggs']
['bread', 'milk', 'butter']
['grapes', 'orange', 'apple', 'banana']


## Summary Statistics

Let's calculate some summary statistics for our cleaned dataset:

In [54]:
# Convert to DataFrame and save as CSV
import pandas as pd
import os

# Calculate statistics
transaction_lengths = [len(transaction) for transaction in transaction_lists]
avg_items_per_transaction = sum(transaction_lengths) / len(transaction_lengths)
median_items_per_transaction = sorted(transaction_lengths)[len(transaction_lengths) // 2]
max_items_per_transaction = max(transaction_lengths)
min_items_per_transaction = min(transaction_lengths)

print(f"Number of transactions: {len(transaction_lists)}")
print(f"Number of unique items: {len(set(item for sublist in transaction_lists for item in sublist))}")
print(f"Average items per transaction: {avg_items_per_transaction:.2f}")
print(f"Median items per transaction: {median_items_per_transaction:.2f}")
print(f"Maximum items in a transaction: {max_items_per_transaction}")
print(f"Minimum items in a transaction: {min_items_per_transaction}")

# Skip visualization due to matplotlib compatibility issue
print("\nSkipping visualization due to matplotlib compatibility issue.")

# Create a list of dictionaries for each transaction
transaction_dicts = []
for i, items in enumerate(transaction_lists):
    for item in items:
        transaction_dicts.append({
            'transaction_id': i + 1,
            'item': item
        })

# Convert to DataFrame
transactions_df = pd.DataFrame(transaction_dicts)

# Save to CSV
csv_path = os.path.join('../output', 'cleaned_transactions.csv')
transactions_df.to_csv(csv_path, index=False)
print(f"\nSaved cleaned transactions to CSV file: {csv_path}")

# Display the head of the DataFrame
print("\nTransactions DataFrame (head):")
print(transactions_df.head(10))

Number of transactions: 20
Number of unique items: 25
Average items per transaction: 3.25
Median items per transaction: 3.00
Maximum items in a transaction: 4
Minimum items in a transaction: 3

Skipping visualization due to matplotlib compatibility issue.

Saved cleaned transactions to CSV file: ../output\cleaned_transactions.csv

Transactions DataFrame (head):
   transaction_id    item
0               1    milk
1               1   bread
2               1   apple
3               2   beans
4               2     oil
5               2    rice
6               3  yogurt
7               3    milk
8               3  cheese
9               3    eggs


## Conclusion

In this notebook, we successfully:

1. Loaded raw transaction data from JSON files
2. Cleaned and normalized the data (lowercasing, stripping, deduplication)
3. Applied synonym normalization using a mapping file
4. Analyzed item frequencies and filtered rare items
5. Saved the cleaned data for further processing

The cleaned dataset is now ready for the next steps in the frequent pattern mining pipeline:
- Transaction encoding
- Frequent itemset mining with Apriori and FP-Growth
- Closed itemset computation and concept lattice construction

The data cleaning process is crucial as it directly impacts the quality and interpretability of the patterns we will discover later. By standardizing item representations and removing rare items, we improve the meaningfulness of the frequent patterns while reducing computational complexity.