# Learn2Clean Example: ANLI R1 Dataset

This notebook demonstrates how to apply Learn2Clean to the ANLI R1 (Adversarial Natural Language Inference Round 1) dataset for text classification.

In [3]:
# Install Learn2Clean with compatible versions - avoiding dependency conflicts
import os
import sys

print("Setting up Learn2Clean with compatible package versions...")

if os.path.exists('../python-package'):
    %cd ../python-package
    
    # First uninstall any existing Learn2Clean
    !pip uninstall -y learn2clean
    
    # Install Learn2Clean without dependencies to avoid conflicts
    !pip install -e . --no-deps
    
    # Now install compatible versions of the dependencies we need
    print("Installing compatible dependencies...")
    
    # Install string matching libraries with fallback
    try:
        # Try alternative string matching that might compile better
        !pip install python-Levenshtein fuzzywuzzy
        print("✓ Installed alternative string matching libraries")
    except:
        print("⚠ Warning: Advanced string matching not available, using basic alternatives")
    
    # Install basic ML dependencies we already have
    print("✓ Using existing numpy, pandas, scikit-learn, scipy, matplotlib")
    
    # Skip problematic dependencies (fancyimpute, py_stringmatching, py_stringsimjoin)
    print("⚠ Skipping fancyimpute and string matching libraries due to version conflicts")
    
    %cd ../examples
    
    print("\n✓ Learn2Clean installed with core functionality!")
    print("Note: Some advanced features (fancy imputation, string similarity) may be limited")
    
else:
    print("Learn2Clean python-package directory not found. Please check the path.")

Setting up Learn2Clean with compatible package versions...
/storage/nammt/autogluon/Learn2Clean/python-package
Obtaining file:///storage/nammt/autogluon/Learn2Clean/python-package
Obtaining file:///storage/nammt/autogluon/Learn2Clean/python-package
Installing collected packages: learn2clean
  Running setup.py develop for learn2clean
Installing collected packages: learn2clean
  Running setup.py develop for learn2clean
Successfully installed learn2clean
You should consider upgrading via the '/storage/nammt/autogluon/learn2clean_env/bin/python3.7 -m pip install --upgrade pip' command.[0m
Successfully installed learn2clean
You should consider upgrading via the '/storage/nammt/autogluon/learn2clean_env/bin/python3.7 -m pip install --upgrade pip' command.[0m
Installing compatible dependencies...
Installing compatible dependencies...
Collecting python-Levenshtein
Collecting python-Levenshtein
  Downloading python_Levenshtein-0.23.0-py3-none-any.whl (9.4 kB)
  Downloading python_Levenshtein-

## 1) Dataset Loading and Preparation

In [15]:
# Load required libraries
import pandas as pd
import numpy as np
from datasets import load_dataset
import os

def load_anli_r1_dataset():
    """Load and prepare ANLI R1 dataset for text classification"""
    print("Loading ANLI R1 dataset...")
    
    try:
        dataset = load_dataset("facebook/anli")
        
        def prepare_anli_data(split_data):
            data = []
            for item in split_data:
                # Combine premise and hypothesis for NLI
                text_features = f"[PREMISE] {item['premise']} [HYPOTHESIS] {item['hypothesis']}"
                
                data.append({
                    'text': text_features,
                    'premise': item['premise'],
                    'hypothesis': item['hypothesis'],
                    'label': item['label']
                })
            return pd.DataFrame(data)
        
        train_df = prepare_anli_data(dataset['train_r1'])
        val_df = prepare_anli_data(dataset['dev_r1'])
        test_df = prepare_anli_data(dataset['test_r1'])

        print(f"ANLI R1 loaded: Train={len(train_df)}, Val={len(val_df)}, Test={len(test_df)}")
        return train_df, val_df, test_df
        
    except Exception as e:
        print(f"Error loading ANLI: {e}")
        return None, None, None

# Load the dataset
train_df, val_df, test_df = load_anli_r1_dataset()

Loading ANLI R1 dataset...
Downloading and preparing dataset None/plain_text to /home/automl/.cache/huggingface/datasets/facebook___parquet/plain_text-bc7c6c9c4d1b458b/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...


Downloading data files: 100%|██████████| 3/3 [00:00<00:00, 1115.61it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 239.68it/s]
                                                                     

Error loading ANLI: {'train_r1', 'dev_r1', 'test_r2', 'train_r2', 'dev_r2', 'test_r1', 'train_r3', 'dev_r3', 'test_r3'}


In [14]:
# Display basic information about the dataset
if train_df is not None:
    print("Dataset shape:")
    print(f"Train: {train_df.shape}")
    print(f"Validation: {val_df.shape}")
    print(f"Test: {test_df.shape}")
    
    print("\nFirst few rows:")
    display(train_df.head())
    
    print("\nLabel distribution:")
    print(train_df['label'].value_counts())
    
    print("\nColumn info:")
    print(train_df.info())

## 2) Prepare Data for Learn2Clean

Learn2Clean works with CSV files, so we need to save our data and create a reader function.

In [3]:
# Create datasets directory if it doesn't exist
os.makedirs('../datasets/anli_r1', exist_ok=True)

# Save datasets as CSV files - KEEP TRAIN AND VALIDATION SEPARATE!
if train_df is not None:
    # Save train, validation, and test separately to avoid data leakage
    train_df.to_csv('../datasets/anli_r1/anli_r1_train.csv', index=False, encoding='utf-8')
    val_df.to_csv('../datasets/anli_r1/anli_r1_val.csv', index=False, encoding='utf-8')
    test_df.to_csv('../datasets/anli_r1/anli_r1_test.csv', index=False, encoding='utf-8')
    
    print("Datasets saved successfully!")
    print(f"Train size: {len(train_df)}")
    print(f"Validation size: {len(val_df)}")
    print(f"Test size: {len(test_df)}")
    print("\nIMPORTANT: Train/val/test kept separate to avoid data leakage for AutoGluon!")

Datasets saved successfully!
Train size: 16946
Validation size: 1000
Test size: 1000

IMPORTANT: Train/val/test kept separate to avoid data leakage for AutoGluon!


In [4]:
# Define dataset reader function for Learn2Clean
def read_dataset(name):
    """Load datasets for Learn2Clean processing"""
    import pandas as pd
    if name == "anli_r1":
        df = pd.read_csv('../datasets/anli_r1/anli_r1_train.csv', sep=',', encoding='utf-8')
    elif name == "anli_r1_val":
        df = pd.read_csv('../datasets/anli_r1/anli_r1_val.csv', sep=',', encoding='utf-8')
    elif name == "anli_r1_test":
        df = pd.read_csv('../datasets/anli_r1/anli_r1_test.csv', sep=',', encoding='utf-8')
    else: 
        raise ValueError('Invalid dataset name')               
    return df

# Test the reader function
test_load = read_dataset("anli_r1")
print(f"Loaded train dataset shape: {test_load.shape}")
print(f"Columns: {test_load.columns.tolist()}")

# Verify all splits
print(f"\nDataset split sizes:")
print(f"Train: {len(read_dataset('anli_r1'))}")
print(f"Validation: {len(read_dataset('anli_r1_val'))}")
print(f"Test: {len(read_dataset('anli_r1_test'))}")

Loaded train dataset shape: (16946, 4)
Columns: ['text', 'premise', 'hypothesis', 'label']

Dataset split sizes:
Train: 16946
Validation: 1000
Test: 1000


## 3) Data Profiling with Learn2Clean

In [7]:
# Add Learn2Clean to Python path
import sys
import os
sys.path.append(os.path.abspath('../python-package'))

import learn2clean.loading.reader as rd 
import learn2clean.normalization.normalizer as nl 
import pandas as pd

# Execute profiling function for ANLI R1 dataset
rd.profile_summary(read_dataset('anli_r1'), plot=False)

ImportError: cannot import name 'loading' from 'Learn2Clean' (/storage/nammt/autogluon/Learn2Clean/python-package/learn2clean/__init__.py)

In [None]:
# Check the target variable
anli_data = read_dataset('anli_r1')
print("Target variable (label) distribution:")
print(anli_data['label'].value_counts())
print("\nTarget variable head:")
print(anli_data['label'].head())

## 4) Learn2Clean Data Processing

Now we'll use Learn2Clean's Reader class to process the ANLI R1 dataset.

In [None]:
# Create Learn2Clean reader with encoding for text classification
d_enc = rd.Reader(sep=',', verbose=True, encoding=True) 

# Process ANLI R1 dataset - ONLY TRAIN DATA for Learn2Clean optimization
# This avoids data leakage by not using validation data in preprocessing decisions
anli_r1_files = ["../datasets/anli_r1/anli_r1_train.csv"]
anli_r1_encoded = d_enc.train_test_split(anli_r1_files, 'label')

print("\nProcessed dataset structure (TRAIN ONLY):")
print(f"Train shape: {anli_r1_encoded['train'].shape}")
print(f"Target shape: {anli_r1_encoded['target'].shape}")
print(f"Target name: {anli_r1_encoded['target'].name}")
print("\nNote: Only training data used for Learn2Clean to avoid data leakage!")

## 5) Manual Data Cleaning Pipeline

Let's create a manual preprocessing pipeline for text classification.

In [None]:
# Add Learn2Clean to Python path (if not already done)
import sys
import os
if '../python-package' not in sys.path:
    sys.path.append(os.path.abspath('../python-package'))

# Import Learn2Clean modules for manual pipeline
import learn2clean.loading.reader as rd 
import learn2clean.normalization.normalizer as nl 
import learn2clean.feature_selection.feature_selector as fs
import learn2clean.duplicate_detection.duplicate_detector as dd
import learn2clean.outlier_detection.outlier_detector as od
import learn2clean.imputation.imputer as imp
import learn2clean.classification.classifier as cl

# Create a copy of the dataset for manual processing
manual_dataset = anli_r1_encoded.copy()

print("Starting manual preprocessing pipeline...")

# Step 1: Handle missing values
print("\n1. Imputation - Replace missing values")
imputer = imp.Imputer(dataset=manual_dataset, strategy='median', verbose=True)
manual_dataset = imputer.transform()

# Step 2: Duplicate detection
print("\n2. Duplicate Detection")
dup_detector = dd.Duplicate_detector(dataset=manual_dataset, strategy='drop_duplicates', verbose=True)
manual_dataset = dup_detector.transform()

# Step 3: Feature selection for text data
print("\n3. Feature Selection")
feat_selector = fs.Feature_selector(dataset=manual_dataset, strategy='WR', exclude='label', verbose=True)
manual_dataset = feat_selector.transform()

print("\nManual preprocessing completed!")
print(f"Final train shape: {manual_dataset['train'].shape}")
print(f"Final test shape: {manual_dataset['test'].shape}")

## 6) Classification with Manual Pipeline

In [None]:
# Test classification with manually cleaned data
print("Testing classification with manually cleaned data...")

# Try different classifiers
classifiers = ['CART', 'NB', 'LDA']

for clf_name in classifiers:
    try:
        print(f"\nTesting {clf_name} classifier:")
        classifier = cl.Classifier(dataset=manual_dataset, goal=clf_name, target_goal='label', verbose=True)
        result = classifier.transform()
        print(f"{clf_name} classification completed successfully")
    except Exception as e:
        print(f"Error with {clf_name}: {e}")

## 7) Automated Learn2Clean Pipeline

Now let's use Learn2Clean's Q-learning approach to automatically find the best preprocessing pipeline.

In [None]:
# Add Learn2Clean to Python path (if not already done)
import sys
import os
if '../python-package' not in sys.path:
    sys.path.append(os.path.abspath('../python-package'))

import learn2clean.qlearning.qlearner as ql

# Create a fresh copy of the dataset for Learn2Clean
l2c_dataset = anli_r1_encoded.copy()

print("Starting Learn2Clean automated pipeline...")
print("This may take several minutes to find the optimal preprocessing sequence.")

# Learn2Clean for CART classification
l2c_classification = ql.Qlearner(
    dataset=l2c_dataset,
    goal='CART', 
    target_goal='label',
    threshold=0.6, 
    target_prepare=None, 
    file_name='anli_r1_example', 
    verbose=False
)

# Run Learn2Clean optimization
l2c_classification.learn2clean()

## 8) Random Baseline Comparison

In [None]:
# Compare with random preprocessing pipeline
random_dataset = anli_r1_encoded.copy()

print("Running random preprocessing pipeline for comparison...")

# Random preprocessing pipeline for CART classification
random_pipeline = ql.Qlearner(
    dataset=random_dataset,
    goal='CART',
    target_goal='label',
    target_prepare=None, 
    verbose=False
)

try:
    random_pipeline.random_cleaning('anli_r1_random_example')
    print("Random pipeline completed successfully")
except Exception as e:
    print(f"Random pipeline error: {e}")

## 9) Results Analysis

The results of Learn2Clean and random cleaning are stored in the 'save' directory as text files.

In [None]:
# Check if results files exist and display them
import os

results_files = [
    'save/anli_r1_example_results.txt',
    'save/anli_r1_random_example_results_file.txt'
]

for file_path in results_files:
    if os.path.exists(file_path):
        print(f"\n=== Results from {file_path} ===")
        with open(file_path, 'r') as f:
            content = f.read()
            print(content[-500:])  # Show last 500 characters
    else:
        print(f"Results file not found: {file_path}")

## 10) Applying Learned Preprocessing to Validation Data

After Learn2Clean finds the optimal preprocessing pipeline on training data, we need to apply the same transformations to validation data for AutoGluon.

In [None]:
# Load validation data separately
val_data = read_dataset('anli_r1_val')
test_data = read_dataset('anli_r1_test')

print(f"Validation data shape: {val_data.shape}")
print(f"Test data shape: {test_data.shape}")

# TODO: Apply the optimal preprocessing pipeline found by Learn2Clean
# to validation and test data using the same transformations
# (same imputation values, same normalization parameters, etc.)

print("\nFor AutoGluon training:")
print("1. Use the Learn2Clean optimized training data")
print("2. Apply the SAME preprocessing pipeline to validation data")
print("3. Keep test data completely separate until final evaluation")
print("4. This ensures no data leakage and valid model evaluation")

## Summary

This notebook demonstrated how to apply Learn2Clean to the ANLI R1 dataset for natural language inference classification **while avoiding data leakage**. The key steps were:

1. **Data Loading**: Loaded the ANLI R1 dataset and prepared it for text classification
2. **Data Separation**: Kept train/validation/test splits separate to avoid data leakage
3. **Profiling**: Used Learn2Clean's profiling capabilities to understand the data
4. **Manual Pipeline**: Created a manual preprocessing pipeline with imputation, duplicate detection, and feature selection
5. **Automated Pipeline**: Used Learn2Clean's Q-learning approach on TRAINING DATA ONLY
6. **Comparison**: Compared Learn2Clean results with random preprocessing baselines
7. **Validation Preparation**: Prepared to apply learned preprocessing to validation data

**Critical for AutoGluon**: The preprocessing pipeline learned on training data must be applied to validation data using the same parameters (same imputation values, normalization statistics, etc.) to ensure valid evaluation and avoid data leakage.