## **Feature:** Data Standardization

**Names:** Gia Bao Ngo

### **What it does**
Provides comprehensive data standardization and preprocessing capabilities for machine learning preparation. Includes normalization, scaling, categorical encoding, rare category handling, and dummy variable creation with multicollinearity handling.

In [3]:
# Load dotenv
import os
from dotenv import load_dotenv
load_dotenv()

# Get API Key
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    print("OpenAI API Key not found")

# Import libraries
from pathlib import Path
import pandas as pd
import numpy as np
# Additional imports for standardization
import math
import re
import datetime
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
import warnings
warnings.filterwarnings('ignore')

# Langchain imports
from langchain_openai import ChatOpenAI  
from langchain.schema import HumanMessage, SystemMessage

### **Helper Functions**
- `normalize_columns(df, columns=None, method='minmax')` - Min-max normalization (0-1 scale)
- `scale_columns(df, columns=None, method='standard')` - Standard/robust scaling (z-score/robust)
- `encode_categorical(df, columns=None, method='onehot')` - Categorical encoding (onehot, label, target)
- `handle_rare_categories(df, column, threshold=0.01, replace_with='Other')` - Rare category handling
- `create_dummy_variables(df, columns=None, drop_first=True)` - Dummy variable creation with multicollinearity handling

In [4]:
def normalize_columns(df, columns=None, method='minmax'):
    """
    Normalize numeric columns to 0-1 scale using different methods.
    
    Parameters:
    - df: pandas DataFrame
    - columns: list of column names to normalize (None = all numeric columns)
    - method: 'minmax' (0-1), 'maxabs' (-1 to 1), or 'unit' (unit vector scaling)
    
    Returns:
    - DataFrame with normalized columns
    """
    result_df = df.copy()
    normalized_cols = []
    
    if columns is None:
        columns = result_df.select_dtypes(include=[np.number]).columns.tolist()
    
    # Validate columns exist and are numeric
    valid_columns = []
    for col in columns:
        if col not in result_df.columns:
            print(f"Warning: Column '{col}' not found in DataFrame")
            continue
        if not pd.api.types.is_numeric_dtype(result_df[col]):
            print(f"Warning: Column '{col}' is not numeric, skipping")
            continue
        valid_columns.append(col)
    
    if not valid_columns:
        print("No valid numeric columns found for normalization")
        return result_df
    
    try:
        if method == 'minmax':
            scaler = MinMaxScaler()
        elif method == 'maxabs':
            from sklearn.preprocessing import MaxAbsScaler
            scaler = MaxAbsScaler()
        elif method == 'unit':
            from sklearn.preprocessing import Normalizer
            # Unit scaling works on samples (rows), not features
            for col in valid_columns:
                # Apply unit scaling to each column individually
                values = result_df[col].values.reshape(-1, 1)
                normalizer = Normalizer()
                normalized_values = normalizer.fit_transform(values).flatten()
                result_df[col] = normalized_values
                normalized_cols.append({
                    'column': col,
                    'method': 'unit',
                    'min_val': result_df[col].min(),
                    'max_val': result_df[col].max()
                })
            
            print(f"=== NORMALIZATION RESULTS ({method.upper()}) ===")
            print(f"Columns normalized: {len(normalized_cols)}")
            for norm in normalized_cols:
                print(f"  {norm['column']}: range [{norm['min_val']:.4f}, {norm['max_val']:.4f}]")
            return result_df
        else:
            print(f"Error: Unknown normalization method '{method}'. Use 'minmax', 'maxabs', or 'unit'")
            return result_df
        
        # Apply scaling for minmax and maxabs methods
        for col in valid_columns:
            original_values = result_df[col].copy()
            # Handle missing values
            non_null_mask = result_df[col].notna()
            
            if non_null_mask.sum() == 0:
                print(f"Warning: Column '{col}' has no non-null values, skipping")
                continue
                
            # Fit and transform only non-null values
            values_to_scale = result_df.loc[non_null_mask, col].values.reshape(-1, 1)
            scaled_values = scaler.fit_transform(values_to_scale).flatten()
            
            # Update the column
            result_df.loc[non_null_mask, col] = scaled_values
            
            normalized_cols.append({
                'column': col,
                'method': method,
                'original_min': original_values.min(),
                'original_max': original_values.max(),
                'new_min': result_df[col].min(),
                'new_max': result_df[col].max()
            })
        
        print(f"=== NORMALIZATION RESULTS ({method.upper()}) ===")
        print(f"Columns normalized: {len(normalized_cols)}")
        
        if normalized_cols:
            print("\nNormalization details:")
            for norm in normalized_cols:
                print(f"  {norm['column']}: [{norm['original_min']:.2f}, {norm['original_max']:.2f}] → [{norm['new_min']:.4f}, {norm['new_max']:.4f}]")
        
        return result_df
        
    except Exception as e:
        print(f"Error during normalization: {e}")
        return df

In [5]:
def scale_columns(df, columns=None, method='standard'):
    """
    Scale numeric columns using z-score (standard) or robust scaling.
    
    Parameters:
    - df: pandas DataFrame
    - columns: list of column names to scale (None = all numeric columns)
    - method: 'standard' (z-score), 'robust' (median/IQR), or 'quantile' (uniform distribution)
    
    Returns:
    - DataFrame with scaled columns
    """
    result_df = df.copy()
    scaled_cols = []
    
    if columns is None:
        columns = result_df.select_dtypes(include=[np.number]).columns.tolist()
    
    # Validate columns exist and are numeric
    valid_columns = []
    for col in columns:
        if col not in result_df.columns:
            print(f"Warning: Column '{col}' not found in DataFrame")
            continue
        if not pd.api.types.is_numeric_dtype(result_df[col]):
            print(f"Warning: Column '{col}' is not numeric, skipping")
            continue
        valid_columns.append(col)
    
    if not valid_columns:
        print("No valid numeric columns found for scaling")
        return result_df
    
    try:
        # Choose scaler based on method
        if method == 'standard':
            scaler = StandardScaler()
        elif method == 'robust':
            scaler = RobustScaler()
        elif method == 'quantile':
            from sklearn.preprocessing import QuantileTransformer
            scaler = QuantileTransformer(output_distribution='normal')
        else:
            print(f"Error: Unknown scaling method '{method}'. Use 'standard', 'robust', or 'quantile'")
            return result_df
        
        # Apply scaling to each column
        for col in valid_columns:
            original_values = result_df[col].copy()
            # Handle missing values
            non_null_mask = result_df[col].notna()
            
            if non_null_mask.sum() == 0:
                print(f"Warning: Column '{col}' has no non-null values, skipping")
                continue
            
            # Get original statistics
            original_mean = original_values.mean()
            original_std = original_values.std()
            original_median = original_values.median()
            
            # Fit and transform only non-null values
            values_to_scale = result_df.loc[non_null_mask, col].values.reshape(-1, 1)
            scaled_values = scaler.fit_transform(values_to_scale).flatten()
            
            # Update the column
            result_df.loc[non_null_mask, col] = scaled_values
            
            # Calculate new statistics
            new_mean = result_df[col].mean()
            new_std = result_df[col].std()
            new_median = result_df[col].median()
            
            scaled_cols.append({
                'column': col,
                'method': method,
                'original_mean': original_mean,
                'original_std': original_std,
                'original_median': original_median,
                'new_mean': new_mean,
                'new_std': new_std,
                'new_median': new_median
            })
        
        print(f"=== SCALING RESULTS ({method.upper()}) ===")
        print(f"Columns scaled: {len(scaled_cols)}")
        
        if scaled_cols:
            print("\nScaling details:")
            for scale in scaled_cols:
                if method == 'standard':
                    print(f"  {scale['column']}: mean {scale['original_mean']:.2f}→{scale['new_mean']:.4f}, "
                          f"std {scale['original_std']:.2f}→{scale['new_std']:.4f}")
                elif method == 'robust':
                    print(f"  {scale['column']}: median {scale['original_median']:.2f}→{scale['new_median']:.4f}, "
                          f"std {scale['original_std']:.2f}→{scale['new_std']:.4f}")
                else:  # quantile
                    print(f"  {scale['column']}: transformed to normal distribution, "
                          f"std {scale['original_std']:.2f}→{scale['new_std']:.4f}")
        
        return result_df
        
    except Exception as e:
        print(f"Error during scaling: {e}")
        return df

In [6]:
def encode_categorical(df, columns=None, method='onehot'):
    """
    Encode categorical columns using different methods.
    
    Parameters:
    - df: pandas DataFrame
    - columns: list of column names to encode (None = all object/category columns)
    - method: 'onehot', 'label', 'target' (requires target column), or 'binary'
    
    Returns:
    - DataFrame with encoded columns
    """
    result_df = df.copy()
    encoded_info = []
    
    if columns is None:
        columns = result_df.select_dtypes(include=['object', 'category']).columns.tolist()
    
    # Validate columns exist and are categorical
    valid_columns = []
    for col in columns:
        if col not in result_df.columns:
            print(f"Warning: Column '{col}' not found in DataFrame")
            continue
        if not (result_df[col].dtype == 'object' or result_df[col].dtype.name == 'category'):
            print(f"Warning: Column '{col}' is not categorical, skipping")
            continue
        valid_columns.append(col)
    
    if not valid_columns:
        print("No valid categorical columns found for encoding")
        return result_df
    
    try:
        if method == 'onehot':
            # One-hot encoding
            for col in valid_columns:
                # Get unique values (excluding NaN)
                unique_vals = result_df[col].dropna().unique()
                n_categories = len(unique_vals)
                
                if n_categories > 50:
                    print(f"Warning: Column '{col}' has {n_categories} categories (>50). Consider using label encoding or handling rare categories first.")
                    continue
                
                # Create one-hot encoded columns
                dummies = pd.get_dummies(result_df[col], prefix=col, dummy_na=False)
                
                # Drop original column and add dummy columns
                result_df = result_df.drop(columns=[col])
                result_df = pd.concat([result_df, dummies], axis=1)
                
                encoded_info.append({
                    'original_column': col,
                    'method': 'onehot',
                    'n_categories': n_categories,
                    'new_columns': list(dummies.columns),
                    'n_new_columns': len(dummies.columns)
                })
        
        elif method == 'label':
            # Label encoding
            for col in valid_columns:
                encoder = LabelEncoder()
                non_null_mask = result_df[col].notna()
                
                if non_null_mask.sum() == 0:
                    print(f"Warning: Column '{col}' has no non-null values, skipping")
                    continue
                
                # Fit and transform non-null values
                encoded_values = encoder.fit_transform(result_df.loc[non_null_mask, col])
                
                # Create new column and preserve nulls
                result_df[f"{col}_encoded"] = np.nan
                result_df.loc[non_null_mask, f"{col}_encoded"] = encoded_values
                
                # Convert to appropriate integer type
                max_label = encoded_values.max()
                if max_label <= 127:
                    result_df[f"{col}_encoded"] = result_df[f"{col}_encoded"].astype('Int8')
                elif max_label <= 32767:
                    result_df[f"{col}_encoded"] = result_df[f"{col}_encoded"].astype('Int16')
                else:
                    result_df[f"{col}_encoded"] = result_df[f"{col}_encoded"].astype('Int32')
                
                # Drop original column
                result_df = result_df.drop(columns=[col])
                
                encoded_info.append({
                    'original_column': col,
                    'method': 'label',
                    'n_categories': len(encoder.classes_),
                    'new_columns': [f"{col}_encoded"],
                    'label_mapping': dict(zip(encoder.classes_, range(len(encoder.classes_))))
                })
        
        elif method == 'binary':
            # Binary encoding (for high-cardinality categorical variables)
            for col in valid_columns:
                # Get unique values
                unique_vals = result_df[col].dropna().unique()
                n_categories = len(unique_vals)
                
                # Calculate number of binary columns needed
                import math
                n_binary_cols = math.ceil(math.log2(n_categories)) if n_categories > 1 else 1
                
                # Create label encoding first
                encoder = LabelEncoder()
                non_null_mask = result_df[col].notna()
                
                if non_null_mask.sum() == 0:
                    print(f"Warning: Column '{col}' has no non-null values, skipping")
                    continue
                
                encoded_values = encoder.fit_transform(result_df.loc[non_null_mask, col])
                
                # Convert to binary representation
                binary_cols = []
                for i in range(n_binary_cols):
                    col_name = f"{col}_binary_{i}"
                    binary_cols.append(col_name)
                    result_df[col_name] = 0
                    
                    # Set binary digits
                    binary_values = (encoded_values >> i) & 1
                    result_df.loc[non_null_mask, col_name] = binary_values
                
                # Drop original column
                result_df = result_df.drop(columns=[col])
                
                encoded_info.append({
                    'original_column': col,
                    'method': 'binary',
                    'n_categories': n_categories,
                    'new_columns': binary_cols,
                    'n_binary_columns': n_binary_cols
                })
        
        else:
            print(f"Error: Unknown encoding method '{method}'. Use 'onehot', 'label', or 'binary'")
            return result_df
        
        print(f"=== CATEGORICAL ENCODING RESULTS ({method.upper()}) ===")
        print(f"Columns processed: {len(valid_columns)}")
        print(f"Columns encoded successfully: {len(encoded_info)}")
        
        if encoded_info:
            print("\nEncoding details:")
            for enc in encoded_info:
                if method == 'onehot':
                    print(f"  {enc['original_column']}: {enc['n_categories']} categories → {enc['n_new_columns']} dummy columns")
                elif method == 'label':
                    print(f"  {enc['original_column']}: {enc['n_categories']} categories → 1 label-encoded column")
                elif method == 'binary':
                    print(f"  {enc['original_column']}: {enc['n_categories']} categories → {enc['n_binary_columns']} binary columns")
        
        return result_df
        
    except Exception as e:
        print(f"Error during categorical encoding: {e}")
        return df

In [7]:
def handle_rare_categories(df, column, threshold=0.01, replace_with='Other'):
    """
    Handle rare categories by grouping them into a single category.
    
    Parameters:
    - df: pandas DataFrame
    - column: column name to process
    - threshold: frequency threshold (0.01 = 1% of total values)
    - replace_with: replacement value for rare categories
    
    Returns:
    - DataFrame with rare categories handled
    """
    if column not in df.columns:
        print(f"Error: Column '{column}' not found in DataFrame")
        return df
    
    result_df = df.copy()
    
    # Calculate value counts and frequencies
    value_counts = result_df[column].value_counts(dropna=False)
    total_count = len(result_df)
    frequencies = value_counts / total_count
    
    # Identify rare categories
    rare_categories = frequencies[frequencies < threshold].index.tolist()
    common_categories = frequencies[frequencies >= threshold].index.tolist()
    
    if len(rare_categories) == 0:
        print(f"No rare categories found in column '{column}' (threshold: {threshold:.1%})")
        return result_df
    
    # Replace rare categories
    result_df[column] = result_df[column].replace(rare_categories, replace_with)
    
    # Calculate statistics
    rare_count = sum(value_counts[cat] for cat in rare_categories)
    rare_percent = (rare_count / total_count) * 100
    
    print(f"=== RARE CATEGORY HANDLING RESULTS ===")
    print(f"Column: {column}")
    print(f"Threshold: {threshold:.1%}")
    print(f"Original unique categories: {len(value_counts)}")
    print(f"Rare categories found: {len(rare_categories)}")
    print(f"Rare categories affected {rare_count} rows ({rare_percent:.1f}% of data)")
    print(f"Categories after grouping: {result_df[column].nunique()}")
    
    if len(rare_categories) <= 10:  # Show rare categories if not too many
        print(f"\\nRare categories replaced: {rare_categories}")
    else:
        print(f"\\nFirst 10 rare categories: {rare_categories[:10]}... (+{len(rare_categories)-10} more)")
    
    # Show final category distribution
    final_counts = result_df[column].value_counts().head(10)
    print(f"\\nTop categories after processing:")
    for cat, count in final_counts.items():
        pct = (count / total_count) * 100
        print(f"  {cat}: {count} ({pct:.1f}%)")
    
    return result_df

In [8]:
def create_dummy_variables(df, columns=None, drop_first=True):
    """
    Create dummy variables with multicollinearity handling.
    
    Parameters:
    - df: pandas DataFrame
    - columns: list of column names to process (None = all object/category columns)
    - drop_first: drop first dummy to avoid multicollinearity
    
    Returns:
    - DataFrame with dummy variables created
    """
    result_df = df.copy()
    dummy_info = []
    
    if columns is None:
        columns = result_df.select_dtypes(include=['object', 'category']).columns.tolist()
    
    # Validate columns
    valid_columns = []
    for col in columns:
        if col not in result_df.columns:
            print(f"Warning: Column '{col}' not found in DataFrame")
            continue
        if not (result_df[col].dtype == 'object' or result_df[col].dtype.name == 'category'):
            print(f"Warning: Column '{col}' is not categorical, skipping")
            continue
        valid_columns.append(col)
    
    if not valid_columns:
        print("No valid categorical columns found for dummy variable creation")
        return result_df
    
    try:
        for col in valid_columns:
            # Get unique values info
            unique_vals = result_df[col].dropna().unique()
            n_categories = len(unique_vals)
            
            if n_categories == 1:
                print(f"Warning: Column '{col}' has only 1 unique value, skipping")
                continue
            
            if n_categories > 100:
                print(f"Warning: Column '{col}' has {n_categories} categories (>100). Consider handling rare categories first.")
                continue
            
            # Create dummy variables
            dummies = pd.get_dummies(result_df[col], prefix=col, dummy_na=False, drop_first=drop_first)
            
            # Handle multicollinearity check
            if drop_first and n_categories > 1:
                n_dummies = n_categories - 1
                dropped_category = unique_vals[0]  # First category (alphabetically) is dropped by default
            else:
                n_dummies = n_categories
                dropped_category = None
            
            # Add dummies to result DataFrame
            for dummy_col in dummies.columns:
                result_df[dummy_col] = dummies[dummy_col]
            
            # Remove original column
            result_df = result_df.drop(columns=[col])
            
            dummy_info.append({
                'original_column': col,
                'n_original_categories': n_categories,
                'n_dummy_variables': n_dummies,
                'dummy_columns': list(dummies.columns),
                'dropped_first': drop_first,
                'dropped_category': dropped_category
            })
        
        print(f"=== DUMMY VARIABLE CREATION RESULTS ===")
        print(f"Columns processed: {len(valid_columns)}")
        print(f"Columns successfully converted: {len(dummy_info)}")
        print(f"Drop first category: {drop_first}")
        
        total_original_cols = len(valid_columns)
        total_dummy_cols = sum([info['n_dummy_variables'] for info in dummy_info])
        net_change = total_dummy_cols - total_original_cols
        
        print(f"\\nColumn count impact:")
        print(f"  Original categorical columns: {total_original_cols}")
        print(f"  New dummy columns: {total_dummy_cols}")
        print(f"  Net change: {net_change:+d} columns")
        
        if dummy_info:
            print(f"\\nDummy variable details:")
            for info in dummy_info:
                dropped_info = f" (dropped: {info['dropped_category']})" if info['dropped_category'] else ""
                print(f"  {info['original_column']}: {info['n_original_categories']} categories → {info['n_dummy_variables']} dummies{dropped_info}")
        
        return result_df
        
    except Exception as e:
        print(f"Error during dummy variable creation: {e}")
        return df

In [9]:
helper_docs = """ Helper functions available:
- normalize_columns(df, columns=None, method='minmax'): Normalize numeric columns to 0-1 scale. Methods: 'minmax' (0-1), 'maxabs' (-1 to 1), 'unit' (unit vector). Returns DataFrame with normalized columns.
- scale_columns(df, columns=None, method='standard'): Scale numeric columns using z-score or robust scaling. Methods: 'standard' (z-score), 'robust' (median/IQR), 'quantile' (uniform distribution). Returns DataFrame with scaled columns.
- encode_categorical(df, columns=None, method='onehot'): Encode categorical columns. Methods: 'onehot' (dummy variables), 'label' (integer encoding), 'binary' (binary representation). Returns DataFrame with encoded columns.
- handle_rare_categories(df, column, threshold=0.01, replace_with='Other'): Group rare categories below threshold into single category. Returns DataFrame with rare categories handled.
- create_dummy_variables(df, columns=None, drop_first=True): Create dummy variables with multicollinearity handling. Drop_first prevents perfect collinearity. Returns DataFrame with dummy columns.

Examples:
- "Normalize numeric columns" -> df = normalize_columns(df)
- "Scale features for machine learning" -> df = scale_columns(df, method='standard')
- "One-hot encode categorical variables" -> df = encode_categorical(df, method='onehot')
- "Handle rare categories in city column" -> df = handle_rare_categories(df, 'city', threshold=0.05)
- "Create dummy variables" -> df = create_dummy_variables(df, drop_first=True)
- "Normalize with min-max scaling" -> df = normalize_columns(df, method='minmax')
- "Use robust scaling for outlier-resistant standardization" -> df = scale_columns(df, method='robust')
- "Label encode all categorical columns" -> df = encode_categorical(df, method='label')
"""

# **MAIN FEATURE FUNCTION**

In [10]:
def standardization(df, user_query):
    """
    Main function that gets called by the main router.
    MUST take (df, user_query) and return df
    """
    
    # Create message chain
    messages = []
    messages.append(SystemMessage(content=helper_docs))
    messages.append(SystemMessage(content=f"""
    You are a data cleaning agent focused on data standardization and preprocessing for machine learning.
    
    Dataset info: Shape: {df.shape}, Sample: {df.head(3).to_string()}

    Libraries available:
    - pd (pandas), np (numpy)
    - math, re, datetime
    - sklearn.preprocessing (StandardScaler, RobustScaler, MinMaxScaler, LabelEncoder, OneHotEncoder)
    - All helper functions listed above
    
    Rules:
    - Return only executable Python code, no explanations, no markdown blocks
    - Use helper functions when appropriate for standardization tasks
    - ASSUME "df" IS ALREADY DEFINED
    - For normalization/scaling, use helper functions that modify DataFrame
    - For categorical encoding, use appropriate helper functions based on query
    - ALWAYS assign the result back to df when modifying: df = normalize_columns(df)
    - In order to generate a response/message to the user use print statements
    print("message")
    - Write a detailed print message to summarise actions taken and reasons
    
    Common query patterns:
    - "Normalize" or "normalize columns" -> df = normalize_columns(df)
    - "Scale" or "standardize" -> df = scale_columns(df, method='standard')
    - "Min-max scaling" -> df = normalize_columns(df, method='minmax')
    - "Robust scaling" -> df = scale_columns(df, method='robust')
    - "One-hot encode" -> df = encode_categorical(df, method='onehot')
    - "Label encode" -> df = encode_categorical(df, method='label')
    - "Handle rare categories" -> df = handle_rare_categories(df, 'column_name', threshold=0.05)
    - "Create dummy variables" -> df = create_dummy_variables(df)
    - "Prepare for machine learning" -> Apply scaling + encoding as appropriate
    """))
    messages.append(HumanMessage(content=f"User request: {user_query}"))
    
    # Call LLM with message chain
    llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")
    response = llm.invoke(messages)
    generated_code = response.content.strip()
    
    # Execute code
    try:
        original_df = df.copy()
        # Create local namespace with our variables
        local_vars = {
            'df': df.copy(),
            'original_df': original_df,
            'pd': pd,
            'np': np,
            'normalize_columns': normalize_columns,
            'scale_columns': scale_columns,
            'encode_categorical': encode_categorical,
            'handle_rare_categories': handle_rare_categories,
            'create_dummy_variables': create_dummy_variables,
            'StandardScaler': StandardScaler,
            'RobustScaler': RobustScaler,
            'MinMaxScaler': MinMaxScaler,
            'LabelEncoder': LabelEncoder,
            'OneHotEncoder': OneHotEncoder,
            'print': print
        }
        
        exec(generated_code, globals(), local_vars)
        return local_vars['df']
    except Exception as e:
        print(f"Error: {e}")
        print(f"Generated Code:{generated_code}")
        return original_df

# **Testing**

In [None]:
# # Create sample data with various standardization opportunities
# test_data = {
#     'numeric_1': range(1, 101),  # Can be normalized/scaled
#     'numeric_2': [x*10 + 50 for x in range(1, 101)],  # Different scale
#     'numeric_3': [x**2 for x in range(1, 101)],  # Non-linear distribution
#     'category_small': ['A', 'B', 'C'] * 33 + ['A'],  # Low cardinality - good for one-hot
#     'category_medium': [f'Cat_{i%10}' for i in range(100)],  # Medium cardinality
#     'category_with_rare': ['Common1'] * 40 + ['Common2'] * 30 + ['Common3'] * 20 + 
#                          [f'Rare_{i}' for i in range(10)],  # Has rare categories
#     'binary_like': ['Yes', 'No'] * 50,  # Binary categorical
# }

# test_df = pd.DataFrame(test_data)
# print("Test DataFrame created:")
# print(f"Shape: {test_df.shape}")
# print("\\nData types:")
# print(test_df.dtypes)
# print("\\nFirst few rows:")
# print(test_df.head())
# print("\\nCategory distributions:")
# for col in ['category_small', 'category_medium', 'category_with_rare', 'binary_like']:
#     print(f"\\n{col}: {test_df[col].nunique()} unique values")
#     print(test_df[col].value_counts().head())

Test DataFrame created:
Shape: (100, 7)
\nData types:
numeric_1              int64
numeric_2              int64
numeric_3              int64
category_small        object
category_medium       object
category_with_rare    object
binary_like           object
dtype: object
\nFirst few rows:
   numeric_1  numeric_2  numeric_3 category_small category_medium  \
0          1         60          1              A           Cat_0   
1          2         70          4              B           Cat_1   
2          3         80          9              C           Cat_2   
3          4         90         16              A           Cat_3   
4          5        100         25              B           Cat_4   

  category_with_rare binary_like  
0            Common1         Yes  
1            Common1          No  
2            Common1         Yes  
3            Common1          No  
4            Common1         Yes  
\nCategory distributions:
\ncategory_small: 3 unique values
category_small
A    34
B  

In [None]:
# query = "Scale features for machine learning"
# result = standardization(test_df, query)

=== SCALING RESULTS (STANDARD) ===
Columns scaled: 3

Scaling details:
  numeric_1: mean 50.50→0.0000, std 29.01→1.0050
  numeric_2: mean 555.00→0.0000, std 290.11→1.0050
  numeric_3: mean 3383.50→-0.0000, std 3024.36→1.0050
Features have been standardized using z-score scaling to prepare for machine learning. This ensures that each feature contributes equally to the model by having a mean of 0 and a standard deviation of 1.


In [None]:
# print("=== TESTING STANDARDIZATION FEATURE ===")
# print("\\n1. Testing standard scaling:")
# result.info()
# print("\\n2. Testing one-hot encoding:")
# result2 = standardization(test_df.copy(), "One-hot encode categorical variables")
# print(f"\\nShape after one-hot encoding: {result2.shape}")
# print("\\n3. Testing rare category handling:")
# result3 = standardization(test_df.copy(), "Handle rare categories in category_with_rare column")
# print("\\n4. Testing normalization:")
# result4 = standardization(test_df.copy(), "Normalize numeric columns with min-max scaling")

=== TESTING STANDARDIZATION FEATURE ===
\n1. Testing standard scaling:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   numeric_1           100 non-null    float64
 1   numeric_2           100 non-null    float64
 2   numeric_3           100 non-null    float64
 3   category_small      100 non-null    object 
 4   category_medium     100 non-null    object 
 5   category_with_rare  100 non-null    object 
 6   binary_like         100 non-null    object 
dtypes: float64(3), object(4)
memory usage: 5.6+ KB
\n2. Testing one-hot encoding:
=== CATEGORICAL ENCODING RESULTS (ONEHOT) ===
Columns processed: 4
Columns encoded successfully: 4

Encoding details:
  category_small: 3 categories → 3 dummy columns
  category_medium: 10 categories → 10 dummy columns
  category_with_rare: 13 categories → 13 dummy columns
  binary_like: 2 categories 