# Metadata

**L1 Taxonomy** - Data Analysis and Engineering

**L2 Taxonomy** - Pandas for DataFrames

**Subtopic** - Handling missing data by imputation or dropping in DataFrames

**Use Case** - Create a Python module that uses Pandas to read a CSV file into a DataFrame. Identify columns with missing data and apply imputation techniques to fill the missing values. The imputation method could be mean, median, mode, or a constant value. In addition, provide an option to drop the rows or columns with missing data based on a threshold of missing values set by the user.

**Programming Language** - Python

**Target Model** - GPT-4o

# Setup

```requirements.txt
pandas
```


# Prompt
I need to create a Python module that uses Pandas to read a CSV file into a DataFrame. My module must first identify all columns that contain missing data. Then, apply different imputation techniques per column. Log every imputation and drop step in detail.

**Input Format**

- A string file path to CSV file.
- A dictionary called imputation_rules where keys are column names and values are imputation strategies ("mean", "median", "mode").
- A float threshold representing the maximum allowable fraction of missing values in any column or row.

**Output Format**

- A cleaned pandas DataFrame.
- A list of strings as a log detailing each imputation or drop action in order.
- DataFrame saved to CSV with name cleaned_<original_filename>.csv

**Examples**

```
Input:
filepath = "data.csv"
imputation_rules = {
    "age": "mean",
    "income": "median",
    "gender": "mode"
}
threshold = 0.25

Output:
(DataFrame object with missing values handled)
[
    "Column age: filled 10 missing values with mean=36.4",
    "Column income: filled 5 missing values with median=70500.0",
    "Column gender: filled 3 missing values with mode='F'",
    "Dropped column 'device_id' due to >90% uniqueness",
    "Dropped row 42 due to missing data > threshold=0.25",
    "Saved cleaned DataFrame to 'cleaned_data.csv'"
]

```





# Requirements

**Explicit and Implicit Points**

- Use pandas to read and process CSV data.
- Apply imputation on columns with missing data.
- Support only "mean", "median", "mode"
- Drop rows or columns after imputation if their NaN ratio exceeds threshold.


**Solution Expectations**

- Ensure type-safe imputation: no mean/median on strings, no mode on high-cardinality columns.
- All columns with high-cardinality should be dropped.
- After all operations the final data will be saved to CSV with name cleaned_<original_filename>.csv



**Function Signatures**

```python
def clean_and_impute(filepath: str, imputation_rules: dict, threshold: float) -> tuple[pd.DataFrame, list[str]]:
    pass
```

**Edge Case Behavior**

- If file is empty or only contains headers, return empty DataFrame and empty log.
- If column is entirely NaN, drop it and log it.
- If a rule points to a non-existent column, skip and log warning.
- If no missing values are found, return and log that nothing was changed.
- If threshold is out of bound return and log that incorrect threshold was
invalid.
- If dropping a high-cardinality column log it.

**Constraints**

- Use only pandas as external library.
- If more than 90% of values in column are unique that will be considered high-cardinality column.
- Threshold is a float between 0 and 1 inclusive.
- Do not modify the original CSV file.


In [None]:
# code

"""Module for cleaning and imputing missing data in a CSV file."""

import os
import pandas as pd


def clean_and_impute(
    filepath: str,
    imputation_rules: dict,
    threshold: float
) -> tuple[pd.DataFrame, list[str]]:
    """
    Clean and impute missing values in the given CSV file.

    Args:
        filepath (str): Path to the CSV file.
        imputation_rules (dict): Dictionary mapping column names to
            imputation strategies ("mean", "median", or "mode").
        threshold (float): Max allowed missing ratio for rows/columns.

    Returns:
        tuple[pd.DataFrame, list[str]]: Cleaned DataFrame and a log list.
    """
    log = []

    if threshold < 0 or threshold > 1:
        log.append(
            "Invalid threshold value: must be between 0 and 1 inclusive."
        )
        return pd.DataFrame(), log

    try:
        df = pd.read_csv(filepath)
    except Exception as e:
        log.append(f"Failed to read CSV: {e}")
        return pd.DataFrame(), log

    if df.empty:
        log.append("CSV is empty or contains only headers.")
        return df, log

    changes_made = False

    for col, strategy in imputation_rules.items():
        if col not in df.columns:
            log.append(
                f"Skipped non-existent column '{col}' for imputation"
            )
            continue

        if df[col].isnull().sum() == 0:
            continue

        if strategy not in ["mean", "median", "mode"]:
            log.append(
                f"Invalid imputation strategy '{strategy}' for column '{col}'"
            )
            continue

        if df[col].isnull().all():
            continue

        if (
            not pd.api.types.is_numeric_dtype(df[col])
            and strategy in ["mean", "median"]
        ):
            log.append(
                f"Cannot apply '{strategy}' imputation on "
                f"non-numeric column '{col}'"
            )
            continue

        filled_count = df[col].isnull().sum()

        if strategy == "mean":
            mean_value = df[col].mean()
            df[col].fillna(mean_value, inplace=True)
            log.append(
                f"Column {col}: filled {filled_count} missing values "
                f"with mean={mean_value}"
            )
            changes_made = True

        elif strategy == "median":
            median_value = df[col].median()
            df[col].fillna(median_value, inplace=True)
            log.append(
                f"Column {col}: filled {filled_count} missing values "
                f"with median={median_value}"
            )
            changes_made = True

        elif strategy == "mode":
            if df[col].nunique() / len(df) > 0.9:
                log.append(
                    f"Cannot apply mode imputation on high-cardinality "
                    f"column '{col}'"
                )
                continue

            if not df[col].mode().empty:
                mode_value = df[col].mode().iloc[0]
                df[col].fillna(mode_value, inplace=True)
                log.append(
                    f"Column {col}: filled {filled_count} missing values "
                    f"with mode='{mode_value}'"
                )
                changes_made = True

    cols_to_drop = []
    for col in df.columns:
        if df[col].nunique() / len(df) > 0.9:
            cols_to_drop.append(col)
            log.append(f"Dropped column '{col}' due to >90% uniqueness")
            changes_made = True

    df.drop(columns=cols_to_drop, inplace=True)

    entirely_nan_cols = df.columns[df.isnull().all()]
    for col in entirely_nan_cols:
        df.drop(columns=[col], inplace=True)
        log.append(f"Dropped column '{col}' due to being entirely NaN")
        changes_made = True

    for col in df.columns[df.isnull().mean() > threshold]:
        df.drop(columns=[col], inplace=True)
        log.append(
            f"Dropped column '{col}' due to missing data > threshold="
            f"{threshold}"
        )
        changes_made = True

    rows_to_drop = df.index[df.isnull().mean(axis=1) > threshold]
    for index in rows_to_drop:
        df.drop(index=index, inplace=True)
        log.append(
            f"Dropped row {index} due to missing data > threshold="
            f"{threshold}"
        )
        changes_made = True

    if not changes_made and not any(df.isnull().sum() > 0):
        log.append("No missing values found, nothing was changed")

    base_filename = os.path.basename(filepath)
    new_filename = f"cleaned_{base_filename}"
    df.to_csv(new_filename, index=False)
    log.append(f"Saved cleaned DataFrame to '{new_filename}'")

    return df, log


imputation_rules = {
    "age": "mean",
    "income": "median",
    "gender": "mode",
    "city": "mode",
    "empty_col": "mode"
}

clean_and_impute('test.csv', imputation_rules, 0.3)


In [None]:
# tests

"""
Unit tests for the clean_and_impute function.
"""
import os
import tempfile
import unittest

import pandas as pd

from main import clean_and_impute


class TestCleanAndImpute(unittest.TestCase):
    """
    Test suite for the clean_and_impute function.
    """

    def setUp(self) -> None:
        """
        Create a temporary directory and switch to it before each test.
        """
        self.tempdir = tempfile.TemporaryDirectory()
        self.original_cwd = os.getcwd()
        os.chdir(self.tempdir.name)

    def tearDown(self) -> None:
        """
        Restore the original directory, clean temporary dir after each test.
        """
        os.chdir(self.original_cwd)
        self.tempdir.cleanup()

    def write_csv(self, filename: str, df: pd.DataFrame) -> None:
        """
        Write the given DataFrame to a CSV file without an index.

        Args:
            filename: The name of the CSV file to write.
            df: The DataFrame to save.
        """
        df.to_csv(filename, index=False)

    def test_threshold_out_of_bounds(self) -> None:
        """
        Threshold outside [0, 1] must return an empty DataFrame & log an error.
        """
        df, log = clean_and_impute('dummy.csv', {}, -0.1)
        self.assertTrue(df.empty)
        self.assertIn('Invalid threshold value', log[0])

        df, log = clean_and_impute('dummy.csv', {}, 1.5)
        self.assertTrue(df.empty)
        self.assertIn('Invalid threshold value', log[0])

    def test_file_not_found(self) -> None:
        """
        A missing file should return an empty DataFrame and log a read failure.
        """
        df, log = clean_and_impute('nofile.csv', {'col': 'mean'}, 0.5)
        self.assertTrue(df.empty)
        self.assertIn('Failed to read CSV', log[0])

    def test_empty_csv(self) -> None:
        """
        CSV with only headers should return empty DataFrame & log the condition.
        """
        df_input = pd.DataFrame(columns=['a', 'b'])
        self.write_csv('empty.csv', df_input)
        df, log = clean_and_impute('empty.csv', {}, 0.5)
        self.assertTrue(df.empty)
        self.assertIn('CSV is empty or contains only headers', log[0])

    def test_skip_nonexistent_column_rule(self) -> None:
        """
          Absent of imputation rules for columns in CSV must be skipped with a
          warning log.
        """
        df_input = pd.DataFrame({'x': [1, None, 3]})
        self.write_csv('data.csv', df_input)
        df, log = clean_and_impute('data.csv', {'y': 'mean'}, 0.5)
        self.assertIn("Skipped non-existent column 'y'", log[0])

    def test_mean_and_median_imputation(self) -> None:
        """
        Numeric columns should be correctly imputed using mean and median
        strategies.
        """
        df_input = pd.DataFrame({
            'a': [1, None, 3],
            'b': [None, 2, 4]
        })
        self.write_csv('in.csv', df_input)
        df, log = clean_and_impute('in.csv', {'a': 'mean', 'b': 'median'}, 1.0)
        self.assertAlmostEqual(df.loc[1, 'a'], 2)
        self.assertAlmostEqual(df.loc[0, 'b'], 2)
        self.assertIn('Column a: filled 1 missing values with mean=', log[0])
        self.assertIn('Column b: filled 1 missing values with median=', log[1])

    def test_mode_imputation_and_high_cardinality(self) -> None:
        """
        Mode imputation should fill missing values, and high-cardinality columns
         should be dropped.
        """
        df_input = pd.DataFrame({
            'c': ['x', None, 'x', 'y', 'z',
                  'w', 'v', 'u', 't', 's']
        })
        self.write_csv('mode.csv', df_input)
        df, log = clean_and_impute('mode.csv', {'c': 'mode'}, 1.0)
        self.assertEqual(df.loc[1, 'c'], 'x')
        self.assertTrue(any('Dropped column' in entry for entry in log))

    def test_drop_entirely_nan_column(self) -> None:
        """
        Columns that are entirely NaN should be dropped and logged.
        """
        df_input = pd.DataFrame({
            'd': [None, None, None],
            'e': [1, 2, 3]
        })
        self.write_csv('nan.csv', df_input)
        df, log = clean_and_impute('nan.csv', {'d': 'mode', 'e': 'mean'}, 1.0)
        self.assertNotIn('d', df.columns)
        self.assertIn("Dropped column 'd' due to being entirely NaN", log)

    def test_drop_columns_over_threshold(self) -> None:
        """
        Columns with missing-value fraction above the threshold should be dropped.
        """
        df_input = pd.DataFrame({
            'f': [1, None, None, 4],
            'g': [1, 2, 3, 4]
        })
        self.write_csv('thr.csv', df_input)
        df, log = clean_and_impute('thr.csv', {'f': 'mean', 'g': 'mean'}, 0.25)
        self.assertNotIn('f', df.columns)
        self.assertIn(
            "Dropped column 'f' due to missing data > threshold=0.25",
            log
        )

    def test_drop_rows_over_threshold(self) -> None:
        """
        Rows with missing-value fraction above the threshold should be dropped.
        """
        df_input = pd.DataFrame({
            'h': [1, None, 3],
            'i': [None, None, None]
        })
        self.write_csv('rows.csv', df_input)
        df, log = clean_and_impute('rows.csv', {'h': 'mean', 'i': 'mode'}, 0.5)
        self.assertNotIn(1, df.index)
        self.assertIn(
            'Dropped row 1 due to missing data > threshold=0.5',
            log
        )

    def test_no_missing_values(self) -> None:
        """
        When no missing values exist, log that nothing was changed.
        """
        df_input = pd.DataFrame({
            'j': [1, 2, 3],
            'k': ['a', 'b', 'c']
        })
        self.write_csv('nomiss.csv', df_input)
        df, log = clean_and_impute('nomiss.csv', {}, 0.5)
        self.assertIn(
            'No missing values found, nothing was changed',
            log
        )

    def test_output_file_created(self) -> None:
        """
        Ensure the cleaned CSV file is saved and the final log entry is correct.
        """
        df_input = pd.DataFrame({
            'l': [1, None],
            'm': ['x', None]
        })
        self.write_csv('out.csv', df_input)
        _, log = clean_and_impute('out.csv', {'l': 'mean', 'm': 'mode'}, 1.0)
        self.assertTrue(os.path.exists('cleaned_out.csv'))
        self.assertIn(
            "Saved cleaned DataFrame to 'cleaned_out.csv'",
            log[-1]
        )


if __name__ == '__main__':
    unittest.main(argv=[''])


# Model Breaking Proof

#### Model Breaking Task URL: <https://rlhf-v3.turing.com/prompt/a930830d-b9cf-45c1-be4d-c24736d990f5?origin=https%3A%2F%2Flabeling-ms.turing.com&redirect_url=https%3A%2F%2Flabeling-ms.turing.com%2Fconversations%2F74355%2Fview>

#### Model code:

```python
import pandas as pd
# <Issue>: <Using numpy, which is not allowed>
import numpy as np
import os

def clean_and_impute(filepath: str, imputation_rules: dict, threshold: float) -> tuple[pd.DataFrame, list[str]]:
    log = []
    
    # Validate threshold
    if threshold < 0 or threshold > 1:
        log.append("Invalid threshold value: must be between 0 and 1 inclusive.")
        return pd.DataFrame(), log
    
    # Read CSV into DataFrame
    try:
        df = pd.read_csv(filepath)
    except Exception as e:
        log.append(f"Failed to read CSV: {e}")
        return pd.DataFrame(), log
    # <Issue>: <Edge case not handled for "If no missing values are found, return and log that nothing was changed">

    # Edge case: Empty file or only headers
    if df.empty:
        log.append("CSV is empty or contains only headers.")
        return df, log
    
    # Identify high-cardinality columns to drop
    cols_to_drop = []
    for col in df.columns:
        if df[col].nunique() / len(df) > 0.9:
            cols_to_drop.append(col)
            log.append(f"Dropped column '{col}' due to >90% uniqueness")
    
    # Drop high-cardinality columns
    df.drop(columns=cols_to_drop, inplace=True)

    # Handle imputation
    for col, strategy in imputation_rules.items():
        if col not in df.columns:
            log.append(f"Skipped non-existent column '{col}' for imputation")
            continue

            # <Issue>: <Missing check and handling for fully NaN columns>

        if df[col].isnull().sum() == 0:
            continue  # No missing data in this column

        if strategy not in ["mean", "median", "mode"]:
            log.append(f"Invalid imputation strategy '{strategy}' for column '{col}'")
            continue

        if df[col].dtype not in [np.float64, np.int64] and strategy in ["mean", "median"]:
            log.append(f"Cannot apply '{strategy}' imputation on non-numeric column '{col}'")
            continue
        
        # <Issue>: <filled_count is calculated after fillna(),which is not correct, it should be before to log correctly>
        
        # Impute based on strategy
        if strategy == "mean":
            mean_value = df[col].mean()
            filled_count = df[col].isnull().sum()
            df[col].fillna(mean_value, inplace=True)
            log.append(f"Column {col}: filled {filled_count} missing values with mean={mean_value}")
        elif strategy == "median":
            median_value = df[col].median()
            filled_count = df[col].isnull().sum()
            df[col].fillna(median_value, inplace=True)
            log.append(f"Column {col}: filled {filled_count} missing values with median={median_value}")
        elif strategy == "mode":
            mode_value = df[col].mode().iloc[0]
            filled_count = df[col].isnull().sum()
            df[col].fillna(mode_value, inplace=True)
            log.append(f"Column {col}: filled {filled_count} missing values with mode='{mode_value}'")

    # <Issue>: <Missing log if no changes were made>

    # Drop any columns or rows with remaining high NaN ratios
    for col in df.columns[df.isnull().mean() > threshold]:
        df.drop(columns=[col], inplace=True)
        log.append(f"Dropped column '{col}' due to missing data > threshold={threshold}")

    for index in df.index[df.isnull().mean(axis=1) > threshold]:
        df.drop(index=index, inplace=True)
        log.append(f"Dropped row {index} due to missing data > threshold={threshold}")

    # Save cleaned DataFrame to a new CSV
    base_filename = os.path.basename(filepath)
    new_filename = f"cleaned_{base_filename}"
    df.to_csv(new_filename, index=False)
    log.append(f"Saved cleaned DataFrame to '{new_filename}'")

    return df, log

```