# Metadata

**L1 Taxonomy** - Background Processes

**L2 Taxonomy** - Asynchronous Processing

**Subtopic** - Async/Await for Handling Asynchronous Operations in Python.

**Use Case** - Develop a Python script that reads and processes multiple CSV files concurrently using async/await. The script should read each file line by line, perform a data transformation, and write the transformed data to a new CSV file.

**Programming Language** - Python

**Target Model** - GPT-4o

# Setup

```requirements.txt
aiofiles
```


# Prompt
I want to develop a Python program that reads two CSV files concurrently and processes them line by line using asynchronous programming. The data in files may contain whitespace, negative float values, string text with mixed casing and inconsistent row lengths. Each line needs to be cleaned, normalized, and then written to an new output CSV.

**Input Format**

- Two CSV files.

**Output Format**

- Two new CSV files.


**Examples**

Input
```
file1.csv:
apple,  34.12345 , -12.5
banana,42, hello

file2.csv:
   x,  y , z
1.2345 , -1.234 ,world

```

Output:

```
file1_output.csv
APPLE,34.123,0
BANANA,42,HELLO
```

```
file2_output.csv
X,Y,Z
1.234,0,WORLD
```



# Requirements

**Explicit and Implicit Points**

- Use asyncio to read both files concurrently.
- Ignore rows with mismatched number of columns.
- Processed output must follow the same column order and be written as the file is read.


**Solution Expectations**

- The output files must be named to inputfilename_output.csv
- The processing of CSV files must be efficent.
- Rows with differing number of columns must be discarded.
- Normalize:

  - Strings -> Uppercase + strip whitespace

  - Floats -> Round to 3 decimal places, clamp negatives to 0

  - Empty values -> Use "" for strings, 0 for floats.


**Function Signatures**

```python
async def process_csv_folder(folder_path: str) -> None:
    pass
```

**Edge Case Behavior**

- Entire empty rows should be skipped.
- Rows with inconsistent columns should be ignored.
- Malformed floats should not crash the transform.
- Negative floats like -0.0001 should become 0.
- If folder has more than 2 CSV files return -1.


**Constraints**

- Only aiofiles as external dependency allowed.
- The solution must not use threading and multiprocessing libraries, only standard and allowed libraries are allowed.
- No more than 2 CSV allowed.



In [None]:
# code

"""
Asynchronous CSV processor.

Infers column types and normalizes values.
"""

import asyncio
import aiofiles
import os
import csv
from io import StringIO
from typing import Optional, List


async def infer_column_types(
    file_path: str,
    expected_columns: int
) -> List[str]:
    """Infer data types for each column by examining non-empty values."""
    column_types = ["string"] * expected_columns

    async with aiofiles.open(file_path, mode='r', encoding='utf-8') as infile:
        await infile.readline()
        sample_count = 0
        async for line in infile:
            line = line.strip()
            if not line or sample_count >= 10:
                continue
            items = line.split(',')
            if len(items) != expected_columns:
                continue
            for i, item in enumerate(items):
                item = item.strip()
                if item:
                    try:
                        float(item)
                        column_types[i] = "float"
                    except ValueError:
                        column_types[i] = "string"
            sample_count += 1

    return column_types


async def normalize_value_typed(
    value: str,
    expected_type: str
) -> str:
    """Normalize a single value according to the rules with type awareness."""
    value = value.strip()
    if not value:
        return "0" if expected_type == "float" else ""
    try:
        num = float(value)
        num = max(0, round(num, 3))
        if num == int(num):
            return str(int(num))
        return f"{num:.3f}".rstrip('0').rstrip('.')
    except ValueError:
        return value.upper()


async def process_line_typed(
    line: str,
    expected_columns: int,
    column_types: List[str]
) -> Optional[List[str]]:
    """Process a single line.

    according to normalization rules with type awareness.
    """
    items = line.split(',')
    if len(items) != expected_columns:
        return None

    processed_items = []
    for i, item in enumerate(items):
        expected_type = column_types[i] if i < len(column_types) else "string"
        normalized = await normalize_value_typed(item, expected_type)
        processed_items.append(normalized)

    return processed_items


async def process_file(
    input_path: str,
    output_path: str
) -> None:
    """Process a single CSV file asynchronously."""
    try:
        async with aiofiles.open(
            input_path,
            mode='r',
            encoding='utf-8'
        ) as infile:
            first_line = await infile.readline()
            if not first_line.strip():
                return

            header_items = first_line.strip().split(',')
            expected_columns = len(header_items)
            column_types = await infer_column_types(
                input_path,
                expected_columns
            )

            processed_header = [
                item.strip().upper() for item in header_items
            ]

            async with aiofiles.open(
                output_path,
                mode='w',
                encoding='utf-8',
                newline=''
            ) as outfile:
                output_buffer = StringIO()
                writer = csv.writer(output_buffer)
                writer.writerow(processed_header)
                await outfile.write(output_buffer.getvalue())

                async for line in infile:
                    line = line.strip()
                    if not line:
                        continue

                    processed_row = await process_line_typed(
                        line,
                        expected_columns,
                        column_types
                    )
                    if processed_row:
                        output_buffer = StringIO()
                        writer = csv.writer(output_buffer)
                        writer.writerow(processed_row)
                        await outfile.write(output_buffer.getvalue())

    except Exception as e:
        print(f"Error processing {input_path}: {e}")


async def process_csv_folder(folder_path: str) -> Optional[int]:
    """Process all CSV files in a folder concurrently."""
    try:
        files = [
            f for f in os.listdir(folder_path)
            if f.endswith('.csv')
        ]

        if len(files) != 2:
            return -1

        tasks = []
        for file in files:
            input_path = os.path.join(folder_path, file)
            output_path = os.path.join(
                folder_path,
                f"{os.path.splitext(file)[0]}_output.csv"
            )
            tasks.append(process_file(input_path, output_path))

        await asyncio.gather(*tasks)
        return None

    except OSError:
        return None


In [None]:
# tests

import unittest
import tempfile
import os
import asyncio


try:
    import nest_asyncio
    nest_asyncio.apply()
except ImportError:
    pass

from main import process_csv_folder

class TestProcessCSVFolder(unittest.TestCase):
    def setUp(self):
        self.tempdir = tempfile.TemporaryDirectory()
        self.orig_cwd = os.getcwd()
        os.chdir(self.tempdir.name)

    def tearDown(self):
        os.chdir(self.orig_cwd)
        self.tempdir.cleanup()

    def write_file(self, name: str, content: str):
        with open(name, 'w') as f:
            f.write(content)

    def read_output(self, name: str):
        with open(name, 'r') as f:
            return [line.strip() for line in f if line.strip()]

    def test_more_than_two_csv_returns_minus1(self):
        # Three CSVs should return -1 and produce no outputs
        for name in ['a.csv', 'b.csv', 'c.csv']:
            self.write_file(name, 'x,y\n1,2\n')
        result = asyncio.run(process_csv_folder(self.tempdir.name))
        self.assertEqual(result, -1)
        self.assertFalse(os.path.exists('a_output.csv'))
        self.assertFalse(os.path.exists('b_output.csv'))
        self.assertFalse(os.path.exists('c_output.csv'))

    def test_two_csv_file1_output(self):
        # Verify first file processing from prompt example
        file1 = 'apple,  34.12345 , -12.5\nbanana,42, hello\n'
        file2 = '   x,  y , z\n1.2345 , -1.234 ,world\n'
        self.write_file('file1.csv', file1)
        self.write_file('file2.csv', file2)
        asyncio.run(process_csv_folder(self.tempdir.name))
        out1 = self.read_output('file1_output.csv')
        self.assertEqual(out1, ['APPLE,34.123,0', 'BANANA,42,HELLO'])

    def test_two_csv_file2_output(self):
        # Verify second file processing from prompt example
        file1 = 'apple,  34.12345 , -12.5\nbanana,42, hello\n'
        file2 = '   x,  y , z\n1.2345 , -1.234 ,world\n'
        self.write_file('file1.csv', file1)
        self.write_file('file2.csv', file2)
        asyncio.run(process_csv_folder(self.tempdir.name))
        out2 = self.read_output('file2_output.csv')
        self.assertEqual(out2, ['X,Y,Z', '1.234,0,WORLD'])

    def test_ignore_inconsistent_rows(self):
        # Rows with mismatched columns should be ignored, no blank rows
        content = (
            'c1,c2,c3\n'
            '1,2,3\n'
            '4,5\n'      # too few
            '6,7,8,9\n'  # too many
            '7,8,9\n'    # valid
        )
        self.write_file('test.csv', content)
        self.write_file('other.csv', content)
        asyncio.run(process_csv_folder(self.tempdir.name))
        lines = self.read_output('test_output.csv')
        self.assertEqual(lines, ['C1,C2,C3', '1,2,3', '7,8,9'])

    def test_ignore_blank_rows(self):
        # Blank lines should be skipped, no inconsistent rows
        content = (
            'h1,h2\n'
            '\n'         # blank
            '1,2\n'      # valid
            '\n'         # blank
            '3,4\n'      # valid
        )
        self.write_file('a.csv', content)
        self.write_file('b.csv', content)
        asyncio.run(process_csv_folder(self.tempdir.name))
        lines = self.read_output('a_output.csv')
        self.assertEqual(lines, ['H1,H2', '1,2', '3,4'])

    def test_malformed_floats_become_zero(self):
        # Malformed floats result in 0
        content = 'v1,v2\nX,notfloat\n'
        self.write_file('vals.csv', content)
        self.write_file('other.csv', content)
        asyncio.run(process_csv_folder(self.tempdir.name))
        lines = self.read_output('vals_output.csv')
        self.assertIn('X,0', lines)

    def test_negative_floats_clamped_zero(self):
        # Negative floats should be clamped to 0
        content = 'v1,v2\nY,-0.0001\n'
        self.write_file('n.csv', content)
        self.write_file('m.csv', content)
        asyncio.run(process_csv_folder(self.tempdir.name))
        lines = self.read_output('n_output.csv')
        self.assertIn('Y,0', lines)

    def test_valid_floats_rounded_three_decimals(self):
        # Valid floats rounded to 3 decimals
        content = 'v1,v2\nZ,2.71828\n'
        self.write_file('p.csv', content)
        self.write_file('q.csv', content)
        asyncio.run(process_csv_folder(self.tempdir.name))
        lines = self.read_output('p_output.csv')
        self.assertIn('Z,2.718', lines)

    def test_ignore_non_csv_with_two_csv(self):
        # Two CSVs and one non-CSV file: only CSVs processed
        self.write_file('file1.csv', 'a,b\n1,2\n')
        self.write_file('file2.csv', 'x,y\n3,4\n')
        self.write_file('readme.txt', 'ignore me')
        asyncio.run(process_csv_folder(self.tempdir.name))
        self.assertTrue(os.path.exists('file1_output.csv'))
        self.assertTrue(os.path.exists('file2_output.csv'))

    def test_non_csv_not_generated(self):
        # Ensure non-CSV file does not get an output
        self.write_file('a.csv', 'x,y\n1,2\n')
        self.write_file('b.csv', 'x,y\n3,4\n')
        self.write_file('c.md', 'ignore')
        asyncio.run(process_csv_folder(self.tempdir.name))
        self.assertFalse(os.path.exists('c_output.md'))

if __name__ == '__main__':
    unittest.main()



  with outcome.testPartExecutor(self):
EEEE
ERROR: test_ignore_inconsistent_and_blank_rows (__main__.TestProcessCSVFolder.test_ignore_inconsistent_and_blank_rows)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/ipython-input-4-1448136916.py", line 66, in test_ignore_inconsistent_and_blank_rows
    result = asyncio.run(process_csv_folder(self.tempdir.name))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 186, in run
    raise RuntimeError(
RuntimeError: asyncio.run() cannot be called from a running event loop

ERROR: test_ignore_non_csv_files (__main__.TestProcessCSVFolder.test_ignore_non_csv_files)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/ipython-input-4-1448136916.py", line 97, in test_ignore_non_csv_files
    result = asyncio.run(process_csv_folder(self.tempdir.name))


SystemExit: True

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


# Model Breaking Proof

#### Model Breaking Task URL: <https://rlhf-v3.turing.com/prompt/664dc169-aa47-4773-8127-74c3d374a899?origin=https%3A%2F%2Flabeling-ms.turing.com&redirect_url=https%3A%2F%2Flabeling-ms.turing.com%2Fconversations%2F74376%2Fview>

#### Model code:

```python
import asyncio
import aiofiles
import os
import csv

async def process_line(line, num_columns):
    # Split line by ','
    items = [item.strip() for item in line.split(',')]
    # If row length is inconsistent, return None
    if len(items) != num_columns:
        return None
        
    processed_items = []
    for item in items:
        # Normalize strings
        if item.isalpha() or not item:

          # <Issue>: <Empty values always become "", even for numeric columns, instead it should be 0 for numeric values>

            processed_items.append(item.upper())
        # Normalize numbers
        else:
            try:
                num = float(item)
                num = max(0, round(num, 3))  # Round and clamp numbers
                processed_items.append(str(num))
            except ValueError:  # Handle malformed floats
                processed_items.append("")
    
    return processed_items

async def process_file(input_path, output_path):
    async with aiofiles.open(input_path, mode='r') as infile, aiofiles.open(output_path, mode='w', newline='') as outfile:
      # <Issue>: <Loads entire file at once, not truly async>
        reader = csv.reader(await infile.readlines())
      # <Issue>: <Outfile is an async file object (aiofiles), which can't be passed to csv.writer>
        writer = csv.writer(outfile)
        
        headers = next(reader, None)
        if headers is None:
            return
        
        num_columns = len(headers)
        
        processed_header = await process_line(','.join(headers), num_columns)
        # <Issue>: <Writer is sync, cannot be awaited>
        if processed_header:
            await writer.writerow(processed_header)
        
        # <Issue>: <reader is a list, so async for doesn’t work here>
        async for idx, row in enumerate(reader):
            processed_row = await process_line(','.join(row), num_columns)
            if processed_row:
                await writer.writerow(processed_row)

async def process_csv_folder(folder_path: str) -> int:
    # List CSV files in folder
    files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]
    
    # Ensure exactly two CSV files
    if len(files) != 2:
        return -1
    
    tasks = []
    for file in files:
        input_path = os.path.join(folder_path, file)
        output_path = os.path.join(folder_path, f"{os.path.splitext(file)[0]}_output.csv")
        tasks.append(process_file(input_path, output_path))
    
    await asyncio.gather(*tasks)

# Example usage:
# asyncio.run(process_csv_folder('/path/to/csv/folder'))
```