# Section 2 — Importing and Exporting Data

**Goal:** Learn how to import and export real-world data in Pandas efficiently. We'll explore common file types like CSV, Excel, JSON, and SQL. We'll also learn how to handle missing values and encoding issues during import, and how to serialize clean data back for downstream use.

We continue using the *retail sales dataset* from Section 1, but this time simulate reading/writing to disk so code is self-contained and reproducible.

## Subtopics
1. Reading and writing CSV, Excel, and JSON files.
2. Managing file encoding, delimiters, and missing data on import.
3. Inspecting imported data quickly (`info`, `head`, `nunique`).
4. Exporting clean data with appropriate compression and format.
5. Handling large files with partial loading (via `chunksize`).

In [None]:
import pandas as pd
import numpy as np
from io import StringIO

# Create mock CSV data as string (simulate file I/O for notebook use)
csv_data = StringIO('''order_id,customer_id,product,quantity,price,order_date\n'
1001,201,T-shirt,2,15.50,2023-01-15\n'
1002,202,Mug,1,9.99,2023-01-22\n'
1003,203,Notebook,3,,2023-01-29\n'
1004,202,Cap,1,12.00,2023-02-05\n'
1005,201,T-shirt,4,16.50,2023-02-12''')

### Reading data from CSV

`pd.read_csv()` is the most common import function. It can infer dtypes, parse dates, and handle missing values automatically. We'll use `na_values` and `parse_dates` to illustrate robust reading.

In [None]:
# Read the CSV into a DataFrame
df = pd.read_csv(csv_data, na_values=['', 'NA', 'None'], parse_dates=['order_date'])
df.info()
df.head()

### Exporting cleaned data to disk or buffer

You can export to CSV, Excel, or JSON formats using `to_csv()`, `to_excel()`, and `to_json()`. Use compression (`.zip`, `.gz`) when saving large files. In practice, always define `index=False` to avoid extra index columns unless explicitly needed.

In [None]:
# Clean data (fill missing price and compute total)
df['price'] = df['price'].fillna(df['price'].median())
df['total'] = df['price'] * df['quantity']

# Export to CSV string buffer for demonstration
csv_buffer = StringIO()
df.to_csv(csv_buffer, index=False)
print(csv_buffer.getvalue().splitlines()[:6])

### Reading JSON and Excel Files

Pandas can natively parse JSON objects and Excel sheets. Here we demonstrate JSON since Excel I/O requires `openpyxl` installed. JSON import is handy for APIs or nested datasets.

In [None]:
# Convert a small subset to JSON and read it back
json_str = df.head(3).to_json(orient='records')
print(json_str)
df_json = pd.read_json(StringIO(json_str))
df_json

### Loading data in chunks

When datasets are very large, use the `chunksize` parameter to read portions incrementally. Each chunk is a smaller DataFrame that can be processed before moving to the next, avoiding memory exhaustion.

In [None]:
# Example of chunked reading (simulate with a small CSV)
csv_data2 = StringIO('\n'.join([','.join(map(str, row)) for row in df.values]))
chunk_iter = pd.read_csv(StringIO(csv_buffer.getvalue()), chunksize=2)

totals = []
for chunk in chunk_iter:
    totals.append(chunk['total'].sum())
print('Partial totals per chunk:', totals, '| Grand total:', sum(totals))

## Real-World Problem 1 — Cleaning a CSV with Mixed Encodings and Missing Values

**Scenario:** You receive a sales CSV where some rows use commas while others use semicolons and contain mixed encodings. Demonstrate how to robustly import such data and normalize column names and dtypes.

In [None]:
# Simulate a messy CSV using multiple delimiters (normally you'd detect this via sniffing)
messy_csv = 'order_id;customer_id;price\n1001;200;10.5\n1002,201,12.5\n1003;202;11.0'

# Detect the delimiter (simple heuristic: count semicolons)
delimiter = ';' if messy_csv.count(';') > messy_csv.count(',') else ','

# Read robustly with encoding fallback
df_messy = pd.read_csv(StringIO(messy_csv.replace(';', ',')), encoding_errors='ignore')

# Normalize column names
df_messy.columns = df_messy.columns.str.strip().str.lower()
df_messy.info()
df_messy.head()

## Real-World Problem 2 — Export a Clean Report for Downstream Analysis

**Task:** Starting from our cleaned retail dataset (`df`), export a product-level summary as both JSON and compressed CSV for downstream ML training. Demonstrate format control and compression.

In [None]:
# Aggregate per product
summary = df.groupby('product', as_index=False).agg(
    total_qty=('quantity', 'sum'),
    total_revenue=('total', 'sum')
)

# Export to multiple formats
json_report = summary.to_json(orient='records', indent=2)
csv_buffer2 = StringIO()
summary.to_csv(csv_buffer2, index=False, compression='infer')

print('JSON Report Example:\n', json_report[:200], '...')
print('\nCSV Content Preview:\n', '\n'.join(csv_buffer2.getvalue().splitlines()[:5]))

## Under the Hood — How Pandas Handles I/O Efficiently

- `read_csv()` is implemented in C and optimized with chunked buffering — it can handle millions of rows efficiently.
- When `parse_dates` is enabled, Pandas infers datetime columns post-read using vectorized conversion.
- Writing functions like `to_csv()` stream data row-by-row; compression is handled by Python’s built-in gzip/bz2 libraries.
- When working with Arrow/Parquet backends (in later sections), I/O can become zero-copy, reducing memory footprint.

## Best Practices / Common Pitfalls
- Always specify `dtype` or `parse_dates` to prevent unintended `object` dtypes.
- Avoid reading massive files into memory at once — use `chunksize`.
- Don’t forget `index=False` when writing CSVs unless index carries meaning.
- Normalize column names early (lowercase, underscores) for consistency.
- For production pipelines, prefer Parquet/Feather over CSV for speed and schema preservation.

## Challenge Exercise (no solution here)

You’re given a CSV with columns: `customer_id`, `purchase_date`, `amount_spent`, and some missing entries. Write code to:
1. Read the CSV robustly (detect delimiter, parse dates).
2. Fill missing `amount_spent` with the median per customer.
3. Export two versions:
   - CSV compressed with gzip.
   - JSON formatted with 2-space indentation.
4. Verify that the exported JSON file can be reloaded with identical totals per customer.

_Hint:_ Compare original vs. re-imported data using `groupby('customer_id').sum()`.

# --- End of Section 2 — Continue to Section 3 ---