# Optional Add-on Module 3: Integration of Pandas with Other Libraries

In this module, we explore **how Pandas integrates with other data ecosystems** — combining flexibility with performance. We'll cover:

- Seamless exchange with **NumPy** arrays
- Database operations using **SQLAlchemy**
- High-speed interchange with **PyArrow** and **Polars**
- Hybrid workflows and conversion performance insights

## 1. Pandas ↔ NumPy Interoperability

Since Pandas is built atop NumPy, conversion between DataFrames and NumPy arrays is almost zero-cost.

In [ ]:
import pandas as pd
import numpy as np

# Create a sample DataFrame
df = pd.DataFrame({
    'x': np.arange(10),
    'y': np.random.randn(10)
})

# Convert to NumPy
arr = df.to_numpy()
print(arr[:5])

# Convert back to DataFrame
df_back = pd.DataFrame(arr, columns=['x', 'y'])
df_back.head()

### Under the Hood:
- `DataFrame.to_numpy()` provides a **view** where possible, not a copy.
- Numeric columns share memory with the original DataFrame.
- `df.values` (legacy) and `.to_numpy()` differ subtly — the latter respects dtype consistency.

## 2. Pandas ↔ SQL Integration (Using SQLAlchemy)

Pandas can read and write directly to SQL databases, enabling hybrid pipelines between relational storage and in-memory analysis.

In [ ]:
from sqlalchemy import create_engine

# Create in-memory SQLite database
engine = create_engine('sqlite://', echo=False)

# Write DataFrame to SQL
df.to_sql('metrics', con=engine, index=False, if_exists='replace')

# Query back into Pandas
result_df = pd.read_sql('SELECT * FROM metrics WHERE x < 5', con=engine)
print(result_df)

✅ **Real-World Problem 1: Data Warehouse Integration**

**Scenario:** You collect transactional data into PostgreSQL but need in-memory analytics.

```python
engine = create_engine('postgresql+psycopg2://user:password@localhost:5432/salesdb')
orders = pd.read_sql('SELECT * FROM orders WHERE date >= CURRENT_DATE - INTERVAL \'30 days\'', engine)
orders.groupby('region')['revenue'].sum().sort_values(ascending=False).head()
```

This approach combines SQL’s filtering power with Pandas’ analytical agility.

## 3. Pandas ↔ PyArrow Interchange

**PyArrow** enables efficient columnar data exchange and zero-copy interoperability between Pandas, Spark, and other big data systems.

In [ ]:
import pyarrow as pa

# Convert Pandas DataFrame to Arrow Table
table = pa.Table.from_pandas(df)
print(table.schema)

# Convert back to Pandas
df_arrow = table.to_pandas()
df_arrow.head()

### Real-World Problem 2: Arrow-based Data Exchange

**Scenario:** You want to share preprocessed datasets with a Spark or DuckDB workflow.

```python
import pyarrow.parquet as pq

# Save to Parquet format
pq.write_table(table, 'data/output.parquet')

# Read back efficiently
table_loaded = pq.read_table('data/output.parquet')
df_loaded = table_loaded.to_pandas()
```

✅ Arrow enables fast binary exchange, ideal for multi-language pipelines.

## 4. Pandas ↔ Polars Integration

**Polars** is a Rust-based DataFrame library offering lightning-fast operations using Apache Arrow memory format. You can easily interconvert with Pandas.

In [ ]:
!pip install -q polars
import polars as pl

# Convert Pandas → Polars
pl_df = pl.from_pandas(df)

# Perform fast lazy computations
lazy_df = pl_df.lazy().filter(pl.col('x') > 3).select(['x', 'y'])
result = lazy_df.collect()
print(result)

### Under the Hood:
- **Polars** uses Arrow buffers, supporting zero-copy conversion.
- Its lazy engine builds query plans, optimizing filters and projections.
- Great for large ETL pipelines and hybrid workflows with Pandas.

## 5. Hybrid Workflow Example: From SQL → Pandas → Polars → Arrow

An end-to-end example combining all integration points.

In [ ]:
# Example hybrid pipeline
df_sql = pd.read_sql('SELECT * FROM metrics', con=engine)
pl_df = pl.from_pandas(df_sql)
arrow_tbl = pa.Table.from_pandas(pl_df.to_pandas())
pq.write_table(arrow_tbl, 'metrics_final.parquet')

print('Pipeline completed and saved as Parquet file.')

## Best Practices / Pitfalls

✅ **Best Practices:**
- Use Arrow for cross-language or distributed data interchange.
- Keep schema consistent when switching between engines.
- Prefer SQLAlchemy over raw connectors for database operations.
- Benchmark conversions when chaining multiple backends.

⚠️ **Pitfalls:**
- Converting between formats repeatedly can increase memory overhead.
- Some dtypes (like complex objects) aren’t natively supported by Arrow/Polars.
- Avoid mixing lazy and eager evaluations without explicit `.collect()` or `.compute()`.

## Challenge Exercise

**Task:** Build a hybrid ETL pipeline that:**
1. Reads data from a SQL database into Pandas.
2. Converts it to Polars for transformations.
3. Saves the final result as a Parquet file via PyArrow.

_Bonus_: Benchmark the total runtime compared to a pure Pandas workflow.

# --- End of Add-on Module Section 3 ---