# Polars Data Types and Schema Management - Comprehensive Guide

This notebook covers Polars' type system, from basic types to advanced schema management.

## What You'll Learn:
- Complete overview of Polars data types
- Categorical vs Enum types (memory optimization)
- Decimal types for financial data
- Schema definition and validation
- Type casting and coercion
- Schema evolution and compatibility
- Performance implications of data types
- Best practices for type selection

In [None]:
import polars as pl
import numpy as np
from datetime import datetime, date, time, timedelta
from decimal import Decimal

print(f"Polars version: {pl.__version__}")

---
# Part 1: Polars Data Types Overview

Polars has a rich type system designed for performance and correctness.

## 1.1 Numeric Types

In [None]:
# Integer types (signed and unsigned)
df_integers = pl.DataFrame({
    'int8': pl.Series([1, 2, 3], dtype=pl.Int8),      # -128 to 127
    'int16': pl.Series([1, 2, 3], dtype=pl.Int16),    # -32,768 to 32,767
    'int32': pl.Series([1, 2, 3], dtype=pl.Int32),    # -2.1B to 2.1B
    'int64': pl.Series([1, 2, 3], dtype=pl.Int64),    # -9.2E18 to 9.2E18
    'uint8': pl.Series([1, 2, 3], dtype=pl.UInt8),    # 0 to 255
    'uint16': pl.Series([1, 2, 3], dtype=pl.UInt16),  # 0 to 65,535
    'uint32': pl.Series([1, 2, 3], dtype=pl.UInt32),  # 0 to 4.3B
    'uint64': pl.Series([1, 2, 3], dtype=pl.UInt64),  # 0 to 1.8E19
})

print("Integer types:")
print(df_integers.schema)
print(f"\nMemory usage: {df_integers.estimated_size('mb'):.6f} MB")

In [None]:
# Floating point types
df_floats = pl.DataFrame({
    'float32': pl.Series([1.5, 2.5, 3.5], dtype=pl.Float32),  # 32-bit float
    'float64': pl.Series([1.5, 2.5, 3.5], dtype=pl.Float64),  # 64-bit float (double)
})

print("Float types:")
print(df_floats.schema)
print(df_floats)

In [None]:
# Choosing the right size (memory efficiency)
# Example: Age column
ages_int64 = pl.Series('age_int64', [25, 30, 35], dtype=pl.Int64)
ages_uint8 = pl.Series('age_uint8', [25, 30, 35], dtype=pl.UInt8)

print(f"Int64 memory: {ages_int64.estimated_size('b')} bytes")
print(f"UInt8 memory: {ages_uint8.estimated_size('b')} bytes")
print(f"Memory saved: {(1 - ages_uint8.estimated_size('b') / ages_int64.estimated_size('b')) * 100:.1f}%")
print("\n💡 Use smallest type that fits your data range!")

## 1.2 Temporal Types

In [None]:
# Date and time types
df_temporal = pl.DataFrame({
    'date': pl.Series([date(2024, 1, 1), date(2024, 1, 2)], dtype=pl.Date),
    'datetime': pl.Series([datetime(2024, 1, 1, 10, 30), datetime(2024, 1, 2, 14, 45)], dtype=pl.Datetime),
    'time': pl.Series([time(10, 30), time(14, 45)], dtype=pl.Time),
    'duration': pl.Series([timedelta(days=1), timedelta(hours=2)], dtype=pl.Duration),
})

print("Temporal types:")
print(df_temporal)
print(f"\nSchema: {df_temporal.schema}")

In [None]:
# Datetime with timezone
df_tz = pl.DataFrame({
    'dt_utc': pl.datetime_range(
        datetime(2024, 1, 1),
        datetime(2024, 1, 3),
        interval='1d',
        time_zone='UTC',
        eager=True
    )
})

print("Datetime with timezone:")
print(df_tz)
print(f"Type: {df_tz['dt_utc'].dtype}")

## 1.3 String and Boolean Types

In [None]:
# String and Boolean
df_basic = pl.DataFrame({
    'text': pl.Series(['hello', 'world'], dtype=pl.Utf8),  # UTF-8 encoded strings
    'flag': pl.Series([True, False], dtype=pl.Boolean),
})

print("String and Boolean:")
print(df_basic)
print(f"Schema: {df_basic.schema}")

---
# Part 2: Categorical and Enum Types

These types are crucial for memory optimization and performance with repetitive string data.

## 2.1 Categorical Type

In [None]:
# Create data with repetitive strings
countries = ['USA', 'Canada', 'Mexico'] * 1000

# As String (Utf8)
df_string = pl.DataFrame({
    'country': countries
})

# As Categorical
df_categorical = pl.DataFrame({
    'country': pl.Series(countries, dtype=pl.Categorical)
})

print("Memory comparison:")
print(f"String (Utf8):     {df_string.estimated_size('kb'):.2f} KB")
print(f"Categorical:       {df_categorical.estimated_size('kb'):.2f} KB")
print(f"Memory saved:      {(1 - df_categorical.estimated_size('kb') / df_string.estimated_size('kb')) * 100:.1f}%")
print("\n💡 Categorical stores strings once and uses integer indices!")

In [None]:
# Converting to Categorical
df = pl.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'category': ['A', 'B', 'A', 'C', 'B']
})

# Convert to categorical
df_cat = df.with_columns(
    pl.col('category').cast(pl.Categorical)
)

print("Original:")
print(df.schema)

print("\nAfter casting to Categorical:")
print(df_cat.schema)
print(df_cat)

In [None]:
# Categorical operations
result = df_cat.group_by('category').agg([
    pl.count().alias('count')
]).sort('category')

print("GroupBy on Categorical (fast):")
print(result)

# Get categories
print(f"\nUnique categories: {df_cat['category'].unique().to_list()}")

## 2.2 Enum Type (Preferred)

In [None]:
# Enum: When categories are known in advance
# Enum is FASTER and SAFER than Categorical

# Define enum with fixed categories
status_enum = pl.Enum(['pending', 'processing', 'completed', 'failed'])

df_enum = pl.DataFrame({
    'order_id': [1, 2, 3, 4, 5],
    'status': pl.Series(['pending', 'completed', 'processing', 'completed', 'failed'], dtype=status_enum)
})

print("DataFrame with Enum:")
print(df_enum)
print(f"\nSchema: {df_enum.schema}")

In [None]:
# Enum validation (rejects invalid values)
try:
    invalid_df = pl.DataFrame({
        'status': pl.Series(['pending', 'invalid_status'], dtype=status_enum)
    })
except Exception as e:
    print(f"❌ Error (expected): {type(e).__name__}")
    print("Enum enforces valid categories!")

In [None]:
# Enum vs Categorical: When to use which?
comparison = pl.DataFrame({
    'Aspect': [
        'Categories known upfront?',
        'Performance',
        'Memory efficiency',
        'Type safety',
        'Multiple DataFrames',
        'Use case'
    ],
    'Enum': [
        '✅ Yes (required)',
        '⭐⭐⭐⭐⭐ Fastest',
        '⭐⭐⭐⭐⭐ Best',
        '✅ Strict validation',
        '✅ Same encoding across DFs',
        'Status codes, fixed categories'
    ],
    'Categorical': [
        '❌ No (dynamic)',
        '⭐⭐⭐⭐ Fast',
        '⭐⭐⭐⭐ Good',
        '❌ No validation',
        '⚠️ Different encodings',
        'Unknown categories, user input'
    ]
})

print("Enum vs Categorical:")
print(comparison)
print("\n🏆 Prefer Enum whenever possible!")

---
# Part 3: Decimal Type

For exact decimal arithmetic (financial calculations, currencies).

In [None]:
# Decimal type for precise financial calculations
# Decimal(precision, scale) where precision=total digits, scale=decimal places

df_decimal = pl.DataFrame({
    'product': ['A', 'B', 'C'],
    'price': pl.Series([19.99, 29.95, 9.99], dtype=pl.Decimal(precision=10, scale=2)),
    'tax_rate': pl.Series([0.08, 0.08, 0.08], dtype=pl.Decimal(precision=5, scale=4)),
})

print("Decimal types:")
print(df_decimal)
print(f"\nSchema: {df_decimal.schema}")

In [None]:
# Decimal arithmetic (exact calculations)
df_calculated = df_decimal.with_columns([
    (pl.col('price') * pl.col('tax_rate')).alias('tax_amount'),
    (pl.col('price') * (1 + pl.col('tax_rate'))).alias('total_price')
])

print("Decimal calculations:")
print(df_calculated)
print("\n💡 Decimals avoid floating point precision errors!")

In [None]:
# Float vs Decimal precision
# Classic floating point problem
float_sum = 0.1 + 0.2
print(f"Float: 0.1 + 0.2 = {float_sum}")
print(f"Expected: 0.3")
print(f"Accurate? {float_sum == 0.3}")

# With Decimal
df_precision = pl.DataFrame({
    'a': pl.Series([Decimal('0.1')], dtype=pl.Decimal(precision=10, scale=2)),
    'b': pl.Series([Decimal('0.2')], dtype=pl.Decimal(precision=10, scale=2))
}).with_columns(
    (pl.col('a') + pl.col('b')).alias('sum')
)

print("\nWith Decimal:")
print(df_precision)
print("✅ Exact arithmetic!")

---
# Part 4: Schema Definition and Validation

## 4.1 Explicit Schema Definition

In [None]:
# Define schema explicitly for better control
schema = {
    'user_id': pl.UInt32,
    'username': pl.Utf8,
    'age': pl.UInt8,
    'is_active': pl.Boolean,
    'signup_date': pl.Date,
    'balance': pl.Decimal(precision=15, scale=2),
    'status': pl.Enum(['active', 'inactive', 'pending'])
}

# Create DataFrame with schema
df_schema = pl.DataFrame(
    {
        'user_id': [1, 2, 3],
        'username': ['alice', 'bob', 'charlie'],
        'age': [25, 30, 35],
        'is_active': [True, True, False],
        'signup_date': [date(2024, 1, 1), date(2024, 1, 2), date(2024, 1, 3)],
        'balance': [1000.50, 2500.75, 500.00],
        'status': ['active', 'active', 'inactive']
    },
    schema=schema
)

print("DataFrame with explicit schema:")
print(df_schema)
print(f"\nSchema: {df_schema.schema}")

In [None]:
# Reading CSV with explicit schema (faster + type safety)
import tempfile
import os

# Create sample CSV
temp_dir = tempfile.mkdtemp()
csv_path = os.path.join(temp_dir, 'users.csv')

with open(csv_path, 'w') as f:
    f.write('user_id,username,age,is_active,signup_date,balance\n')
    f.write('1,alice,25,true,2024-01-01,1000.50\n')
    f.write('2,bob,30,true,2024-01-02,2500.75\n')
    f.write('3,charlie,35,false,2024-01-03,500.00\n')

# Read with schema
csv_schema = {
    'user_id': pl.UInt32,
    'username': pl.Utf8,
    'age': pl.UInt8,
    'is_active': pl.Boolean,
    'signup_date': pl.Date,
    'balance': pl.Float64  # Will be exact in Decimal if needed
}

df_from_csv = pl.read_csv(csv_path, schema=csv_schema)

print("Read CSV with schema:")
print(df_from_csv)
print(f"\nSchema: {df_from_csv.schema}")
print("\n💡 Explicit schema = faster parsing + type safety")

## 4.2 Schema Validation

In [None]:
# Validate DataFrame against expected schema
def validate_schema(df: pl.DataFrame, expected_schema: dict) -> bool:
    """Validate DataFrame schema matches expected schema."""
    actual_schema = df.schema
    
    # Check all expected columns exist
    for col_name, expected_type in expected_schema.items():
        if col_name not in actual_schema:
            print(f"❌ Missing column: {col_name}")
            return False
        
        actual_type = actual_schema[col_name]
        if actual_type != expected_type:
            print(f"❌ Type mismatch for '{col_name}': expected {expected_type}, got {actual_type}")
            return False
    
    print("✅ Schema validation passed!")
    return True

# Test validation
expected_schema = {
    'user_id': pl.UInt32,
    'username': pl.Utf8,
    'age': pl.UInt8
}

test_df = pl.DataFrame({
    'user_id': pl.Series([1, 2], dtype=pl.UInt32),
    'username': ['alice', 'bob'],
    'age': pl.Series([25, 30], dtype=pl.UInt8)
})

validate_schema(test_df, expected_schema)

In [None]:
# Schema enforcement on write/read
# Write with schema preserved (Parquet is best)
parquet_path = os.path.join(temp_dir, 'data.parquet')
df_schema.write_parquet(parquet_path)

# Read back - schema is preserved!
df_read = pl.read_parquet(parquet_path)

print("Original schema:")
print(df_schema.schema)

print("\nRead schema (from Parquet):")
print(df_read.schema)

print("\n✅ Parquet preserves exact schema (including Enum, Decimal)!")

## 4.3 Type Casting and Coercion

In [None]:
# Safe type casting
df_cast = pl.DataFrame({
    'int_col': [1, 2, 3, 4, 5],
    'str_num': ['10', '20', '30', '40', '50'],
    'float_col': [1.1, 2.2, 3.3, 4.4, 5.5]
})

df_casted = df_cast.select([
    pl.col('int_col').cast(pl.Float64).alias('int_to_float'),
    pl.col('str_num').cast(pl.Int64).alias('str_to_int'),
    pl.col('float_col').cast(pl.Int64).alias('float_to_int'),  # Truncates
    pl.col('int_col').cast(pl.Utf8).alias('int_to_str')
])

print("Type casting:")
print(df_casted)
print(f"\nSchema: {df_casted.schema}")

In [None]:
# Handling cast failures
df_invalid = pl.DataFrame({
    'values': ['1', '2', 'invalid', '4']
})

# Strict cast (fails on invalid)
try:
    df_invalid.select(pl.col('values').cast(pl.Int64, strict=True))
except Exception as e:
    print(f"❌ Strict cast failed: {type(e).__name__}")

# Non-strict cast (invalid -> null)
df_with_nulls = df_invalid.select(
    pl.col('values').cast(pl.Int64, strict=False).alias('values_int')
)

print("\nNon-strict cast (invalid -> null):")
print(df_with_nulls)

In [None]:
# Downcasting for memory efficiency
df_large = pl.DataFrame({
    'id': range(1000),
    'value': range(1000)
})

print(f"Original (Int64): {df_large.estimated_size('kb'):.2f} KB")

# Downcast to smaller types
df_optimized = df_large.select([
    pl.col('id').cast(pl.UInt16).alias('id'),
    pl.col('value').cast(pl.UInt16).alias('value')
])

print(f"Optimized (UInt16): {df_optimized.estimated_size('kb'):.2f} KB")
print(f"Memory saved: {(1 - df_optimized.estimated_size('kb') / df_large.estimated_size('kb')) * 100:.1f}%")

---
# Part 5: Schema Evolution

In [None]:
# Schema evolution scenario
# Version 1: Original schema
df_v1 = pl.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})

# Version 2: Added column
df_v2 = pl.DataFrame({
    'id': [4, 5, 6],
    'name': ['Diana', 'Eve', 'Frank'],
    'age': [28, 32, 40],
    'email': ['diana@example.com', 'eve@example.com', 'frank@example.com']  # New!
})

# Version 3: Changed type
df_v3 = pl.DataFrame({
    'id': [7, 8, 9],
    'name': ['Grace', 'Henry', 'Iris'],
    'age': [26.5, 31.5, 38.5],  # Float instead of int!
})

print("V1 schema:", df_v1.schema)
print("V2 schema:", df_v2.schema)
print("V3 schema:", df_v3.schema)

In [None]:
# Combining DataFrames with different schemas
# align_columns will add missing columns with nulls

combined = pl.concat(
    [df_v1, df_v2, df_v3],
    how='diagonal'  # Handles different schemas
)

print("Combined with schema evolution:")
print(combined)
print(f"\nFinal schema: {combined.schema}")
print("Note: Missing 'email' is null, 'age' cast to Float64")

---
# Part 6: Binary and Object Types

## 6.1 Binary Type

In [None]:
# Binary data (bytes)
df_binary = pl.DataFrame({
    'id': [1, 2, 3],
    'data': [b'hello', b'world', b'polars']
})

print("Binary data:")
print(df_binary)
print(f"Schema: {df_binary.schema}")

In [None]:
# Binary operations
import base64

# Encode string to binary
df_encode = pl.DataFrame({
    'text': ['hello', 'world']
}).with_columns(
    pl.col('text').str.encode('utf-8').alias('binary')
)

print("Encode to binary:")
print(df_encode)

# Decode binary to string
df_decode = df_encode.with_columns(
    pl.col('binary').str.decode('utf-8').alias('decoded')
)

print("\nDecode from binary:")
print(df_decode)

## 6.2 Object Type (Use Sparingly)

In [None]:
# Object type for arbitrary Python objects (SLOW - avoid if possible)
class CustomObject:
    def __init__(self, value):
        self.value = value
    def __repr__(self):
        return f"CustomObject({self.value})"

df_object = pl.DataFrame({
    'id': [1, 2, 3],
    'obj': [CustomObject(10), CustomObject(20), CustomObject(30)]
})

print("Object type (slow - avoid):")
print(df_object)
print(f"Schema: {df_object.schema}")
print("\n⚠️ Object types are slow - use native types when possible!")

---
# Part 7: Type Selection Best Practices

In [None]:
# Best practices guide
best_practices = pl.DataFrame({
    'Data Type': [
        'Age (0-120)',
        'IDs (positive, < 4B)',
        'Prices/Money',
        'Status codes (fixed)',
        'Categories (unknown)',
        'Text/Names',
        'Dates',
        'Timestamps',
        'True/False flags',
        'Percentages (0-1)'
    ],
    'Recommended Type': [
        'UInt8',
        'UInt32',
        'Decimal(15, 2)',
        'Enum',
        'Categorical',
        'Utf8',
        'Date',
        'Datetime',
        'Boolean',
        'Float32'
    ],
    'Why': [
        'Smallest type that fits range',
        'Efficient, no negatives needed',
        'Exact arithmetic, no float errors',
        'Fastest, type-safe, consistent',
        'Memory efficient, flexible',
        'Standard string type',
        'Date-only operations',
        'Full precision timestamps',
        'Single bit storage',
        'Half precision sufficient'
    ]
})

print("Type Selection Best Practices:")
print(best_practices)

In [None]:
# Real-world example with optimized types
df_optimized_schema = pl.DataFrame(
    {
        'customer_id': [1, 2, 3, 4, 5],
        'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
        'age': [25, 30, 35, 28, 42],
        'is_premium': [True, False, True, False, True],
        'status': ['active', 'active', 'inactive', 'pending', 'active'],
        'balance': [1234.56, 5678.90, 910.11, 1213.14, 1516.17],
        'signup_date': [date(2024, 1, 1), date(2024, 1, 2), date(2024, 1, 3), 
                       date(2024, 1, 4), date(2024, 1, 5)]
    },
    schema={
        'customer_id': pl.UInt32,
        'name': pl.Utf8,
        'age': pl.UInt8,
        'is_premium': pl.Boolean,
        'status': pl.Enum(['active', 'inactive', 'pending']),
        'balance': pl.Decimal(precision=10, scale=2),
        'signup_date': pl.Date
    }
)

print("Optimized customer DataFrame:")
print(df_optimized_schema)
print(f"\nSchema: {df_optimized_schema.schema}")
print(f"Memory: {df_optimized_schema.estimated_size('kb'):.4f} KB")

---
# Summary

## Key Takeaways:

### **Type Categories:**
1. **Numeric**: Int8-Int64, UInt8-UInt64, Float32, Float64, Decimal
2. **Temporal**: Date, Datetime, Time, Duration
3. **Text**: Utf8, Categorical, Enum
4. **Other**: Boolean, Binary, Object

### **Memory Optimization:**
- ✅ Use smallest numeric type that fits your range
- ✅ Use UInt for positive-only values
- ✅ Use Enum/Categorical for repetitive strings
- ✅ Use Decimal for financial data
- ❌ Avoid Object type (very slow)

### **Enum vs Categorical:**
| Feature | Enum | Categorical |
|---------|------|-------------|
| Performance | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Type Safety | ✅ Strict | ❌ None |
| Use Case | Known categories | Dynamic categories |
| **Recommendation** | **Prefer Enum** | Only if dynamic |

### **Schema Best Practices:**
1. Define schemas explicitly for production code
2. Use Parquet to preserve exact schemas
3. Validate schemas before processing
4. Handle schema evolution with `how='diagonal'`
5. Use strict=False for graceful cast failures

### **Common Patterns:**
```python
# Define schema
schema = {
    'id': pl.UInt32,
    'status': pl.Enum(['active', 'inactive']),
    'price': pl.Decimal(10, 2)
}

# Create with schema
df = pl.DataFrame(data, schema=schema)

# Read with schema
df = pl.read_csv('data.csv', schema=schema)

# Cast safely
df = df.with_columns(pl.col('col').cast(pl.Int32, strict=False))
```

---
# Practice Exercises

In [None]:
# Exercise 1: Optimize this DataFrame's memory usage
df_exercise = pl.DataFrame({
    'id': [1, 2, 3, 4, 5],  # All positive, < 1000
    'age': [25, 30, 35, 40, 45],  # 0-120 range
    'score': [85.5, 92.3, 78.1, 88.9, 95.2],  # Percentages
})

# TODO: Cast columns to smallest appropriate types


In [None]:
# Exercise 2: Convert repetitive strings to Enum
df_status = pl.DataFrame({
    'order_id': range(1000),
    'status': ['pending', 'shipped', 'delivered'] * 333 + ['pending']
})

# TODO: Convert status to Enum, measure memory savings


In [None]:
# Exercise 3: Define and validate a schema for user data
# TODO: Create schema with: user_id (UInt32), email (Utf8), 
#       is_verified (Boolean), created_at (Datetime)


In [None]:
# Exercise 4: Handle schema evolution
# TODO: Combine 3 DataFrames with different schemas using diagonal concat


In [None]:
# Exercise 5: Use Decimal for financial calculations
# TODO: Calculate total price with tax using Decimal to avoid float errors


In [None]:
# Cleanup
import shutil
shutil.rmtree(temp_dir)
print(f"Cleaned up: {temp_dir}")