# Session 1: Introduction to Polars

Welcome to the Advanced Tech Track! In this course, we'll learn **Polars**, a modern DataFrame library that offers significant performance advantages over Pandas.

## Learning Objectives

By the end of this session, you will be able to:
1. Understand what Polars is and why it's useful
2. Create DataFrames and Series in Polars
3. Read and write various file formats
4. Perform basic data inspection
5. Select and transform columns using the expression API

## Prerequisites

This course assumes you're familiar with:
- Python fundamentals
- Pandas basics (DataFrames, Series, filtering, groupby)

## 1. What is Polars?

**Polars** is a DataFrame library written in Rust with Python bindings. It's designed for:

- **Speed**: Often 10-100x faster than Pandas for large datasets
- **Memory efficiency**: Better memory management and lazy evaluation
- **Modern API**: Consistent, expressive syntax based on expressions
- **Parallel execution**: Automatic parallelization of operations

### Why learn Polars?

| Aspect | Pandas | Polars |
|--------|--------|--------|
| Written in | C/Cython | Rust |
| Memory model | Eager only | Eager + Lazy |
| Parallelization | Manual | Automatic |
| Index | Row index | No index |
| Missing values | NaN + None | null |
| String handling | object dtype | Native strings |

In [None]:
# Import Polars
import polars as pl

# Check version
print(f"Polars version: {pl.__version__}")

## 2. Creating DataFrames and Series

Let's start by creating DataFrames - the core data structure in Polars.

### 2.1 Creating a DataFrame from a dictionary

In [None]:
# Create a DataFrame from a dictionary
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "city": ["New York", "Paris", "London", "Tokyo"]
})

df

### Pandas Comparison

```python
# Pandas
import pandas as pd
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "city": ["New York", "Paris", "London", "Tokyo"]
})
```

The syntax is nearly identical! The main difference is `pl.DataFrame` vs `pd.DataFrame`.

### 2.2 Creating a Series

In [None]:
# Create a Series
s = pl.Series("temperatures", [22.5, 25.0, 18.3, 30.1, 27.8])
print(s)
print(f"\nData type: {s.dtype}")

In [None]:
# Series with different data types
dates = pl.Series("dates", ["2024-01-01", "2024-01-02", "2024-01-03"]).str.to_date()
print(dates)
print(f"\nData type: {dates.dtype}")

## 3. Reading and Writing Files

Polars supports many file formats. Let's explore the most common ones.

### 3.1 Reading CSV Files

In [None]:
# Read a CSV file
employees = pl.read_csv("data/employees.csv")
employees.head()

### Pandas Comparison

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Read CSV | `pd.read_csv()` | `pl.read_csv()` |
| Read JSON | `pd.read_json()` | `pl.read_json()` |
| Read Parquet | `pd.read_parquet()` | `pl.read_parquet()` |
| Read Excel | `pd.read_excel()` | `pl.read_excel()` |

### 3.2 Writing Files

In [None]:
# Write to CSV
employees.head(10).write_csv("data/employees_sample.csv")

# Write to Parquet (efficient columnar format)
employees.write_parquet("data/employees.parquet")

print("Files written successfully!")

In [None]:
# Read back the Parquet file
employees_parquet = pl.read_parquet("data/employees.parquet")
employees_parquet.head()

### Why use Parquet?

- **Compression**: Much smaller file sizes than CSV
- **Speed**: Faster to read and write
- **Type preservation**: Data types are preserved (no parsing needed)
- **Column-oriented**: Efficient for analytical queries

## 4. Basic Data Inspection

Let's explore how to inspect our data in Polars.

In [None]:
# Shape: (rows, columns)
print(f"Shape: {employees.shape}")
print(f"Rows: {employees.height}")
print(f"Columns: {employees.width}")

In [None]:
# Column names
print("Columns:", employees.columns)

In [None]:
# Data types
print("Data types:")
print(employees.dtypes)

In [None]:
# Schema (column name -> data type mapping)
employees.schema

In [None]:
# First n rows
employees.head(5)

In [None]:
# Last n rows
employees.tail(5)

In [None]:
# Statistical summary
employees.describe()

### Pandas Comparison: Inspection Methods

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Shape | `df.shape` | `df.shape` |
| Columns | `df.columns` | `df.columns` |
| Data types | `df.dtypes` | `df.dtypes` |
| First rows | `df.head()` | `df.head()` |
| Last rows | `df.tail()` | `df.tail()` |
| Summary | `df.describe()` | `df.describe()` |
| Info | `df.info()` | `df.schema` |

## 5. Column Selection with `select()` and `pl.col()`

This is where Polars starts to differ from Pandas. Polars uses an **expression API** for selecting and transforming columns.

### 5.1 Basic Column Selection

In [None]:
# Select a single column by name (returns DataFrame)
employees.select("first_name")

In [None]:
# Select multiple columns
employees.select("first_name", "last_name", "department")

In [None]:
# Using pl.col() - the expression way
employees.select(pl.col("first_name"), pl.col("salary"))

### 5.2 Selecting with Patterns

In [None]:
# Select all columns
employees.select(pl.all())

In [None]:
# Select columns that start with a pattern
employees.select(pl.col("^.*_name$"))  # Columns ending with '_name'

In [None]:
# Select columns by data type
employees.select(pl.col(pl.Int64))  # Only integer columns

In [None]:
# Select all string columns
employees.select(pl.col(pl.String))

### Pandas Comparison: Column Selection

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Single column | `df["col"]` or `df.col` | `df.select("col")` |
| Multiple columns | `df[["col1", "col2"]]` | `df.select("col1", "col2")` |
| All columns | `df` | `df.select(pl.all())` |
| By dtype | `df.select_dtypes(include=['int64'])` | `df.select(pl.col(pl.Int64))` |

## 6. Introduction to the Expression API

The **expression API** is the heart of Polars. Expressions are lazy operations that describe transformations without executing them immediately.

### 6.1 Basic Expressions

In [None]:
# pl.col() creates an expression that references a column
salary_expr = pl.col("salary")
print(f"Expression: {salary_expr}")

In [None]:
# Expressions can be chained with operations
employees.select(
    pl.col("first_name"),
    pl.col("salary"),
    pl.col("salary") * 1.1  # 10% raise (creates a new column)
)

In [None]:
# Use .alias() to rename the result
employees.select(
    pl.col("first_name"),
    pl.col("salary"),
    (pl.col("salary") * 1.1).alias("salary_with_raise")
)

### 6.2 Aggregation Expressions

In [None]:
# Compute aggregations
employees.select(
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("salary").min().alias("min_salary"),
    pl.col("salary").max().alias("max_salary"),
    pl.col("salary").std().alias("std_salary")
)

In [None]:
# Count unique values
employees.select(
    pl.col("department").n_unique().alias("unique_departments"),
    pl.col("position").n_unique().alias("unique_positions")
)

### 6.3 String Expressions

In [None]:
# String operations via .str namespace
employees.select(
    pl.col("first_name"),
    pl.col("first_name").str.to_uppercase().alias("name_upper"),
    pl.col("first_name").str.len_chars().alias("name_length")
)

In [None]:
# Combine first and last name
employees.select(
    pl.col("first_name"),
    pl.col("last_name"),
    (pl.col("first_name") + " " + pl.col("last_name")).alias("full_name")
)

### Pandas Comparison: Transformations

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Multiply column | `df["col"] * 1.1` | `pl.col("col") * 1.1` |
| Rename result | `(df["col"] * 1.1).rename("new")` | `(pl.col("col") * 1.1).alias("new")` |
| Mean | `df["col"].mean()` | `pl.col("col").mean()` |
| Uppercase | `df["col"].str.upper()` | `pl.col("col").str.to_uppercase()` |

## 7. Creating New Columns with `with_columns()`

To add new columns to an existing DataFrame, use `with_columns()`.

In [None]:
# Add new columns
employees_enhanced = employees.with_columns(
    # Annual bonus (10% of salary)
    (pl.col("salary") * 0.10).alias("bonus"),
    
    # Full name
    (pl.col("first_name") + " " + pl.col("last_name")).alias("full_name"),
    
    # Uppercase department
    pl.col("department").str.to_uppercase().alias("department_upper")
)

employees_enhanced.head()

### Pandas Comparison

```python
# Pandas way (modifies in place or requires copy)
df["bonus"] = df["salary"] * 0.10
df["full_name"] = df["first_name"] + " " + df["last_name"]

# Or with .assign() (returns new DataFrame)
df = df.assign(
    bonus=df["salary"] * 0.10,
    full_name=df["first_name"] + " " + df["last_name"]
)
```

Polars' `with_columns()` always returns a new DataFrame, promoting immutability.

## 8. Key Differences from Pandas

### No Index
Polars doesn't have a row index. This simplifies many operations and avoids index alignment issues.

### Expressions vs Direct Operations
Polars encourages using expressions (`pl.col()`) rather than direct column access.

### Immutability
Polars operations return new DataFrames rather than modifying in place.

### Strict Typing
Polars is stricter about data types, which helps catch errors early.

## Summary: Pandas to Polars Cheat Sheet

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Import | `import pandas as pd` | `import polars as pl` |
| Create DataFrame | `pd.DataFrame({...})` | `pl.DataFrame({...})` |
| Read CSV | `pd.read_csv("file.csv")` | `pl.read_csv("file.csv")` |
| Write CSV | `df.to_csv("file.csv")` | `df.write_csv("file.csv")` |
| Select columns | `df[["col1", "col2"]]` | `df.select("col1", "col2")` |
| Add column | `df["new"] = expr` | `df.with_columns(expr.alias("new"))` |
| Column reference | `df["col"]` | `pl.col("col")` |
| Rename | `df.rename(columns={...})` | `df.rename({...})` |
| Shape | `df.shape` | `df.shape` |
| Data types | `df.dtypes` | `df.dtypes` or `df.schema` |

## Practice Exercises

Try these exercises using the `employees` DataFrame:

1. Select only the `position` and `salary` columns
2. Create a new column `monthly_salary` that divides `salary` by 12
3. Create a column `email_domain` that extracts "company.com" from the email addresses
4. Calculate the average, min, and max salary in a single `select()` statement

In [None]:
# Exercise 1: Select position and salary


In [None]:
# Exercise 2: Create monthly_salary column


In [None]:
# Exercise 3: Extract email domain


In [None]:
# Exercise 4: Calculate salary statistics


## Next Session Preview

In Session 2, we'll dive deeper into:
- Filtering rows with `filter()`
- Conditional logic with `when().then().otherwise()`
- Groupby operations and aggregations
- Joining DataFrames
- Handling missing data