# Session 1: Introduction to Polars

Welcome to the Advanced Tech Track! In this course, we'll learn **Polars**, a modern DataFrame library that offers significant performance advantages over Pandas.

## Learning Objectives

By the end of this session, you will be able to:
1. Understand what Polars is and why it's useful
2. Create DataFrames and Series in Polars
3. Read and write various file formats
4. Perform basic data inspection
5. Select and transform columns using the expression API

## Prerequisites

This course assumes you're familiar with:
- Python fundamentals
- Pandas basics (DataFrames, Series, filtering, groupby)

## 1. What is Polars?

**Polars** is a DataFrame library written in Rust with Python bindings. It's designed for:

- **Speed**: Often 10-100x faster than Pandas for large datasets
- **Memory efficiency**: Better memory management and lazy evaluation
- **Modern API**: Consistent, expressive syntax based on expressions
- **Parallel execution**: Automatic parallelization of operations

### Why learn Polars?

| Aspect | Pandas | Polars |
|--------|--------|--------|
| Written in | C/Cython | Rust |
| Memory model | Eager only | Eager + Lazy |
| Parallelization | Manual | Automatic |
| Index | Row index | No index |
| Missing values | NaN + None | null |
| String handling | object dtype | Native strings |

In [None]:
# Import Polars
import polars as pl

# Check version
print(f"Polars version: {pl.__version__}")

## 2. Creating DataFrames and Series

Let's start by creating DataFrames - the core data structure in Polars.

### 2.1 Creating a DataFrame from a dictionary

In [None]:
# Create a DataFrame from a dictionary
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "city": ["New York", "Paris", "London", "Tokyo"]
})

df

### Pandas Comparison

```python
# Pandas
import pandas as pd
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "city": ["New York", "Paris", "London", "Tokyo"]
})
```

The syntax is nearly identical! The main difference is `pl.DataFrame` vs `pd.DataFrame`.

### 2.2 Creating a Series

In [None]:
# Create a Series
s = pl.Series("temperatures", [22.5, 25.0, 18.3, 30.1, 27.8])
print(s)
print(f"\nData type: {s.dtype}")

In [None]:
# Series with different data types
dates = pl.Series("dates", ["2024-01-01", "2024-01-02", "2024-01-03"]).str.to_date()
print(dates)
print(f"\nData type: {dates.dtype}")

### Pandas Comparison: Series

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Create Series | `pd.Series([1, 2, 3])` | `pl.Series([1, 2, 3])` |
| Named Series | `pd.Series([1, 2, 3], name="values")` | `pl.Series("values", [1, 2, 3])` |
| Access dtype | `s.dtype` | `s.dtype` |
| To datetime | `pd.to_datetime(s)` | `s.str.to_date()` |

**Note**: In Polars, the series name comes *first* in the constructor: `pl.Series("name", [values])`, while in Pandas it's a keyword argument: `pd.Series([values], name="name")`.

## 3. Reading and Writing Files

Polars supports many file formats. Let's explore the most common ones.

### 3.1 Reading CSV Files

In [None]:
# Read a CSV file
employees = pl.read_csv("data/employees.csv")
employees.head()

### Pandas Comparison

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Read CSV | `pd.read_csv()` | `pl.read_csv()` |
| Read JSON | `pd.read_json()` | `pl.read_json()` |
| Read Parquet | `pd.read_parquet()` | `pl.read_parquet()` |
| Read Excel | `pd.read_excel()` | `pl.read_excel()` |

### 3.2 Writing Files

In [None]:
# Write to CSV
employees.head(10).write_csv("data/employees_sample.csv")

# Write to Parquet (efficient columnar format)
employees.write_parquet("data/employees.parquet")

print("Files written successfully!")

In [None]:
# Read back the Parquet file
employees_parquet = pl.read_parquet("data/employees.parquet")
employees_parquet.head()

### 3.3 Understanding Parquet Format

**Parquet** is a columnar storage format designed for efficient data storage and retrieval. Unlike CSV (which stores data row by row), Parquet stores data column by column.

#### CSV vs Parquet: How They Store Data

```
CSV (Row-oriented):          Parquet (Column-oriented):
┌────┬─────┬──────┐          ┌────────────────────┐
│ id │ name│ sal  │          │ id: 1, 2, 3        │
├────┼─────┼──────┤          │ name: A, B, C      │
│ 1  │ A   │ 50k  │          │ sal: 50k, 60k, 70k │
│ 2  │ B   │ 60k  │          └────────────────────┘
│ 3  │ C   │ 70k  │
└────┴─────┴──────┘
```

#### Why Parquet is Better for Analytics

| Feature | CSV | Parquet |
|---------|-----|---------|
| **Storage** | Plain text | Binary, compressed |
| **File Size** | Large | 2-10x smaller |
| **Data Types** | Lost (everything is text) | Preserved |
| **Read Speed** | Must parse text | Direct binary read |
| **Column Selection** | Must read entire file | Reads only needed columns |
| **Schema** | None (inferred) | Embedded in file |

#### When to Use Each Format

- **CSV**: Human-readable, data exchange, small datasets, compatibility
- **Parquet**: Analytics, large datasets, repeated reads, data pipelines

### 3.4 Performance Comparison: Pandas vs Polars, CSV vs Parquet

Let's compare the read performance of different combinations.

In [None]:
import pandas as pd
import time
import os

# First, let's check file sizes
csv_path = "data/employees.csv"
parquet_path = "data/employees.parquet"

csv_size = os.path.getsize(csv_path)
parquet_size = os.path.getsize(parquet_path)

print("=== File Size Comparison ===")
print(f"CSV file:     {csv_size:,} bytes")
print(f"Parquet file: {parquet_size:,} bytes")
print(f"Parquet is {csv_size/parquet_size:.1f}x smaller")

In [None]:
# Benchmark function
def benchmark_read(read_func, path, n_runs=10):
    """Run read operation multiple times and return average time."""
    times = []
    for _ in range(n_runs):
        start = time.time()
        df = read_func(path)
        times.append(time.time() - start)
    return sum(times) / len(times) * 1000  # Return milliseconds

# Run benchmarks
print("=== Read Performance Comparison (10 runs each) ===\n")

# Pandas CSV
pandas_csv_time = benchmark_read(pd.read_csv, csv_path)
print(f"Pandas  + CSV:     {pandas_csv_time:.2f} ms")

# Pandas Parquet
pandas_parquet_time = benchmark_read(pd.read_parquet, parquet_path)
print(f"Pandas  + Parquet: {pandas_parquet_time:.2f} ms")

# Polars CSV
polars_csv_time = benchmark_read(pl.read_csv, csv_path)
print(f"Polars  + CSV:     {polars_csv_time:.2f} ms")

# Polars Parquet
polars_parquet_time = benchmark_read(pl.read_parquet, parquet_path)
print(f"Polars  + Parquet: {polars_parquet_time:.2f} ms")

print(f"\n=== Speedup Summary ===")
print(f"Polars CSV vs Pandas CSV:         {pandas_csv_time/polars_csv_time:.1f}x faster")
print(f"Polars Parquet vs Pandas Parquet: {pandas_parquet_time/polars_parquet_time:.1f}x faster")
print(f"Polars Parquet vs Pandas CSV:     {pandas_csv_time/polars_parquet_time:.1f}x faster")

### Key Takeaways: File Formats and Performance

1. **Parquet files are smaller** due to efficient binary compression
2. **Parquet reads are faster** because:
   - No text parsing required
   - Data types don't need inference
   - Can read only needed columns (projection pushdown)
3. **Polars is faster than Pandas** for both CSV and Parquet formats
4. **Best combination**: Polars + Parquet for maximum performance

**Note on dataset size**: With small files (like our 100-row example), you may see some overhead from Parquet's binary format. The performance benefits become dramatic with larger datasets (10K+ rows). In Session 3, we'll benchmark with 100K rows where the differences are substantial.

**Recommendation**: When working with data you'll read multiple times, convert CSV to Parquet:

```python
# One-time conversion
pl.read_csv("data.csv").write_parquet("data.parquet")

# Then always read from Parquet
df = pl.read_parquet("data.parquet")
```

## 4. Basic Data Inspection

Let's explore how to inspect our data in Polars.

In [None]:
# Shape: (rows, columns)
print(f"Shape: {employees.shape}")
print(f"Rows: {employees.height}")
print(f"Columns: {employees.width}")

In [None]:
# Column names
print("Columns:", employees.columns)

In [None]:
# Data types
print("Data types:")
print(employees.dtypes)

In [None]:
# Schema (column name -> data type mapping)
employees.schema

In [None]:
# First n rows
employees.head(5)

In [None]:
# Last n rows
employees.tail(5)

In [None]:
# Statistical summary
employees.describe()

### Pandas Comparison: Inspection Methods

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Shape | `df.shape` | `df.shape` |
| Columns | `df.columns` | `df.columns` |
| Data types | `df.dtypes` | `df.dtypes` |
| First rows | `df.head()` | `df.head()` |
| Last rows | `df.tail()` | `df.tail()` |
| Summary | `df.describe()` | `df.describe()` |
| Info | `df.info()` | `df.schema` |

## 5. Column Selection with `select()` and `pl.col()`

This is where Polars starts to differ from Pandas. Polars uses an **expression API** for selecting and transforming columns.

### 5.1 Basic Column Selection

In [None]:
# Select a single column by name (returns DataFrame)
employees.select("first_name")

In [None]:
# Select multiple columns
employees.select("first_name", "last_name", "department")

In [None]:
# Using pl.col() - the expression way
employees.select(pl.col("first_name"), pl.col("salary"))

### 5.2 Selecting with Patterns

In [None]:
# Select all columns
employees.select(pl.all())

In [None]:
# Select columns that start with a pattern
employees.select(pl.col("^.*_name$"))  # Columns ending with '_name'

In [None]:
# Select columns by data type
employees.select(pl.col(pl.Int64))  # Only integer columns

### Pandas Comparison: Pattern-Based Selection

| Operation | Pandas | Polars |
|-----------|--------|--------|
| All columns | `df` or `df.loc[:, :]` | `df.select(pl.all())` |
| Regex pattern | `df.filter(regex=r"^.*_name$")` | `df.select(pl.col("^.*_name$"))` |
| By dtype | `df.select_dtypes(include=['int64'])` | `df.select(pl.col(pl.Int64))` |
| String columns | `df.select_dtypes(include=['object'])` | `df.select(pl.col(pl.String))` |
| Exclude columns | `df.drop(columns=["col"])` | `df.select(pl.all().exclude("col"))` |

**Note**: Polars' `pl.col()` accepts regex patterns directly, making pattern-based selection more concise than Pandas' `filter()` method.

In [None]:
# Select all string columns
employees.select(pl.col(pl.String))

### Pandas Comparison: Column Selection

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Single column | `df["col"]` or `df.col` | `df.select("col")` |
| Multiple columns | `df[["col1", "col2"]]` | `df.select("col1", "col2")` |
| All columns | `df` | `df.select(pl.all())` |
| By dtype | `df.select_dtypes(include=['int64'])` | `df.select(pl.col(pl.Int64))` |

## 6. Introduction to the Expression API

The **expression API** is the heart of Polars and what makes it fundamentally different from Pandas.

### What is an Expression?

An **expression** in Polars is a *description of a computation*, not the result itself. Think of it like a recipe:
- The recipe (expression) describes what to do
- The cooking (execution) happens later when you call `.select()`, `.with_columns()`, or `.collect()`

This separation allows Polars to:
1. **Optimize** the computation before running it
2. **Parallelize** operations automatically
3. **Reuse** expressions across different contexts

### Pandas vs Polars: Mental Model

| Aspect | Pandas | Polars |
|--------|--------|--------|
| Column access | Direct: `df["col"]` returns data | Expression: `pl.col("col")` returns a recipe |
| When computed | Immediately | When collected/selected |
| Optimization | None (eager) | Query optimization possible |
| Reusability | Limited | Expressions can be stored and reused |

### 6.1 Basic Expressions

In [None]:
# Expressions can be stored in variables and reused
salary_with_raise = (pl.col("salary") * 1.1).alias("salary_raised")
salary_with_bonus = (pl.col("salary") * 1.2).alias("salary_with_bonus")

# Use the stored expressions
employees.select(
    pl.col("first_name"),
    pl.col("salary"),
    salary_with_raise,
    salary_with_bonus
)

In [None]:
# Expressions can chain multiple operations
employees.select(
    pl.col("first_name"),
    pl.col("salary"),
    (pl.col("salary") / 12).round(2).alias("monthly_salary")  # Chain division and rounding
)

In [None]:
# The key difference: pl.col() vs df["col"]

# In Pandas, this immediately accesses the data:
# pandas_salary = df["salary"]  # Returns actual data (Series)

# In Polars, this creates an expression (a recipe):
polars_salary_expr = pl.col("salary")  # Returns an expression object
print(f"This is an expression object: {type(polars_salary_expr)}")
print(f"Expression: {polars_salary_expr}")

# The expression is only evaluated when used in a context like select()
result = employees.select(polars_salary_expr)
print(f"\nAfter select(), we get actual data: {type(result)}")

### Pandas Comparison: Column Access vs Expressions

```python
# PANDAS - Direct access, immediate execution
salary_data = df["salary"]           # Returns Series with actual values
doubled = df["salary"] * 2           # Computed immediately
df["doubled"] = df["salary"] * 2     # Adds column immediately

# POLARS - Expression-based, deferred execution  
salary_expr = pl.col("salary")              # Returns expression (recipe)
doubled_expr = pl.col("salary") * 2         # Still just an expression
df.with_columns((pl.col("salary") * 2).alias("doubled"))  # Executed here
```

**Key insight**: In Polars, you build up expressions that describe what you want, then execute them all at once. This allows Polars to optimize the entire operation.

### 6.2 Arithmetic Expressions

Polars supports all standard arithmetic operations on expressions.

In [None]:
# Arithmetic operations on expressions
employees.select(
    pl.col("first_name"),
    pl.col("salary"),
    (pl.col("salary") + 5000).alias("salary_plus_5k"),       # Addition
    (pl.col("salary") - 10000).alias("salary_minus_10k"),    # Subtraction
    (pl.col("salary") * 2).alias("salary_doubled"),          # Multiplication
    (pl.col("salary") / 1000).alias("salary_in_thousands"),  # Division
    (pl.col("salary") // 10000).alias("salary_10k_brackets"), # Floor division
    (pl.col("salary") % 10000).alias("salary_mod_10k"),      # Modulo
)

### Pandas Comparison: Arithmetic Operations

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Addition | `df["col"] + 5000` | `pl.col("col") + 5000` |
| Subtraction | `df["col"] - 1000` | `pl.col("col") - 1000` |
| Multiplication | `df["col"] * 2` | `pl.col("col") * 2` |
| Division | `df["col"] / 1000` | `pl.col("col") / 1000` |
| Floor division | `df["col"] // 10` | `pl.col("col") // 10` |
| Modulo | `df["col"] % 10` | `pl.col("col") % 10` |
| Round | `df["col"].round(2)` | `pl.col("col").round(2)` |
| Absolute | `df["col"].abs()` | `pl.col("col").abs()` |

The operators are identical! The difference is that Pandas operates on data directly, while Polars builds an expression.

### 6.3 Aggregation Expressions

In [None]:
# Compute aggregations
employees.select(
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("salary").min().alias("min_salary"),
    pl.col("salary").max().alias("max_salary"),
    pl.col("salary").std().alias("std_salary")
)

In [None]:
# Count unique values
employees.select(
    pl.col("department").n_unique().alias("unique_departments"),
    pl.col("position").n_unique().alias("unique_positions")
)

### Pandas Comparison: Aggregation Functions

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Mean | `df["col"].mean()` | `pl.col("col").mean()` |
| Sum | `df["col"].sum()` | `pl.col("col").sum()` |
| Min | `df["col"].min()` | `pl.col("col").min()` |
| Max | `df["col"].max()` | `pl.col("col").max()` |
| Std | `df["col"].std()` | `pl.col("col").std()` |
| Count | `df["col"].count()` | `pl.col("col").count()` |
| Unique count | `df["col"].nunique()` | `pl.col("col").n_unique()` |
| First | `df["col"].iloc[0]` | `pl.col("col").first()` |
| Last | `df["col"].iloc[-1]` | `pl.col("col").last()` |

**Key difference**: In Pandas, these return a scalar value. In Polars, they return expressions that can be combined with other expressions in a single `select()` call.

### 6.4 String Expressions

Both Pandas and Polars provide a `.str` accessor (also called a "namespace") for string operations. This is one area where the two libraries are conceptually similar, but with important differences.

#### What is the `.str` Namespace?

The `.str` namespace is a collection of string methods that can be applied to text data. Instead of writing `upper(column)`, you write `column.str.to_uppercase()`. This keeps all string operations organized under one umbrella.

#### Similarities Between Pandas and Polars `.str`

| Aspect | Both Libraries |
|--------|----------------|
| Access pattern | Use `.str.method_name()` syntax |
| Chaining | Methods can be chained: `.str.lower().str.strip_chars()` |
| Vectorized | Operations apply to all values at once (no loops needed) |
| Null handling | Both handle null/NaN values gracefully |

#### Key Differences

| Aspect | Pandas `.str` | Polars `.str` |
|--------|---------------|---------------|
| **Applied to** | Series directly: `df["col"].str.upper()` | Expressions: `pl.col("col").str.to_uppercase()` |
| **Execution** | Immediate | Deferred (part of expression) |
| **Method names** | Shorter: `upper()`, `lower()`, `len()` | More explicit: `to_uppercase()`, `to_lowercase()`, `len_chars()` |
| **Return type** | Series (data) | Expression (recipe) |
| **After split** | Index with `.str[0]` | Use `.list.first()` or `.list.get(0)` |

#### Why Polars Uses Different Names

Polars chose more explicit method names to avoid ambiguity:

- `len()` vs `len_chars()`: In Polars, `len_chars()` counts characters, while `len_bytes()` counts bytes. This matters for Unicode text (e.g., "café" has 4 characters but 5 bytes in UTF-8).
- `upper()` vs `to_uppercase()`: The `to_` prefix makes it clear a transformation is happening.
- `startswith()` vs `starts_with()`: Polars uses snake_case consistently.

In [None]:
# String operations via .str namespace
employees.select(
    pl.col("first_name"),
    pl.col("first_name").str.to_uppercase().alias("name_upper"),
    pl.col("first_name").str.len_chars().alias("name_length")
)

In [None]:
# Combine first and last name
employees.select(
    pl.col("first_name"),
    pl.col("last_name"),
    (pl.col("first_name") + " " + pl.col("last_name")).alias("full_name")
)

### Pandas Comparison: String Operations

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Uppercase | `df["col"].str.upper()` | `pl.col("col").str.to_uppercase()` |
| Lowercase | `df["col"].str.lower()` | `pl.col("col").str.to_lowercase()` |
| Length | `df["col"].str.len()` | `pl.col("col").str.len_chars()` |
| Contains | `df["col"].str.contains("x")` | `pl.col("col").str.contains("x")` |
| Replace | `df["col"].str.replace("a", "b")` | `pl.col("col").str.replace("a", "b")` |
| Split | `df["col"].str.split(",")` | `pl.col("col").str.split(",")` |
| Strip whitespace | `df["col"].str.strip()` | `pl.col("col").str.strip_chars()` |
| Starts with | `df["col"].str.startswith("x")` | `pl.col("col").str.starts_with("x")` |
| Ends with | `df["col"].str.endswith("x")` | `pl.col("col").str.ends_with("x")` |
| Concatenate | `df["a"] + " " + df["b"]` | `pl.col("a") + " " + pl.col("b")` |

**Note**: Polars uses `to_uppercase()`/`to_lowercase()` instead of `upper()`/`lower()`, and `len_chars()` instead of `len()` (to distinguish from byte length).

### 6.5 Boolean and Comparison Expressions

Boolean expressions are essential for filtering (covered in Session 2) and conditional logic.

In [None]:
# Comparison expressions return boolean values
employees.select(
    pl.col("first_name"),
    pl.col("salary"),
    (pl.col("salary") > 100000).alias("high_earner"),
    (pl.col("salary") >= 80000).alias("above_80k"),
    (pl.col("department") == "Engineering").alias("is_engineering"),
    (pl.col("department") != "Sales").alias("not_sales"),
)

In [None]:
# Combining boolean expressions with AND (&) and OR (|)
employees.select(
    pl.col("first_name"),
    pl.col("department"),
    pl.col("salary"),
    # AND: both conditions must be true
    ((pl.col("department") == "Engineering") & (pl.col("salary") > 100000)).alias("eng_high_earner"),
    # OR: either condition can be true
    ((pl.col("department") == "Sales") | (pl.col("department") == "Marketing")).alias("sales_or_marketing"),
    # NOT: negate a condition
    (~(pl.col("is_active"))).alias("is_inactive"),
)

### Pandas Comparison: Boolean and Comparison Operations

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Equal | `df["col"] == value` | `pl.col("col") == value` |
| Not equal | `df["col"] != value` | `pl.col("col") != value` |
| Greater than | `df["col"] > value` | `pl.col("col") > value` |
| Less than | `df["col"] < value` | `pl.col("col") < value` |
| Greater or equal | `df["col"] >= value` | `pl.col("col") >= value` |
| Less or equal | `df["col"] <= value` | `pl.col("col") <= value` |
| AND | `(cond1) & (cond2)` | `(cond1) & (cond2)` |
| OR | `(cond1) \| (cond2)` | `(cond1) \| (cond2)` |
| NOT | `~condition` | `~condition` |
| Is null | `df["col"].isna()` | `pl.col("col").is_null()` |
| Is not null | `df["col"].notna()` | `pl.col("col").is_not_null()` |
| Is in list | `df["col"].isin([...])` | `pl.col("col").is_in([...])` |
| Between | `df["col"].between(a, b)` | `pl.col("col").is_between(a, b)` |

**Important**: Always wrap conditions in parentheses when using `&` and `|` due to Python's operator precedence.

## 7. Creating New Columns with `with_columns()`

To add new columns to an existing DataFrame, use `with_columns()`.

In [None]:
# Add new columns
employees_enhanced = employees.with_columns(
    # Annual bonus (10% of salary)
    (pl.col("salary") * 0.10).alias("bonus"),
    
    # Full name
    (pl.col("first_name") + " " + pl.col("last_name")).alias("full_name"),
    
    # Uppercase department
    pl.col("department").str.to_uppercase().alias("department_upper")
)

employees_enhanced.head()

### Pandas Comparison

```python
# Pandas way (modifies in place or requires copy)
df["bonus"] = df["salary"] * 0.10
df["full_name"] = df["first_name"] + " " + df["last_name"]

# Or with .assign() (returns new DataFrame)
df = df.assign(
    bonus=df["salary"] * 0.10,
    full_name=df["first_name"] + " " + df["last_name"]
)
```

Polars' `with_columns()` always returns a new DataFrame, promoting immutability.

## 8. Key Differences from Pandas

### No Index
Polars doesn't have a row index. This simplifies many operations and avoids index alignment issues.

### Expressions vs Direct Operations
Polars encourages using expressions (`pl.col()`) rather than direct column access.

### Immutability
Polars operations return new DataFrames rather than modifying in place.

### Strict Typing
Polars is stricter about data types, which helps catch errors early.

## Summary: Pandas to Polars Cheat Sheet

| Operation | Pandas | Polars |
|-----------|--------|--------|
| Import | `import pandas as pd` | `import polars as pl` |
| Create DataFrame | `pd.DataFrame({...})` | `pl.DataFrame({...})` |
| Read CSV | `pd.read_csv("file.csv")` | `pl.read_csv("file.csv")` |
| Write CSV | `df.to_csv("file.csv")` | `df.write_csv("file.csv")` |
| Select columns | `df[["col1", "col2"]]` | `df.select("col1", "col2")` |
| Add column | `df["new"] = expr` | `df.with_columns(expr.alias("new"))` |
| Column reference | `df["col"]` | `pl.col("col")` |
| Rename | `df.rename(columns={...})` | `df.rename({...})` |
| Shape | `df.shape` | `df.shape` |
| Data types | `df.dtypes` | `df.dtypes` or `df.schema` |

## Practice Exercises

Try these exercises using the `employees` DataFrame:

1. Select only the `position` and `salary` columns
2. Create a new column `monthly_salary` that divides `salary` by 12
3. Create a column `email_domain` that extracts "company.com" from the email addresses
4. Calculate the average, min, and max salary in a single `select()` statement

In [None]:
# Exercise 1: Select position and salary


In [None]:
# Exercise 2: Create monthly_salary column


In [None]:
# Exercise 3: Extract email domain


In [None]:
# Exercise 4: Calculate salary statistics


## Next Session Preview

In Session 2, we'll dive deeper into:
- Filtering rows with `filter()`
- Conditional logic with `when().then().otherwise()`
- Groupby operations and aggregations
- Joining DataFrames
- Handling missing data