# Introduction to Polars

[Polars](https://github.com/pola-rs/polars) is a fast DataFrame library for data wrangling/analytics. Like pandas, it lets you load, transform, join, group, and aggregate tabular data---but it's built with a different performance model:

- **Engine:** written in Rust (pandas is mostly Python with NumPy/C extensions).
- **Execution model:** supports lazy evaluation (build a query plan, then execute efficiently) and query optimization; pandas is mostly eager (does work immediately).
- **Parallelism:** Polars is designed to use multiple CPU cores by default for many operations; pandas is often single-threaded for typical DataFrame ops.
- **Memory model:** Polars uses Apache Arrow columnar memory under the hood, which is great for speed and interoperability.


### Pandas vs Polars

| Dimension                   | **pandas**                                                         | **Polars**                                                |
| --------------------------- | ------------------------------------------------------------------ | --------------------------------------------------------- |
| **Primary language**        | Python (with C/NumPy under the hood)                               | Rust (Python bindings)                                    |
| **Execution model**         | Eager (operations run immediately)                                 | Eager **and** Lazy (query planning + optimization)        |
| **Performance**             | Good for small--medium data; can slow down on large groupbys/joins | Often much faster, especially on large datasets           |
| **Parallelism**             | Limited; many ops are single-threaded                              | Built-in multi-threading by default                       |
| **Memory format**           | NumPy-based, row-oriented tendencies                               | Apache Arrow, columnar                                    |
| **Memory efficiency**       | Can be higher overhead, especially with object columns             | Generally more memory-efficient                           |
| **Data types**              | Flexible; `object` dtype common                                    | Strict, explicit dtypes                                   |
| **Missing values**          | Uses `NaN`, `None`, nullable dtypes added later                    | Native null support via Arrow                             |
| **Index**                   | Central concept (powerful but sometimes confusing)                 | No index (explicit columns only)                          |
| **API style**               | Imperative, step-by-step                                           | Expression-based, declarative                             |
| **Lazy evaluation**         | ❌ No                                                              | ✅ Yes                                                    |
| **Query optimization**      | ❌ No                                                              | ✅ Yes                                                    |
| **Streaming / out-of-core** | Limited                                                            | Supported (especially with lazy mode)                     |
| **String performance**      | Often slower (`object` strings)                                    | Very fast (Arrow strings)                                 |
| **Time series support**     | Very mature                                                        | Solid and improving                                       |
| **Ecosystem support**       | Massive (default for ML, stats, viz)                               | Growing, but smaller                                      |
| **Learning curve**          | Low (widely taught)                                                | Moderate (different mental model)                         |
| **Interoperability**        | Native to most Python data tools                                   | Easy conversion to/from pandas                            |
| **Typical use cases**       | Data exploration, ML prep, teaching, quick analysis                | ETL pipelines, large data, performance-critical workflows |
| **Maturity**                | Very mature                                                        | Newer but rapidly evolving                                |


## Using Polars


### Installing Polars

To install polars, use pip in your terminal or command prompt (where your Jupyter environment is set up):

```
pip install polars
```


In [19]:
import polars as pl
import datetime as dt

df_polars = pl.DataFrame(
    {
        "developer": [
            "Alice Chen",
            "Brian Patel",
            "Carlos Gomez",
            "Diana Nguyen",
        ],
        "hire_date": [
            dt.date(2019, 6, 1),
            dt.date(2020, 9, 15),
            dt.date(2018, 3, 22),
            dt.date(2021, 1, 10),
        ],
        "weekly_commits": [45, 30, 60, 25],
        "hours_worked": [40, 38, 45, 35],
    }
)

df_polars

developer,hire_date,weekly_commits,hours_worked
str,date,i64,i64
"""Alice Chen""",2019-06-01,45,40
"""Brian Patel""",2020-09-15,30,38
"""Carlos Gomez""",2018-03-22,60,45
"""Diana Nguyen""",2021-01-10,25,35


Check the type of the `df` object to confirm it's a Polars DataFrame:


In [20]:
type(df_polars)

polars.dataframe.frame.DataFrame

### Comparing Polars and Pandas DataFrames


In [21]:
import pandas as pd
import datetime as dt

df_pandas = pd.DataFrame(
    {
        "developer": [
            "Alice Chen",
            "Brian Patel",
            "Carlos Gomez",
            "Diana Nguyen",
        ],
        "hire_date": [
            "2019-06-01",
            "2020-09-15",
            "2018-03-22",
            "2021-01-10",
        ],
        "weekly_commits": [45, 30, 60, 25],
        "hours_worked": [40, 38, 45, 35],
    }
)

df_pandas

Unnamed: 0,developer,hire_date,weekly_commits,hours_worked
0,Alice Chen,2019-06-01,45,40
1,Brian Patel,2020-09-15,30,38
2,Carlos Gomez,2018-03-22,60,45
3,Diana Nguyen,2021-01-10,25,35


In [22]:
type(df_pandas)

pandas.core.frame.DataFrame

You'll notice that the Polars DataFrame is of type `polars.dataframe.DataFrame`, while the pandas DataFrame is of type `pandas.core.frame.DataFrame`. Other than that, they look quite similar!

However, Polars offers many modern-API features. For example, Polars supports expressions for data transformations.

For example, here is an expression to calculate the productivity score for each developer based on their number of weekly commits and hours worked:


In [23]:
result = df_polars.select(
    pl.col("developer"),
    pl.col("hire_date").dt.year().alias("hire_year"),
    (pl.col("weekly_commits") / pl.col("hours_worked")).alias("productivity_score"),
)
result

developer,hire_year,productivity_score
str,i32,f64
"""Alice Chen""",2019,1.125
"""Brian Patel""",2020,0.789474
"""Carlos Gomez""",2018,1.333333
"""Diana Nguyen""",2021,0.714286


You can also add columns to the DataFrame instead of selecting them using `with_columns`:


In [24]:
result = df_polars.with_columns(
    pl.col("developer"),
    pl.col("hire_date").dt.year().alias("hire_year"),
    (pl.col("weekly_commits") / pl.col("hours_worked")).alias("productivity_score"),
)
result

developer,hire_date,weekly_commits,hours_worked,hire_year,productivity_score
str,date,i64,i64,i32,f64
"""Alice Chen""",2019-06-01,45,40,2019,1.125
"""Brian Patel""",2020-09-15,30,38,2020,0.789474
"""Carlos Gomez""",2018-03-22,60,45,2018,1.333333
"""Diana Nguyen""",2021-01-10,25,35,2021,0.714286


### Filtering

Filtering in Polars can be done using expressions as well. For example, to filter developers with a productivity score greater than 1:


In [25]:
result = df_polars.filter((pl.col("weekly_commits") / pl.col("hours_worked")) > 1)
result

developer,hire_date,weekly_commits,hours_worked
str,date,i64,i64
"""Alice Chen""",2019-06-01,45,40
"""Carlos Gomez""",2018-03-22,60,45


You can also provide multiple expressions as separate parameters, which is much more convenient than having to use bitwise operators (`&`, `|`) as in pandas.


In [26]:
result = df_polars.filter(
    pl.col("hire_date").is_between(dt.date(2018, 1, 1), dt.date(2019, 12, 31)),
    pl.col("hours_worked") >= 40,
)
result

developer,hire_date,weekly_commits,hours_worked
str,date,i64,i64
"""Alice Chen""",2019-06-01,45,40
"""Carlos Gomez""",2018-03-22,60,45
