# Intro to Polars

Polars is a _lightning_ fast DataFrame library written in Rust that uses Apache Arrow for its columnar and table-like containers.

## Apache Arrow

From the [Apache Arrow documentation](https://arrow.apache.org/overview/):

Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead

This Apache Arrow memory is used by Polars to store data in a columnar format, which is very efficient for data processing.

## Philosophy

From the [Polars documentation](https://pola-rs.github.io/polars-book/user-guide/):

The goal of Polars is to provide a lightning fast DataFrame library that:
* Utilizes all available cores on your machine.
* Optimizes queries to reduce unneeded work/memory allocations.
* Handles datasets much larger than your available RAM.
* Has an API that is consistent and predictable.
* Has a strict schema (data-types should be known before running the query).
* Polars is written in Rust which gives it C/C++ performance and allows it to fully control performance critical parts in a query engine.

As such Polars goes to great lengths to:
* Reduce redundant copies.
* Traverse memory cache efficiently.
* Minimize contention in parallelism.
* Process data in chunks.
* Reuse memory allocations.

## Installing and using Polars

Polars is available on [PyPI](https://pypi.org/project/polars/) and can be installed with the following command:

```bash
pip install polars
```

An then, just import it in your project:

```python
import polars as pl
```


In [1]:
!pip install polars

Collecting polars
  Obtaining dependency information for polars from https://files.pythonhosted.org/packages/b0/02/e4a34c662d05b402df99e7bfa90dddb69e87b027e8aab86dec7247b2af0e/polars-0.18.13-cp38-abi3-macosx_11_0_arm64.whl.metadata
  Downloading polars-0.18.13-cp38-abi3-macosx_11_0_arm64.whl.metadata (14 kB)
Downloading polars-0.18.13-cp38-abi3-macosx_11_0_arm64.whl (16.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: polars
Successfully installed polars-0.18.13


In [2]:
import polars as pl

## Polars vs Pandas

That's the purpose of this repo: learning to use Polars based on operations known from Pandas, and comparing the syntax of both libraries, sometimes vs SQL.

The main difference is that **Polars can natively parallelize the processing of operations on multiple cores**, whereas Pandas couldn't do that until other tools built on top of Pandas (like Dask) were created.

Another difference is that **Polars includes a lazy API**. This means that the operations are not executed until the result is actually needed. When working with big datasets, we might not want to perform the operation until it's really needed, so that we can save time and resources with this lazy approach.

From now on, whenever there is a comparison between pandas and polars, I will use the following header: 

> ### pd vs pl
