<a href="https://colab.research.google.com/github/ad17171717/YouTube-Tutorials/blob/main/Data%20Science%20with%20Python/Data_Science_with_Python!_An_Introduction_to_Polars.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#check if polars is installed
#if running in Windows, in PowerShell run: "pip freeze | Select-String polars"
!pip freeze | grep polars

polars==0.17.3


In [None]:
#install polars/pandas if needed
!pip install polars
!pip install pandas



In [None]:
import polars as pl
import pandas as pd

#modules for plotting
import plotly.express as px
import seaborn as sns

# **Polars**

**Polars is a DataFrame library for manipulating structured data. The core is written in Rust, but the library is also available in Python and NodeJS. Polars is built upon the safe Arrow2 implementation of the Apache Arrow specification.**

**Key features of Polars include:**


    
- **Out of Core: Polars supports out of core data transformation with its streaming API. Allowing you to process your results without requiring all your data to be in memory at the same time**

- **Parallel: Polars fully utilises the power of your machine by dividing the workload among the available CPU cores without any additional configuration.**

- **Vectorized Query Engine: Polars uses Apache Arrow, a columnar data format, to process your queries in a vectorized manner. It uses SIMD to optimize CPU usage.**

<sup>Source: [Polars Documentation](https://www.pola.rs/)</sup>

## **Downloading the Data Set**

In [None]:
#download the data from NASA's website
!curl https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD > meteors.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3859k    0 3859k    0     0   816k      0 --:--:--  0:00:04 --:--:--  918k


# **Comparing Polars to pandas**

## **timeit Information**

| **Symbol**  | **Name**  | **Conversion Rate**  |
|---|---|---|
| ns  | nanosecond  | 1 second = 1,000,000,000 nanoseconds  |
| μs  | microsecond  |  1 second = 1,000,000 microseconds |
| ms  | milliseconds  |  1 second = 1,000 milliseconds |

## **Reading in a Data Set**

In [None]:
polars_df = pl.read_csv('meteors.csv')
polars_df.head()

name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
str,i64,str,str,f64,str,i64,f64,f64,str
"""Aachen""",1,"""Valid""","""L5""",21.0,"""Fell""",1880,50.775,6.08333,"""(50.775, 6.083…"
"""Aarhus""",2,"""Valid""","""H6""",720.0,"""Fell""",1951,56.18333,10.23333,"""(56.18333, 10.…"
"""Abee""",6,"""Valid""","""EH4""",107000.0,"""Fell""",1952,54.21667,-113.0,"""(54.21667, -11…"
"""Acapulco""",10,"""Valid""","""Acapulcoite""",1914.0,"""Fell""",1976,16.88333,-99.9,"""(16.88333, -99…"
"""Achiras""",370,"""Valid""","""L6""",780.0,"""Fell""",1902,-33.16667,-64.95,"""(-33.16667, -6…"


In [None]:
#create pandas DataFrame for comparison
pandas_df = pd.read_csv('meteors.csv')

In [None]:
%%timeit
polars_df = pl.read_csv('meteors.csv')

43.8 ms ± 4.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
%%timeit
pandas_df = pd.read_csv('meteors.csv')

114 ms ± 50.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## **Selecting a Column(s) within a DataFrame**

### **Single Column**

In [None]:
polars_df['id']

id
i64
1
2
6
10
370
379
390
392
398
417


In [None]:
%%timeit
polars_df['id']

1.6 µs ± 400 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [None]:
%%timeit
pandas_df['id']

2.38 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


### **Multiple Columns**

In [None]:
polars_df[['id','mass (g)']]

id,mass (g)
i64,f64
1,21.0
2,720.0
6,107000.0
10,1914.0
370,780.0
379,4239.0
390,910.0
392,30000.0
398,1620.0
417,1440.0


In [None]:
%%timeit
polars_df[['id','mass (g)']]

4.18 µs ± 586 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [None]:
%%timeit
pandas_df[['id','mass (g)']]

883 µs ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## **Sorting the Dataset**

In [None]:
polars_df.sort(descending=True,by='year')

name,id,nametype,recclass,mass (g),fall,year,reclat,reclong,GeoLocation
str,i64,str,str,f64,str,i64,f64,f64,str
"""Northwest Afri…",57150,"""Valid""","""CK6""",55.0,"""Found""",2101,0.0,0.0,"""(0.0, 0.0)"""
"""Chelyabinsk""",57165,"""Valid""","""LL5""",100000.0,"""Fell""",2013,54.81667,61.11667,"""(54.81667, 61.…"
"""Northwest Afri…",57166,"""Valid""","""Martian (sherg…",30.0,"""Found""",2013,0.0,0.0,"""(0.0, 0.0)"""
"""Northwest Afri…",57258,"""Valid""","""Angrite""",46.2,"""Found""",2013,0.0,0.0,"""(0.0, 0.0)"""
"""Northwest Afri…",57268,"""Valid""","""Achondrite-ung…",45.8,"""Found""",2013,0.0,0.0,"""(0.0, 0.0)"""
"""Northwest Afri…",57420,"""Valid""","""H4""",916.0,"""Found""",2013,0.0,0.0,"""(0.0, 0.0)"""
"""Northwest Afri…",57421,"""Valid""","""LL6""",517.0,"""Found""",2013,0.0,0.0,"""(0.0, 0.0)"""
"""Northwest Afri…",57422,"""Valid""","""LL6""",246.0,"""Found""",2013,0.0,0.0,"""(0.0, 0.0)"""
"""Northwest Afri…",57423,"""Valid""","""H4""",459.0,"""Found""",2013,0.0,0.0,"""(0.0, 0.0)"""
"""Northwest Afri…",57425,"""Valid""","""L5""",611.0,"""Found""",2013,0.0,0.0,"""(0.0, 0.0)"""


In [None]:
%%timeit
polars_df.sort(by='year',descending=True)

6.6 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
%%timeit
pandas_df.sort_values(by='year',ascending=False)

7.65 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


## **Running Calculations**

### **Calculating the Average of a Column**

In [None]:
print(f'The average mass for the meteors in the dataset is {polars_df["mass (g)"].mean():,.2f}')

The average mass for the meteors in the dataset is 13,278.08


In [None]:
%%timeit
polars_df["mass (g)"].mean()

22.5 µs ± 645 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [None]:
%%timeit
pandas_df["mass (g)"].mean()

196 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### **Calculating the Median of a Column**

In [None]:
print(f'The median mass for the meteors in the dataset is {polars_df["mass (g)"].median():,.2f}')

The median mass for the meteors in the dataset is 32.60


In [None]:
%%timeit
polars_df["mass (g)"].median()

3.29 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
%%timeit
pandas_df["mass (g)"].median()

893 µs ± 107 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## **Plotting Data within a Polars DataFrame**

**The current version (0.17.3) of a Polars DataFrame does not have a built-in `plot` function. We will use the plotly module to graph data from our Polars DataFrame**

### **Scatterplot**

In [None]:
fig = px.scatter(polars_df, x='year', y='mass (g)',color='fall')
fig.show()

## **Lazy Evaluation with Polars**

**Polars supports two modes of operation: lazy and eager. In the eager API the query is executed immediately while in the lazy API the query is only evaluated when called explicitly.**

**The ideal case to use the lazy API is right from a file as the query optimizer may help reduce the amount of data read from the file. The lazy operation can lower the load on memory & CPU, allowing larger data sets to be read and speeding up the read time.**

<sup>Source: [Lazy API](https://pola-rs.github.io/polars/user-guide/lazy/using/) from Polars Documentation site</sup>

In [None]:
q1 = (pl.scan_csv('/content/meteors.csv')
    .with_columns(pl.col('year'))
    .filter(pl.col('year') < 2000))

lazy_df = q1.collect()

In [None]:
lazy_df['year'].unique()

year
i64
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010


# **References and Additional Learning**

## **Articles**

- **[Python Polars: A Lightning-Fast DataFrame Library](https://realpython.com/polars-python/) by Harrison Hoffman from Real Python**

## **Documentation**

- **[Polars Documentation](https://www.pola.rs/)**

- **[Lazy API](https://pola-rs.github.io/polars/user-guide/lazy/using/) from Polars Documentation site**

# **Connect**
- **Feel free to connect with Adrian on [YouTube](https://www.youtube.com/channel/UCPuDxI3xb_ryUUMfkm0jsRA), [LinkedIn](https://www.linkedin.com/in/adrian-dolinay-frm-96a289106/), [X](https://twitter.com/DolinayG), [GitHub](https://github.com/ad17171717), [Medium](https://adriandolinay.medium.com/) and [Odysee](https://odysee.com/@adriandolinay:0). Happy coding!**