# Introduction to Polars
There are several Python libraries that can be used to analyse large datasets. The most popular one is Pandas (https://pandas.pydata.org/), which is used for countless projects in data science.

A modern alternative to Pandas is Polars (https://pola.rs/), which is considered to be more efficient and faster than Pandas. If you are not invested in Pandas, there is no reason why you should not start with Polars as the data manipulation tool of your choice.

#### Import Polars

In [1]:
import polars as pl

### DataFrames
Polars works with DataFrames. A DataFrame can be compared to a spreadsheet with a number of columns each containing the same number of rows. Each column contains data of a given datatype (e.g. string, date, float).

#### Create a dataframe
A dataframe can be created in different way, e.g. from a Python dictionary, from a file or a url, etc.

In [15]:
# create dataframe from dictionary
grades = pl.DataFrame(
    {
        'subject': ['maths', 'physics', 'biology', 'chemistry'],
        'grade': [4.5, 5.5, 5.5, 5],
        'teacher': ['Bernoulli', 'Einstein', 'Darwin', 'Laue']
    }
)

grades

subject,grade,teacher
str,f64,str
"""maths""",4.5,"""Bernoulli"""
"""physics""",5.5,"""Einstein"""
"""biology""",5.5,"""Darwin"""
"""chemistry""",5.0,"""Laue"""


In [31]:
# create datafram from csv file
path = 'data/pokemon.csv'
pokemons = pl.read_csv(path, infer_schema_length=None)

pokemons # displays first 5 and last 5 rows; shape provides information about the number of rows and columns

abilities,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,against_grass,against_ground,against_ice,against_normal,against_poison,against_psychic,against_rock,against_steel,against_water,attack,base_egg_steps,base_happiness,base_total,capture_rate,classfication,defense,experience_growth,height_m,hp,japanese_name,name,percentage_male,pokedex_number,sp_attack,sp_defense,speed,type1,type2,weight_kg,generation,is_legendary
str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,i64,i64,i64,i64,str,str,i64,i64,f64,i64,str,str,f64,i64,i64,i64,i64,str,str,f64,i64,i64
"""['Overgrow', 'Chlorophyll']""",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,0.25,1.0,2.0,1.0,1.0,2.0,1.0,1.0,0.5,49,5120,70,318,"""45""","""Seed Pokémon""",49,1059860,0.7,45,"""Fushigidaneフシギダネ""","""Bulbasaur""",88.1,1,65,65,45,"""grass""","""poison""",6.9,1,0
"""['Overgrow', 'Chlorophyll']""",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,0.25,1.0,2.0,1.0,1.0,2.0,1.0,1.0,0.5,62,5120,70,405,"""45""","""Seed Pokémon""",63,1059860,1.0,60,"""Fushigisouフシギソウ""","""Ivysaur""",88.1,2,80,80,60,"""grass""","""poison""",13.0,1,0
"""['Overgrow', 'Chlorophyll']""",1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,0.25,1.0,2.0,1.0,1.0,2.0,1.0,1.0,0.5,100,5120,70,625,"""45""","""Seed Pokémon""",123,1059860,2.0,80,"""Fushigibanaフシギバナ""","""Venusaur""",88.1,3,122,120,80,"""grass""","""poison""",100.0,1,0
"""['Blaze', 'Solar Power']""",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,0.5,2.0,0.5,1.0,1.0,1.0,2.0,0.5,2.0,52,5120,70,309,"""45""","""Lizard Pokémon""",43,1059860,0.6,39,"""Hitokageヒトカゲ""","""Charmander""",88.1,4,60,50,65,"""fire""",,8.5,1,0
"""['Blaze', 'Solar Power']""",0.5,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,0.5,2.0,0.5,1.0,1.0,1.0,2.0,0.5,2.0,64,5120,70,405,"""45""","""Flame Pokémon""",58,1059860,1.1,58,"""Lizardoリザード""","""Charmeleon""",88.1,5,80,65,80,"""fire""",,19.0,1,0
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""['Beast Boost']""",0.25,1.0,0.5,2.0,0.5,1.0,2.0,0.5,1.0,0.25,0.0,1.0,0.5,0.0,0.5,1.0,0.5,1.0,101,30720,0,570,"""25""","""Launch Pokémon""",103,1250000,9.2,97,"""Tekkaguyaテッカグヤ""","""Celesteela""",,797,107,101,61,"""steel""","""flying""",999.9,7,1
"""['Beast Boost']""",1.0,1.0,0.5,0.5,0.5,2.0,4.0,1.0,1.0,0.25,1.0,1.0,0.5,0.0,0.5,0.5,0.5,0.5,181,30720,0,570,"""255""","""Drawn Sword Pokémon""",131,1250000,0.3,59,"""Kamiturugiカミツルギ""","""Kartana""",,798,59,31,109,"""grass""","""steel""",0.1,7,1
"""['Beast Boost']""",2.0,0.5,2.0,0.5,4.0,2.0,0.5,1.0,0.5,0.5,1.0,2.0,1.0,1.0,0.0,1.0,1.0,0.5,101,30720,0,570,"""15""","""Junkivore Pokémon""",53,1250000,5.5,223,"""Akuzikingアクジキング""","""Guzzlord""",,799,97,53,43,"""dark""","""dragon""",888.0,7,1
"""['Prism Armor']""",2.0,2.0,1.0,1.0,1.0,0.5,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,0.5,1.0,1.0,1.0,107,30720,0,600,"""3""","""Prism Pokémon""",101,1250000,2.4,97,"""Necrozmaネクロズマ""","""Necrozma""",,800,127,89,79,"""psychic""",,230.0,7,1


#### Selecting Columns
Specific columns can be selected by column name using the *select* method. Multiple columns can be selected using a list.

In [32]:
grades.select('teacher') # select column 'teacher'

teacher
str
"""Bernoulli"""
"""Einstein"""
"""Darwin"""
"""Laue"""


In [33]:
pokemons.select(['name', 'classfication']) # select columns 'name' and 'classfication'

name,classfication
str,str
"""Bulbasaur""","""Seed Pokémon"""
"""Ivysaur""","""Seed Pokémon"""
"""Venusaur""","""Seed Pokémon"""
"""Charmander""","""Lizard Pokémon"""
"""Charmeleon""","""Flame Pokémon"""
…,…
"""Celesteela""","""Launch Pokémon"""
"""Kartana""","""Drawn Sword Pokémon"""
"""Guzzlord""","""Junkivore Pokémon"""
"""Necrozma""","""Prism Pokémon"""


#### Adding new columns
Columns can be added using the method *with_columns*. The new column can be based on existing columns. The alias method can be used to rename the new column.

In [34]:
# Add a column with the sum of 'attack' and 'defense'
(pokemons
    .select(['attack', 'defense'])
    .with_columns((pl.col('attack') + pl.col('defense')).alias('att+def'))
)

attack,defense,att+def
i64,i64,i64
49,49,98
62,63,125
100,123,223
52,43,95
64,58,122
…,…,…
101,103,204
181,131,312
101,53,154
107,101,208


#### Sorting
The dataframe can be sorted based on the values in a column.

In [46]:
grades.sort('teacher') # sort by teacher's name

subject,grade,teacher
str,f64,str
"""maths""",4.5,"""Bernoulli"""
"""biology""",5.5,"""Darwin"""
"""physics""",5.5,"""Einstein"""
"""chemistry""",5.0,"""Laue"""


In [47]:
grades.sort('subject', descending=True) # sort by subject name (descending)

subject,grade,teacher
str,f64,str
"""physics""",5.5,"""Einstein"""
"""maths""",4.5,"""Bernoulli"""
"""chemistry""",5.0,"""Laue"""
"""biology""",5.5,"""Darwin"""


#### Filtering rows
Specific rows can be selected based on a condition using the *filter* method. 

In [35]:
grades.filter(pl.col('subject') != 'physics') # select all rows where the subject is not physics

subject,grade,teacher
str,f64,str
"""maths""",4.5,"""Bernoulli"""
"""biology""",5.5,"""Darwin"""
"""chemistry""",5.0,"""Laue"""


In [36]:
grades.filter(pl.col('grade') > 5) # select all rows with grades greater than 5

subject,grade,teacher
str,f64,str
"""physics""",5.5,"""Einstein"""
"""biology""",5.5,"""Darwin"""


In [37]:
# select all rows where 'attack' is greater than 'defense'
(pokemons
    .select(['name', 'attack', 'defense'])
    .filter(pl.col('attack') > pl.col('defense'))
)

name,attack,defense
str,i64,i64
"""Charmander""",52,43
"""Charmeleon""",64,58
"""Charizard""",104,78
"""Weedle""",35,30
"""Beedrill""",150,40
…,…,…
"""Pheromosa""",137,37
"""Xurkitree""",89,71
"""Kartana""",181,131
"""Guzzlord""",101,53


#### Operations on columns
There are many operations acting on the values in a column, e.g. sum, mean, etc.

In [38]:
grades.select(pl.mean('grade')) # calculate the average grade

grade
f64
5.125


In [40]:
# calculate the average of 'attack' and the sum of 'defense'
pokemons.select([pl.mean('attack'), pl.sum('defense')])

attack,defense
f64,i64
77.857678,58480


#### Grouping and aggregating
Data can often be grouped based on the value in one column. Aggregation allows to perform calculations per group.

In [2]:
# dataframe containt grades for four subjects in order of when they were received
gradebook = pl.DataFrame(
    {
        'subject': ['P', 'M', 'C', 'B', 'M', 'C', 'P', 'B', 'M', 'P', 'M', 'C', 'B'],
        'grade': [5.5, 4.5, 4.5, 5, 5, 4.5, 3.5, 6, 5, 4, 4.5, 5, 5.5]
    }
)

# calculate the average per subject
mean_grades = (gradebook
 .group_by('subject')
 .agg(pl.col('grade').mean())
)

mean_grades

subject,grade
str,f64
"""P""",4.333333
"""B""",5.5
"""M""",4.75
"""C""",4.666667


In [52]:
# group by 'type1', then calculate mean attack and defense per group, sort by mean attack
(pokemons
    .select(['name', 'type1', 'attack', 'defense'])
    .group_by('type1')
    .agg(pl.col('attack').mean(), pl.col('defense').mean())
    .sort('attack')
)

type1,attack,defense
str,f64,f64
"""fairy""",62.111111,68.166667
"""psychic""",65.566038,69.264151
"""flying""",66.666667,65.0
"""bug""",70.125,70.847222
"""electric""",70.820513,61.820513
…,…,…
"""rock""",90.666667,96.266667
"""steel""",93.083333,120.208333
"""ground""",94.8125,83.90625
"""fighting""",99.178571,66.392857


#### Plotting data
Data can be visualised using most plotting libraries. There are some shortcuts built into Polars which are based on Altair (https://altair-viz.github.io/gallery/index.html).

In [6]:
# diagram for the maths grades
(gradebook
    .filter(pl.col('subject') == 'M')
    .with_row_index('test_no', offset=1)
    .plot.line(
        x='test_no',
        y='grade'
    )
)

In [99]:
# For a nicer version we make use of the underlying library's (Altair) more advanced formatting options
M_grades = (gradebook
    .filter(pl.col('subject') == 'M')
    .with_row_index('test_no', offset=1)
           )

alt.Chart(M_grades).mark_line().encode(
    alt.Y('grade:Q').scale(domain=(4, 6), clamp=True),
    x='test_no'
).interactive()

In [105]:
# histogram for the different type1 options
(pokemons
 .select('type1')
 .group_by('type1')
 .agg(pl.len().alias('frequency'))
 .plot.bar(
     x='type1',
     y='frequency'
 )
)