# Introduction to Polars
There are several Python libraries that can be used to analyse large datasets. The most popular one is Pandas (https://pandas.pydata.org/), which is used for countless projects in data science.

A modern alternative to Pandas is Polars (https://pola.rs/), which is considered to be more efficient and faster than Pandas. If you are not invested in Pandas, there is no reason why you should not start with Polars as the data manipulation tool of your choice.

#### Import Polars

In [None]:
import polars as pl

### DataFrames
Polars works with DataFrames. A DataFrame can be compared to a spreadsheet with a number of columns each containing the same number of rows. Each column contains data of a given datatype (e.g. string, date, float).

#### Create a dataframe
A dataframe can be created in different way, e.g. from a Python dictionary, from a file or a url, etc.

In [None]:
# create dataframe from dictionary
grades = pl.DataFrame(
    {
        'subject': ['maths', 'physics', 'biology', 'chemistry'],
        'grade': [4.5, 5.5, 5.5, 5],
        'teacher': ['Bernoulli', 'Einstein', 'Darwin', 'Laue']
    }
)

grades

In [None]:
# create datafram from csv file
path = 'data/pokemon.csv'
pokemons = pl.read_csv(path, infer_schema_length=None)

pokemons # displays first 5 and last 5 rows; shape provides information about the number of rows and columns

#### Selecting Columns
Specific columns can be selected by column name using the *select* method. Multiple columns can be selected using a list.

In [None]:
grades.select('teacher') # select column 'teacher'

In [None]:
pokemons.select(['name', 'classfication']) # select columns 'name' and 'classfication'

#### Adding new columns
Columns can be added using the method *with_columns*. The new column can be based on existing columns. The alias method can be used to rename the new column.

In [None]:
# Add a column with the sum of 'attack' and 'defense'
(pokemons
    .select(['attack', 'defense'])
    .with_columns((pl.col('attack') + pl.col('defense')).alias('att+def'))
)

#### Sorting
The dataframe can be sorted based on the values in a column.

In [None]:
grades.sort('teacher') # sort by teacher's name

In [None]:
grades.sort('subject', descending=True) # sort by subject name (descending)

#### Filtering rows
Specific rows can be selected based on a condition using the *filter* method. 

In [None]:
grades.filter(pl.col('subject') != 'physics') # select all rows where the subject is not physics

In [None]:
grades.filter(pl.col('grade') > 5) # select all rows with grades greater than 5

In [None]:
# select all rows where 'attack' is greater than 'defense'
(pokemons
    .select(['name', 'attack', 'defense'])
    .filter(pl.col('attack') > pl.col('defense'))
)

#### Operations on columns
There are many operations acting on the values in a column, e.g. sum, mean, etc.

In [None]:
grades.select(pl.mean('grade')) # calculate the average grade

In [None]:
# calculate the average of 'attack' and the sum of 'defense'
pokemons.select([pl.mean('attack'), pl.sum('defense')])

#### Grouping and aggregating
Data can often be grouped based on the value in one column. Aggregation allows to perform calculations per group.

In [None]:
# dataframe containt grades for four subjects in order of when they were received
gradebook = pl.DataFrame(
    {
        'subject': ['P', 'M', 'C', 'B', 'M', 'C', 'P', 'B', 'M', 'P', 'M', 'C', 'B'],
        'grade': [5.5, 4.5, 4.5, 5, 5, 4.5, 3.5, 6, 5, 4, 4.5, 5, 5.5]
    }
)

# calculate the average per subject
mean_grades = (gradebook
 .group_by('subject')
 .agg(pl.col('grade').mean())
)

mean_grades

In [None]:
# group by 'type1', then calculate mean attack and defense per group, sort by mean attack
(pokemons
    .select(['name', 'type1', 'attack', 'defense'])
    .group_by('type1')
    .agg(pl.col('attack').mean(), pl.col('defense').mean())
    .sort('attack')
)

#### Plotting data
Data can be visualised using most plotting libraries. There are some shortcuts built into Polars which are based on Altair (https://altair-viz.github.io/gallery/index.html).

In [None]:
# diagram for the maths grades
(gradebook
    .filter(pl.col('subject') == 'M')
    .with_row_index('test_no', offset=1)
    .plot.line(
        x='test_no',
        y='grade'
    )
)

In [None]:
# For a nicer version we make use of the underlying library's (Altair) more advanced formatting options
M_grades = (gradebook
    .filter(pl.col('subject') == 'M')
    .with_row_index('test_no', offset=1)
           )

alt.Chart(M_grades).mark_line().encode(
    alt.Y('grade:Q').scale(domain=(4, 6), clamp=True),
    x='test_no'
).interactive()

In [None]:
# histogram for the different type1 options
(pokemons
 .select('type1')
 .group_by('type1')
 .agg(pl.len().alias('frequency'))
 .plot.bar(
     x='type1',
     y='frequency'
 )
)