In [2]:
import pandas as pd
import polars as pl

In [2]:
!pip show polars

Name: polars
Version: 0.20.5
Summary: Blazingly fast DataFrame library
Home-page: 
Author: 
Author-email: Ritchie Vink <ritchie46@gmail.com>
License: 
Location: /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages
Requires: 
Required-by: 


### Basic functionality

We can import data in a similar fashion to pandas

In [3]:
ac_pheno = pl.read_csv("ac_pheno.txt",separator="\t",null_values="NA")

In [4]:
ac_pheno

PIT,Length,Weight,Tank,Sex,Site
i64,i64,i64,i64,str,i64
919540,465,1514,1,"""U""",1
918025,455,1250,1,"""U""",1
917803,405,937,1,"""U""",1
918763,505,2667,4,"""M""",2
917365,500,2204,4,"""U""",2
916380,520,2336,4,"""U""",2
9186524,535,3065,4,"""U""",2
915778,490,1774,4,"""U""",2
916993,435,1426,3,"""U""",1
916238,475,1545,3,"""U""",1


In [5]:
ac_pheno.shape

(2862, 6)

In [6]:
ac_pheno.schema

OrderedDict([('PIT', Int64),
             ('Length', Int64),
             ('Weight', Int64),
             ('Tank', Int64),
             ('Sex', String),
             ('Site', Int64)])

Conveniently `polars`largely uses the same naming conventions with `pandas`

In [7]:
ac_pheno.describe()

describe,PIT,Length,Weight,Tank,Sex,Site
str,f64,f64,f64,f64,str,f64
"""count""",2862.0,2838.0,2845.0,2852.0,"""2852""",2852.0
"""null_count""",0.0,24.0,17.0,10.0,"""10""",10.0
"""mean""",952706.21768,463.836152,1648.48225,2.934432,,1.461781
"""std""",534319.113939,39.908756,529.221099,1.155825,,0.498625
"""min""",915581.0,305.0,306.0,1.0,"""M""",1.0
"""25%""",916798.0,440.0,1286.0,2.0,,1.0
"""50%""",918002.0,465.0,1600.0,3.0,,1.0
"""75%""",919318.0,490.0,1950.0,4.0,,2.0
"""max""",9202514.0,590.0,3942.0,4.0,"""U""",2.0


In the following we can see how to apply different types of expressions to `polars` dataframes

Let's see how we can select one or multiple columns

In [8]:
ac_pheno.select('Site')

Site
i64
1
1
1
2
2
2
2
2
1
1


In [9]:
ac_pheno.select(["Weight","Sex"])

Weight,Sex
i64,str
1514,"""U"""
1250,"""U"""
937,"""U"""
2667,"""M"""
2204,"""U"""
2336,"""U"""
3065,"""U"""
1774,"""U"""
1426,"""U"""
1545,"""U"""


An alternative and more flexible way to accomplish the same is as follows

In [10]:
ac_pheno.select(pl.col(["Sex","Length"]))

Sex,Length
str,i64
"""U""",465
"""U""",455
"""U""",405
"""M""",505
"""U""",500
"""U""",520
"""U""",535
"""U""",490
"""U""",435
"""U""",475


The above is more flexible as it allows for further manipulations

In [11]:
ac_pheno.select(pl.col(["Sex","Length"]).sort())

Sex,Length
str,i64
,
,
,
,
,
,
,
,
,
,


Similarly if we want to extract data that abide to certain thresholds

In [12]:
ac_pheno.filter(pl.col("Weight") > 1600)

PIT,Length,Weight,Tank,Sex,Site
i64,i64,i64,i64,str,i64
918763,505,2667,4,"""M""",2
917365,500,2204,4,"""U""",2
916380,520,2336,4,"""U""",2
9186524,535,3065,4,"""U""",2
915778,490,1774,4,"""U""",2
919647,525,2424,3,"""M""",1
917332,495,2118,3,"""U""",1
918936,490,2317,3,"""M""",1
917426,505,1889,3,"""M""",1
9168104,485,1840,3,"""U""",1


We can chain multiple expressions as follows

In [13]:
ac_pheno.filter(pl.col("Weight") > 1600).min()

PIT,Length,Weight,Tank,Sex,Site
i64,i64,i64,i64,str,i64
915582,415,1601,1,"""M""",1


Polars allows us to get valuable insights of our data through aggregations. This come hand in hand with the `group_by`function.

In [14]:
ac_pheno.group_by("Site").agg([
    pl.mean("Weight").alias("Mean_weight_per_site"),
    pl.var("Weight").alias("Weight_variance_per_site"),
    pl.len().alias("Number_of_records_per_site")
])

Site,Mean_weight_per_site,Weight_variance_per_site,Number_of_records_per_site
i64,f64,f64,u32
,,,10
1.0,1445.198953,150555.098571,1535
2.0,1884.334093,326924.6193,1317


### Joins 

In [15]:
ac_pedigree = pl.read_csv("ac_ped.txt",separator="\t",null_values="NA")

In [16]:
ac_pedigree

Id,Sire,Dam,Year_Class,Selected_gen
i64,str,str,i64,i64
478665,"""0""","""0""",2013,7
478620,"""0""","""0""",2013,7
478601,"""02F49B""","""01FD38""",2013,7
478656,"""02F49B""","""01FD38""",2013,7
478671,"""02F49B""","""01FD38""",2013,7
478651,"""02F49B""","""01FD38""",2013,7
478660,"""0""","""0""",2013,7
478667,"""02F49B""","""01FD38""",2013,7
478649,"""02F49B""","""01FD38""",2013,7
478661,"""02F49B""","""01FD38""",2013,7


Apparentlly the syntax for performing joins using polars is pretty straightforward. Even more than in pandas.

Let's start with a left join.

In [17]:
ac_pedigree.join(ac_pheno, left_on="Id", right_on="PIT",how="left")

Id,Sire,Dam,Year_Class,Selected_gen,Length,Weight,Tank,Sex,Site
i64,str,str,i64,i64,i64,i64,i64,str,i64
478665,"""0""","""0""",2013,7,,,,,
478620,"""0""","""0""",2013,7,,,,,
478601,"""02F49B""","""01FD38""",2013,7,,,,,
478656,"""02F49B""","""01FD38""",2013,7,,,,,
478671,"""02F49B""","""01FD38""",2013,7,,,,,
478651,"""02F49B""","""01FD38""",2013,7,,,,,
478660,"""0""","""0""",2013,7,,,,,
478667,"""02F49B""","""01FD38""",2013,7,,,,,
478649,"""02F49B""","""01FD38""",2013,7,,,,,
478661,"""02F49B""","""01FD38""",2013,7,,,,,


Similarly if we want an inner join

In [18]:
ac_pedigree.join(ac_pheno, left_on="Id", right_on="PIT",how="inner")

Id,Sire,Dam,Year_Class,Selected_gen,Length,Weight,Tank,Sex,Site
i64,str,str,i64,i64,i64,i64,i64,str,i64
916577,"""597579""","""479801""",2017,8,455,1556,3,"""U""",1
915812,"""597579""","""479801""",2017,8,430,1339,3,"""M""",1
915812,"""597579""","""479801""",2017,8,430,1339,3,"""M""",1
916294,"""597579""","""479801""",2017,8,505,2038,2,"""M""",1
916246,"""597579""","""479801""",2017,8,415,1213,1,"""U""",1
916009,"""597579""","""479801""",2017,8,455,1435,2,"""U""",1
916104,"""597579""","""479801""",2017,8,455,1499,1,"""U""",1
916518,"""597579""","""479801""",2017,8,470,1661,2,"""M""",1
916274,"""597579""","""479801""",2017,8,485,1858,1,"""M""",1
916506,"""597579""","""479801""",2017,8,490,2113,3,"""M""",1


Compared to pandas conducting semi- or anti- joins with polars is more straightforward.

In [19]:
ac_pedigree.join(ac_pheno, left_on="Id", right_on="PIT",how="semi")

Id,Sire,Dam,Year_Class,Selected_gen
i64,str,str,i64,i64
916577,"""597579""","""479801""",2017,8
915812,"""597579""","""479801""",2017,8
916294,"""597579""","""479801""",2017,8
916246,"""597579""","""479801""",2017,8
916009,"""597579""","""479801""",2017,8
916104,"""597579""","""479801""",2017,8
916518,"""597579""","""479801""",2017,8
916274,"""597579""","""479801""",2017,8
916506,"""597579""","""479801""",2017,8
916042,"""597579""","""479801""",2017,8


In [20]:
ac_pedigree.join(ac_pheno, left_on="Id", right_on="PIT",how="anti")

Id,Sire,Dam,Year_Class,Selected_gen
i64,str,str,i64,i64
478665,"""0""","""0""",2013,7
478620,"""0""","""0""",2013,7
478601,"""02F49B""","""01FD38""",2013,7
478656,"""02F49B""","""01FD38""",2013,7
478671,"""02F49B""","""01FD38""",2013,7
478651,"""02F49B""","""01FD38""",2013,7
478660,"""0""","""0""",2013,7
478667,"""02F49B""","""01FD38""",2013,7
478649,"""02F49B""","""01FD38""",2013,7
478661,"""02F49B""","""01FD38""",2013,7


### Pivots

In [36]:
ac_tank_long=ac_pheno.pivot(index="Site", columns=["Tank"],
               values="Weight",aggregate_function="mean")

In [37]:
ac_tank_long

Site,1,4,3,null,2
i64,f64,f64,f64,f64,f64
1.0,1412.042596,,1446.13936,,1476.640873
2.0,,1884.334093,,,
,,,,,


### Melts

In [45]:
ac_tank_long.melt(id_vars=["Site"],
                  value_vars=["1","2","3","4"],
                 variable_name="Tank",
                 value_name="Mean_weight")

Site,Tank,Mean_weight
i64,str,f64
1.0,"""1""",1412.042596
2.0,"""1""",
,"""1""",
1.0,"""2""",1476.640873
2.0,"""2""",
,"""2""",
1.0,"""3""",1446.13936
2.0,"""3""",
,"""3""",
1.0,"""4""",


### Lazy Polars

In `polars`operations can be specified using its lazy API. In this case the code is run only when necessary. This powerful concept allows `polars`to optimize the code execution and memory usage. This whole concept in key when we want to analyze datasets that don't fit in memory. In connection to the above the key object is the so-called `LazyFrame`.

In [21]:
ac_pheno_lazy = pl.LazyFrame(ac_pheno)

In [22]:
ac_pheno_lazy

In [23]:
lazy_expression = ac_pheno_lazy.with_columns(
    pl.col("Weight").sqrt().alias(
        "Weight_sqrt")).filter(pl.col("Length") > 550)

In [24]:
lazy_expression

In [25]:
print(lazy_expression.explain())

 WITH_COLUMNS:
 [col("Weight").sqrt().alias("Weight_sqrt")]
  DF ["PIT", "Length", "Weight", "Tank"]; PROJECT */6 COLUMNS; SELECTION: "[(col(\"Length\")) > (550)]"


In [26]:
lazy_expression.collect()

PIT,Length,Weight,Tank,Sex,Site,Weight_sqrt
i64,i64,i64,i64,str,i64,f64
918831,560,1477,2,"""U""",1,38.431758
917763,580,3460,4,"""U""",2,58.821765
916343,560,3232,4,"""M""",2,56.850682
917532,555,3080,4,"""U""",2,55.497748
915904,590,3561,4,"""M""",2,59.674115
920482,560,3134,4,"""U""",2,55.98214
917075,560,2873,4,"""M""",2,53.600373
919961,555,3568,4,"""M""",2,59.732738
917388,580,1728,4,"""U""",2,41.569219
920263,585,3675,4,"""M""",2,60.621778


Polars can help us handle files to large to fit in memmory. When dealing with such files we need to use functions like `scan_csv`.

In [27]:
ac_pheno_large = pl.scan_csv("ac_pheno.txt",separator="\t",null_values="NA")

Unlike `read_csv`here we don't read the entire file to memory. As you might have guessed polars is using the LazyFrame API under the hood.

In [28]:
ac_pheno_large

In [29]:
ac_pheno_large.schema

OrderedDict([('PIT', Int64),
             ('Length', Int64),
             ('Weight', Int64),
             ('Tank', Int64),
             ('Sex', String),
             ('Site', Int64)])

In [30]:
ac_pheno_large.filter(pl.col("Weight") > 1000)

As you see above the code is not really executed. To actually run the code we need to use `collect`as already shown.

In [31]:
ac_pheno_large.filter(pl.col("Weight") > 1000).collect()

PIT,Length,Weight,Tank,Sex,Site
i64,i64,i64,i64,str,i64
919540,465,1514,1,"""U""",1
918025,455,1250,1,"""U""",1
918763,505,2667,4,"""M""",2
917365,500,2204,4,"""U""",2
916380,520,2336,4,"""U""",2
9186524,535,3065,4,"""U""",2
915778,490,1774,4,"""U""",2
916993,435,1426,3,"""U""",1
916238,475,1545,3,"""U""",1
916890,475,1408,3,"""U""",1


### Exercises