# Getting Started With Polars

***Polars*** is a DataFrame library that is completely written in Rust. Polars have several features worth mentioning:

1. Polars does not use an index for the dataframe
2. Polars represents data internally using *Apache Arrow arrays* which is more efficient in areas such as load time, memory usage, and computation (Pandas 2.0 also can do this)
3. As Polars is written in Rust, it can run many operations in parallel
4. Polars supports *lazy evaluation* (similar to PySpark)

In [4]:
import polars as pl

## Creating Polars DataFrame

In [3]:
df = pl.DataFrame(
    {
         'Model': ['iPhone X','iPhone XS','iPhone 12',
                   'iPhone 13','Samsung S11','Samsung S12',
                   'Mi A1','Mi A2'],
         'Sales': [80,170,130,205,400,30,14,8],     
         'Company': ['Apple','Apple','Apple','Apple',
                     'Samsung','Samsung','Xiao Mi','Xiao Mi'],
     }
)

df

Model,Sales,Company
str,i64,str
"""iPhone X""",80,"""Apple"""
"""iPhone XS""",170,"""Apple"""
"""iPhone 12""",130,"""Apple"""
"""iPhone 13""",205,"""Apple"""
"""Samsung S11""",400,"""Samsung"""
"""Samsung S12""",30,"""Samsung"""
"""Mi A1""",14,"""Xiao Mi"""
"""Mi A2""",8,"""Xiao Mi"""


Polars expects the column header names to be of string type

In [5]:
df2 = pl.DataFrame(
     {
        0 : [1,2,3],
        1 : [80,170,130],
     }
)

ValueError: Series name must be a string.

In [6]:
df2 = pl.DataFrame(
     {
        '0' : [1,2,3],
        '1' : [80,170,130],
     }
)

Displaying datatypes

In [10]:
df.dtypes

[Utf8, Int64, Utf8]

Displaying column names

In [11]:
df.columns

['Model', 'Sales', 'Company']

getting content of the dataframe as a list of tuples

In [12]:
df.rows()

[('iPhone X', 80, 'Apple'),
 ('iPhone XS', 170, 'Apple'),
 ('iPhone 12', 130, 'Apple'),
 ('iPhone 13', 205, 'Apple'),
 ('Samsung S11', 400, 'Samsung'),
 ('Samsung S12', 30, 'Samsung'),
 ('Mi A1', 14, 'Xiao Mi'),
 ('Mi A2', 8, 'Xiao Mi')]

## Selecting Column(s)

In [13]:
df.select(
    'Model'
)

Model
str
"""iPhone X"""
"""iPhone XS"""
"""iPhone 12"""
"""iPhone 13"""
"""Samsung S11"""
"""Samsung S12"""
"""Mi A1"""
"""Mi A2"""


For multiple selection, just pass in a list of column names

In [15]:
df.select(
    ['Model', 'Sales']
).head()

Model,Sales
str,i64
"""iPhone X""",80
"""iPhone XS""",170
"""iPhone 12""",130
"""iPhone 13""",205
"""Samsung S11""",400


retrieving all columns with a specific datatype

In [20]:
df.select(
    pl.col(pl.Utf8)
)

Model,Company
str,str
"""iPhone X""","""Apple"""
"""iPhone XS""","""Apple"""
"""iPhone 12""","""Apple"""
"""iPhone 13""","""Apple"""
"""Samsung S11""","""Samsung"""
"""Samsung S12""","""Samsung"""
"""Mi A1""","""Xiao Mi"""
"""Mi A2""","""Xiao Mi"""


The statement `pl.col(pl.Utf8)` is known as an **expression** in Polars. Expressions are very powerful in Polars. For example, you can *pipe* together expressions

In [21]:
df.select(
    pl.col(['Model','Sales']).sort_by('Sales')    
)

Model,Sales
str,i64
"""Mi A2""",8
"""Mi A1""",14
"""Samsung S12""",30
"""iPhone X""",80
"""iPhone 12""",130
"""iPhone XS""",170
"""iPhone 13""",205
"""Samsung S11""",400


If you want multiple columns, you can enclose your expression in a list:

In [22]:
df.select(
    [pl.col(pl.Int64),'Company']
)

Sales,Company
i64,str
80,"""Apple"""
170,"""Apple"""
130,"""Apple"""
205,"""Apple"""
400,"""Samsung"""
30,"""Samsung"""
14,"""Xiao Mi"""
8,"""Xiao Mi"""


## Selecting Row(s)

To select a single row in a dataframe, pass in the row number using the `row()` method

In [24]:
df.row(0)   # get the first row

('iPhone X', 80, 'Apple')

If you need to get multiple rows based on row numbers, there are two ways:

1. Use the square bracket indexing method (not recommended)

In [28]:


df[:2]# first 2 rows

Model,Sales,Company
str,i64,str
"""iPhone X""",80,"""Apple"""
"""iPhone XS""",170,"""Apple"""


In [26]:
df[[1,3]] # second and fourth row

Model,Sales,Company
str,i64,str
"""iPhone XS""",170,"""Apple"""
"""iPhone 13""",205,"""Apple"""


2. Use the filter() function

In [29]:
df.filter(
    pl.col('Company') == 'Apple'
)

Model,Sales,Company
str,i64,str
"""iPhone X""",80,"""Apple"""
"""iPhone XS""",170,"""Apple"""
"""iPhone 12""",130,"""Apple"""
"""iPhone 13""",205,"""Apple"""


In [32]:
# specify multiple conditions

df.filter(
    (pl.col('Company') == 'Apple') |
    (pl.col('Company') == 'Samsung')
)

Model,Sales,Company
str,i64,str
"""iPhone X""",80,"""Apple"""
"""iPhone XS""",170,"""Apple"""
"""iPhone 12""",130,"""Apple"""
"""iPhone 13""",205,"""Apple"""
"""Samsung S11""",400,"""Samsung"""
"""Samsung S12""",30,"""Samsung"""


## Selecting Rows and Columns

selecting rows and columns at the same time can be done by chaining the filter() and select() methods

In [33]:
df.filter(
    pl.col('Company') == 'Apple'
).select('Model')

Model
str
"""iPhone X"""
"""iPhone XS"""
"""iPhone 12"""
"""iPhone 13"""


In [34]:
df.filter(
    pl.col('Company') == 'Apple'
).select(['Model','Sales'])

Model,Sales
str,i64
"""iPhone X""",80
"""iPhone XS""",170
"""iPhone 12""",130
"""iPhone 13""",205
