## What is a Polars `DataFrame`?
In this lecture we have a high-level look at a Polars `DataFrame` and learn:
- how to access important metadata
- how to compare schema
- how Polars stores data with Apache Arrow
- what happens when we modify a `DataFrame`

In [1]:
import polars as pl
import numpy as np

In [2]:
csv_file = "../data/titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


A Polars `DataFrame`:
- is a tabular dataset stored in an Arrow Table (see below)
- has a height and a width
- has unique string column names
- has a data type for each column
- has methods for transforming the data stored in the Arrow Table

We can get the height (number of rows) and width (number of columns) as attributes

In [4]:
df.width

12

In [5]:
df.height

891

In [6]:
df.shape

(891, 12)

## Data type schema

Every column in a `DataFrame` has a data type called a `dtype`.

We can get a `pl.Schema` that maps column names to dtypes with the `.schema` attribute

In [8]:
df.schema

Schema([('PassengerId', Int64),
        ('Survived', Int64),
        ('Pclass', Int64),
        ('Name', String),
        ('Sex', String),
        ('Age', Float64),
        ('SibSp', Int64),
        ('Parch', Int64),
        ('Ticket', String),
        ('Fare', Float64),
        ('Cabin', String),
        ('Embarked', String)])

The `schema` has a Polars type `pl.Schema`. We can also create a `pl.Schema` manually

In [10]:
pl.Schema(
    [
        ("a", pl.Int64), 
        ("b", pl.Float64)
    ]
)

Schema([('a', Int64), ('b', Float64)])

When testing data pipelines a common task is to compare the output `schema` with an expected schema. While we can do a quick comparison with `==` a more important point for effective debugging is to explain what any differences are if there are differences. 

Below we define a function that can do this comparison and report on the differences. Note that there is nothing Polars-specific about this code, it uses Python methods throughout

In [12]:
def compare_polars_schema(
        df_schema:pl.Schema, 
        target_schema: pl.Schema
):
    """
    Compare two pl.Schema and report on any differences
    Args:
        df_schema (OrderedDict): The schema of our DataFrame
        target_schema (OrderedDict): The target schema of our DataFrame that we are comparing to
    
    Returns:
        Dict containing comparison details, with keys indicating the type of difference
    """
    # Check if they are the same
    if df_schema == target_schema:
        return {"match": True}
    
    # Otherwise do a detailed comparison
    comparison_result = {
        "match": False,
        "differences": {}
    }
    
    # Check keys
    df_keys = set(df_schema.keys())
    target_keys = set(target_schema.keys())
    
    # Check for missing or extra keys
    missing_in_target_schema = df_keys - target_keys
    missing_in_df_schema = target_keys - df_keys
    
    if missing_in_target_schema:
        comparison_result["differences"]["keys_missing_in_target"] = list(missing_in_target_schema)
    
    if missing_in_df_schema:
        comparison_result["differences"]["keys_missing_in_df"] = list(missing_in_df_schema)
    
    # Check common keys for dtype differences
    common_keys = df_keys.intersection(target_keys)
    
    dtype_differences = {}
    for key in common_keys:
        if df_schema[key] != target_schema[key]:
            dtype_differences[key] = {
                "df_type": str(df_schema[key]),
                "target_type": str(target_schema[key])
            }
    
    if dtype_differences:
        comparison_result["differences"]["dtype_mismatches"] = dtype_differences
    
    return comparison_result

We now do an example to see what the output looks like

In [13]:
# Example usage
def schema_comparison_example():
    # Create a sample DataFrame
    df = pl.DataFrame(
        {
            "col1":[0,1],
            "col2":[0.0,1.0],
            "col3":["0","1"],
        }
    )
    df_schema = df.schema
    # Create a target with a mismatched schema compared to df
    target_schema = pl.Schema([
        ("col1", pl.Int64),
        ("col2", pl.Float32),
        ("col4", pl.Date),
    ])
    
    # Compare the schema
    comparison = compare_polars_schema(df_schema=df.schema, target_schema=target_schema)
    print(comparison)

And then we run the example

In [14]:
schema_comparison_example()

{'match': False, 'differences': {'keys_missing_in_target': ['col3'], 'keys_missing_in_df': ['col4'], 'dtype_mismatches': {'col2': {'df_type': 'Float64', 'target_type': 'Float32'}}}}


In an actual testing suite we would of course raise an `Exception` if the schema didn't match the target rather than just printing the output.

As well as `schema` there is also a `dtypes` attribute (as in Pandas). However, this gives a `list` of dtypes with no column names

In [15]:
df.dtypes

[Int64,
 Int64,
 Int64,
 String,
 String,
 Float64,
 Int64,
 Int64,
 String,
 Float64,
 String,
 String]

A `Series` also has a data type attribute

In [16]:
df['Name'].dtype

String

### Supertypes
We can group the dtypes into groups:
- integers e.g. pl.Int8,pl.Int16 etc
- floats pl.Float32,pl.Float64
- string pl.String
- boolean pl.Boolean
- datetime pl.Datetime,pl.Date etc

Polars also has a concept of supertypes. Supertypes occur where we are trying to do an operation involving columns that have different types. If the dtypes of these columns have a supertype all columns are cast to that type to do the operation. 

Supertypes are defined on a given pair of dtypes rather than being universal. Here are some simple examples:
- pl.Int8 & pl.Int16 -> pl.Int16
- pl.Float32 & pl.Float64 -> pl.Float64

There are also rules in place for other combinations e.g.:
- pl.Int64 & pl.Boolean -> pl.Boolean
- pl.Int32 & pl.Float32 -> pl.Float64 (following a convention set by Numpy)
- any dtype & pl.String -> pl.String (any column can be cast to string)

We see an example of a supertype in the exercises.

## Apache Arrow

A classic Pandas `DataFrame` stores its data in Numpy arrays. In Polars the data is stored in an Arrow Table. 

> I refer to *classic* Pandas meaning basically pre-version 2.0 of Pandas that was the dominant `DataFrame` for more than a decade. These days the different versions of Pandas differ so much that it becomes challenging to make comparisons of what you can do in each, especially for someone like me who has barely used Pandas in recent years.

We can see this Arrow Table by calling `to_arrow` - this is a cheap operation as it is just viewing the underlying data

In [17]:
df.to_arrow()

pyarrow.Table
PassengerId: int64
Survived: int64
Pclass: int64
Name: large_string
Sex: large_string
Age: double
SibSp: int64
Parch: int64
Ticket: large_string
Fare: double
Cabin: large_string
Embarked: large_string
----
PassengerId: [[1,2,3,4,5,...,54,55,56,57,58],[59,60,61,62,63,...,116,117,118,119,120],...,[830,831,832,833,834,...,884,885,886,887,888],[889,890,891]]
Survived: [[0,1,1,1,0,...,1,0,1,1,0],[1,0,0,1,0,...,0,0,0,0,0],...,[1,1,1,0,0,...,0,0,0,0,1],[0,1,0]]
Pclass: [[3,1,3,1,3,...,2,1,1,2,3],[2,3,3,1,1,...,3,3,2,1,3],...,[1,3,2,3,3,...,2,3,3,2,1],[3,1,3]]
Name: [["Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Thayer)","Heikkinen, Miss. Laina","Futrelle, Mrs. Jacques Heath (Lily May Peel)","Allen, Mr. William Henry",...,"Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson)","Ostby, Mr. Engelhart Cornelius","Woolner, Mr. Hugh","Rugg, Miss. Emily","Novel, Mr. Mansouer"],["West, Miss. Constance Mirium","Goodwin, Master. William Frederick","Sirayanian, Mr. O

An Arrow Table is a collection of Arrow Arrays - these are one-dimensional vectors that are the fundamental data store. We can see the Arrow Array for a column by calling `to_arrow` on a `Series`

In [18]:
df["Age"].to_arrow()

<pyarrow.lib.DoubleArray object at 0x107a13dc0>
[
  22,
  38,
  26,
  35,
  35,
  null,
  54,
  2,
  27,
  14,
  ...
  33,
  22,
  28,
  25,
  39,
  27,
  19,
  null,
  26,
  32
]

In [19]:
np.sqrt(df["Age"]) # can use numpy functions on Arrow Arrays

Age
f64
4.690416
6.164414
5.09902
5.91608
5.91608
…
5.196152
4.358899
""
5.09902


### What is Apache Arrow?
Apache Arrow is an open source cross-language project to store tabular data in-memory. Apache Arrow is both:
- a specificiation for how data should be represented in memory
- a set of libraries in different languages that implement that specification


### Why does `Polars` use `Apache Arrow`?
The Apache Arrow project developed when it became clear that Numpy arrays - designed for scientific computing - are not the optimal data store for tabular data.

Arrow allows for:
- a standardised way of representing data across packages and languages
- sharing data without copying between processes (known as "zero-copy")
- faster vectorised calculations
- working with larger-than-memory data in chunks
- consistent representation of missing data
- built-in support for string data
- built-in support for nested data

Overall, Polars can process data more quickly and with less memory usage because of Arrow.

### What are the downsides of `Apache Arrow`?
The design of Arrow is optimised for operations on one-dimensional columns, whreas the design of Numpy is optimised for operations on multi-dimensional arrays. This tradeoff means some kinds of operations will be slower with Arrow data compared to Numpy:
- transposing a dataframe
- doing matrix multiplication/linear algebra on a `dataframe`

For this kind of use case - where calculations require accessing data by row and column - it may be faster to convert to a Numpy array.

### So what is the relationship between a Polars `DataFrame` and Arrow data?
A Polars `DataFrame` holds references to an Arrow Table which holds references to Arrow Arrays. We can think of a Polars `DataFrame` being a lightweight object that points to the lightweight Arrow Table which points to the heavyweight Arrow Arrays (heavyweight because they hold the actual data). 

This detached structure means we can make changes to the cheap `DataFrame` wrapper and copy none (or a minimal amount) of the data in the Arrow Arrays. 

We now do some examples of how we can do quick operations because they don't change the data. For this we create a large `DataFrame` with random values (note how we can populate a `DataFrame` directly from a numpy array)

In [20]:
df_shape = (1_000_000,100)
df_polars = pl.DataFrame(
    np.random.standard_normal(df_shape)
)
df_polars.head(3)

column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8,column_9,column_10,column_11,column_12,column_13,column_14,column_15,column_16,column_17,column_18,column_19,column_20,column_21,column_22,column_23,column_24,column_25,column_26,column_27,column_28,column_29,column_30,column_31,column_32,column_33,column_34,column_35,column_36,…,column_63,column_64,column_65,column_66,column_67,column_68,column_69,column_70,column_71,column_72,column_73,column_74,column_75,column_76,column_77,column_78,column_79,column_80,column_81,column_82,column_83,column_84,column_85,column_86,column_87,column_88,column_89,column_90,column_91,column_92,column_93,column_94,column_95,column_96,column_97,column_98,column_99
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0.401969,0.549111,-0.60348,-0.565217,-0.017172,-1.800017,-0.435738,1.822096,-1.725269,-0.015223,-1.401739,-0.352787,-1.179195,1.103873,-0.542957,-0.489364,1.767422,-0.376739,0.202385,-0.225297,0.833543,-1.006501,0.138366,0.314155,0.153404,-0.126023,-1.216006,-1.077871,0.547019,-1.329078,-0.26293,0.605772,-0.875689,-0.814478,0.617151,1.238756,-0.9687,…,-0.976554,-0.688245,-1.708425,-1.250049,0.185938,0.355974,-0.213328,-1.514451,-0.990595,-1.467074,0.576029,0.530613,-0.325089,1.522109,1.348162,-0.934342,-0.141484,-1.145944,0.619144,-1.218595,1.345548,-0.062254,-0.25316,0.265145,-0.084599,0.725055,0.0299,-1.406469,0.768375,-0.845368,0.578795,0.403301,0.431864,1.353614,-0.148553,2.666534,0.060067
-0.569646,-0.028009,1.401577,-0.581498,-0.532146,0.805015,1.022268,0.278073,0.94441,0.422146,-0.371931,1.657003,0.38758,0.515825,1.014199,0.501339,-0.718018,1.077795,0.760942,-0.248247,-0.841206,0.737351,1.893553,-0.740845,0.062942,0.252624,-0.563097,-0.545996,0.731725,-0.813116,0.050579,-1.142736,0.174429,1.395617,0.135473,0.157613,-0.432439,…,-1.416626,0.745029,-0.855919,2.25391,-0.019712,-0.343971,-0.388375,0.80826,0.39645,0.192193,0.006363,-0.013049,0.101021,0.408987,-0.53343,-0.508581,-1.276594,-1.548005,-1.982354,0.103259,0.242304,-1.041436,1.093893,-0.307963,0.239851,-0.511633,1.821618,1.048238,0.622409,-0.180233,-0.032442,2.093378,1.065154,0.779948,2.159051,-0.021476,-0.165806
0.124984,-0.545322,-0.593737,0.168139,-0.542188,-0.113282,-1.563304,0.552518,0.393507,0.191684,0.034226,0.381065,0.318101,0.274702,-0.22882,1.37846,1.112151,-0.322584,0.447412,0.245791,0.065148,-1.177527,-0.076449,-1.012626,-0.350088,0.885792,-1.002062,-0.522666,-1.036793,-1.075158,1.254257,1.186244,0.227165,0.0298,-0.415397,0.037935,0.894448,…,-0.221115,1.806728,0.195727,-0.234613,-0.474907,-0.213013,-2.609018,1.571085,-1.884189,1.655369,-0.031662,-0.102235,-0.1138,1.122104,-0.661939,0.752962,0.191886,0.085603,-0.532099,0.815944,0.739223,-0.1934,0.660126,-0.717946,0.973634,-0.624099,-1.949118,0.67249,1.017536,-0.224681,1.373867,-1.099503,0.722346,0.124709,-0.711741,0.350382,-0.182928


And we confirm the `DataFrame` is the correct shape

In [21]:
df_polars.shape

(1000000, 100)

### Dropping a column
We see how long it takes to drop a column from a Polars `DataFrame`. 

> We use the IPython `timeit` module to time performance in a cell. By default `timeit` runs the target code many times to get statistics of how long it takes. The default number of iterations tend to be more than necessary. We can control the number of iterations with the -n and -r arguments. The total number of iterations is then n*r. Here we do 1*3 = 3 iterations

In [22]:
%%timeit -n1 -r3
df_polars.drop("column_0")

The slowest run took 16.28 times longer than the fastest. This could mean that an intermediate result is being cached.
385 μs ± 393 μs per loop (mean ± std. dev. of 3 runs, 1 loop each)


> You may get a warning about some runs being much faster than others. Generally it's best to just run a few times until you get a run with consistent timings so the warning disappears. 

Polars does this `drop` very fast (and much faster than Pandas). This is because Polars just creates a new `DataFrame` object (a cheap operation) that points to all the Arrow Arrays except `column_0`. Polars basically just loops through the list of column names for this operation!

### Renaming a column
We have a similar fast performance whenever we change some part of a `DataFrame` that does not affect the actual data in the columns. For example, if we rename a column...

In [23]:
%%timeit -n1 -r3
df_polars.rename({"column_0":"a"})

The slowest run took 19.56 times longer than the fastest. This could mean that an intermediate result is being cached.
280 μs ± 339 μs per loop (mean ± std. dev. of 3 runs, 1 loop each)


Polars again does this very fast because it just updates the column name and checks the column names are still unique.

### Cloning a `DataFrame`
Or if we create a new `DataFrame` by cloning

In [24]:
%%timeit -n1 -r3
df_polars.clone()

The slowest run took 14.55 times longer than the fastest. This could mean that an intermediate result is being cached.
9.15 μs ± 9.83 μs per loop (mean ± std. dev. of 3 runs, 1 loop each)


In this case Polars has created a new `DataFrame` object that points at the same Arrow Table.
### Updating a cloned `DataFrame`

Although the new and old `DataFrames` initially point at the same Arrow Table we do not need to worry about changes to one affecting the other.

If we make changes to a value in one of the `DataFrames` - say the new `DataFrame` - then the new `DataFrame` will:
- copy the data in **the column that has changed** to a new Arrow Array
- create a new Arrow Table that points to the updated Arrow Array along with the unchanged Arrow Arrays

So now we have:
- two `DataFrames` that point to:
- two Arrow Tables that point to:
- the same Array Arrays for the unchanged columns and different Arrow Arrays for the changed column

In this way we create a new `DataFrame` but **only ever have to copy data in columns that change**. We see how changes to the new `DataFrame` do not affect the old `DataFrame` in this example where we change the first value in the first row

In [None]:
df_polars2 = df_polars.clone()
df_polars2[0,0] = 1000
df_polars2[0,0]

In the original `DataFrame` we still have the original value

In [None]:
df_polars[0,0]

## Exercises
In the exercises you will develop your understanding of:
- getting the dtypes of a `DataFrame`
- getting the dtypes of a `Series`

### Exercise 1 

What are the dtypes of this `DataFrame`?

In [None]:
df = pl.DataFrame(
    {
        'a':[0,1,2],
        'b':[0,1,2.0]
    },
    strict=False
)
# df<blank>

Note the `strict=True` argument here: this tells Polars that if the types in one of the columns are not homogenous then it should use the supertype

Create an expected schema where `a` is `pl.Int64` and `b` is `pl.Int64`

In [None]:
# target_schema = 

Compare the actual and expected schemas to find any differences

Correct the schema and check the comparison again

## Solutions

### Solution to Exercise 1
What are the dtypes of this `DataFrame`?

In [None]:
df = pl.DataFrame(
    {'a':[0,1,2],'b':[0,1,2.0]},
    strict=False
)
df.schema

Create an expected dtype where `a` is `pl.Int64` and `b` is `pl.Int64`

In [None]:
target_schema = pl.Schema([("a", pl.Int64), ("b", pl.Int64)])

Compare the actual and expected schemas to find any differences

In [None]:
compare_polars_schema(
    df_schema=df.schema,
    target_schema=target_schema
)

Correct the schema and check the comparison again

In [None]:
target_schema = pl.Schema([("a", pl.Int64), ("b", pl.Float64)])

In [None]:
compare_polars_schema(
    df_schema=df.schema,
    target_schema=target_schema
)