# Introduction to Data Types
By the end of this section you will be able to:
- get the data type schema of a `DataFrame`
- get the data type of a `Series`
- explain the relationship between Polars and Apache Arrow


We look at the different data types in more detail in the Section on Data types and missing values.

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## Data type schema

Every column in a `DataFrame` has a data type called a `dtype`.

You can get a `dict` that maps column names to dtypes with the `.schema` attribute

In [None]:
df.schema

There is also a `dtypes` attribute (as in Pandas). However, this gives a `list` of dtypes with no column names

In [None]:
df.dtypes

A `Series` also has a data type attribute

In [None]:
df['Name'].dtype

## Apache Arrow

A Pandas `DataFrame` has underlying Numpy arrays where the data is stored. In Polars the data is stored in an Arrow Table. 

We can see this Arrow Table by calling `to_arrow` - this is a cheap operation as it is just accessing the underlying data

In [None]:
df.to_arrow()

### What is Apache Arrow?
Apache Arrow is an open source cross-language project to store tabular data in-memory. Apache Arrow is both:
- a specificiation for how data should be represented in memory
- a set of libraries in different languages that implement that specification

Polars uses the implementation of the Arrow specification from the Rust library [Arrow2](https://docs.rs/arrow2/latest/arrow2/)

### Why does `Polars` use `Apache Arrow`?
Arrow allows for:
- sharing data without copying ("zero-copy")
- faster vectorised calculations
- working with larger-than-memory data in chunks
- consistent representation of missing data

Overall, Polars can process data more quickly and with less memory usage because of Arrow.

### How do we use Arrow in practice?
In practice **we rarely need to deal with Arrow directly** - Polars handles that for us.

The main time we may like call `to_arrow` would be when we try tp pass the data to another library that supports Arrow. This can allow us to pass data between libraries without copying. 

### So what is a Polars `DataFrame`?
One important consequence of using Arrow is that a Polars `DataFrame` doesn't hold data directly. Instead a Polars `DataFrame` holds references to an Arrow table.

One consequence is that when we add a new column using `with_columns` (we will see this in details later) we create a new `DataFrame`

In [None]:
(
    df
    .with_columns(
        pl.lit(0).alias("zeroes")
    )
)

However, creating a new `DataFrame` is a **cheap** operation as we are not copying the existing data to the new `DataFrame` - we are just copying **references** to the existing data along with the reference to the new column 

## Exercises
In the exercises you will develop your understanding of:
- getting the dtypes of a `DataFrame`
- getting the dtypes of a `Series`

### Exercise 1 

What are the dtypes of this `DataFrame`?

In [None]:
df = pl.DataFrame({'a':[0,1,2],
                   'b':[0,1,2.0]
                  })
df<blank>

### Exercise 2
Create a `Series` by selecting the `a` column of `df`

In [None]:
df = pl.DataFrame({'a':[0,1,2],'b':[0,1,2.0]})
# df<blank>

What is the dtype of `a`?
What is the dtype of `b`?

## Solutions

### Solution to Exercise 1
What are the dtypes of this `DataFrame`?

In [None]:
df = pl.DataFrame({'a':[0,1,2],'b':[0,1,2.0]})
df.schema

### Solution to Exercise 2
Create a `Series` by selecting the `a` column of `df`

In [None]:
df = pl.DataFrame({'a':[0,1,2],'b':[0,1,2.0]})
s = df["a"]

In [None]:
s

`s` has 64-bit integer dtype 

In [None]:
s2 = df["b"]
s2

`s2` has 64-bit floating point dtype 