# Parquet files 1: Single Parquet files
By the end of this lecture you will be able to:
- read from a Parquet file
- use query optimisation to read a subset of columns
- get the schema of a Parquet file
- write a Parquet file with compression


## What is a Parquet file?
A Parquet file is:
- a *binary* file where data is ordered in columns rather than rows
- each column has a name and a dtype
- each column can be compressed separately with automatic dictionary encoding

The Apache Parquet and Apache Arrow projects evolved together as columnar formats where Apache Parquet is the format for the data on disk and Apache Arrow is the format for the data in memory.

Compared to CSV a Parquet file:
- is faster to read and write than a CSV file
- takes less space on disk, especially once compression is applied
- allows Polars to select which columns to read without parsing the full dataset

In [None]:
from pathlib import Path

import polars as pl

## Creating a Titanic Parquet file
We begin by creating a Parquet file from the Titanic CSV file

In [None]:
csvFile = "../data/titanic.csv"

We create the Parquet Titanic directory in the `data_files/parquet` sub-directory of the `io` sub-directory

In [None]:
parquetFilePath = Path("data_files/parquet/titanic")
if not parquetFilePath.exists():
    parquetFilePath.mkdir(parents=True,exist_ok=True)

Now we set the path that we will write the Parquet file to

In [None]:
parquetFile = "data_files/parquet/titanic/titanic.parquet"

We read the CSV and write to the Parquet path

In [None]:
pl.read_csv(csvFile).write_parquet(parquetFile)

## Reading a Parquet file
We read the Parquet file to a `DataFrame`

In [None]:
df = pl.read_parquet(parquetFile)
df.head(3)

As a Parquet file stores the schema as metadata we can get the schema of a Parquet file without having to read any data.

In Polars we can use the `read_parquet_schema` function for this

In [None]:
pl.read_parquet_schema(parquetFile)

We see that the dtypes are preserved in a Parquet file (unlike a CSV file where all data is converted to text)

We can select a subset of columns to read from a Parquet file with the `columns` argument

In [None]:
(
    pl.read_parquet(
        parquetFile,
        columns=["Pclass","Name"]
    )
    .head(3)
)

When we work in lazy mode in Polars the query optimiser will select a subset of columns automatically

In [None]:
print(
    pl.scan_parquet(parquetFile)
    .select(["Pclass","Name"])
    .describe_optimized_plan()
)

We can also specify a smaller number of rows that we want to read with `n_rows`

In [None]:
(
    pl.read_parquet(
        parquetFile,
        n_rows=2
    )
)

If we are running out of memory when reading a Parquet file we can specify `low_memory = True`. This can help to reduce peak memory usage at the expense of a longer load time

In [None]:
(
    pl.read_parquet(
        parquetFile,
        low_memory=True
    )
    .head(2)
)

Polars reads the Parquet file in multiple threads into different chunks of memory. By default Polars then combines all the chunks into a single chunk in parallel. With the `low_memory=True` argument Polars reduces peak memory usage by not doing this recombination in parallel.

Using `low_memory = True` will not help if the ultimate `DataFrame` does not fit in memory. In this case using `streaming` in lazy mode is the best option

## Writing a Parquet file
When we write a Parquet file we can specify compression options. I recommend using `zstd` in most cases for a good balance of compressed file size on disk and read time into memory. The `lz4` option is an alternative when faster reading and writing is preferred.

In [None]:
df.write_parquet(parquetFile,compression="zstd")

We can also adjust the degree of compression with `compression_level`. The range of values depends on the compression scheme chosen - see the docstrings for details.

## Exercises
In the exercises you will develop your understanding of:
- read and writing Parquet files
- categorical dtypes in Parquet files
- reading the schema of Parquet files
- reading a subset of Parquet files

### Exercise one
We will write a new Parquet file for the exercises to this path

In [None]:
exerciseParquetFile = "data_files/parquet/titanic/titanic_exercise.parquet"

Before we write to this file read the Parquet file created at the start of the notebook to a `DataFrame`. 

Convert the `Sex` column to `pl.Categorical`

In [None]:
df = (
    pl.read_parquet(parquetFile)
    .with_column(<blank>)
)
df.head(3)

Write the `DataFrame` with a categorical column to `exerciseParquetFile`

Read the schema of `exerciseParquetFile` to confirm whether Parquet can preserve categorical encodings

Create a lazy query that only reads these columns
```python
["Survived","Pclass","Age","Sex"]
```

In [None]:
(
    <blank>
)

## Solutions

### Solution to exercise one
We will write a new Parquet file for the exercises to this path

In [None]:
exerciseParquetFile = "data_files/parquet/titanic/titanic_exercise.parquet"

Before we write to this file read the Parquet file created at the start of the notebook to a `DataFrame`. 

Convert the `Sex` column to `pl.Categorical`

In [None]:
df = (
    pl.read_parquet(parquetFile)
    .with_column(pl.col("Sex").cast(pl.Categorical))
)
df.head(3)

Write the `DataFrame` with a categorical column to `exerciseParquetFile`

In [None]:
df.write_parquet(exerciseParquetFile)

Read the schema of `exerciseParquetFile` to confirm whether Parquet can preserve categorical encodings

In [None]:
pl.read_parquet_schema(exerciseParquetFile)

Create a lazy query that only reads these columns
```python
["Survived","Pclass","Age","Sex"]
```

In [None]:
(
    pl.scan_parquet(exerciseParquetFile)
    .select(["Survived","Pclass","Age","Sex"])
)