# CSV files 2: multiple files
By the end of this lecture you will be able to:
- read multiple CSV files with a glob pattern
- read multiple CSV files from a list
- read multiple CSV files in lazy mode
- automate CSV discovery in sub-directories

We import Python's built-in `pathlib` module to work with multiple file paths and create sub-directories

In [None]:
from pathlib import Path

import polars as pl

pl.Config.set_tbl_rows(6)

We need a dataset with multiple CSV files that share the same scheme for this notebook.

We create multiple CSV files from the Titanic dataset in a new directory.

We begin by reading in the full CSV

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)

We create a new sub-directory in this directory.

We use the `mkdir` method of a `Path` object to create this new sub-directory

In [None]:
# Path to the new directory
csvDirectory = Path("data_files/csv/multiple_csv")
# Create the new directory if it doesn't already exist
csvDirectory.mkdir(parents=True,exist_ok=True)

We split the `DataFrame` and write the two new files to the sub-directory

In [None]:
df[:700].write_csv(csvDirectory / "train.csv")
df[700:].write_csv(csvDirectory / "test.csv")

## Reading CSVs in eager mode
### Reading multiple files with wildcard patterns

We can read multiple CSV files with the same scheme using wildcard patterns

In [None]:
(
    pl.read_csv(csvDirectory / "*.csv")
    .head(2)
)

The files are read in alphabetical order where `test` comes before `train`.

If you prefer to work in raw file paths the cell above is equivalent to

In [None]:
(
    pl.read_csv("data_files/csv/multiple_csv/*.csv")
    .head(2)
)

#### What happens when we use the wildcard pattern `*`?
When we use the wildcard pattern `*` Polars calls `scan_csv` and does a lazy query to combine all the matching files - this is an automatic version of the lazy mode we see below!

However, unlike the manual version we see below we do not have access to the query optimiser. This means that if we follow `read_csv` with - for example - a `filter` method each file is still be read in full and then the `filter` is applied

In [None]:
(
    pl.read_csv("data_files/csv/multiple_csv/*.csv")
    .filter(pl.col("Pclass") == 1)
    .head(2)
)

### Reading from a list of file paths

If we have a list of file paths we can also read them manually with `pl.concat`.

#### Making a file path generator

In this example we call `glob` on the `csvDirectory` `Path` object to make `filePathsGenerator`.

The `filePathsGenerator` object is a Python `generator`. We can loop through a generator like a list and to produce the next element. If you are not familar with generators [check out the excellent Whirlwind Tour of Python](https://jakevdp.github.io/WhirlwindTourOfPython/12-generators.html) for an introduction

In [None]:
filePathsGenerator = csvDirectory.glob("*csv")

#### Iterating through the generator
We iterate through the files from the generator. 

If we want to re-run this cell we need to re-run the cell above first to reset `filePathsGenerator` to the start of the file paths

In [None]:
(
    pl.concat(
        [pl.read_csv(csvPath) for csvPath in filePathsGenerator]
    )
    .head(3)
)

## Scanning CSVs in lazy mode

### Scanning multiple files with a wildcard
We can scan multiple CSV files in a directory in lazy mode using a wildcard

In [None]:
print(
    pl.scan_csv(csvDirectory / "*.csv")
    .filter(pl.col("Age") > 50)
    .describe_optimized_plan()
)

The plan shows us that Polars:
- carries out the normal `CSV SCAN` on each file including the `SELECTION` on `Age` to create a chunk in memory for each file
- connects the chunks internally to a single object in `UNION`
- does a `RECHUNK` to combine the chunks from each file into a single chunk for the `DataFrame`

> The `RECHUNK` is optional. In some queries it may be faster to set `rechunk = False` in `pl.scan_csv`

We evaluate this plan on all the CSVs with `collect`

In [None]:
(
    pl.scan_csv(csvDirectory / "*.csv")
    .filter(pl.col("Age") > 50)
    .collect()
)

### Scanning from a list of file paths in lazy mode
We can also create a list of scanned CSV files in lazy mode.

We re-define the `filePathsGenerator` in this cell so that we can iterate through it again

In [None]:
filePathsGenerator = csvDirectory.glob("*csv")
queriesList = [
    pl.scan_csv(csvPath) for csvPath in filePathsGenerator
]
queriesList

The `queriesList` is a `list` of `LazyFrames`.

Polars can evaluate a `list` of `LazyFrames` with `pl.collect_all`.  The output is a `list` of `DataFrames`

To return the output as a single `DataFrame` we call:
- `pl.collect_all` on the `list` to return a `list` of `DataFrames`
- `pl.concat` the combine the `list` of `DataFrames` to a single `DataFrame`

In [None]:
pl.concat(
    pl.collect_all(queriesList)
)

When you call `pl.collect_all` Polars runs `pl.collect` on each element in parallel.

For large datasets we can use streaming with `streaming = True` in `pl.collect_all`. In this case Polars reads any large CSVs in batches.

## Discovering file paths
In some cases we want an easy way to find all the CSVs in sub-directories.

We can use PyArrow in this case. While using PyArrow isn't necessary in this simple example, it is handy with more complicated directory structures

In [None]:
import pyarrow.dataset as ds

dataset = ds.dataset(
    csvDirectory,
    format="csv"
)

We list the files that PyArrow has found

In [None]:
dataset.files

We can then read these files in eager mode by:
- letting PyArrow turn them into an Arrow table and
- creating a Polars `DataFrame` from the Arrow table with zero-copy

In [None]:
(
    pl.from_arrow(
        dataset.to_table()
    )
    .head(3)
)

With PyArrow we can do manual optimisations such as limit the columns or apply a row filter in the arguments of `to_table`

In [None]:
(
    pl.from_arrow(
        dataset.to_table(
            columns=["Pclass","Age"],
            filter = ds.field("Age") > 70)
    )
    .head(3)
)

See the PyArrow docs for more info on the `dataset` object: https://arrow.apache.org/docs/python/dataset.html

## So which approach should you use?
Each of these approaches will work, but these are my opinions for general cases:
- If you want to read all files into memory with no query optimisations use `pl.read_csv`
- Use a wildcard if you can specify the files using a wildcard
- Use a list if you want more control over which files you read
- Use PyArrow if you have a more complicated directory structure

## Exercises
In the exercises you will develop your understanding of:
- reading multiple CSV files in eager mode
- reading multiple CSV files in lazy mode
- reading CSVs with PyArrow

### Exercise 1
The NYC taxi dataset CSV has 1000 rows containing records from different days.

### Set-up
We transform this CSV into a set of partitioned CSVs in sub-directories. 

We first set the path to the full CSV

In [None]:
nycCsvFile = "../data/nyc_trip_data_1k.csv"

We now:
- read the CSV
- add a column that records the date from the `pickup` datetime
- partition the `DataFrame` into a dictionary that maps dates to the `DataFrame` for that date

In [None]:
dailyDfDict = (
    pl.read_csv(nycCsvFile,parse_dates=True)
    .with_column(
    pl.col("pickup").dt.truncate("1d").dt.strftime("%Y-%m-%d").alias("pickup_day")
    )
    .partition_by("pickup_day",as_dict=True)
)


The keys of the `dailyDfDict` are the string dates for each day

In [None]:
dailyDfDict.keys()

The values for each key is a `DataFrame` for that date

In [None]:
dailyDfDict['2022-01-01'].head(3)

We now create a partitioned directory called `daily_nyc` for the data.

The name of each sub-directory is a date.

The content of each sub-directory is the CSV for that date

In [None]:
# Path to the new directory
nycCsvDirectory = Path("data_files/csv/daily_nyc")

# Create the new directory if it doesn't already exist
nycCsvDirectory.mkdir(parents=True,exist_ok=True)

# Loop through each date
for day, df in dailyDfDict.items():
    # Create a Path object for that date
    dailyDirectory = (nycCsvDirectory / day)
    # Create the sub-directory for that date
    dailyDirectory.mkdir(parents=True,exist_ok=True)
    # Write a CSV called daily.csv
    df.write_csv(dailyDirectory / "daily.csv")


We list the contents of `daily_nyc` to see the sub-directories for each date

In [None]:
ls data_files/csv/daily_nyc/

We list the contents of one sub-directory to show the CSV

In [None]:
ls data_files/csv/daily_nyc/2022-01-01/

### Now on to the exercise!

Read all the CSV files in eager mode using a path with wildcards for the final directory name

In [None]:
(
    pl.read_csv(
        "data_files/csv/daily_nyc<blank>
    )
)

Read the CSV files in eager mode using:
- a `glob` and a `generator`
- a concatenation of the list of `DataFrames`

In [None]:
nycFilePathsGenerator = nycCsvDirectory<blank>

Read all the CSV files in lazy mode using a path with wildcards for the final directory name

Read all the CSVs in lazy mode **between 2022-01-01 and 2022-01-09** inclusive

- Scan the required `DataFrames` by iterating through the generator
- Call `collect_all` to evaluate all the `LazyFrames`
- `concat` all the `DataFrames`

If you want a hint about filtering the dates expand the cell below

In [None]:
#Hint: in an `if` statement convert the `csvPath` to string with `csvPath.as_posix()` and check if 2022-01-0
# is in the string

### Exercise 2
Create a PyArrow `dataset` object with all the CSVs

List all the CSV files in the dataset

Read all the files into a Polars `DataFrame`

## Solutions

### Solution to exercise 1

Read all the CSV files in eager mode using a path with wildcards for the final directory name

In [None]:
pl.read_csv("data_files/csv/daily_nyc/*/daily.csv")

Read the CSV files in eager mode using:
- a `glob` and a `generator`
- a concatenation of the list of `DataFrames`

In [None]:
filePathsGenerator = nycCsvDirectory.glob("*/*.csv")
(
    pl.concat(
        [pl.read_csv(csvPath) for csvPath in filePathsGenerator]
    )
).shape

Read all the CSV files in lazy mode using a path with wildcards for the final directory name

In [None]:
(
    pl.scan_csv("data_files/csv/daily_nyc/*/daily.csv")
    .collect()
)

Read all the CSVs in lazy mode *between 2022-01-01 and 2022-01-09** inclusive

- Scan the required `DataFrames` by iterating through the generator
- Call `collect_all` to evaluate all the `LazyFrames`
- `concat` all the `DataFrames`

In [None]:
#Hint: in an `if` statement convert the `csvPath` to string with `csvPath.as_posix()` and check if 2022-01-0
# is in the string

In [None]:
nycFilePathsGenerator = nycCsvDirectory.glob("*/daily.csv")
(
    pl.concat(
        pl.collect_all(
            [pl.scan_csv(csvPath) for csvPath in nycFilePathsGenerator if "2022-01-0" in csvPath.as_posix()]
        )
    )
).shape

### Solution to exercise 2
Create a PyArrow `dataset` object with all the CSVs

In [None]:
dataset = ds.dataset(nycCsvDirectory,format="csv")

List all the CSV files in the dataset

In [None]:
dataset.files

Read all the files into a Polars `DataFrame`

In [None]:
(
    pl.from_arrow(
        dataset.to_table()
    )
    .head(3)
)