# Working with multiple files

In [1]:
from pathlib import Path

import polars as pl

pl.Config.set_tbl_rows(6)

polars.config.Config

In [2]:
csv_file = "data/titanic.csv"

In [3]:
df = pl.read_csv(csv_file)

In [4]:
# Path to the new directory
csv_directory = Path("data/csv/multiple_csv")

csv_directory.mkdir(parents=True,exist_ok=True)

Split `DataFrame`

In [5]:
df[:700].write_csv(csv_directory / "train.csv")
df[700:].write_csv(csv_directory / "test.csv")

## Eager mode

### Reading multiple files with wildcard patterns

Read multiple CSV files with the same schema using a wildcard `*` pattern, the files are alphabetical order.

In [6]:
pl.read_csv(
    csv_directory / "*.csv"
).head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
701,1,1,"""Astor, Mrs. John Jacob (Madele…","""female""",18.0,1,0,"""PC 17757""",227.525,"""C62 C64""","""C"""
702,1,1,"""Silverthorne, Mr. Spencer Vict…","""male""",35.0,0,0,"""PC 17475""",26.2875,"""E24""","""S"""


#### What happens when using the wildcard pattern `*`?

1. Make a list of the files that match the pattern
2. Calls `scan_csv` on each file to make a list of `LazyFrames`
3. Does a vertical `concatenation` of the `LazyFrames`
4. Calls `collect` to return a `DataFrame`

`read_csv` with `*` is an automated version of the lazy mode.

### What happens if there is a potential optimization?

In [7]:
pl.read_csv(
    csv_directory / "*.csv"
).filter(
    pl.col("Pclass") == 1
).head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
701,1,1,"""Astor, Mrs. John Jacob (Madele…","""female""",18.0,1,0,"""PC 17757""",227.525,"""C62 C64""","""C"""
702,1,1,"""Silverthorne, Mr. Spencer Vict…","""male""",35.0,0,0,"""PC 17475""",26.2875,"""E24""","""S"""


Actually, Polars reads all csv files `into memory` and concatenated, filter, and return them.

### Reading from a list of file paths

In [8]:
file_path_list = [csv_directory / "train.csv",csv_directory / "test.csv"]

pl.concat(
    [pl.read_csv(csv_path) for csv_path in file_path_list]
).head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


## Scanning CSVs in lazy mode

### Scanning multiple files with a wildcard

In [9]:
print(
    pl.scan_csv(
        csv_directory / "*.csv"
    ).filter(
        pl.col("Age") > 50
    ).explain()
)

Csv SCAN [data/csv/multiple_csv/test.csv, data/csv/multiple_csv/train.csv]
PROJECT */12 COLUMNS
SELECTION: [(col("Age")) > (50.0)]
ESTIMATED ROWS: 891


In [10]:
pl.scan_csv(
        csv_directory / "*.csv"
    ).filter(
        pl.col("Age") > 50
    ).collect()

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
715,0,2,"""Greenberg, Mr. Samuel""","""male""",52.0,0,0,"""250647""",13.0,,"""S"""
746,0,1,"""Crosby, Capt. Edward Gifford""","""male""",70.0,1,1,"""WE/P 5735""",71.0,"""B22""","""S"""
766,1,1,"""Hogeboom, Mrs. John C (Anna An…","""female""",51.0,1,0,"""13502""",77.9583,"""D11""","""S"""
…,…,…,…,…,…,…,…,…,…,…,…
685,0,2,"""Brown, Mr. Thomas William Solo…","""male""",60.0,1,1,"""29750""",39.0,,"""S"""
695,0,1,"""Weir, Col. John""","""male""",60.0,0,0,"""113800""",26.55,,"""S"""
696,0,2,"""Chapman, Mr. Charles Henry""","""male""",52.0,0,0,"""248731""",13.5,,"""S"""


## Handling variations in column names

We can't concatenate CSVs that have different column names with `pl.scan_csv`

In [11]:
df1 = pl.DataFrame({"int_column": [0, 1, 2]})

df2 = pl.DataFrame({"Int_Column": [3, 4]})

mismatched_column_names_path = Path('data/csv/mismatched_column_names/')

if not mismatched_column_names_path.exists():
    mismatched_column_names_path.mkdir()

df1.write_csv(mismatched_column_names_path / "df1.csv")
df2.write_csv(mismatched_column_names_path / "df2.csv")

If we try to call `pl.scan_csv` with a `*` we get an `Exception`

In [12]:
pl.scan_csv(mismatched_column_names_path / 'df*.csv').collect()

ComputeError: schema names differ: got int_column, expected Int_Column

This error occurred with the following context stack:
	[1] 'csv scan'
	[2] 'sink'


We can use `with_column_names` parameter to solve it.

In [13]:
pl.scan_csv(
    mismatched_column_names_path / 'df*.csv',
    with_column_names=lambda cols: [col.lower() for col in cols]
).collect()

int_column
i64
0
1
2
3
4


### Scanning from a list of file paths in lazy mode

In [14]:
files_list = [
    'data/csv/multiple_csv/train.csv',
    'data/csv/multiple_csv/test.csv'
]

queries_list = [
    pl.scan_csv(csv_path) for csv_path in files_list
]

queries_list

[<LazyFrame at 0x20C30D1B6A0>, <LazyFrame at 0x20C30E28B90>]

The `queries_list` is a `list` of `LazyFrames`

Polars can evaluate a `list` of `LazyFrames` with `pl.collect_all` and the output is a `list` of `DataFrames`

In [15]:
pl.concat(
    queries_list
).collect().head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


## Discovering file paths

In some cases we want an easy way to find all the CSVs in sub-directories.

We can use `PyArrow`

In [16]:
import pyarrow.dataset as ds

dataset = ds.dataset(
    csv_directory,
    format="csv"
)

In [17]:
dataset.files

['data/csv/multiple_csv/test.csv', 'data/csv/multiple_csv/train.csv']

In [18]:
pl.from_arrow(
    dataset.to_table()
).head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
701,1,1,"""Astor, Mrs. John Jacob (Madele…","""female""",18.0,1,0,"""PC 17757""",227.525,"""C62 C64""","""C"""
702,1,1,"""Silverthorne, Mr. Spencer Vict…","""male""",35.0,0,0,"""PC 17475""",26.2875,"""E24""","""S"""
703,0,3,"""Barbara, Miss. Saiide""","""female""",18.0,0,1,"""2691""",14.4542,"""""","""C"""


In [19]:
pl.from_arrow(
    dataset.to_table(
        columns=["Pclass", "Age"],
        filter=ds.field("Age") > 70
    )
).head(3)

Pclass,Age
i64,f64
3,74.0
1,71.0
3,70.5


Use `PyArrow` when there is a more complicated directory structure

## Exercises

### Exercise 1
The NYC taxi dataset CSV has 1000 rows containing records from different days.

### Set-up

In [20]:
nyccsv_file = "data/nyc_trip_data_1k.csv"

- read the CSV
- add a column that records the date from the `pickup` datetime
- partition the `DataFrame` into a dictionary that maps dates to the `DataFrame` for that date

In [21]:
dailyDfDict = (
    pl.read_csv(nyccsv_file,try_parse_dates=True)
    .with_columns(
    pl.col("pickup").dt.truncate("1d").dt.strftime("%Y-%m-%d").alias("pickup_day")
    )
    .partition_by(by=["pickup_day"],as_dict=True)
)


The keys of the `dailyDfDict` are the string dates for each day

In [22]:
dailyDfDict.keys()

dict_keys([('2022-01-01',), ('2022-01-02',), ('2022-01-03',), ('2022-01-04',), ('2022-01-05',), ('2022-01-06',), ('2022-01-07',), ('2022-01-08',), ('2022-01-09',), ('2022-01-10',), ('2022-01-11',), ('2022-01-12',), ('2022-01-13',), ('2022-01-14',), ('2022-02-01',)])

The values for each key is a `DataFrame` for that date

In [23]:
dailyDfDict['2022-01-01',].head(3)

VendorID,pickup,dropoff,passenger_count,trip_distance,fare_amount,tip_amount,pickup_day
str,datetime[μs],datetime[μs],f64,f64,f64,f64,str
"""id1""",2022-01-01 00:04:14,2022-01-01 00:26:12,1.0,10.83,31.0,0.0,"""2022-01-01"""
"""id2""",2022-01-01 00:32:17,2022-01-01 00:49:23,1.0,3.97,14.5,3.66,"""2022-01-01"""
"""id8""",2022-01-01 00:40:58,2022-01-01 01:00:59,4.0,8.44,25.5,0.0,"""2022-01-01"""


We now create a partitioned directory called `daily_nyc` for the data.

The name of each sub-directory is a date.

The content of each sub-directory is the CSV for that date

In [25]:
# Path to the new directory
nyccsv_directory = Path("data/csv/daily_nyc")

# Create the new directory if it doesn't already exist
nyccsv_directory.mkdir(parents=True,exist_ok=True)

# Loop through each date
for (day,), df in dailyDfDict.items():
    # Create a Path object for that date
    dailyDirectory = (nyccsv_directory / day)
    # Create the sub-directory for that date
    dailyDirectory.mkdir(parents=True,exist_ok=True)
    # Write a CSV called daily.csv
    df.write_csv(dailyDirectory / "daily.csv")


### Now on to the exercise!

Read all the CSV files in eager mode using a path with wildcards for the final directory name

In [None]:
pl.read_csv("data/csv/daily_nyc/*/daily.csv")

VendorID,pickup,dropoff,passenger_count,trip_distance,fare_amount,tip_amount,pickup_day
str,str,str,f64,f64,f64,f64,str
"""id1""","""2022-01-01T00:04:14.000000""","""2022-01-01T00:26:12.000000""",1.0,10.83,31.0,0.0,"""2022-01-01"""
"""id2""","""2022-01-01T00:32:17.000000""","""2022-01-01T00:49:23.000000""",1.0,3.97,14.5,3.66,"""2022-01-01"""
"""id8""","""2022-01-01T00:40:58.000000""","""2022-01-01T01:00:59.000000""",4.0,8.44,25.5,0.0,"""2022-01-01"""
…,…,…,…,…,…,…,…
"""id2""","""2022-01-14T18:34:11.000000""","""2022-01-14T18:39:18.000000""",3.0,0.92,5.5,2.45,"""2022-01-14"""
"""id0""","""2022-01-14T18:49:08.000000""","""2022-01-14T18:54:08.000000""",0.0,0.8,5.0,2.3,"""2022-01-14"""
"""id5""","""2022-02-01T03:00:05.000000""","""2022-02-01T03:15:08.000000""",3.0,2.62,12.0,0.0,"""2022-02-01"""


Read the CSV files in eager mode using:
- a `glob` and a `generator`
- a concatenation of the list of `DataFrames`

In [31]:
file_paths_generator = nyccsv_directory.glob("*/*.csv")

pl.concat(
    [pl.read_csv(csv_path) for csv_path in file_path_list]
).shape

(891, 12)

Read all the CSV files in lazy mode using a path with wildcards for the final directory name

In [33]:
pl.scan_csv("data/csv/daily_nyc/*/daily.csv").collect()

VendorID,pickup,dropoff,passenger_count,trip_distance,fare_amount,tip_amount,pickup_day
str,str,str,f64,f64,f64,f64,str
"""id1""","""2022-01-01T00:04:14.000000""","""2022-01-01T00:26:12.000000""",1.0,10.83,31.0,0.0,"""2022-01-01"""
"""id2""","""2022-01-01T00:32:17.000000""","""2022-01-01T00:49:23.000000""",1.0,3.97,14.5,3.66,"""2022-01-01"""
"""id8""","""2022-01-01T00:40:58.000000""","""2022-01-01T01:00:59.000000""",4.0,8.44,25.5,0.0,"""2022-01-01"""
…,…,…,…,…,…,…,…
"""id2""","""2022-01-14T18:34:11.000000""","""2022-01-14T18:39:18.000000""",3.0,0.92,5.5,2.45,"""2022-01-14"""
"""id0""","""2022-01-14T18:49:08.000000""","""2022-01-14T18:54:08.000000""",0.0,0.8,5.0,2.3,"""2022-01-14"""
"""id5""","""2022-02-01T03:00:05.000000""","""2022-02-01T03:15:08.000000""",3.0,2.62,12.0,0.0,"""2022-02-01"""


Read all the CSVs in lazy mode **between 2022-01-01 and 2022-01-09** inclusive

- Scan the required `DataFrames` by iterating through the generator
- Call `collect_all` to evaluate all the `LazyFrames`
- `concat` all the `DataFrames`

If you want a hint about filtering the dates expand the cell below

In [35]:
nycfile_paths_generator = nyccsv_directory.glob("*/daily.csv")

pl.concat(
    pl.collect_all(
        [pl.scan_csv(csv_path) for csv_path in nycfile_paths_generator if "2022-01-0" in csv_path.as_posix()]
    )
).shape

(652, 8)

### Exercise 2
Create a PyArrow `dataset` object with all the CSVs

In [36]:
dataset = ds.dataset(nyccsv_directory, format="csv")

List all the CSV files in the dataset

In [37]:
dataset.files

['data/csv/daily_nyc/2022-01-01/daily.csv',
 'data/csv/daily_nyc/2022-01-02/daily.csv',
 'data/csv/daily_nyc/2022-01-03/daily.csv',
 'data/csv/daily_nyc/2022-01-04/daily.csv',
 'data/csv/daily_nyc/2022-01-05/daily.csv',
 'data/csv/daily_nyc/2022-01-06/daily.csv',
 'data/csv/daily_nyc/2022-01-07/daily.csv',
 'data/csv/daily_nyc/2022-01-08/daily.csv',
 'data/csv/daily_nyc/2022-01-09/daily.csv',
 'data/csv/daily_nyc/2022-01-10/daily.csv',
 'data/csv/daily_nyc/2022-01-11/daily.csv',
 'data/csv/daily_nyc/2022-01-12/daily.csv',
 'data/csv/daily_nyc/2022-01-13/daily.csv',
 'data/csv/daily_nyc/2022-01-14/daily.csv',
 'data/csv/daily_nyc/2022-02-01/daily.csv']

Read all the files into a Polars `DataFrame`

In [38]:
pl.from_arrow(
    dataset.to_table()
).head(3)

VendorID,pickup,dropoff,passenger_count,trip_distance,fare_amount,tip_amount,pickup_day
str,datetime[ns],datetime[ns],f64,f64,f64,f64,date
"""id1""",2022-01-01 00:04:14,2022-01-01 00:26:12,1.0,10.83,31.0,0.0,2022-01-01
"""id2""",2022-01-01 00:32:17,2022-01-01 00:49:23,1.0,3.97,14.5,3.66,2022-01-01
"""id8""",2022-01-01 00:40:58,2022-01-01 01:00:59,4.0,8.44,25.5,0.0,2022-01-01
