# Excel files
By the end of this lecture you will be able to:
- read an Excel worksheet into a `DataFrame`
- read multiple Excel worksheets into a `dict`
- pass arguments to the XML parser
- pass arguments to the CSV parser

In [None]:
import polars as pl
import pandas as pd

## Creating an Excel file
We create a simple Excel file with one worksheet from a CSV file. Polars does not support writing to Excel files so we write the Excel file with Pandas.

For this Pandas functionality we must install the openpyxl package

In [None]:
## Uncomment and run this cell to install openpyxl
# %pip install openpyxl

We set the path to our CSV file and the Excel file that we will create in this directory

In [None]:
csvFile = "../data/titanic.csv"
excelFile = "titanic.xlsx"

We read the CSV, convert to Pandas and write to `excelFile`

In [None]:
df = pl.read_csv(csvFile)
dfPandas = df.to_pandas()
dfPandas.to_excel(excelFile)

In the simplest cases we can just read the first sheet in the Excel file with `pl.read_excel`

In [None]:
df = pl.read_excel(excelFile)
df.head(2)

## Specifying the worksheet
We specify the worksheet with integer id numbers or names.

### Specifying with id numbers
We specify the sheet by numbers with the `sheet_id` argument.
- By default `sheet_id = 1` and Polars reads the first worksheet
- If we set `sheet_id = 0` Polars returns all sheets as a `dict` that maps string sheet names to `DataFrames`

In [None]:
excelDict = pl.read_excel(excelFile,sheet_id=0)
excelDict.keys()

### Specifying with sheet name
By default there is no `sheet_name` and the `sheet_id = 1` argument controls the behaviour. We can instead specify the sheet by name with the `sheet_name` argument. 

In [None]:
(
    pl.read_excel(
        excelFile,
        sheet_name="Sheet1"
    )
    .head(2)
)

## How does Polars read Excel files?

Unlike the other I/O options Polars does not have a parser or dependency in Rust to read excel files. Instead Polars uses [the xlsx2csv library](https://github.com/dilshod/xlsx2csv).
The option to read Excel files is only available from the Python API for Polars.

When we call `pl.read_excel`:
- Polars calls xlsx2csv with the path to the Excel file
- xlsx2csv parses the XML and converts it to a CSV in-memory
- Polars parses the CSV with `pl.read_csv`


## Controlling how the Excel file is parsed
The parsing process has two stages as set out above:
- `xlsx2csv` parsing the XML to create an in-memory CSV
- `pl.read_csv` parsing the CSV to create a `DataFrame`

Each of these stages accepts arguments to control the parsing.

### Parsing the XML
We can pass arguments to xlsx2csv to control how it parses the XML. This includes:
- specifying the date format with `DATEFORMAT %Y/%m/%d`
- specifying the format for floats with `FLOATFORMAT %.15f`
- skip empty lines

See https://github.com/dilshod/xlsx2csv for the full set of options.

We pass these arguments as a `dict` to the `xlsx2csv_options` argument

In [None]:
(
    pl.read_excel(
        excelFile,
        xlsx2csv_options =
            {
                "skip_empty_lines": True,
            }
    )
    .head(2)
)

### Parsing the CSV
Once xlsx2csv has created the CSV we can pass arguments that we would pass to `pl.read_csv`.

In this example we rename the first column using `new_columns`

In [None]:
(
    pl.read_excel(
        excelFile,
        read_csv_options =
            {
                "new_columns":["Id"]
            }
    )
    .head(2)
)

Reading Exceil files happens in eager mode only, we cannot do a lazy scan of an Excel file.

Parsing the XML in Excel files is slow - consider converting your data to CSV, Parquet or Arrow formats if possible.

## Exercises
In the exercises you will develop your understanding of:
- passing arguments to `pl.read_excel`

### Exercise 1
We:
- create a Polars `DataFrame` from the NYC taxi extract
- convert it to a Pandas `DataFrame`
- write the Pandas `DataFrame` to an Excel file

In [None]:
nycCSVFile = "../data/nyc_trip_data_1k.csv"
(
    pl.read_csv(nycCSVFile)
    .to_pandas()
    .to_excel("nyc.xlsx")
)

Create a `DataFrame` from the `nyx.xlsx` file with the date columns automatically parsed as datetime dtypes

## Solutions

### Solution to Exercise 1
Create a `DataFrame` from the `nyx.xlsx` file with the date columns automatically parsed as datetime dtypes

In [None]:
pl.read_excel("nyc.xlsx",read_csv_options={"parse_dates":True})