# Conversion to & from Pandas and Numpy
By the end of this section you will be able to:
- convert between a `DataFrame` and Numpy
- convert a `Series` to `Numpy` with zero-copy
- convert between a `DataFrame` and Polars with zero-copy

Key functionality in this notebook requires that your Pandas version is 1.5+, Polars is 0.16.4+ and PyArrow is 11+.

Use `pl.show_versions()` to check your installation

In [None]:
import polars as pl
import numpy as np
import pandas as pd

In [None]:
pl.show_versions()

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## Convert a `DataFrame` to Numpy

To convert a `DataFrame` to Numpy use the `to_numpy` method. This clones (copies) the data.

In [None]:
arr = df.to_numpy()
arr

This conversion turns each row into a Numpy `ndarray` and vertically stacks these row-arrays.

As the `DataFrame` has a mix of types the Numpy array has an `object` dtype.

If the columns have uniform numeric dtype then the Numpy array has the corresponding dtype.

In this example we use `select` to choose the 64-bit floating point columns only for conversion to Numpy. We cover this in more detail in the Section on Selecting columns and transforming dataframes.

In [None]:
(
    df
    .select(
        pl.col(pl.Float64)
    )
    .to_numpy()
    .dtype
)

Typically it is better to do the conversion to `Numpy` at the last moment as data processing in `Polars` is often faster and more memory efficient.

## Convert Numpy to a `DataFrame`

We can create a Polars `DataFrame` from a Numpy array

In [None]:
rand_array = np.random.standard_normal((5,3))
pl.DataFrame(rand_array)

We can optionally pass a list of column names to `pl.DataFrame` if we want to specify these

## Convert a `Series` to Numpy
Converting a `Series` to Numpy has more options than converting an entire `DataFrame`.

To do a simple conversion where the data is cloned use `to_numpy` on the `Series`

In [None]:
df['Age'].head().to_numpy()

### Convert a `Series` to Numpy with zero-copy
In some cases we can convert a `Series` to Numpy without copying ("zero-copy"). 

Zero-copy is possible if there are no `null` or `NaN` values.

In [None]:
arr = df['Survived'].head().to_numpy(zero_copy_only=True)
arr

With zero-copy conversion the Numpy array is read-only so you cannot change the values in the Numpy array.

So the following effort to change a value raises an `Exception`

In [None]:
arr = df['Survived'].head().to_numpy(zero_copy_only=True)
# arr[0] = 100 # This will thrown an exception 

## Convert a `DataFrame` to Pandas

### Convert to a Numpy-backed Pandas DataFrame
Pandas has historically used Numpy arrays to represent its data in memory.

To convert a `DataFrame` to Pandas with Numpy array use the `to_pandas` method. This clones the data similar to calling `to_numpy` on a `DataFrame` above.

This conversion requires that you have `PyArrow` installed with `pip` or `conda`.

In [None]:
df.to_pandas().head(2)

You can check the data types: 

In [None]:
type(df), type(df.to_pandas())

### Convert to a PyArrow-backed Pandas `DataFrame`
Since Pandas release 1.5.0 and Polars release 1.6.4 you can have a Pandas `DataFrame` backed by an Arrow Table. You can create a Pandas `DataFrame` that references the same Arrow Table as your Polars `DataFrame`.

In [None]:
df.to_pandas(use_pyarrow_extension_array=True).head(2)

The advantage of using the pyarrow extension array is that creating the Pandas `DataFrame` is very cheap as it does not require copying data. 

If there is a function you want from Pandas you can do a quick transformation to Pandas, apply the function and revert back to Polars. This works in eager mode only of course.

This PyArrow conversion is a new feature in both libraries to there may be bugs with trickier features such as categorical or nested columns.

Note that when you use the PyArrow extension approach the dtypes of the columns are PyArrow dtypes

In [None]:
# Without PyArrow dtypes
df.to_pandas(use_pyarrow_extension_array=False).dtypes

In [None]:
# With PyArrow dtypes
df.to_pandas(use_pyarrow_extension_array=True).dtypes

### Calling `pd.DataFrame` on a Polars `DataFrame`?
**Warning** - at present you can call `pd.DataFrame` on a Polars `DataFrame` but the result is:
- transposed and
- has lost the column names

In [None]:
pd.DataFrame(df).head()

Hopefully this conversion will be easier when both libraries have adopted the [dataframe interchange protocol](https://data-apis.org/dataframe-protocol/latest/index.html).

### Conversion from Pandas to Polars
You can convert from Pandas to Polars by calling `pl.DataFrame` on the Pandas `DataFrame`

In [None]:
pl.DataFrame(df.to_pandas()).head(3)

Or by calling `pl.from_pandas` on the Pandas `DataFrame`

In [None]:
pl.from_pandas(df.to_pandas()).head(3)

## Convert a `Series` to Pandas
You can convert a `Series` to Pandas with a call that clones the data

In [None]:
df['Age'].to_pandas().head()

Or you can again use the PyArrow extension type in Pandas for a zero-copy operation

In [None]:
df['Age'].to_pandas(use_pyarrow_extension_array=True).head()

## Exercises

No exercises for this section!