# Conversion to & from Numpy and Pandas
By the end of this lecture you will be able to:
- convert between Polars and Numpy
- convert between Polars and Pandas

Key functionality in this notebook requires that your Pandas version is 2.0+ (automated testing is carried out with the latest version of Pandas on PyPi).

Use `pl.show_versions()` to check your installation

In [1]:
import polars as pl
import numpy as np
import pandas as pd

In [4]:
pl.show_versions()

--------Version info---------
Polars:              1.7.1
Index type:          UInt32
Platform:            macOS-15.0-arm64-arm-64bit
Python:              3.12.5 | packaged by Anaconda, Inc. | (main, Sep 12 2024, 13:22:57) [Clang 14.0.6 ]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               5.0.1
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.2
pyarrow              16.1.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>


In [2]:
csv_file = "../data/titanic.csv"

In [3]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


## Convert a `DataFrame` to Numpy

To convert a `DataFrame` to Numpy use the `to_numpy` method. This clones (copies) the data.

In [5]:
arr = df.to_numpy()
arr

array([[1, 0, 3, ..., 7.25, None, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, None, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, None, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, None, 'Q']], dtype=object)

This conversion turns each row into a Numpy `ndarray` and vertically stacks these row-arrays.

As the `DataFrame` has a mix of types the Numpy array has an `object` dtype.

If the columns have uniform numeric dtype then the Numpy array has the corresponding dtype.

In this example we use `select` to choose the 64-bit floating point columns only for conversion to Numpy...

> We cover `select` in more detail in the Section on Selecting columns and transforming dataframes.

In [6]:
floats_array = (
    df
    .select(
        pl.col(pl.Float64)
    )
    .to_numpy()
)
floats_array

array([[22.    ,  7.25  ],
       [38.    , 71.2833],
       [26.    ,  7.925 ],
       ...,
       [    nan, 23.45  ],
       [26.    , 30.    ],
       [32.    ,  7.75  ]])

... and we get a float Numpy array

In [None]:
floats_array.dtype

The Polars sequence dtypes `pl.List` and `pl.Array` are common ways to store sequences that might be passed to Numpy. We learn more about these in Section 4 of the course.

## Convert Numpy to a `DataFrame`

We can create a Polars `DataFrame` from a Numpy array

In [7]:
rand_array = np.random.standard_normal((5,3))
(
    pl.DataFrame(
        rand_array
    )
)

column_0,column_1,column_2
f64,f64,f64
1.519486,-1.474066,-0.396043
-1.687464,0.54541,0.550352
0.16938,-0.778125,0.807696
0.774408,1.217639,-0.84679
0.743746,0.306673,-0.536983


We can optionally pass a list of column names to `pl.DataFrame` if we want to specify these.

If we have a **1D** Numpy array we can create a Polars `Series` or `DataFrame` with zero-copy. We start by creating a 1D array of all ones

In [None]:
arr = np.ones(10)
arr.shape

We can then create a `Series` or `DataFrame` from a Numpy array with *zero-copy* of the data

In [None]:
# zero copy series conversion
pl.Series("a", arr)

# zero copy DataFrame conversion
pl.DataFrame(
    {
       "a": arr,
    }
)

Zero-copy means that the data - the array of ones - is only stored in one place in memory. Both Numpy and Polars are looking at this same place to get the data for the original Numpy array or the Polars `Series` or `DataFrame`. If (or when) one of these libraries transforms the data then it creates its own copy with the transformed data.

## Convert a `Series` to Numpy
Converting a `Series` to Numpy has more options than converting an entire `DataFrame`.

To do a simple conversion where the data is cloned use `to_numpy` on the `Series`

In [10]:
tmp = (
    df['Age']
    .head()
    .to_numpy()
)
tmp[tmp>20]

array([22., 38., 26., 35., 35., 54., 27.])

The `Age` column has a float dtype and so does the Numpy output. Note that the `null` value in the 7th position becomes a `NaN` in Numpy.

Also be aware that an integer column containing `null` values is cast to float in Numpy. We show this here where we `cast` the float `Age` column to integer and then convert to Numpy

In [None]:
(
    df['Age']
    .cast(pl.Int32)
    .head()
    .to_numpy()
)

And here we get the same output as above.

### Convert a `Series` to Numpy with zero-copy
In some cases we can convert a `Series` to Numpy without copying ("zero-copy"). 

Zero-copy is only possible if there are no `null` or `NaN` values such as in the `Survived` column. If we want to ensure that conversion to Numpy happens with zero-copy - and raise an `Exception` if a copy is needed - we use the `allow_copy` argument

In [None]:
arr = (
    df['Survived']
    .head()
    .to_numpy(allow_copy=False)
)
arr

If we try this zero-copy approach with the `Age` column - where there is a `null` value we get an `Exception`

In [None]:
arr = (
    df['Age']
    .head()
    .to_numpy(allow_copy=False)
)
arr

With zero-copy conversion the Numpy array is read-only so we cannot change the values in the Numpy array.

In the following example we get an `Exception` when we try to change the values after a zero-copy operation on the `Survived` column

In [None]:
arr = (
    df['Survived']
    .head()
    .to_numpy(allow_copy=False)
)
arr[0] = 100

## Convert a `DataFrame` to Pandas

### Convert to a Numpy-backed Pandas DataFrame
Pandas has historically used Numpy arrays to represent its data in memory.

To convert a `DataFrame` to Pandas with Numpy array use the `to_pandas` method. This clones the data similar to calling `to_numpy` on a `DataFrame` above.

> This conversion to Pandas requires that you have `PyArrow` installed with `pip` or `conda`.

In [14]:
tmp = (
    df
    .to_pandas()
    .head(2)
)
tmp.iloc[0,0]

1

In [18]:
(
    pl.scan_csv(csv_file)
    .select('Age')
    .collect()
    .to_pandas()
).iloc[0,0]

22.0

### Convert to a PyArrow-backed Pandas `DataFrame`
Since Pandas release 1.5.0 and Polars release 1.6.4 you can have a Pandas `DataFrame` backed by an Arrow Table. You can create a Pandas `DataFrame` that references the same Arrow Table as your Polars `DataFrame`

In [12]:
(
    df
    .to_pandas(use_pyarrow_extension_array=True)
    .head(2)
)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


The advantage of using the pyarrow extension array is that creating the Pandas `DataFrame` is very cheap as it does not require copying data. 

If there is a function you want from Pandas you can do a quick transformation to Pandas, apply the function and revert back to Polars. This works in eager mode only of course.

This PyArrow conversion is a new feature in both libraries so there may be incompatability with trickier features such as categorical or nested columns.

Note that when you do **not** use the PyArrow extension approach the dtypes of the columns in Pandas are the standard Pandas dtypes. When you do use the PyArrow extension approach the the dtypes of the columns in Pandas are PyArrow dtypes

In [None]:
# Without PyArrow dtypes
df.to_pandas(use_pyarrow_extension_array=False).dtypes

In [None]:
# With PyArrow dtypes
df.to_pandas(use_pyarrow_extension_array=True).dtypes

### Calling `pd.DataFrame` on a Polars `DataFrame`
You can call `pd.DataFrame` on a Polars `DataFrame`

In [None]:
df_pandas = (
    pd.DataFrame(df)
    .head()
)

In [None]:
df_pandas



### Conversion from Pandas to Polars
You can convert from Pandas to Polars by calling `pl.DataFrame` on the Pandas `DataFrame`

In [None]:
(
    pl.DataFrame(
        df.to_pandas().set_index("PassengerId")
    )
    .head(3)
)

Note, however, that the `index` column is lost when converting to Polars.

You can also call `pl.from_pandas` on a Pandas `DataFrame`

In [None]:
(
    pl.from_pandas(
        df.to_pandas()
    ).head(3)
)

Both approaches are equivalent.

## Convert a `Series` to Pandas
You can convert a `Series` to Pandas with `to_pandas` which clones the data

In [None]:
(
    df['Age']
    .to_pandas()
    .head()
)

Or you can again use the PyArrow extension type in Pandas for a zero-copy conversion

In [None]:
(
    df['Age']
    .to_pandas(use_pyarrow_extension_array=True)
    .head()
)

## Exercises

No exercises for this lecture!