# Conversion to & from Numpy and Pandas

In [2]:
import polars as pl
import numpy as np
import pandas as pd

In [3]:
csv_file = "data/titanic.csv"

In [4]:
df = pl.read_csv(csv_file)
df.head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


## Convert a `DataFrame` to Numpy

To convert a `DataFrame` to Numpy use the `to_numpy` method.

In [5]:
arr = df.to_numpy()
arr

array([[1, 0, 3, ..., 7.25, None, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, None, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, None, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, None, 'Q']], shape=(891, 12), dtype=object)

As the `DataFrame` has a mix of types the Numpy array has an `object` dtype.

If the columns have uniform numeric dtype then the Numpy array has the corresponding dtype.

In [6]:
floats_array = df.select(pl.col(pl.Float64)).to_numpy()  # The column dtype is Float64

floats_array

array([[22.    ,  7.25  ],
       [38.    , 71.2833],
       [26.    ,  7.925 ],
       ...,
       [    nan, 23.45  ],
       [26.    , 30.    ],
       [32.    ,  7.75  ]], shape=(891, 2))

In [7]:
floats_array.dtype

dtype('float64')

The Polars sequence dtypes `pl.List` and `pl.Array` are common ways to store sequences that might be passed to Numpy.

## Convert Numpy to a `DataFrame`

Create a Polars `DataFrame` from a Numpy array

In [None]:
rand_array = np.random.standard_normal((5, 3))

pl.DataFrame(
    rand_array,
)

column_0,column_1,column_2
f64,f64,f64
-1.095254,-0.547323,-0.737591
1.8798,1.415416,-0.79006
0.091525,1.675003,-0.13731
-1.328819,-0.808994,-0.127288
0.144079,-0.807012,-0.774749


If we have a **1D** Numpy array we can create a Polars `Series` or `DataFrame` with zero-copy.

In [9]:
arr = np.ones(10)
arr.shape

(10,)

We can then create a `Series` or `DataFrame` from a Numpy array with *zero-copy* of the data

In [10]:
# zero copy series conversion
pl.Series("a", arr)

a
f64
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0


In [None]:
# zero copy DataFrame conversion
pl.DataFrame(
    {
        "a": arr,
    }
)

a
f64
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0


Zero-copy means that the data - the array of ones - is only stored in one place in memory. 
Both Numpy and Polars are looking at this same place to get the data for the original Numpy array or the Polars `Series` or `DataFrame`. 
If (or when) one of these libraries transforms the data then it creates its own copy with the transformed data.

## Convert a `Series` to Numpy

To do a simple conversion where the data is cloned use `to_numpy` on the `Series`

In [12]:
df["Age"].head().to_numpy()

array([22., 38., 26., 35., 35., nan, 54.,  2., 27., 14.])

Note that the `null` value becomes a `NaN` in Numpy.

Also be aware that an integer column containing `null` values is cast to float in Numpy.

In [13]:
df['Age'].cast(pl.Int32).head().to_numpy()

array([22., 38., 26., 35., 35., nan, 54.,  2., 27., 14.])

### Convert a `Series` to Numpy with zero-copy

Zero-copy is only possible if there are no `null` or `NaN` values. 

Ensure that conversion to Numpy happens with zero-copy - and raise an `Exception` if a copy is needed - use the `allow_copy` argument

In [None]:
arr = df["Survived"].head().to_numpy(allow_copy=False)
arr

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1])

If we try this zero-copy approach with a `null` value we get an `Exception`

In [None]:
arr = df["Age"].head().to_numpy(allow_copy=False)
arr

RuntimeError: copy not allowed: cannot convert to a NumPy array without copying data

With zero-copy conversion the Numpy array is read-only so we cannot change the values in the Numpy array.

In [None]:
arr = df["Survived"].head().to_numpy(allow_copy=False)
arr[0] = 100

ValueError: assignment destination is read-only

## Convert a `DataFrame` to Pandas

### Convert to a Numpy-backed Pandas DataFrame

To convert a `DataFrame` to Pandas with Numpy array use the `to_pandas` method. 

This clones the data similar to calling `to_numpy` on a `DataFrame` above.

> This conversion to Pandas requires that you have `PyArrow`.

In [None]:
df.to_pandas().head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


### Convert to a PyArrow-backed Pandas `DataFrame`

We can create a Pandas `DataFrame` that references the same Arrow Table as your Polars `DataFrame`

In [None]:
df.to_pandas(use_pyarrow_extension_array=True).head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


The advantage of using the pyarrow extension array is that creating the Pandas `DataFrame` is very cheap as it does not require copying data. 

If there is a function you want from Pandas you can do a quick transformation to Pandas, apply the function and revert back to Polars. This works in eager mode only of course.

This PyArrow conversion is a new feature in both libraries so there may be incompatability with trickier features such as categorical or nested columns.

Note that when you do **not** use the PyArrow extension approach the dtypes of the columns in Pandas are the standard Pandas dtypes.

In [19]:
# Without PyArrow dtypes
df.to_pandas(use_pyarrow_extension_array=False).dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [20]:
# With PyArrow dtypes
df.to_pandas(use_pyarrow_extension_array=True).dtypes

PassengerId           int64[pyarrow]
Survived              int64[pyarrow]
Pclass                int64[pyarrow]
Name           large_string[pyarrow]
Sex            large_string[pyarrow]
Age                  double[pyarrow]
SibSp                 int64[pyarrow]
Parch                 int64[pyarrow]
Ticket         large_string[pyarrow]
Fare                 double[pyarrow]
Cabin          large_string[pyarrow]
Embarked       large_string[pyarrow]
dtype: object

### Calling `pd.DataFrame` on a Polars `DataFrame`
You can call `pd.DataFrame` on a Polars `DataFrame`

In [None]:
df_pandas = pd.DataFrame(df).head()

In [22]:
df_pandas

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S




### Conversion from Pandas to Polars
You can convert from Pandas to Polars by calling `pl.DataFrame` on the Pandas `DataFrame`

In [None]:
pl.DataFrame(df.to_pandas().set_index("PassengerId")).head(3)

Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,str,str,f64,i64,i64,str,f64,str,str
0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


Note, however, that the `index` column is lost when converting to Polars.

You can also call `pl.from_pandas` on a Pandas `DataFrame`

In [None]:
pl.from_pandas(df.to_pandas()).head(3)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""


Both approaches are equivalent.

## Convert a `Series` to Pandas
You can convert a `Series` to Pandas with `to_pandas` which clones the data

In [None]:
df["Age"].to_pandas().head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

Or you can again use the PyArrow extension type in Pandas for a zero-copy conversion

In [None]:
df["Age"].to_pandas(use_pyarrow_extension_array=True).head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: double[pyarrow]