# An Introduction to Polars for Pandas Users
In this notebook, we'll be covering the new tabular dataframe library known as **Polars**. Polars is starting to gain traction for its speedy capabilities, and this is enabled as Polars is built on top of Rust. Polars is an alternative to the industry favorite **Pandas**, and several data scientists are now switching to Polars as their "go to" dataframe library. Throughout this notebook, we'll be doing a direct compare / contrast between Pandas and Polars using the [Titanic dataset](https://www.kaggle.com/c/titanic).

To demonstrate the speediness of Polars versus Pandas, we will be outputting the execution speed of each cell down below. While we could use the Jupyter magic command `%%time`, this would be very tedious to write for every cell. Instead, we'll make use of a special Jupyter extension that does this very cleanly. In order make use of the extension, you will need to run the following commands:

```
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextension enable execute_time/ExecuteTime
```

After doing the proper installation, you can toggle on the execution times in the Jupyter interface by going to "Cell > Execution Timings > Toggle visibility (all)". For context, I am running this notebook on a standard 2021 MacBook Pro with an M1 Pro chip.

## Installation

Installing Polars is as simple as installing any other Python library. Despite being built on top of Rust, it is not imperative to pre-install Rust before installing Polars. To use `pip` to install Polars, simply run the following command:

```
pip install polars
```

Additionally, if you do not have it already installed, you will need to separately need to install Pyarrow, which Polars requires to execute some specific functions. For example, in order to convert a Pandas dataframe into a Polars dataframe using Polars' `from_pandas()` function, Pyarrow is required. To install Pyarrow, simply run the following command

```
pip install pyarrow
```

## Getting Started
Now that we've installed Polars, let's go ahead and get started running some basic functions that I like to run every time I work with a new dataset. To keep things straightforward, we're going to name our Titanic dataframe loaded with Pandas as `df_pandas` and our Titanic dataframe loaded with Polars as `df_polars`.

In [1]:
# Importing the Python libraries we'll be using throughout this notebook
import pandas as pd
import polars as pl
from category_encoders.one_hot import OneHotEncoder
ðŸ˜ƒ
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix

SyntaxError: invalid character 'ðŸ˜ƒ' (U+1F603) (2556418927.py, line 5)

### Loading Data from a CSV File

In [None]:
# Setting the filepath for the Titanic dataset
TITANIC_FILEPATH = '../data/titanic/train.csv'

In [None]:
# Importing the Titanic training dataset with Pandas
df_pandas = pd.read_csv(TITANIC_FILEPATH)

In [None]:
# Importing the Titanic training dataset with Polars
df_polars = pl.read_csv(TITANIC_FILEPATH)

### Viewing the First Rows of Each DataFrame

In [None]:
# Viewing the first few rows of the Pandas DataFrame
df_pandas.head()

In [None]:
# Viewing the first few rows of the Polars dataframe
df_polars.head()

### Viewing Information about the DataFrame

In [None]:
# Viewing the general contents of the Pandas DataFrame
df_pandas.info()

In [None]:
# Viewing stats about the Pandas DataFrame
df_pandas.describe()

In [None]:
# Viewing information about the Polars dataframe
df_polars.describe()

### Displaying Value Counts of a Specific Feature

In [None]:
# Viewing the values associated to the "Embarked" column in the Pandas DataFrame
df_pandas['Embarked'].value_counts()

In [None]:
# Viewing the values associated to the "Embarked" column in the Polars DataFrame
df_polars['Embarked'].value_counts()

## Data Wrangling
Now that we've loaded our data and performed some quickstart functions, let's go ahead and execute some basic data wrangling techniques to see how the syntax and performance fares between Polars and Pandas.

### Getting a Slice of the DataFrame

In [None]:
# Getting a slice of the Pandas DataFrame using index values
df_pandas[15:30]

In [None]:
# Getting a slice of the Polars DataFrame using index values
df_polars[15:30]

### Filtering the DataFrame by Feature Values

In [None]:
# Extracting teenagers from the Pandas DataFrame
df_pandas[df_pandas['Age'].between(13, 19)]

In [None]:
# Extracting teenagers from the Polars DataFrame
df_polars.filter(df_polars['Age'].is_between(13, 19))

### Filling Null Values

In [None]:
# Filling "Embarked" nulls in the Pandas DataFrame
df_pandas['Embarked'].fillna('S', inplace = True)

In [None]:
# Filling "Embarked" nulls in the Polars DataFrame
df_polars = df_polars.with_columns(df_polars['Embarked'].fill_null('S'))

### Grouping Data by Feature Names

In [None]:
# Grouping data by ticket class and gender to view counts in the Pandas DataFrame
df_pandas.groupby(by = ['Pclass', 'Sex']).count()

In [None]:
# Grouping data by ticket class and gender to view counts in the Polars DataFrame
df_polars.groupby(by = ['Pclass', 'Sex']).count()

## Feature Engineering
Now that we have performed some basic data wrangling functions, I want to perform some simple feature engineering so that we can feed this dataset into a machine learning algorithm. I did this same thing with the Titanic dataset a while back [as part of this notebook](https://github.com/dkhundley/titanic-byoc/blob/main/notebooks/feature-engineering.ipynb), so we're going to see if we can basically emulate the same things with Polars.

In [None]:
# Reloading each DataFrame from scratch
df_pandas = pd.read_csv(TITANIC_FILEPATH)
df_polars = pl.read_csv(TITANIC_FILEPATH)

### Dropping Unnecessary Features

In [2]:
# Dropping unnecessary features from the Pandas DataFrame
df_pandas.drop(columns = ['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace = True)

NameError: name 'df_pandas' is not defined

In [None]:
# Dropping unnecessary features from the Polars DataFrame
df_polars = df_polars.drop(columns = ['PassengerId', 'Name', 'Ticket', 'Cabin'])

In [None]:
# Separating the supporting features (X) from the predictor feature (y) for Pandas
X_pandas = df_pandas.drop(columns = ['Survived'])
y_pandas = df_pandas[['Survived']]

In [None]:
# Separating the supporting features (X) from the predictor feature (y) for Polars
X_polars = df_polars.drop(columns = ['Survived'])
y_polars = df_polars[['Survived']]

### Engineering the "Sex" (Gender) Column

In [None]:
# Instantiating One Hot Encoder objects for each respective DataFrame
sex_ohe_encoder_pandas = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')
sex_ohe_encoder_polars = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')

In [None]:
# Performing a one hot encoding on the "Sex" column for the Pandas DataFrame
sex_dummies_pandas = sex_ohe_encoder_pandas.fit_transform(X_pandas['Sex'])

In [None]:
# Performing a one hot encoding on the "Sex" column for the Polars DataFrame
sex_dummies_polars = sex_ohe_encoder_polars.fit_transform(X_polars['Sex'].to_pandas())

In [None]:
# Concatenating the gender dummies back to the original Pandas DataFrame
X_pandas = pd.concat([X_pandas, sex_dummies_pandas], axis = 1)

In [None]:
# Converting the Polars dummies from a Pandas DataFrame to a Polars DataFrame
sex_dummies_polars = pl.from_pandas(sex_dummies_polars)

# Concatenating the gender dummies back to the original Polars DataFrame
X_polars = pl.concat([X_polars, sex_dummies_polars], how = 'horizontal')

In [None]:
# Dropping the original "Sex" column for each DataFrame
X_pandas.drop(columns = ['Sex'], inplace = True)
X_polars = X_polars.drop(columns = ['Sex'])

### Engineering the "Embarked" Column

In [None]:
# Instantiating One Hot Encoder objects for each respective dataframe
embarked_ohe_encoder_pandas = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')
embarked_ohe_encoder_polars = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')

In [None]:
# Performing a one hot encoding on the "Embarked" column for the Pandas dataframe
embarked_dummies_pandas = embarked_ohe_encoder_pandas.fit_transform(X_pandas['Embarked'])

In [None]:
# Performing a one hot encoding on the "Embarked" column for the Polars dataframe
embarked_dummies_polars = embarked_ohe_encoder_polars.fit_transform(X_polars['Embarked'].to_pandas())

In [None]:
# Concatenating the "embarked" dummies back to the original Pandas dataframe
X_pandas = pd.concat([X_pandas, embarked_dummies_pandas], axis = 1)

In [None]:
# Converting the Polars dummies from a Pandas dataframe to a Polars dataframe
embarked_dummies_polars = pl.from_pandas(embarked_dummies_polars)

# Concatenating the gender dummies back to the original Polars dataframe
X_polars = pl.concat([X_polars, embarked_dummies_polars], how = 'horizontal')

In [None]:
# Dropping the original "Embarked" column for each dataframe
X_pandas.drop(columns = ['Embarked'], inplace = True)
X_polars = X_polars.drop(columns = ['Embarked'])

### Engineering the "Age" Column

In [None]:
# Extracting the median age of the "Age" column using each respective DataFrame
median_age_pandas = X_pandas['Age'].median()
median_age_polars = X_pandas['Age'].median()

In [None]:
# Filling null values with the median age for each respective DataFrame
X_pandas.fillna(median_age_pandas, inplace = True)
X_polars = X_polars.with_columns(X_polars['Age'].fill_null(median_age_polars))

In [None]:
# Establishing our bins values and names
bin_labels = ['child', 'teen', 'young_adult', 'adult', 'elder']
bin_values = [-1, 12, 19, 30, 60, 100]

In [None]:
# Applying "Age" binning for the Pandas DataFrame
age_bins_pandas = pd.DataFrame(pd.cut(X_pandas['Age'], bins = bin_values, labels = bin_labels))

Note: I really tried to get Polars' implementation of the `cut()` function to behave like the Pandas implementation, but... it was confusing. It does appear to work somewhat, but it re-ordered the whole set of data from least to greatest, meaning that I can't simply concatenate it back to the original Polars dataframe. According to [Polars' documentation about the `cut()` function](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.cut.html), this function is still in an "experimental state" as of February 24, 2023. I demonstrated what I'm talking about in the cell below, but I can't proceed forward like this. I'm going to have to use the Pandas values here for my Polars dataframe.

In [None]:
# Applying "Age" binning for the Polars DataFrame
age_bins_polars = pl.cut(X_polars['Age'], bins = bin_values)
age_bins_polars.head()

In [None]:
# Converting the Pandas age bins to Polars for use in the Polars DataFrame
age_bins_polars = pl.from_pandas(age_bins_pandas)

In [None]:
# Instantiating One Hot Encoder objects for each respective DataFrame
age_ohe_encoder_pandas = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')
age_ohe_encoder_polars = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')

In [None]:
# Performing a one hot encoding on the age bins for the Pandas DataFrame
age_dummies_pandas = age_ohe_encoder_pandas.fit_transform(age_bins_pandas)

In [None]:
# Performing a one hot encoding on the age bins for the Pandas dataframe
age_dummies_polars = age_ohe_encoder_pandas.fit_transform(age_bins_polars.to_pandas())

In [None]:
# Concatenating the age bin dummies back to the original Pandas DataFrame
X_pandas = pd.concat([X_pandas, age_dummies_pandas], axis = 1)

In [None]:
# Converting the Polars dummies from a Pandas dataframe to a Polars DataFrame
age_dummies_polars = pl.from_pandas(age_dummies_polars)

# Concatenating the gender dummies back to the original Polars DataFrame
X_polars = pl.concat([X_polars, age_dummies_polars], how = 'horizontal')

In [None]:
# Dropping the original "Age" column for each DataFrame
X_pandas.drop(columns = ['Age'], inplace = True)
X_polars = X_polars.drop(columns = ['Age'])

In [None]:
# Viewing the first few rows of the final, feature engineered Pandas DataFrame
X_pandas.head()

In [None]:
# Viewing the first few rows of the final, feature engineered Pandas DataFrame
X_polars.head()

## Predictive Modeling with Machine Learning

### Performing a Train-Test Split

In [None]:
# Performing a train-validation split on the Pandas data
X_train_pandas, X_val_pandas, y_train_pandas, y_val_pandas = train_test_split(X_pandas, y_pandas, test_size = 0.2, random_state = 42)

In [None]:
# Performing a train-validation split on the Polars data
X_train_polars, X_val_polars, y_train_polars, y_val_polars = train_test_split(X_polars, y_polars, test_size = 0.2, random_state = 42)

### Performing Model Training

In [None]:
# Instantiating a Random Forest Classifier object for each respective DataFrame
rfc_model_pandas = RandomForestClassifier(n_estimators = 50,
                                          max_depth = 20,
                                          min_samples_split = 10,
                                          min_samples_leaf = 2)

rfc_model_polars = RandomForestClassifier(n_estimators = 50,
                                          max_depth = 20,
                                          min_samples_split = 10,
                                          min_samples_leaf = 2)

In [None]:
# Fitting the Pandas DataFrame to the Random Forest Classifier algorithm
rfc_model_pandas.fit(X_train_pandas, y_train_pandas.values.ravel())

In [None]:
# Fitting the Polars DataFrame to the Random Forest Classifier algorithm
rfc_model_polars.fit(X_train_polars, y_train_polars)