# An Introduction to Polars for Pandas Users
In this notebook, we'll be covering the new tabular dataframe library known as **Polars**. Polars is starting to gain traction for its speedy capabilities, and this is enabled as Polars is built on top of Rust. Polars is an alternative to the industry favorite **Pandas**, and several data scientists are now switching to Polars as their "go to" dataframe library. Throughout this notebook, we'll be doing a direct compare / contrast between Pandas and Polars using the [Titanic dataset](https://www.kaggle.com/c/titanic).

To demonstrate the speediness of Polars versus Pandas, we will be outputting the execution speed of each cell down below. While we could use the Jupyter magic command `%%time`, this would be very tedious to write for every cell. Instead, we'll make use of a special Jupyter extension that does this very cleanly. In order make use of the extension, you will need to run the following commands:

```
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextension enable execute_time/ExecuteTime
```

After doing the proper installation, you can toggle on the execution times in the Jupyter interface by going to "Cell > Execution Timings > Toggle visibility (all)". For context, I am running this notebook on a standard 2021 MacBook Pro with an M1 Pro chip.

## Installation

Installing Polars is as simple as installing any other Python library. Despite being built on top of Rust, it is not imperative to pre-install Rust before installing Polars. To use `pip` to install Polars, simply run the following command:

```
pip install polars
```

Additionally, if you do not have it already installed, you will need to separately need to install Pyarrow, which Polars requires to execute some specific functions. For example, in order to convert a Pandas dataframe into a Polars dataframe using Polars' `from_pandas()` function, Pyarrow is required. To install Pyarrow, simply run the following command

```
pip install pyarrow
```

## Getting Started
Now that we've installed Polars, let's go ahead and get started running some basic functions that I like to run every time I work with a new dataset. To keep things straightforward, we're going to name our Titanic dataframe loaded with Pandas as `df_pandas` and our Titanic dataframe loaded with Polars as `df_polars`.

In [1]:
# Importing the Python libraries we'll be using throughout this notebook
import pandas as pd
import polars as pl
from category_encoders.one_hot import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

### Loading Data from a CSV File

In [2]:
# Setting the filepath for the Titanic dataset
TITANIC_FILEPATH = '../data/titanic/train.csv'

In [3]:
# Importing the Titanic training dataset with Pandas
df_pandas = pd.read_csv(TITANIC_FILEPATH)

In [4]:
# Importing the Titanic training dataset with Polars
df_polars = pl.read_csv(TITANIC_FILEPATH)

### Viewing the First Rows of Each DataFrame

In [5]:
# Viewing the first few rows of the Pandas DataFrame
df_pandas.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
# Viewing the first few rows of the Polars dataframe
df_polars.head()

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow...","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. ...","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Mis...","""female""",26.0,0,0,"""STON/O2. 31012...",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs....","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. Wil...","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


### Viewing Information about the DataFrame

In [7]:
# Viewing the general contents of the Pandas DataFrame
df_pandas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [8]:
# Viewing stats about the Pandas DataFrame
df_pandas.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [9]:
# Viewing information about the Polars dataframe
df_polars.describe()

describe,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
str,f64,f64,f64,str,str,f64,f64,f64,str,f64,str,str
"""count""",891.0,891.0,891.0,"""891""","""891""",891.0,891.0,891.0,"""891""",891.0,"""891""","""891"""
"""null_count""",0.0,0.0,0.0,"""0""","""0""",177.0,0.0,0.0,"""0""",0.0,"""687""","""2"""
"""mean""",446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
"""std""",257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
"""min""",1.0,0.0,1.0,"""Abbing, Mr. An...","""female""",0.42,0.0,0.0,"""110152""",0.0,"""A10""","""C"""
"""max""",891.0,1.0,3.0,"""van Melkebeke,...","""male""",80.0,8.0,6.0,"""WE/P 5735""",512.3292,"""T""","""S"""
"""median""",446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,


### Displaying Value Counts of a Specific Feature

In [10]:
# Viewing the values associated to the "Embarked" column in the Pandas DataFrame
df_pandas['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [11]:
# Viewing the values associated to the "Embarked" column in the Polars DataFrame
df_polars['Embarked'].value_counts()

Embarked,counts
str,u32
"""S""",644
,2
"""Q""",77
"""C""",168


## Data Wrangling
Now that we've loaded our data and performed some quickstart functions, let's go ahead and execute some basic data wrangling techniques to see how the syntax and performance fares between Polars and Pandas.

### Getting a Slice of the DataFrame

In [12]:
# Getting a slice of the Pandas DataFrame using index values
df_pandas[15:30]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.125,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0,,S
18,19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,1,0,345763,18.0,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
20,21,0,2,"Fynney, Mr. Joseph J",male,35.0,0,0,239865,26.0,,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S
22,23,1,3,"McGowan, Miss. Anna ""Annie""",female,15.0,0,0,330923,8.0292,,Q
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S
24,25,0,3,"Palsson, Miss. Torborg Danira",female,8.0,3,1,349909,21.075,,S


In [13]:
# Getting a slice of the Polars DataFrame using index values
df_polars[15:30]

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
16,1,2,"""Hewlett, Mrs. ...","""female""",55.0,0,0,"""248706""",16.0,,"""S"""
17,0,3,"""Rice, Master. ...","""male""",2.0,4,1,"""382652""",29.125,,"""Q"""
18,1,2,"""Williams, Mr. ...","""male""",,0,0,"""244373""",13.0,,"""S"""
19,0,3,"""Vander Planke,...","""female""",31.0,1,0,"""345763""",18.0,,"""S"""
20,1,3,"""Masselmani, Mr...","""female""",,0,0,"""2649""",7.225,,"""C"""
21,0,2,"""Fynney, Mr. Jo...","""male""",35.0,0,0,"""239865""",26.0,,"""S"""
22,1,2,"""Beesley, Mr. L...","""male""",34.0,0,0,"""248698""",13.0,"""D56""","""S"""
23,1,3,"""McGowan, Miss....","""female""",15.0,0,0,"""330923""",8.0292,,"""Q"""
24,1,1,"""Sloper, Mr. Wi...","""male""",28.0,0,0,"""113788""",35.5,"""A6""","""S"""
25,0,3,"""Palsson, Miss....","""female""",8.0,3,1,"""349909""",21.075,,"""S"""


### Filtering the DataFrame by Feature Values

In [14]:
# Extracting teenagers from the Pandas DataFrame
df_pandas[df_pandas['Age'].between(13, 19)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
22,23,1,3,"McGowan, Miss. Anna ""Annie""",female,15.0,0,0,330923,8.0292,,Q
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0000,C23 C25 C27,S
38,39,0,3,"Vander Planke, Miss. Augusta Maria",female,18.0,2,0,345764,18.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
853,854,1,1,"Lines, Miss. Mary Conover",female,16.0,0,1,PC 17592,39.4000,D28,S
855,856,1,3,"Aks, Mrs. Sam (Leah Rosen)",female,18.0,0,1,392091,9.3500,,S
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
877,878,0,3,"Petroff, Mr. Nedelio",male,19.0,0,0,349212,7.8958,,S


In [15]:
# Extracting teenagers from the Polars DataFrame
df_polars.filter(df_polars['Age'].is_between(13, 19))

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
10,1,2,"""Nasser, Mrs. N...","""female""",14.0,1,0,"""237736""",30.0708,,"""C"""
15,0,3,"""Vestrom, Miss....","""female""",14.0,0,0,"""350406""",7.8542,,"""S"""
23,1,3,"""McGowan, Miss....","""female""",15.0,0,0,"""330923""",8.0292,,"""Q"""
28,0,1,"""Fortune, Mr. C...","""male""",19.0,3,2,"""19950""",263.0,"""C23 C25 C27""","""S"""
39,0,3,"""Vander Planke,...","""female""",18.0,2,0,"""345764""",18.0,,"""S"""
40,1,3,"""Nicola-Yarred,...","""female""",14.0,1,0,"""2651""",11.2417,,"""C"""
45,1,3,"""Devaney, Miss....","""female""",19.0,0,0,"""330958""",7.8792,,"""Q"""
50,0,3,"""Arnold-Franchi...","""female""",18.0,1,0,"""349237""",17.8,,"""S"""
68,0,3,"""Crease, Mr. Er...","""male""",19.0,0,0,"""S.P. 3464""",8.1583,,"""S"""
69,1,3,"""Andersson, Mis...","""female""",17.0,4,2,"""3101281""",7.925,,"""S"""


### Filling Null Values

In [16]:
# Filling "Embarked" nulls in the Pandas DataFrame
df_pandas['Embarked'].fillna('S', inplace = True)

In [17]:
# Filling "Embarked" nulls in the Polars DataFrame
df_polars = df_polars.with_columns(df_polars['Embarked'].fill_null('S'))

### Grouping Data by Feature Names

In [18]:
# Grouping data by ticket class and gender to view counts in the Pandas DataFrame
df_pandas.groupby(by = ['Pclass', 'Sex']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,female,94,94,94,85,94,94,94,94,81,94
1,male,122,122,122,101,122,122,122,122,95,122
2,female,76,76,76,74,76,76,76,76,10,76
2,male,108,108,108,99,108,108,108,108,6,108
3,female,144,144,144,102,144,144,144,144,6,144
3,male,347,347,347,253,347,347,347,347,6,347


In [19]:
# Grouping data by ticket class and gender to view counts in the Polars DataFrame
df_polars.groupby(by = ['Pclass', 'Sex']).count()

Pclass,Sex,count
i64,str,u32
1,"""male""",122
2,"""male""",108
3,"""male""",347
2,"""female""",76
3,"""female""",144
1,"""female""",94


## Feature Engineering
Now that we have performed some basic data wrangling functions, I want to perform some simple feature engineering so that we can feed this dataset into a machine learning algorithm. I did this same thing with the Titanic dataset a while back [as part of this notebook](https://github.com/dkhundley/titanic-byoc/blob/main/notebooks/feature-engineering.ipynb), so we're going to see if we can basically emulate the same things with Polars.

In [20]:
# Reloading each DataFrame from scratch
df_pandas = pd.read_csv(TITANIC_FILEPATH)
df_polars = pl.read_csv(TITANIC_FILEPATH)

### Dropping Unnecessary Features

In [21]:
# Dropping unnecessary features from the Pandas DataFrame
df_pandas.drop(columns = ['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace = True)

In [22]:
# Dropping unnecessary features from the Polars DataFrame
df_polars = df_polars.drop(columns = ['PassengerId', 'Name', 'Ticket', 'Cabin'])

In [23]:
# Separating the supporting features (X) from the predictor feature (y) for Pandas
X_pandas = df_pandas.drop(columns = ['Survived'])
y_pandas = df_pandas[['Survived']]

In [24]:
# Separating the supporting features (X) from the predictor feature (y) for Polars
X_polars = df_polars.drop(columns = ['Survived'])
y_polars = df_polars[['Survived']]

### Engineering the "Sex" (Gender) Column

In [25]:
# Instantiating One Hot Encoder objects for each respective DataFrame
sex_ohe_encoder_pandas = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')
sex_ohe_encoder_polars = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')

In [26]:
# Performing a one hot encoding on the "Sex" column for the Pandas DataFrame
sex_dummies_pandas = sex_ohe_encoder_pandas.fit_transform(X_pandas['Sex'])

In [27]:
# Performing a one hot encoding on the "Sex" column for the Polars DataFrame
sex_dummies_polars = sex_ohe_encoder_polars.fit_transform(X_polars['Sex'].to_pandas())

In [28]:
# Concatenating the gender dummies back to the original Pandas DataFrame
X_pandas = pd.concat([X_pandas, sex_dummies_pandas], axis = 1)

In [29]:
# Converting the Polars dummies from a Pandas DataFrame to a Polars DataFrame
sex_dummies_polars = pl.from_pandas(sex_dummies_polars)

# Concatenating the gender dummies back to the original Polars DataFrame
X_polars = pl.concat([X_polars, sex_dummies_polars], how = 'horizontal')

In [30]:
# Dropping the original "Sex" column for each DataFrame
X_pandas.drop(columns = ['Sex'], inplace = True)
X_polars = X_polars.drop(columns = ['Sex'])

### Engineering the "Embarked" Column

In [31]:
# Instantiating One Hot Encoder objects for each respective dataframe
embarked_ohe_encoder_pandas = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')
embarked_ohe_encoder_polars = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')

In [32]:
# Performing a one hot encoding on the "Embarked" column for the Pandas dataframe
embarked_dummies_pandas = embarked_ohe_encoder_pandas.fit_transform(X_pandas['Embarked'])

In [33]:
# Performing a one hot encoding on the "Embarked" column for the Polars dataframe
embarked_dummies_polars = embarked_ohe_encoder_polars.fit_transform(X_polars['Embarked'].to_pandas())

In [34]:
# Concatenating the "embarked" dummies back to the original Pandas dataframe
X_pandas = pd.concat([X_pandas, embarked_dummies_pandas], axis = 1)

In [35]:
# Converting the Polars dummies from a Pandas dataframe to a Polars dataframe
embarked_dummies_polars = pl.from_pandas(embarked_dummies_polars)

# Concatenating the gender dummies back to the original Polars dataframe
X_polars = pl.concat([X_polars, embarked_dummies_polars], how = 'horizontal')

In [36]:
# Dropping the original "Embarked" column for each dataframe
X_pandas.drop(columns = ['Embarked'], inplace = True)
X_polars = X_polars.drop(columns = ['Embarked'])

### Engineering the "Age" Column

In [37]:
# Extracting the median age of the "Age" column using each respective DataFrame
median_age_pandas = X_pandas['Age'].median()
median_age_polars = X_pandas['Age'].median()

In [38]:
# Filling null values with the median age for each respective DataFrame
X_pandas.fillna(median_age_pandas, inplace = True)
X_polars = X_polars.with_columns(X_polars['Age'].fill_null(median_age_polars))

In [39]:
# Establishing our bins values and names
bin_labels = ['child', 'teen', 'young_adult', 'adult', 'elder']
bin_values = [-1, 12, 19, 30, 60, 100]

In [40]:
# Applying "Age" binning for the Pandas DataFrame
age_bins_pandas = pd.DataFrame(pd.cut(X_pandas['Age'], bins = bin_values, labels = bin_labels))

Note: I really tried to get Polars' implementation of the `cut()` function to behave like the Pandas implementation, but... it was confusing. It does appear to work somewhat, but it re-ordered the whole set of data from least to greatest, meaning that I can't simply concatenate it back to the original Polars dataframe. According to [Polars' documentation about the `cut()` function](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.cut.html), this function is still in an "experimental state" as of February 24, 2023. I demonstrated what I'm talking about in the cell below, but I can't proceed forward like this. I'm going to have to use the Pandas values here for my Polars dataframe.

In [41]:
# Applying "Age" binning for the Polars DataFrame
age_bins_polars = pl.cut(X_polars['Age'], bins = bin_values)
age_bins_polars.head()

  age_bins_polars = pl.cut(X_polars['Age'], bins = bin_values)


Age,break_point,category
f64,f64,cat
0.42,12.0,"""(-1.0, 12.0]"""
0.67,12.0,"""(-1.0, 12.0]"""
0.75,12.0,"""(-1.0, 12.0]"""
0.75,12.0,"""(-1.0, 12.0]"""
0.83,12.0,"""(-1.0, 12.0]"""


In [42]:
# Converting the Pandas age bins to Polars for use in the Polars DataFrame
age_bins_polars = pl.from_pandas(age_bins_pandas)

In [43]:
# Instantiating One Hot Encoder objects for each respective DataFrame
age_ohe_encoder_pandas = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')
age_ohe_encoder_polars = OneHotEncoder(use_cat_names = True, handle_unknown = 'ignore')

In [44]:
# Performing a one hot encoding on the age bins for the Pandas DataFrame
age_dummies_pandas = age_ohe_encoder_pandas.fit_transform(age_bins_pandas)

In [45]:
# Performing a one hot encoding on the age bins for the Pandas dataframe
age_dummies_polars = age_ohe_encoder_pandas.fit_transform(age_bins_polars.to_pandas())

In [46]:
# Concatenating the age bin dummies back to the original Pandas DataFrame
X_pandas = pd.concat([X_pandas, age_dummies_pandas], axis = 1)

In [47]:
# Converting the Polars dummies from a Pandas dataframe to a Polars DataFrame
age_dummies_polars = pl.from_pandas(age_dummies_polars)

# Concatenating the gender dummies back to the original Polars DataFrame
X_polars = pl.concat([X_polars, age_dummies_polars], how = 'horizontal')

In [48]:
# Dropping the original "Age" column for each DataFrame
X_pandas.drop(columns = ['Age'], inplace = True)
X_polars = X_polars.drop(columns = ['Age'])

In [49]:
# Viewing the first few rows of the final, feature engineered Pandas DataFrame
X_pandas.head()

Unnamed: 0,Pclass,SibSp,Parch,Fare,Sex_male,Sex_female,Embarked_S,Embarked_C,Embarked_Q,Embarked_nan,Age_child,Age_teen,Age_young_adult,Age_adult,Age_elder
0,3,1,0,7.25,1,0,1,0,0,0,0,0,1,0,0
1,1,1,0,71.2833,0,1,0,1,0,0,0,0,0,1,0
2,3,0,0,7.925,0,1,1,0,0,0,0,0,1,0,0
3,1,1,0,53.1,0,1,1,0,0,0,0,0,0,1,0
4,3,0,0,8.05,1,0,1,0,0,0,0,0,0,1,0


In [50]:
# Viewing the first few rows of the final, feature engineered Pandas DataFrame
X_polars.head()

Pclass,SibSp,Parch,Fare,Sex_male,Sex_female,Embarked_S,Embarked_C,Embarked_Q,Embarked_nan,Age_young_adult,Age_adult,Age_child,Age_teen,Age_elder
i64,i64,i64,f64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
3,1,0,7.25,1,0,1,0,0,0,1,0,0,0,0
1,1,0,71.2833,0,1,0,1,0,0,0,1,0,0,0
3,0,0,7.925,0,1,1,0,0,0,1,0,0,0,0
1,1,0,53.1,0,1,1,0,0,0,0,1,0,0,0
3,0,0,8.05,1,0,1,0,0,0,0,1,0,0,0


## Predictive Modeling with Machine Learning

### Performing a Train-Test Split

In [51]:
# Performing a train-validation split on the Pandas data
X_train_pandas, X_val_pandas, y_train_pandas, y_val_pandas = train_test_split(X_pandas, y_pandas, test_size = 0.2, random_state = 42)

In [52]:
# Performing a train-validation split on the Polars data
X_train_polars, X_val_polars, y_train_polars, y_val_polars = train_test_split(X_polars, y_polars, test_size = 0.2, random_state = 42)

### Performing Model Training

In [53]:
# Instantiating a Random Forest Classifier object for each respective DataFrame
rfc_model_pandas = RandomForestClassifier(n_estimators = 50,
                                          max_depth = 20,
                                          min_samples_split = 10,
                                          min_samples_leaf = 2)

rfc_model_polars = RandomForestClassifier(n_estimators = 50,
                                          max_depth = 20,
                                          min_samples_split = 10,
                                          min_samples_leaf = 2)

In [54]:
# Fitting the Pandas DataFrame to the Random Forest Classifier algorithm
rfc_model_pandas.fit(X_train_pandas, y_train_pandas.values.ravel())

In [55]:
# Fitting the Polars DataFrame to the Random Forest Classifier algorithm
rfc_model_polars.fit(X_train_polars, y_train_polars)

ValueError: Found input variables with inconsistent numbers of samples: [15, 1]