# Data Preprocessing (Polars Version)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import polars as pl

Next, we actually need to import our data. Since our data is stored as a CSV (comma-separated value) file, we will use the `read_csv` function provided by **polars**, which takes our file name as an argument.

In [2]:
df = pl.read_csv('../../datasets/sample.csv')
df.head(5)

Country,Age,Salary,Purchased
str,i64,i64,str
"""France""",44,72000.0,"""No"""
"""Spain""",27,48000.0,"""Yes"""
"""Germany""",30,54000.0,"""No"""
"""Spain""",38,61000.0,"""No"""
"""Germany""",40,,"""Yes"""


In [3]:
df.describe()

statistic,Country,Age,Salary,Purchased
str,str,f64,f64,str
"""count""","""10""",9.0,9.0,"""10"""
"""null_count""","""0""",1.0,1.0,"""0"""
"""mean""",,38.777778,63777.777778,
"""std""",,7.693793,12265.579662,
"""min""","""France""",27.0,48000.0,"""No"""
"""25%""",,35.0,54000.0,
"""50%""",,38.0,61000.0,
"""75%""",,44.0,72000.0,
"""max""","""Spain""",50.0,83000.0,"""Yes"""


### Handling Missing Data

Now we must take care of missing data. Many modern datasets will lack some data and it is very important to deal with it in an intelligent way. One intuition might be to remove a particular observation if it has any missing data. This is generally not considered a great practice because as datasets get larger, we expect many observations to have some sort of missing data.

One logical solution, and the one used here, will be to replace any missing values with the **mean of the column**. This allows us to get all of the rest of the information from a given row without skewing the data all that much. 

We can do this easily in `polars` by selecting the columns we want to impute ('Age' and 'Salary') and using the `fill_null` method, setting the fill value to the `mean` of that column. We use `with_columns` to apply this transformation.

In [4]:
# Fill missing numeric values with the mean of their respective columns
df = df.with_columns([
    pl.col('Age').fill_null(pl.col('Age').mean()),
    pl.col('Salary').fill_null(pl.col('Salary').mean())
])

# Display the full data to see the filled values (10 rows in original data)
print(df.head(10))

shape: (10, 4)
┌─────────┬───────────┬──────────────┬───────────┐
│ Country ┆ Age       ┆ Salary       ┆ Purchased │
│ ---     ┆ ---       ┆ ---          ┆ ---       │
│ str     ┆ f64       ┆ f64          ┆ str       │
╞═════════╪═══════════╪══════════════╪═══════════╡
│ France  ┆ 44.0      ┆ 72000.0      ┆ No        │
│ Spain   ┆ 27.0      ┆ 48000.0      ┆ Yes       │
│ Germany ┆ 30.0      ┆ 54000.0      ┆ No        │
│ Spain   ┆ 38.0      ┆ 61000.0      ┆ No        │
│ Germany ┆ 40.0      ┆ 63777.777778 ┆ Yes       │
│ France  ┆ 35.0      ┆ 58000.0      ┆ Yes       │
│ Spain   ┆ 38.777778 ┆ 52000.0      ┆ No        │
│ France  ┆ 48.0      ┆ 79000.0      ┆ Yes       │
│ Germany ┆ 50.0      ┆ 83000.0      ┆ No        │
│ France  ┆ 37.0      ┆ 67000.0      ┆ Yes       │
└─────────┴───────────┴──────────────┴───────────┘


### Encoding Categorical Variables

Now we must encode any categorical variables we might have. When building machine learning models, we are training a mathematical model on data, and strings (like 'Country' and 'Purchased' in this data set) have no meaning as mathematical objects. That is, until we encode them.

**1. Feature Variables (X):** For the 'Country' column, we will use one-hot encoding. `polars` provides a simple method `to_dummies` which will create new binary (0 or 1) columns for each country ('Country_France', 'Country_Spain', etc.).

**2. Target Variable (y):** For the 'Purchased' column, we need to map 'Yes' and 'No' to numeric values. We will use a `when-then-otherwise` expression to map 'Yes' to 1 and 'No' to 0.

In [5]:
# Encode categorical feature 'Country' using one-hot encoding
df = df.to_dummies(columns=['Country'])

# Encode the target variable 'Purchased'
# Map 'Yes' to 1 and 'No' to 0
df = df.with_columns(
    pl.when(pl.col('Purchased') == 'Yes')
      .then(1)
      .otherwise(0)
      .alias('Purchased') # Overwrite the original column
      .cast(pl.Int32)     # Cast to integer
)

print(df.head(10))

shape: (10, 6)
┌────────────────┬─────────────────┬───────────────┬───────────┬──────────────┬───────────┐
│ Country_France ┆ Country_Germany ┆ Country_Spain ┆ Age       ┆ Salary       ┆ Purchased │
│ ---            ┆ ---             ┆ ---           ┆ ---       ┆ ---          ┆ ---       │
│ u8             ┆ u8              ┆ u8            ┆ f64       ┆ f64          ┆ i32       │
╞════════════════╪═════════════════╪═══════════════╪═══════════╪══════════════╪═══════════╡
│ 1              ┆ 0               ┆ 0             ┆ 44.0      ┆ 72000.0      ┆ 0         │
│ 0              ┆ 0               ┆ 1             ┆ 27.0      ┆ 48000.0      ┆ 1         │
│ 0              ┆ 1               ┆ 0             ┆ 30.0      ┆ 54000.0      ┆ 0         │
│ 0              ┆ 0               ┆ 1             ┆ 38.0      ┆ 61000.0      ┆ 0         │
│ 0              ┆ 1               ┆ 0             ┆ 40.0      ┆ 63777.777778 ┆ 1         │
│ 1              ┆ 0               ┆ 0             ┆ 35.0      ┆ 

### Splitting Features (X) and Target (y)

Now that all our data is numeric and preprocessed, we can separate our features (X) from our target variable (y).

* **X (features):** All columns *except* 'Purchased'.
* **y (target):** The 'Purchased' column.

We will convert these to **NumPy arrays**, as this is the format `sklearn`'s `train_test_split` function expects.

In [6]:
# Select all columns *except* 'Purchased' for X
X = df.select(pl.all().exclude('Purchased')).to_numpy()

# Select only the 'Purchased' column for y
y = df.select('Purchased').to_numpy().ravel() # .ravel() for 1D array

print("X = ", X[:5]) # Print first 5 rows
print("y = ", y[:5]) # Print first 5 values

X =  [[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
  7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
  4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
  5.40000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
  6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
  6.37777778e+04]]
y =  [0 1 0 0 1]


### Splitting into Training and Test Sets

The penultimate step in data preprocessing it to split our data into a training and a testing sets. What this means is that we want to split up our data into two disjoint sets of observations. The 'train' set of this data is what the model is going to learn on. Once the model has learned on the train data, we will use the 'test' set to see how well the model is able to preform. 

It is absolutely critical that there is no overlap between these sets. To understand why, consider this scenario. A teacher always gives out a practice exam before the actual exam and wants to know how performance on the practice exam relates to performance on the real exam (the practice exam is our 'train' set and the real exam is our 'test' set). If there is overlap between the questions on the practice exam and the real exam, then the performance on the real exam is going to be biased. 

The general starting point is 80-20 train-test. `sklearn` provides an extremely easy way to split up into train test, shown below. We will use a 70-30 split (test_size = 0.3) to match the original notebook.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0, stratify=y)
print("X train = ", X_train, "\n")
print("X test = ", X_test, "\n")

X train =  [[0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01
  4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01
  6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01
  6.70000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01
  6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01
  5.40000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01
  7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01
  8.30000000e+04]] 

X test =  [[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01
  7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01
  5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01
  5.80000000e+04]] 



### Feature Scaling

A final step for many machine learning models is called feature scaling. The intuition behind feature scaling can get a bit mathematical so not much time will be spent trying to understand it, but it is good practice. The idea behind feature scaling is that if variables are on extremely different scales, then the model may become biased. 

Take for example our Age and Salary features. Age ranges from 27 to 50 while salary 48000 to 83000. If we were to create a model out of this, the term for Salary would dominate the term for Age because the values for Salary are just so much larger and have so much larger of a scale than Age. The remedy is to scale all values to have roughly the same range. 

We will use `StandardScaler` from `sklearn`. This scales the data to have a mean of 0 and a standard deviation of 1. 

**Important:** We `fit_transform` on the **training data** (to learn the mean and std) but only `transform` the **test data** (to apply the *same* scaling as the training data). This prevents data leakage.

In [8]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print("X train = ", X_train)

X train =  [[-0.63245553 -0.8660254   1.58113883 -1.47345809 -1.46772255]
 [-0.63245553  1.15470054 -0.63245553  0.18190841 -0.11436799]
 [ 1.58113883 -0.8660254  -0.63245553 -0.20009925  0.16202132]
 [-0.63245553 -0.8660254   1.58113883 -0.07276336 -0.35263464]
 [-0.63245553  1.15470054 -0.63245553 -1.09145044 -0.95306659]
 [ 1.58113883 -0.8660254  -0.63245553  1.20059548  1.19133324]
 [-0.63245553  1.15470054 -0.63245553  1.45526725  1.53443721]]


We have now seen a general overview of how to preprocess data for machine learning models. It is critical to undergo these steps as needed before starting any sort of machine learning model to ensure the accuracy and integrity of the model created.