In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# 1. Types of variables (Page 15 - Section 1.2.2)

You have learned about continuous (numerical), discrete (numerical), nominal (categorical), and ordinal (categorical) variable types. Let's see an example dataframe with each type:

In [2]:
data = {
    'Height_cm': [175.0, 160.5, 182.3, 158.4, 170.2], # Continuous
    'Number_of_Children': [2, 1, 0, 3, 1], # Discrete
    'Favorite_Color': ['Red', 'Blue', 'Green', 'Blue', 'Green'], # Nominal
    'Satisfaction': ['High', 'Low', 'Medium', 'Medium', 'High'] # Ordinal
}

# Create DataFrame
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Height_cm,Number_of_Children,Favorite_Color,Satisfaction
0,175.0,2,Red,High
1,160.5,1,Blue,Low
2,182.3,0,Green,Medium
3,158.4,3,Blue,Medium
4,170.2,1,Green,High


Imagine you would like to use these as features for an ML model. ML models typically expect numerical data. Therefore while numerical variables can be used as is, categorical variables might require encoding. Let's see how you can encode nominal and ordinal variables in scikit learn.

## 1.1 Encoding nominal categorical variables

The most common way to encode nominal categorical variables is One-Hot Encoding. It used to represent categorical variables as binary vectors. In this technique, each category in the variable is converted into a unique binary vector, where all elements are 0 except for the one corresponding to the category, which is set to 1. This approach is commonly used in machine learning to allow algorithms to process categorical data that would otherwise be difficult to interpret. In our example, "Red," "Blue," and "Green," one-hot encoding would represent these as [1, 0, 0], [0, 1, 0], and [0, 0, 1], respectively.

In [3]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)

In [4]:
# The fit operation determines the encoding for each unique category in the column.
# It sets up the encoder but does not transform the actual data yet.
ohe.fit(df[['Favorite_Color']])

In [5]:
feature_names = ohe.categories_
feature_names

[array(['Blue', 'Green', 'Red'], dtype=object)]

In [6]:
# The transform operation converts the original categorical data
# into one-hot encoded binary vectors.
encoded_features = ohe.transform(df[['Favorite_Color']])
encoded_features

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [7]:
# The resulting features
encoded_df = pd.DataFrame(encoded_features, columns=feature_names).astype('int')
encoded_df

Unnamed: 0,Blue,Green,Red
0,0,0,1
1,1,0,0
2,0,1,0
3,1,0,0
4,0,1,0


Please read the [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) documentation for more information.

## 1.2 Encoding ordinal categorical variables

Ordinal categorical variables have a meaningful order or ranking among categories. To encode these variables, one-hot encoding can be used. However, one-hot encoding loses the inherent order among the categories. Ordinal encoding, on the other hand, preserves this order by assigning integer values to each category based on their rank. Here's how you can do it:

In [17]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(df[['Satisfaction']])

In [21]:
encoded_features = oe.transform(df[['Satisfaction']])

In [25]:
encoded_df = pd.DataFrame(encoded_features, columns=['Satisfaction_Encoded']).astype('int')
combined_df = pd.concat([df[['Satisfaction']], encoded_df], axis=1)
combined_df

Unnamed: 0,Satisfaction,Satisfaction_Encoded
0,High,0
1,Low,1
2,Medium,2
3,Medium,2
4,High,0


Ordinal encoding indeed assigned integers to each category, however, the order is not what we expected. Without specifying the order, the encoder might assign integers based on the order in which it encounters the categories. To ensure the correct order is preserved, we can explicitly define the category order when creating the OrdinalEncoder object:

In [28]:
oe = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
oe.fit(df[['Satisfaction']])

In [29]:
encoded_features = oe.transform(df[['Satisfaction']])
encoded_df = pd.DataFrame(encoded_features, columns=['Satisfaction_Encoded']).astype('int')
combined_df = pd.concat([df[['Satisfaction']], encoded_df], axis=1)
combined_df

Unnamed: 0,Satisfaction,Satisfaction_Encoded
0,High,2
1,Low,0
2,Medium,1
3,Medium,1
4,High,2


Looks better! Please read the [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) documentation for more information.

IMPORTANT

The way you encode your data also depends on the algorithm you would like to use. For example CatBoost supports categorical features natively.

https://catboost.ai/

# 2. The Iris Dataset (Page 20 - Exercise 1.9)

Sir Ronald Aylmer Fisher was an English statistician, evolutionary biologist, and geneticist who worked on a data set that contained sepal length and width, and petal length and width from three species of iris flowers (setosa, versicolor and virginica). There were 50 flowers from each species in the data set.

This dataset is available on scikit learn:

In [34]:
from sklearn.datasets import load_iris
data = load_iris(as_frame=True)
iris_df = data.frame
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Use this dataframe to answer the following:

(a) How many cases were included in the data?

(b) How many numerical variables are included in the data? Indicate what they are, and if they
are continuous or discrete.

(c) How many categorical variables are included in
the data, and what are they? List the corresponding levels (categories).

In [35]:
# (a)
num_rows = iris_df.shape[0]
print(num_rows)

150


In [36]:
# (b)
numerical_values = iris_df.select_dtypes(include = 'number').columns
print(numerical_values)

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')


In [38]:
# (c)
categorical_values = iris_df.select_dtypes(include = 'category').columns
print(numerical_values)
print(iris_df['target'].unique())

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')
[0 1 2]


# 3. Sampling principles (Page 22 - Section 1.3)

In machine learning, we sample data in two distinct scenarios:

- Dataset collection
- Train/Val/Test split

**Dataset collection**

When collecting data, our goal is to obtain a representative subset of the target population to ensure that the model is trained on data that accurately reflects the real-world scenario. For instance, in developing a medical image classification model, you may need data from various hospitals, multiple doctors, and different imaging systems. The sampling strategy must align with the population of interest, which is closely tied to the specific business case.

1. If the model is intended to work exclusively with Siemens X-Ray machines, the sampling can focus on data from these specific machines.
2. If the model is designed to work with any X-ray machine, the sampling must encompass a broader range of machines from different manufacturers.

The second case involves a larger and more diverse population than the first, requiring a more comprehensive sampling approach to ensure the model's generalizability across different equipment.

Take a look at this paper: [ObjectNet](https://objectnet.dev), which shows how ImageNet, the most famous image classification benchmark, is in fact a biased sample 🤯 It also introduces a new dataset with controlled variables like object orientation, background, and viewpoint, revealing that models trained on ImageNet often fail to generalize well in varied real-world conditions. This emphasizes how critical sampling strategies are.

Why is it important to consider controlled variables like orientation, background, and viewpoint in evaluating model performance? What do these findings suggest about the limitations of benchmarks like ImageNet?

YOUR ANSWER HERE

Controlled variables like orientation, background, and viewpoint are essential in evaluating model performance because they reflect real-world diversity. Without these, models may overfit to specific conditions in the training set and fail to generalize to new situations.

Benchmarks like ImageNet have limitations because they often feature biased samples with ideal conditions. As shown by ObjectNet, models trained on ImageNet struggle in varied real-world conditions, suggesting the need for more diverse and realistic datasets to ensure robust model performance.

**Train/Val/Test split**

After collecting data, we end up with a sample that needs to be further divided into training, validation, and test sets (and potentially more subsets). This division is also a form of sampling, but it's done within the already collected sample, rather than directly from the population. The same principle applies: each subset should be representative of the entire population to ensure the model's validity. Additionally, it's crucial to ensure that these subsets are independent of each other to prevent information leakage, which could lead to biased or overly optimistic performance estimates.

Andrew Ng, a renowned AI expert, published a paper in 2017 on a model for classifying chest X-rays using the ChestX-ray14 dataset. Here is the section where they describe the dataset:

"We use the ChestX-ray14 dataset released by Wang et al. (2017) which contains 112,120 frontal-view X-ray images of 30,805 unique patients. Wang et al. (2017) annotate each image with up to 14 different thoracic pathology labels using automatic extraction methods on radiology reports. We label images that have pneumonia as one of the annotated pathologies as positive examples and label all other images as negative examples for the pneumonia detection task. **We randomly split the entire dataset into 80% training, and 20% validation.**"

Paper: https://arxiv.org/pdf/1711.05225v1

What potential issues could arise from using a random split for training and validation in this context, particularly considering the number of patients and the number of images?

YOUR ANSWER HERE

There are more images than patients, meaning some patients have multiple images. If we randomly split the data into training and validation sets, images from the same patient could end up in both sets. This would cause information leakage, where information from the training set is unintentionally shared with the validation set, leading to an overestimation of the model's performance.

## 3.1 Simple random sampling

Let's see how we can do sampling in pandas and numpy. In pandas, getting a sample is simple:

In [39]:
# run this cell a few times
iris_df.sample(n=10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
94,5.6,2.7,4.2,1.3,1
130,7.4,2.8,6.1,1.9,2
33,5.5,4.2,1.4,0.2,0
93,5.0,2.3,3.3,1.0,1
12,4.8,3.0,1.4,0.1,0
126,6.2,2.8,4.8,1.8,2
59,5.2,2.7,3.9,1.4,1
111,6.4,2.7,5.3,1.9,2
141,6.9,3.1,5.1,2.3,2
54,6.5,2.8,4.6,1.5,1


`numpy.random.choice()` can sample only from a 1D array, therefore we can sample the indices:

In [58]:
random_indices = np.random.choice(iris_df.index, size=10)
random_indices

array([126,  70,  96,  23,  50,  89,  14, 126,  16,  87], dtype=int64)

However, this method (by default) samples with replacement, meaning you can get one indice twice.

In sampling, "with replacement" means that each selected element is returned to the population before the next selection, allowing it to be chosen more than once. Conversely, "without replacement" means that once an element is selected, it is removed from the population and cannot be chosen again.

setting `replace=False` will solve this issue.

In [64]:
random_indices = np.random.choice(iris_df.index, size=10, replace=False)
random_indices

array([ 69,  32,  15, 113,  77,  28,  13,  70,  22,   3], dtype=int64)

In [73]:
iris_df.loc[random_indices]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
69,5.6,2.5,3.9,1.1,1
32,5.2,4.1,1.5,0.1,0
15,5.7,4.4,1.5,0.4,0
113,5.7,2.5,5.0,2.0,2
77,6.7,3.0,5.0,1.7,1
28,5.2,3.4,1.4,0.2,0
13,4.3,3.0,1.1,0.1,0
70,5.9,3.2,4.8,1.8,1
22,4.6,3.6,1.0,0.2,0
3,4.6,3.1,1.5,0.2,0


In [82]:
# we can sample multiple times
random_indices = np.random.choice(iris_df.index, size=(5, 10), replace=False)
random_indices

array([[146, 137,  28, 115,  98,  93,  20,  68,  81,   1],
       [ 12,  73,   8,  41,  84,  42,  11,  21,  99,  58],
       [ 38,  67,  10,  23,  25,  75, 114,  45, 132, 100],
       [  4,  44, 105, 149,  53,  97,  51, 109,  60, 119],
       [139, 123,  85,   2,  17, 135,  77,  78, 112, 138]], dtype=int64)

however, without replacement applies within the row (ids in a row are unique) as well as from row to row (the whole array consists of unique ids), meaning we cannot repeat sampling this way more than the sample size:

In [86]:
# since 20*10 = 200 is larger than the dataset size (150) we will get an error
random_indices = np.random.choice(iris_df.index, size=(20, 10), replace=False)
random_indices

ValueError: Cannot take a larger sample than population when 'replace=False'

In [115]:
# we can solve this by replace=True
# but now again some rows can have some ids more than once
random_indices = np.random.choice(iris_df.index, size=(20, 10), replace=True)
random_indices

array([[ 82,  88,  33,  65,  19,  83,  69, 117, 133,  27],
       [  9, 139,  13, 118,  70, 101,   9, 118, 123,  90],
       [ 62, 104,  66,  33,   6,  85,  20, 125, 103,  68],
       [ 35,  29, 104,  96,  17,  15,  43,  24, 108,  51],
       [112, 138,   0,  23,  87, 126,  63,   2, 123,  21],
       [ 72, 135, 110,  20,  48,  65,   3,  37,  16, 123],
       [ 12,  82,  78, 130, 111, 121,  63,  66,   2,  81],
       [ 22, 139,   7, 147,  73,  85,   6, 136, 107, 130],
       [ 78, 129, 143,  81,  98, 133,   9,  45,  63,  45],
       [  7,  30,  18,  84,  55,  55,  34,  67,  44,  90],
       [135,  66,  92, 131,   3,  79,  20,  56,  99, 147],
       [ 87, 116, 128,  52,  70,  86,  72,  17, 149,  48],
       [ 97,  79,  67, 109, 140,   2,  97,  31, 142, 142],
       [116,  26, 143,  87,   4,  27,  59,  89, 114,   1],
       [ 52, 113,  44,  36,  88,  48,  24, 109, 111,  59],
       [ 94,  50,  70, 111, 138, 146,  65,  72,  60,  97],
       [ 76, 142, 113,  56, 124,  55,  55, 114, 136, 106

In [119]:
# which we can check as follows:

def has_duplicates(row):
    return len(row) != len(np.unique(row))

rows_with_duplicates = np.array([row for row in random_indices if has_duplicates(row)])
rows_with_duplicates

array([[  9, 139,  13, 118,  70, 101,   9, 118, 123,  90],
       [ 78, 129, 143,  81,  98, 133,   9,  45,  63,  45],
       [  7,  30,  18,  84,  55,  55,  34,  67,  44,  90],
       [ 97,  79,  67, 109, 140,   2,  97,  31, 142, 142],
       [ 76, 142, 113,  56, 124,  55,  55, 114, 136, 106],
       [ 76,  47,  85, 130,  72,  84,  35,  24, 126,  72]], dtype=int64)

In [123]:
# We can solve this with a for loop
# Inside every row we have unique ids
# But from row to row ids can repeat
random_indices = np.array([np.random.choice(iris_df.index, size=10, replace=False) for _ in range(20)])
random_indices

array([[ 95,  20,  13,  85,  72,  46,  52,  99, 141, 149],
       [104,  75,  74, 122, 110,  77,  55,  70, 126,  54],
       [ 67,  58,  85, 102,  65, 136,  61, 133, 132,  55],
       [ 38, 134,  85,  76,  86,  25,  88,  73, 147,  48],
       [109, 128, 147,  50, 149,   5,  61, 124, 138,  84],
       [140,  24,   5, 109, 139,  31, 104, 105,  41,  92],
       [ 46,  95,  99, 108,  74,  42, 126, 131,  40, 119],
       [ 46,  89, 109, 137, 100,  10, 108,  28,  95,  96],
       [138, 118,  26,  10, 110,  37, 104,   5,   2, 149],
       [132,  91, 140, 128,  49,  82,  16, 109, 146,  71],
       [146, 112,   7,  75, 134, 127,  54,  25,  15,  90],
       [ 43,  83,  23, 137,   0,  87,  80,  37,  94,   4],
       [ 84,  67, 104,  47, 132, 113, 103, 149, 100, 139],
       [106, 140, 133,  33, 112,  65,  75,  10,  59, 118],
       [137,   6, 130, 143,  37,  58,  51,   7, 118, 135],
       [110,   2,  39,  14, 126, 101, 128,  32, 125, 130],
       [ 40,  58, 113,  64,  62,  41,  17,  71,  13,   8

In [124]:
# no duplicates in any row:
rows_with_duplicates = np.array([row for row in random_indices if has_duplicates(row)])
rows_with_duplicates

array([], dtype=float64)

## 3.2 Stratified sampling

scikit learn let's you do stratified sampling when splitting a dataset. Let's see why it is important, with randomly generated data.

Imagine a binary classification problem with 50 samples, each having 5 features. The positive class has a prevalence of 10%, meaning that 10% of the samples are labeled as positive (class 1), while the remaining 90% are labeled as negative (class 0). The dataset is generated with random features and class labels according to the specified proportions.

In [157]:
n_samples = 50
n_features = 5
positive_class_proportion = 0.1
negative_class_proportion = 1 - positive_class_proportion

X = np.random.randn(n_samples, n_features)
y = np.random.choice([0, 1], size=n_samples, p=[negative_class_proportion, positive_class_proportion])

In [158]:
# Note: The actual proportion of the positive class may not be exactly 0.2 due to the randomness in sampling,
# but it will approximate this value. To check, we can calculate the proportion of positive samples in 'y'.
p_y = np.sum(y) / n_samples
print(p_y)

0.22


In [162]:
# If you prefer a method that ensures the correct proportion
# The choice depends on the simulation you would like to make
number_of_ones = int(positive_class_proportion * 50)
number_of_zeros = n_samples - number_of_ones
y = np.concatenate((np.ones(number_of_ones), np.zeros(number_of_zeros)))
p_y = np.sum(y) / n_samples
print(p_y)

0.1


In [163]:
from sklearn.model_selection import train_test_split

In [173]:
# RUN THIS CELL MULTIPLE TIMES
# let's first split with simple random sampling
X_train, X_test, y_train, y_test = train_test_split(X, y)

p_train = np.sum(y_train)/len(y_train)
p_test = np.sum(y_test)/len(y_test)

print(f"Positive class proportion in training set: {p_train:.2f}")
print(f"Positive class proportion in test set: {p_test:.2f}")
print(f"Positive class proportion in before split: {p_y:.2f}")

Positive class proportion in training set: 0.11
Positive class proportion in test set: 0.08
Positive class proportion in before split: 0.10


When you run the cell above multiple times, you may notice that the proportions of the positive class in the training and test sets can vary significantly across different splits. This variability occurs because the simple random sampling method does not control for class proportions.

We can use stratification and solve this problem:

In [177]:
# RUN THIS CELL MULTIPLE TIMES
# You will see that the proportions are always close to the positive_class_proportion
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

p_train = np.sum(y_train)/len(y_train)
p_test = np.sum(y_test)/len(y_test)

print(f"Positive class proportion in training set: {p_train:.2f}")
print(f"Positive class proportion in test set: {p_test:.2f}")
print(f"Positive class proportion in before split: {p_y:.2f}")

Positive class proportion in training set: 0.11
Positive class proportion in test set: 0.08
Positive class proportion in before split: 0.10
