Splitting a dataset into training, validation, and test sets is an important concept in machine learning. It helps to evaluate the performance of the model and prevent overfitting.

The training set is used to train the model, the validation set is used to tune the hyperparameters of the model, and the test set is used to evaluate the final performance of the model on unseen data.

To split the dataset into training, validation, and test sets, we can use the train_test_split function from the scikit-learn library. Here is an example:

In [11]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Further split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Print the shapes of the resulting datasets
print(f"Training data shape: {X_train.shape}")
print(f"Validation data shape: {X_val.shape}")
print(f"Testing data shape: {X_test.shape}")


Training data shape: (96, 4)
Validation data shape: (24, 4)
Testing data shape: (30, 4)
