In [1]:
from sklearn import datasets
import numpy as np 
from sklearn.model_selection import train_test_split


In [2]:
# load the iris dataset
iris = datasets.load_iris()

# split the dataset into features and labels 
X = iris.data
y = iris.target

# print(X, y)

print(X.shape) # 150 * 4 array 
print(y.shape) # 150 elements linear array


(150, 4)
(150,)


In [3]:
# split the data into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(120, 4)
(30, 4)
(120,)
(30,)


# Scikit-learn's Inbuilt Datasets

Yes, **scikit-learn (sklearn) comes with many inbuilt datasets!** This is a fantastic feature for learning, experimenting, and quickly prototyping machine learning models. These datasets are readily available through the `sklearn.datasets` module.

They generally fall into a few helpful categories:

* **Toy Datasets:** These are small, simple datasets perfect for educational purposes, demonstrating how algorithms work, and quick tests.
    * Examples: `load_iris`, `load_diabetes`, `load_digits`, `load_wine`, `load_breast_cancer`, `load_linnerud`.
* **Real-world Datasets:** These are larger datasets that get downloaded the first time you access them.
    * Examples: `fetch_20newsgroups`, `fetch_california_housing`, `fetch_olivetti_faces`.
    * **Note:** These require an internet connection the very first time you use them, as they aren't bundled directly with the scikit-learn installation.
* **Generated Datasets:** These are functions that create synthetic datasets with specific properties. They're super useful for testing algorithms under controlled conditions, especially when you want to see how a model performs with different data characteristics.
    * Examples: `make_classification`, `make_regression`, `make_blobs`.

---

## What is the Iris Dataset?

The **Iris dataset** is arguably the most famous and widely used "toy" dataset in machine learning, particularly for classification problems. It's often referred to as the "Hello World" of machine learning because it's simple enough to understand quickly but complex enough to showcase various classification techniques.

Here's what makes it so significant:

* **Origin:** It was introduced by the renowned British statistician and biologist Ronald Fisher in 1936.
* **Purpose:** It serves as a classic benchmark for demonstrating and testing classification algorithms due to its straightforward nature and well-defined classes.
* **Structure:**
    * **Instances (samples):** It contains 150 total data points.
    * **Features (attributes):** Each instance has 4 numerical features, all measured in centimeters:
        1.  Sepal length
        2.  Sepal width
        3.  Petal length
        4.  Petal width
    * **Classes (targets):** The data represents 3 distinct species of Iris flowers, with 50 instances for each species:
        1.  Iris Setosa
        2.  Iris Versicolor
        3.  Iris Virginica
* **Characteristics:**
    * One class (`Iris Setosa`) is **linearly separable** from the other two, meaning you can draw a straight line (or plane) to separate it from the rest.
    * The other two classes (`Iris Versicolor` and `Iris Virginica`) are **not perfectly linearly separable** from each other. This slight overlap adds a touch of challenge and makes the dataset interesting for evaluating more sophisticated classification algorithms beyond basic linear models.

---

## How to Load the Iris Dataset in Scikit-learn

Loading the Iris dataset is straightforward. Here's how you do it and how to inspect its structure:

```python
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()

# The 'data' attribute holds the features (X)
X = iris.data

# The 'target' attribute holds the labels (y)
y = iris.target

# You can also access descriptive information about the dataset
feature_names = iris.feature_names # Names of the features
target_names = iris.target_names   # Names of the target classes

print("Features (X) shape:", X.shape)
print("Target (y) shape:", y.shape)
print("Feature names:", feature_names)
print("Target names:", target_names)

print("\nFirst 5 samples of features (X):\n", X[:5])
print("First 5 samples of target labels (y):\n", y[:5])

# Understanding `train_test_split` in Scikit-learn

The `train_test_split` function is a fundamental utility in machine learning workflows, used to divide your dataset into training and testing subsets.

---

## Why We Split Data

The primary goal of splitting your data is to accurately evaluate your machine learning model's performance.

* **Training Set:** This portion of the data is used to **train your machine learning model**. The model learns patterns and relationships from these examples.
* **Testing Set:** This independent portion of the data is used to **evaluate the performance of your trained model on unseen data**. This is crucial for assessing how well your model **generalizes** to new, real-world examples and helps in **preventing overfitting**.

The function takes your input **features `X`** (independent variables) and your **target variable `y`** (dependent variable/labels) and splits them consistently, ensuring that corresponding rows in `X` and `y` stay together.

---

## Understanding the Parameters

Let's break down the key parameters of the `train_test_split` function:

* **`X`**: Your **feature matrix** (e.g., `iris.data`). This contains the input features for your model.
* **`y`**: Your **target vector** (e.g., `iris.target`). This contains the labels or values your model is trying to predict.
* **`test_size`**:
    * If a `float` (e.g., `0.2` or `0.3`), it represents the **proportion** of the dataset to allocate to the test split. So, `0.2` means 20% of the data goes into the test set.
    * If an `int`, it represents the **absolute number** of samples to include in the test split.
    * **Default:** If neither `test_size` nor `train_size` is specified, `test_size` defaults to `0.25` (25%).
* **`train_size`**:
    * If a `float`, it represents the **proportion** of the dataset to allocate to the training split.
    * If an `int`, it represents the **absolute number** of samples to include in the training split.
    * **Default:** If neither `test_size` nor `train_size` is specified, its value is inferred from `test_size`.
* **`random_state`**:
    * An `integer` or `None`. This parameter is vital for **reproducibility**. Setting it to a specific integer (e.g., `42`) ensures that your data split will be the same every time you run your code. If left as `None` (the default), the split will differ with each execution due to a changing random number generator.
* **`shuffle`**:
    * `bool`, default=`True`. Determines whether the data is **shuffled** before being split. Shuffling is almost always recommended to ensure both the training and test sets are representative of the overall dataset and don't carry any inherent ordering biases.
* **`stratify`**:
    * `array-like` or `None`, default=`None`. If provided, this parameter should be your `y` array. It ensures that the **proportions of class labels** in both the training and testing sets are roughly the same as those in the original full dataset. This is especially useful for **imbalanced datasets** to prevent a situation where one class is over-represented in one split and under-represented in the other.

---

## Why You Typically Don't Use Both `test_size` and `train_size` Simultaneously

The reason you usually specify only one of `test_size` or `train_size` is that **they are complementary**. If you define one, the other is automatically determined.

For example, with a dataset of 100 samples:
* If you set `test_size = 0.2`, 20 samples go to the test set, and by implication, 80 samples (`100 - 20`) form the training set. The `train_size` is implicitly `0.8`.
* Similarly, if you set `train_size = 0.8`, 80 samples go to the training set, and 20 samples (`100 - 80`) form the test set. The `test_size` is implicitly `0.2`.

Specifying both values when they might conflict can lead to errors. For instance, if you provide `test_size=0.3` and `train_size=0.6`, their sum is `0.9`, leaving 10% of the data unaccounted for. Scikit-learn will typically raise a `ValueError` in such cases unless their sum exactly equals 1.0 (for float proportions) or the total number of samples (for integer counts).

The function is designed for convenience: just specify the split size you care about most, and it handles the rest. Most users think in terms of how much data they want to hold out for testing, which is why `test_size` is more commonly used.

---

## Example Usage

```python
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load the Iris dataset (a common toy dataset)
iris = load_iris()
X, y = iris.data, iris.target

print(f"Original dataset shape: {X.shape}")

# 1. Common usage: Specifying test_size (20% for testing)
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X, y, test_size=0.2, random_state=42)
print("\n--- Using test_size = 0.2 ---")
print(f"X_train_1 shape: {X_train_1.shape}") # (120, 4)
print(f"X_test_1 shape: {X_test_1.shape}")   # (30, 4)

# 2. Alternatively: Specifying train_size (80% for training)
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X, y, train_size=0.8, random_state=42)
print("\n--- Using train_size = 0.8 ---")
print(f"X_train_2 shape: {X_train_2.shape}") # (120, 4)
print(f"X_test_2 shape: {X_test_2.shape}")   # (30, 4)
# Notice that the shapes are identical to the first example because the splits are complementary.

# 3. Using both (when they are consistent – sums to 1.0)
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=42)
print("\n--- Using both (consistent) ---")
print(f"X_train_3 shape: {X_train_3.shape}") # (120, 4)
print(f"X_test_3 shape: {X_test_3.shape}")   # (30, 4)

# 4. Attempting to use both with inconsistent values (will raise a ValueError)
try:
    X_train_4, X_test_4, y_train_4, y_test_4 = train_test_split(X, y, test_size=0.3, train_size=0.6, random_state=42)
except ValueError as e:
    print(f"\n--- Error with inconsistent test_size and train_size ---")
    print(e)
    # Expected Output: ValueError: The sum of test_size and train_size = 0.9, but should be at most 1.0. Reduce test_size or train_size.