### What is a Train-Test Split? 🤔
When you're working with a machine learning model, you want to teach the model with some data and then test how well it learned. 

- **Training Set** 🏋️‍♂️: This is the data you use to train (or teach) your model. It's like the practice your model gets before the big game!
- **Testing Set** 🧪: This is the data you use to see how well your model performs. It's like a quiz to check if your model learned well!

### Why Split the Data? ✂️
If you train and test on the same data, your model might do well just because it remembers the answers (overfitting). By splitting the data, you can see if the model can handle new, unseen data.

### How to Split the Data? 🤓
We’ll use a **toy dataset** (a small and simple dataset) to demonstrate. Here's how you can do it in Python:

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load toy dataset (Iris dataset 🌸)
iris = load_iris()
X = iris.data  # Features 📊
y = iris.target  # Labels 🎯

# First Split 🔪
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.2, random_state=1)
print(f"First Split - Training set size: {len(X_train1)}, Testing set size: {len(X_test1)}")

First Split - Training set size: 120, Testing set size: 30


In [4]:
X_train1

array([[6.1, 3. , 4.6, 1.4],
       [7.7, 3. , 6.1, 2.3],
       [5.6, 2.5, 3.9, 1.1],
       [6.4, 2.8, 5.6, 2.1],
       [5.8, 2.8, 5.1, 2.4],
       [5.3, 3.7, 1.5, 0.2],
       [5.5, 2.3, 4. , 1.3],
       [5.2, 3.4, 1.4, 0.2],
       [6.5, 2.8, 4.6, 1.5],
       [6.7, 2.5, 5.8, 1.8],
       [6.8, 3. , 5.5, 2.1],
       [5.1, 3.5, 1.4, 0.3],
       [6. , 2.2, 5. , 1.5],
       [6.3, 2.9, 5.6, 1.8],
       [6.6, 2.9, 4.6, 1.3],
       [7.7, 2.6, 6.9, 2.3],
       [5.7, 3.8, 1.7, 0.3],
       [5. , 3.6, 1.4, 0.2],
       [4.8, 3. , 1.4, 0.3],
       [5.2, 2.7, 3.9, 1.4],
       [5.1, 3.4, 1.5, 0.2],
       [5.5, 3.5, 1.3, 0.2],
       [7.7, 3.8, 6.7, 2.2],
       [6.9, 3.1, 5.4, 2.1],
       [7.3, 2.9, 6.3, 1.8],
       [6.4, 2.8, 5.6, 2.2],
       [6.2, 2.8, 4.8, 1.8],
       [6. , 3.4, 4.5, 1.6],
       [7.7, 2.8, 6.7, 2. ],
       [5.7, 3. , 4.2, 1.2],
       [4.8, 3.4, 1.6, 0.2],
       [5.7, 2.5, 5. , 2. ],
       [6.3, 2.7, 4.9, 1.8],
       [4.8, 3. , 1.4, 0.1],
       [4.7, 3

In [5]:
y_train1

array([1, 2, 1, 2, 2, 0, 1, 0, 1, 2, 2, 0, 2, 2, 1, 2, 0, 0, 0, 1, 0, 0,
       2, 2, 2, 2, 2, 1, 2, 1, 0, 2, 2, 0, 0, 2, 0, 2, 2, 1, 1, 2, 2, 0,
       1, 1, 2, 1, 2, 1, 0, 0, 0, 2, 0, 1, 2, 2, 0, 0, 1, 0, 2, 1, 2, 2,
       1, 2, 2, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 2, 2, 2, 0, 0, 1, 0, 2, 0,
       2, 2, 0, 2, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 2, 0,
       0, 2, 1, 2, 1, 2, 2, 1, 2, 0])

In [6]:
# Second Split ✂️
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Second Split - Training set size: {len(X_train2)}, Testing set size: {len(X_test2)}")

Second Split - Training set size: 120, Testing set size: 30


In [7]:
X_train2

array([[4.6, 3.6, 1. , 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [6.7, 3.1, 4.4, 1.4],
       [4.8, 3.4, 1.6, 0.2],
       [4.4, 3.2, 1.3, 0.2],
       [6.3, 2.5, 5. , 1.9],
       [6.4, 3.2, 4.5, 1.5],
       [5.2, 3.5, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.2, 4.1, 1.5, 0.1],
       [5.8, 2.7, 5.1, 1.9],
       [6. , 3.4, 4.5, 1.6],
       [6.7, 3.1, 4.7, 1.5],
       [5.4, 3.9, 1.3, 0.4],
       [5.4, 3.7, 1.5, 0.2],
       [5.5, 2.4, 3.7, 1. ],
       [6.3, 2.8, 5.1, 1.5],
       [6.4, 3.1, 5.5, 1.8],
       [6.6, 3. , 4.4, 1.4],
       [7.2, 3.6, 6.1, 2.5],
       [5.7, 2.9, 4.2, 1.3],
       [7.6, 3. , 6.6, 2.1],
       [5.6, 3. , 4.5, 1.5],
       [5.1, 3.5, 1.4, 0.2],
       [7.7, 2.8, 6.7, 2. ],
       [5.8, 2.7, 4.1, 1. ],
       [5.2, 3.4, 1.4, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [5.1, 3.8, 1.9, 0.4],
       [5. , 2. , 3.5, 1. ],
       [6.3, 2.7, 4.9, 1.8],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [5.6, 2

In [8]:
y_train2

array([0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0, 0, 1, 2, 2, 1, 2, 1, 2,
       1, 0, 2, 1, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 1, 2, 0, 1, 2, 0, 2, 2,
       1, 1, 2, 1, 0, 1, 2, 0, 0, 1, 1, 0, 2, 0, 0, 1, 1, 2, 1, 2, 2, 1,
       0, 0, 2, 2, 0, 0, 0, 1, 2, 0, 2, 2, 0, 1, 1, 2, 1, 2, 0, 2, 1, 2,
       1, 1, 1, 0, 1, 1, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 1, 2, 2, 1, 2,
       1, 1, 2, 2, 0, 1, 2, 0, 1, 2])

In [None]:
# Third Split ✨
X_train3, X_test3, y_train3, y_test3 = train_test_split(X, y, test_size=0.2, random_state=99)
print(f"Third Split - Training set size: {len(X_train3)}, Testing set size: {len(X_test3)}")

### Explanation of the Code 🧑‍🏫
1. **Load Iris Dataset** 🌸: The Iris dataset is a famous toy dataset used for beginners. It contains features (like petal length) and labels (like which flower species it is).
  
2. **Splitting the Data** 🔪✂️✨:
    - We use the `train_test_split` function from `sklearn` to split the data.
    - The `test_size=0.2` means that 20% of the data will be used for testing, and 80% for training.
    - `random_state` is like a seed that helps keep the split the same every time you run it. By changing `random_state`, you get different splits.

### Results 📊
- After each split, you'll see the size of the training and testing sets. This helps you understand how your data is being divided.

By performing multiple splits with different `random_state` values, you can see how the data division changes. This helps ensure that your model isn't just lucky with one particular split!

### Final Thoughts 💡
Splitting your data correctly is super important in building a model that performs well in the real world. It ensures your model is generalizable and can handle new data confidently!

### What is `random_state`? 🎲
`random_state` is like setting a seed for randomness. Imagine you're planting a garden. If you want to get the same type of flowers in the same spots every time you plant, you'd use the same bag of seeds. The `random_state` is like choosing that specific bag of seeds.

In the context of splitting data:

- **Without `random_state`**: Every time you split the data, you might get a different result (like planting flowers randomly every time).
- **With `random_state`**: The split will be the same every time you run the code (like using the same bag of seeds to plant the garden in the same way).

### Why use `random_state`? 🤔
1. **Reproducibility** 🧩: If you share your code with someone else or run it again later, using the same `random_state` ensures that the data is split in the exact same way. This consistency is important for comparing results and debugging.
  
2. **Control** 🎛️: Sometimes, you want to explore how different splits affect your model. By changing the `random_state`, you can get different splits and see how robust your model is.

### Example with and without `random_state` 🖥️

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load Iris dataset 🌸
iris = load_iris()
X = iris.data
y = iris.target

# Split without random_state 🤷‍♂️
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.2)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.2)

print("Without random_state:")
print(f"First Split - Training set size: {len(X_train1)}, Testing set size: {len(X_test1)}")
print(f"Second Split - Training set size: {len(X_train2)}, Testing set size: {len(X_test2)}")

Without random_state:
First Split - Training set size: 120, Testing set size: 30
Second Split - Training set size: 120, Testing set size: 30


In [16]:
X_train1

array([[6. , 2.7, 5.1, 1.6],
       [5.2, 3.5, 1.5, 0.2],
       [6.6, 3. , 4.4, 1.4],
       [6.9, 3.1, 5.1, 2.3],
       [5.8, 2.7, 5.1, 1.9],
       [4.4, 3. , 1.3, 0.2],
       [6.5, 3. , 5.8, 2.2],
       [6.2, 3.4, 5.4, 2.3],
       [5.5, 3.5, 1.3, 0.2],
       [7.2, 3.2, 6. , 1.8],
       [5.1, 2.5, 3. , 1.1],
       [5.6, 3. , 4.1, 1.3],
       [6.1, 2.9, 4.7, 1.4],
       [5.5, 2.3, 4. , 1.3],
       [5.8, 2.7, 4.1, 1. ],
       [7.1, 3. , 5.9, 2.1],
       [5.8, 2.7, 5.1, 1.9],
       [5.1, 3.3, 1.7, 0.5],
       [7.7, 2.6, 6.9, 2.3],
       [4.4, 3.2, 1.3, 0.2],
       [5.3, 3.7, 1.5, 0.2],
       [5.4, 3.7, 1.5, 0.2],
       [6.2, 2.8, 4.8, 1.8],
       [4.3, 3. , 1.1, 0.1],
       [4.6, 3.6, 1. , 0.2],
       [5. , 2. , 3.5, 1. ],
       [4.9, 3.1, 1.5, 0.1],
       [6.1, 2.8, 4. , 1.3],
       [5.5, 4.2, 1.4, 0.2],
       [6. , 3.4, 4.5, 1.6],
       [5.1, 3.8, 1.5, 0.3],
       [5.7, 2.8, 4.1, 1.3],
       [6. , 2.2, 4. , 1. ],
       [7.2, 3.6, 6.1, 2.5],
       [6.5, 3

In [17]:
y_train1

array([1, 0, 1, 2, 2, 0, 2, 2, 0, 2, 1, 1, 1, 1, 1, 2, 2, 0, 2, 0, 0, 0,
       2, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 0, 0, 2,
       2, 0, 2, 1, 2, 0, 1, 1, 1, 2, 1, 0, 1, 1, 2, 0, 0, 1, 0, 0, 1, 1,
       2, 1, 0, 2, 2, 0, 0, 0, 0, 1, 1, 2, 1, 0, 0, 2, 2, 1, 0, 0, 0, 2,
       2, 0, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 1, 1,
       1, 1, 1, 2, 2, 1, 0, 2, 2, 0])

In [18]:
X_train2

array([[4.7, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [4.9, 2.4, 3.3, 1. ],
       [5.5, 2.5, 4. , 1.3],
       [6. , 3. , 4.8, 1.8],
       [6.7, 3.1, 4.4, 1.4],
       [5.1, 3.8, 1.9, 0.4],
       [5. , 3.6, 1.4, 0.2],
       [7.7, 3.8, 6.7, 2.2],
       [5. , 2.3, 3.3, 1. ],
       [5. , 3.2, 1.2, 0.2],
       [5.6, 3. , 4.1, 1.3],
       [4.9, 3.1, 1.5, 0.2],
       [6.5, 3. , 5.8, 2.2],
       [4.7, 3.2, 1.6, 0.2],
       [5.5, 4.2, 1.4, 0.2],
       [5. , 3.3, 1.4, 0.2],
       [7.2, 3.2, 6. , 1.8],
       [6.4, 3.2, 5.3, 2.3],
       [6.6, 2.9, 4.6, 1.3],
       [6. , 2.2, 4. , 1. ],
       [6.5, 2.8, 4.6, 1.5],
       [7.7, 3. , 6.1, 2.3],
       [6.4, 2.8, 5.6, 2.1],
       [5.8, 2.6, 4. , 1.2],
       [6.7, 3.1, 5.6, 2.4],
       [6.4, 2.8, 5.6, 2.2],
       [5.9, 3.2, 4.8, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [4.4, 3. , 1.3, 0.2],
       [5.4, 3.9, 1.3, 0.4],
       [4.8, 3.4, 1.9, 0.2],
       [5.8, 2.8, 5.1, 2.4],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3

In [19]:
y_train2

array([0, 0, 1, 1, 2, 1, 0, 0, 2, 1, 0, 1, 0, 2, 0, 0, 0, 2, 2, 1, 1, 1,
       2, 2, 1, 2, 2, 1, 1, 0, 0, 0, 2, 0, 0, 1, 1, 2, 2, 2, 2, 0, 2, 2,
       2, 0, 1, 0, 1, 2, 0, 1, 1, 2, 2, 0, 0, 0, 2, 0, 1, 1, 1, 1, 0, 0,
       0, 2, 2, 1, 0, 1, 0, 1, 0, 1, 0, 2, 1, 0, 2, 0, 1, 0, 1, 2, 2, 0,
       2, 1, 2, 0, 0, 1, 0, 2, 2, 2, 2, 1, 0, 1, 2, 1, 0, 1, 1, 1, 1, 1,
       2, 0, 2, 2, 2, 1, 0, 1, 0, 2])

In [20]:
# Split with random_state 🎲
X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.2, random_state=42)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nWith random_state:")
print(f"First Split - Training set size: {len(X_train1)}, Testing set size: {len(X_test1)}")
print(f"Second Split - Training set size: {len(X_train2)}, Testing set size: {len(X_test2)}")


With random_state:
First Split - Training set size: 120, Testing set size: 30
Second Split - Training set size: 120, Testing set size: 30


In [21]:
X_train1

array([[4.6, 3.6, 1. , 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [6.7, 3.1, 4.4, 1.4],
       [4.8, 3.4, 1.6, 0.2],
       [4.4, 3.2, 1.3, 0.2],
       [6.3, 2.5, 5. , 1.9],
       [6.4, 3.2, 4.5, 1.5],
       [5.2, 3.5, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.2, 4.1, 1.5, 0.1],
       [5.8, 2.7, 5.1, 1.9],
       [6. , 3.4, 4.5, 1.6],
       [6.7, 3.1, 4.7, 1.5],
       [5.4, 3.9, 1.3, 0.4],
       [5.4, 3.7, 1.5, 0.2],
       [5.5, 2.4, 3.7, 1. ],
       [6.3, 2.8, 5.1, 1.5],
       [6.4, 3.1, 5.5, 1.8],
       [6.6, 3. , 4.4, 1.4],
       [7.2, 3.6, 6.1, 2.5],
       [5.7, 2.9, 4.2, 1.3],
       [7.6, 3. , 6.6, 2.1],
       [5.6, 3. , 4.5, 1.5],
       [5.1, 3.5, 1.4, 0.2],
       [7.7, 2.8, 6.7, 2. ],
       [5.8, 2.7, 4.1, 1. ],
       [5.2, 3.4, 1.4, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [5.1, 3.8, 1.9, 0.4],
       [5. , 2. , 3.5, 1. ],
       [6.3, 2.7, 4.9, 1.8],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [5.6, 2

In [23]:
y_train1

array([0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0, 0, 1, 2, 2, 1, 2, 1, 2,
       1, 0, 2, 1, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 1, 2, 0, 1, 2, 0, 2, 2,
       1, 1, 2, 1, 0, 1, 2, 0, 0, 1, 1, 0, 2, 0, 0, 1, 1, 2, 1, 2, 2, 1,
       0, 0, 2, 2, 0, 0, 0, 1, 2, 0, 2, 2, 0, 1, 1, 2, 1, 2, 0, 2, 1, 2,
       1, 1, 1, 0, 1, 1, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 1, 2, 2, 1, 2,
       1, 1, 2, 2, 0, 1, 2, 0, 1, 2])

In [22]:
X_train2

array([[4.6, 3.6, 1. , 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [6.7, 3.1, 4.4, 1.4],
       [4.8, 3.4, 1.6, 0.2],
       [4.4, 3.2, 1.3, 0.2],
       [6.3, 2.5, 5. , 1.9],
       [6.4, 3.2, 4.5, 1.5],
       [5.2, 3.5, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.2, 4.1, 1.5, 0.1],
       [5.8, 2.7, 5.1, 1.9],
       [6. , 3.4, 4.5, 1.6],
       [6.7, 3.1, 4.7, 1.5],
       [5.4, 3.9, 1.3, 0.4],
       [5.4, 3.7, 1.5, 0.2],
       [5.5, 2.4, 3.7, 1. ],
       [6.3, 2.8, 5.1, 1.5],
       [6.4, 3.1, 5.5, 1.8],
       [6.6, 3. , 4.4, 1.4],
       [7.2, 3.6, 6.1, 2.5],
       [5.7, 2.9, 4.2, 1.3],
       [7.6, 3. , 6.6, 2.1],
       [5.6, 3. , 4.5, 1.5],
       [5.1, 3.5, 1.4, 0.2],
       [7.7, 2.8, 6.7, 2. ],
       [5.8, 2.7, 4.1, 1. ],
       [5.2, 3.4, 1.4, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [5.1, 3.8, 1.9, 0.4],
       [5. , 2. , 3.5, 1. ],
       [6.3, 2.7, 4.9, 1.8],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [5.6, 2

In [24]:
y_train2

array([0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0, 0, 1, 2, 2, 1, 2, 1, 2,
       1, 0, 2, 1, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 1, 2, 0, 1, 2, 0, 2, 2,
       1, 1, 2, 1, 0, 1, 2, 0, 0, 1, 1, 0, 2, 0, 0, 1, 1, 2, 1, 2, 2, 1,
       0, 0, 2, 2, 0, 0, 0, 1, 2, 0, 2, 2, 0, 1, 1, 2, 1, 2, 0, 2, 1, 2,
       1, 1, 1, 0, 1, 1, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 1, 2, 2, 1, 2,
       1, 1, 2, 2, 0, 1, 2, 0, 1, 2])

### What You'll Notice 👀
- **Without `random_state`**: The sizes might be the same, but the specific data points in the training and testing sets will be different each time.
- **With `random_state`**: The data points in the training and testing sets will be identical every time you run the code.

### Summary ✨
Using `random_state` is like setting a bookmark in a book—you'll always start reading from the same spot. It gives you control and consistency, which are crucial in machine learning for testing and comparing results