# Data Shuffling for Non-Technical Students

## Overview
Data Shuffling is the process of randomly reordering the data in a dataset. This ensures that the model doesn't learn unintended patterns that are based on the order of the data. In machine learning, it's crucial to shuffle data to create a fair and unbiased training process.

## Why is Data Shuffling Important?
- **Prevents Learning Based on Order**: When a dataset is ordered (e.g., by time or another variable), the model may pick up on patterns that are tied to that order. Shuffling the data helps prevent this.
- **Improves Model Generalization**: It ensures that the model gets trained on a more representative and diverse sample of the data.
- **Avoids Overfitting**: If the data is ordered, the model might memorize the sequence rather than learn the underlying patterns.

## Example of Data Shuffling
Imagine we have a list of students' exam scores ordered by their age:
```text
Age: 10, 12, 13, 14, 15
Score: 80, 85, 70, 90, 88


In [None]:
import pandas as pd

# Create a sample dataframe
data = pd.DataFrame({
    'Age': [10, 12, 13, 14, 15],
    'Score': [80, 85, 70, 90, 88]
})

# Shuffle the dataframe
shuffled_data = data.sample(frac=1).reset_index(drop=True)

# Show shuffled data
shuffled_data
# sample(): This method is used to randomly sample rows from the dataframe.
# frac=1: This argument specifies the fraction of the original dataset to return. In this case, frac=1 means return 100% of the rows, but shuffled randomly. Essentially, it shuffles all the rows in the dataset.
# If you set frac=0.5, it would randomly sample 50% of the rows.
# By setting frac=1, you ensure that the entire dataset is shuffled without any loss of data.
# Shuffling: When you apply .sample(frac=1), it randomly reorders the rows of the dataframe.
# 2. .reset_index(drop=True)
# reset_index(): This method resets the index of the dataframe after shuffling.
# The index is simply the row labels (e.g., 0, 1, 2, 3,...). After shuffling, the rows are no longer in the original order, and the index can become disorganized (i.e., it still refers to the original row positions).
# drop=True: By setting drop=True, we are telling pandas to discard the old index rather than adding it as a new column in the dataframe.
# If you set drop=False (which is the default), pandas would add the old index as a column in the dataframe.

Unnamed: 0,Age,Score
0,12,85
1,14,90
2,13,70
3,10,80
4,15,88
