## How to Split Dataset (structured and unstructured) for ML training

If you have a single dataset (e.g., all images are in one folder) and you want to split it into train and validation sets automatically, here's how you can do it in Python.

This will work if your dataset is unstructured and needs to be divided based on random splitting or a certain percentage (like 80% for training and 20% for validation).

#### 1. Steps to Split Dataset (unstructured) into train/ and val/ Folders:
Prepare Directory Structure:
You need to create a train/ and val/ folder inside the parent folder.

Randomly Split Images:
We'll randomly assign each image to either the train/ or val/ folder.

#### 🧑‍💻 Python Code Example to Split Dataset:

#### 🔍 How It Works:
Source Directory: source_directory should contain all the images you want to split. These images are assumed to be unclassified (not in any subfolders).

Train/Validation Directories: The script will create train/ and val/ directories and move the images into them.

Random Splitting: The images are shuffled randomly, and a percentage of them (default 20% for validation) are moved to the val/ directory. The rest go into the train/ directory.

Directory Structure After Split:

#### 📊 Customizing the Split:
**Validation Size**: You can change the val_size argument to any percentage. For example, val_size=0.2 means 20% of the images go to the validation set, and the rest (80%) go to the training set.

**Folder Names**: The code doesn't handle the creation of subfolders for class-based labels (like in your earlier question). If you want to split into class-based folders (e.g., cat/, dog/), you’ll need to modify the script to create these subdirectories based on existing classes in the filenames.

### 2. Steps for splitting Class-based Folder Structure
If your dataset has subfolders with images already classified, you can adjust the above script slightly:

#### What Happens Here:
If you have a structure like dataset/images/cat, dataset/images/dog, etc., the script creates corresponding train/cat, val/cat, etc., subdirectories.

It moves images within each class to the appropriate folders.

## How to load custom image set if they are sorted into folder names
If your custom image dataset is organized into folders by class names, that’s perfect — both PyTorch and TensorFlow/Keras have built-in tools to handle this directly.

#### 📁 Folder Structure Example

In this structure:

**Folder names** = **class labels** (cat, dog, etc.)

**Images are stored inside class folders**

#### ✅ PyTorch: Use ImageFolder
🔧 **Code Example**

#### ✅ TensorFlow/Keras: Use image_dataset_from_directory
🔧 **Code Example**

You can also check the class names:

#### ⚠️ Tips
Make sure all images are valid image formats (.jpg, .png, etc.)

Folder names should not contain special characters or spaces (ideally)

Ensure the same class folders exist in both train/ and val/


## Load Dataset using Torchvision ImageFolder
PyTorch uses torchvision.datasets.ImageFolder to automatically associate folder names with class labels using a simple, intuitive mechanism:

#### 🔍 How ImageFolder Works:
When you initialize ImageFolder like this:

Here's what happens internally:

**Scans Subfolders** under 'dataset/train':

For example: cat/, dog/, car/

**Sorts Folder Names Alphabetically:**

Let's say your folders are ['cat', 'car', 'dog'], then:

- 'car' → label 0

- 'cat' → label 1

- 'dog' → label 2

**Maps Images to Labels:**

- Every image in dataset/train/cat/ will be given label 1 (from 'cat')

- Images in dataset/train/dog/ will be labeled 2, and so on

You can view the mapping:

Each item returned by dataset[i] is:

#### ✅ Summary:

| Folder Name| Label |
|-----------|-----|
| car/	| 0 |
| cat/	| 1 |
| dog/	| 2 |

The folder name becomes the class, and the index assigned is based on alphabetical order, unless you customize it.

## How to load custom image dataset with their names as classification labels
if your image filenames are labels (e.g., cat.jpg, dog.jpg, car.png, etc.) and not organized into folders, you can still load and use them for training. Below are ways to do this in PyTorch and TensorFlow/Keras.

#### 🔍 Assumptions

Each file's **name (or part of it)** is the **label** (e.g., dog.jpg → label: "dog").

#### 🧠 PyTorch Custom Dataset
You'll need to write a custom Dataset class to parse labels from filenames.

#### 🧪 Usage:

#### 🌟 TensorFlow/Keras (Using Filenames as Labels)
TensorFlow doesn't handle this directly, so you must map filenames to labels yourself.

#### 📁 Step 1: Create a list of filenames and labels

#### 📦 Step 2: Create TensorFlow Dataset

#### ✅ Summary

| Framework	| Method |
|-----------|--------|
|PyTorch	| Create a custom Dataset that reads labels from filenames |
|TensorFlow	| Create a tf.data.Dataset manually from image paths and filename-based labels|

