# Data loading and processing in PyTorch

## Table of contents

1. [Understanding data loading and processing in PyTorch](#understanding-data-loading-and-processing-in-pytorch)
2. [Setting up the environment](#setting-up-the-environment)
3. [Working with datasets and DataLoader](#working-with-datasets-and-dataloader)
4. [Data transformations and augmentations](#data-transformations-and-augmentations)
5. [Handling different data formats](#handling-different-data-formats)
6. [Preprocessing pipelines](#preprocessing-pipelines)
7. [Advanced data loading techniques](#advanced-data-loading-techniques)
8. [Practical examples and use cases](#practical-examples-and-use-cases)
9. [Further exercises](#further-exercises)

## Understanding data loading and processing in PyTorch

### **Why data loading and processing matter**

The performance of a machine learning model heavily relies on the quality and format of the data it receives. Properly loading and preprocessing data ensures that the model can effectively learn from the input data, leading to better generalization and accuracy. Efficient data handling also reduces bottlenecks during training, particularly when working with large datasets or complex models.

### **Key concepts**

- **Datasets**: PyTorch provides the `torch.utils.data.Dataset` class as an abstract class for handling datasets. Custom datasets can be created by subclassing `Dataset` and overriding two methods: `__len__()` to return the size of the dataset and `__getitem__()` to retrieve a data sample. PyTorch also offers built-in datasets like MNIST, CIFAR-10, and more, which can be easily loaded using `torchvision.datasets`.
- **DataLoader**: The `torch.utils.data.DataLoader` class is responsible for loading data in batches, shuffling data, and handling multiprocessing for loading data in parallel. It is highly customizable, allowing for control over batch size, shuffling, and the number of worker threads used for loading.
- **Transforms**: Data transformations are essential for normalizing, augmenting, and converting data into the appropriate format for model training. PyTorch’s `torchvision.transforms` module provides a wide range of predefined transformations that can be chained together using `transforms.Compose`. Custom transformations can also be created to fit specific needs.

### **Data loading workflow in PyTorch**

The typical data loading workflow in PyTorch involves the following steps:

- **Defining the dataset**: Whether using a built-in dataset or creating a custom one, the first step is to define the dataset by subclassing `Dataset`. This involves specifying how to access and return individual samples.
- **Applying transforms**: Once the dataset is defined, transformations are applied to the data to ensure it is in the correct format for model training. This might include normalization, resizing, cropping, or more advanced augmentations like random rotations or color jitter.
- **Creating DataLoader**: With the dataset and transformations in place, the DataLoader is created to handle the batching, shuffling, and parallel loading of data. This is where most of the heavy lifting in terms of data management happens.
- **Iterating through data**: Finally, the DataLoader is used in the training loop to iterate through the dataset in batches, feeding data to the model for training or validation.

### **Handling large datasets**

For large datasets that cannot fit into memory, PyTorch’s DataLoader supports lazy loading, where only a portion of the data is loaded into memory at a time. This is done through the use of custom datasets and careful management of batch sizes and worker threads. Techniques such as data streaming, where data is continuously fed from disk to memory, can also be employed.

### **Optimization techniques**

Optimizing data loading and processing can have a significant impact on training speed and model performance. Some key techniques include:

- **Using multiple workers**: Increasing the number of worker threads in the DataLoader can speed up data loading by parallelizing the process.
- **Prefetching data**: Preloading the next batch while the model is training on the current batch can reduce the waiting time between epochs.
- **Data augmentation**: Real-time data augmentation during training can increase the diversity of the dataset without the need to store augmented images on disk.

### **Common pitfalls and best practices**

- **Shuffling data**: Always shuffle the training data to prevent the model from learning the order of the data, which can lead to overfitting.
- **Normalizing data**: Proper normalization ensures that the data is on a similar scale, which is crucial for stable and efficient model training.
- **Managing data formats**: Ensure that the data is in the correct format (e.g., tensors) before feeding it to the model. PyTorch expects data in the form of tensors, with specific shapes depending on the model architecture.

## Setting up the environment

##### **Q1: How do you install the necessary libraries for data loading and processing in PyTorch?**

In [1]:
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# !pip install numpy matplotlib scikit-learn pandas

##### **Q2: How do you import the required modules for data handling in PyTorch?**

In [3]:
import torch
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

## Working with datasets and DataLoader

##### **Q3: How do you load built-in datasets using `torchvision`?**

In [4]:
# Using the MNIST dataset as an example:
import torchvision.datasets as datasets
import torchvision.transforms as transforms

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))  # Normalize to range [-1, 1]
])

train_dataset = datasets.MNIST(root='../00-src', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='../00-src', train=False, download=True, transform=transform)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ../00-src\MNIST\raw\train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:06<00:00, 1561489.88it/s]


Extracting ../00-src\MNIST\raw\train-images-idx3-ubyte.gz to ../00-src\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ../00-src\MNIST\raw\train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 116348.42it/s]


Extracting ../00-src\MNIST\raw\train-labels-idx1-ubyte.gz to ../00-src\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ../00-src\MNIST\raw\t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 2508841.99it/s]


Extracting ../00-src\MNIST\raw\t10k-images-idx3-ubyte.gz to ../00-src\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ../00-src\MNIST\raw\t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 4570664.29it/s]

Extracting ../00-src\MNIST\raw\t10k-labels-idx1-ubyte.gz to ../00-src\MNIST\raw






##### **Q4: How do you explore the properties of a dataset, such as size and classes, in PyTorch?**

In [5]:
print(f'Train dataset size: {len(train_dataset)}')
print(f'Test dataset size: {len(test_dataset)}')  # Dataset size

Train dataset size: 60000
Test dataset size: 10000


In [6]:
print(f'Classes: {train_dataset.classes}')  # Only applicable for datasets that have 'classes' attribute

Classes: ['0 - zero', '1 - one', '2 - two', '3 - three', '4 - four', '5 - five', '6 - six', '7 - seven', '8 - eight', '9 - nine']


In [7]:
sample_image, sample_label = train_dataset[0]
print(f'Sample image shape: {sample_image.shape}')  # Shape of a single sample
print(f'Sample label: {sample_label}')

Sample image shape: torch.Size([1, 28, 28])
Sample label: 5


##### **Q5: How do you create a custom dataset class in PyTorch?**

In [8]:
# Subclass torch.utils.data.Dataset and implement the __len__ and __getitem__ methods:
class CustomDataset(Dataset):
    def __init__(self, data, labels, transform=None):  # Initializes the dataset object with data, labels, and any optional transformations
        self.data = data
        self.labels = labels
        self.transform = transform

    def __len__(self):  # Returns the total number of samples
        return len(self.data)

    def __getitem__(self, idx):  # Retrieves the sample and label at the given index
        sample = self.data[idx]
        label = self.labels[idx]

        if self.transform:
            sample = self.transform(sample)

        return sample, label

##### **Q6: How do you implement the `__len__` and `__getitem__` methods for a custom dataset?**

In [9]:
# see above!

##### **Q7: How do you use the DataLoader to batch data in PyTorch?**

In [12]:
import random

train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)  # Create DataLoader for the custom dataset

all_batches = list(train_loader)  # Convert the DataLoader to a list to randomly sample from it

sampled_batches = random.sample(all_batches, 5)  # Randomly select five batches

for images, labels in sampled_batches:  # Iterate through the sampled batches and print the batch size
    print(f'Batch size: {images.size(0)}')

Batch size: 64
Batch size: 64
Batch size: 64
Batch size: 64
Batch size: 64


##### **Q8: How do you shuffle data using DataLoader in PyTorch?**

In [13]:
train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)  # set the shuffle parameter to True

##### **Q9: How do you load data in parallel using multiple workers with DataLoader?**

In [14]:
train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True, num_workers=4)  # set the num_workers parameter in DataLoader

## Data transformations and augmentations

##### **Q10: How do you apply basic data transformations, such as normalization, in PyTorch?**

In [15]:
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  # Normalizes each channel to range [-1, 1]
])

dataset = datasets.MNIST(root='../00-src', train=True, download=True, transform=transform)  # Apply the transform when loading the dataset

##### **Q11: How do you resize and crop images using PyTorch transformations?**

In [16]:
transform = transforms.Compose([
    transforms.Resize((128, 128)),  # Resize to 128x128
    transforms.CenterCrop(112),     # Crop the center 112x112
    transforms.ToTensor()
])

dataset = datasets.MNIST(root='../00-src', train=True, download=True, transform=transform)

##### **Q12: How do you compose multiple transformations using `transforms.Compose` in PyTorch?**

In [17]:
transform = transforms.Compose([
    transforms.Resize((128, 128)),
    transforms.RandomHorizontalFlip(),  # Randomly flip the image horizontally
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5], std=[0.5])  # Normalize for grayscale images
])

dataset = datasets.MNIST(root='../00-src', train=True, download=True, transform=transform)

##### **Q13: What are some common data augmentation techniques?**

In [18]:
transform = transforms.Compose([
    transforms.RandomRotation(degrees=15),  # Rotate the image by up to 15 degrees
    transforms.RandomHorizontalFlip(p=0.5),  # Randomly flip the image horizontally with a 50% probability
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),  # Randomly change brightness, contrast, saturation, and hue
    transforms.ToTensor()
])

dataset = datasets.MNIST(root='../00-src', train=True, download=True, transform=transform)

In [19]:
dataset

Dataset MNIST
    Number of datapoints: 60000
    Root location: ../00-src
    Split: Train
    StandardTransform
Transform: Compose(
               RandomRotation(degrees=[-15.0, 15.0], interpolation=nearest, expand=False, fill=0)
               RandomHorizontalFlip(p=0.5)
               ColorJitter(brightness=(0.8, 1.2), contrast=(0.8, 1.2), saturation=(0.8, 1.2), hue=(-0.1, 0.1))
               ToTensor()
           )

## Handling different data formats

##### **Q14: How do you load image data from files and directories in PyTorch?**

In [21]:
import urllib.request
import tarfile

root_dir = '../00-src'  # Define the path to the root directory

if not os.path.exists(root_dir):
    os.makedirs(root_dir)  # Create the directory if it doesn't exist

url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"  # URL of a sample dataset (e.g., a small dataset of flowers)
tgz_path = os.path.join(root_dir, "flower_photos.tgz")

urllib.request.urlretrieve(url, tgz_path)  # Download the dataset

with tarfile.open(tgz_path, 'r:gz') as tar_ref:
    tar_ref.extractall(root_dir)  # Extract the dataset

data_dir = os.path.join(root_dir, "flower_photos")  # Define the directory containing the images

print(f"Dataset extracted to {data_dir}")

  tar_ref.extractall(root_dir)


Dataset extracted to ../00-src\flower_photos


In [22]:
transform_imgs = transforms.Compose([
    transforms.Resize((128, 128)),  # Resize images to 128x128
    transforms.ToTensor(),  # Convert images to tensors
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  # Normalize images
])

dataset_imgs = datasets.ImageFolder(root=data_dir, transform=transform_imgs)  # Load the image data from the directory

dataloader_imgs = DataLoader(dataset_imgs, batch_size=32, shuffle=True)  # Create a DataLoader to batch and shuffle the data

##### **Q15: How do you load and preprocess CSV or tabular data using `pandas` and convert it to tensors?**

In [23]:
csv_path = os.path.join(root_dir, "sample_data.csv")  # Define the path to save the CSV file

df = pd.DataFrame({
    'feature1': [1.0, 2.0, 3.0, 4.0, 5.0],
    'feature2': [10.0, 20.0, 30.0, 40.0, 50.0],
    'label': [0, 1, 0, 1, 0]
})  # Create a sample CSV file
df.to_csv(csv_path, index=False)

In [24]:
df = pd.read_csv(csv_path)  # Load the CSV file using pandas

features = torch.tensor(df[['feature1', 'feature2']].values, dtype=torch.float32)
labels = torch.tensor(df['label'].values, dtype=torch.long)  # Convert features and labels to tensors

print(features)
print(labels)

tensor([[ 1., 10.],
        [ 2., 20.],
        [ 3., 30.],
        [ 4., 40.],
        [ 5., 50.]])
tensor([0, 1, 0, 1, 0])


##### **Q16: How do you load and preprocess text data in PyTorch, including tokenization and embedding creation?**

In [25]:
text_path = os.path.join(root_dir, "sample_text.txt")

url = "https://www.gutenberg.org/files/11/11-0.txt"  # Download a sample text file: Alice's Adventures in Wonderland
urllib.request.urlretrieve(url, text_path)

('../00-src\\sample_text.txt', <http.client.HTTPMessage at 0x1bc56d6d370>)

In [26]:
# !pip install nltk

In [30]:
import nltk
from collections import Counter

nltk.download('punkt')  # This will download the punkt tokenizer
nltk.download('punkt_tab')  # Add this line to download 'punkt_tab' if needed

with open(text_path, 'r', encoding='utf-8') as f:  # Load the text data with the correct encoding
    text_data = f.readlines()

tokenized_data = [nltk.word_tokenize(line.lower()) for line in text_data]  # Tokenize the text data using nltk

counter = Counter([word for line in tokenized_data for word in line])
vocab = {word: idx for idx, (word, _) in enumerate(counter.items(), start=1)}  # Build a vocabulary
vocab['<unk>'] = 0  # Add an unknown token

text_as_tensor = [torch.tensor([vocab.get(word, 0) for word in line], dtype=torch.long) for line in tokenized_data]  # Convert tokens to tensor indices

print(text_as_tensor[0])  # Print tensor for the first line of text

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fellm\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\fellm\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


tensor([ 1,  2,  2,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])


##### **Q17: What strategies can you use to handle missing data when loading and preprocessing datasets?**

In [31]:
torch.manual_seed(42)  # Set random seed for reproducibility

num_samples = 10000
num_features = 10

data = torch.randn(num_samples, num_features)  # Generate random data

mask = torch.rand(num_samples, num_features) < 0.1
data[mask] = float('nan')

print(f"Number of missing values: {torch.isnan(data).sum().item()}")

Number of missing values: 9836


In [32]:
# Dropping rows with missing values:
rows_with_nan = torch.isnan(data).any(dim=1)

cleaned_data = data[~rows_with_nan]  # Drop rows with any NaNs

print(f"Original dataset size: {data.size()}")
print(f"Cleaned dataset size: {cleaned_data.size()}")

Original dataset size: torch.Size([10000, 10])
Cleaned dataset size: torch.Size([3601, 10])


In [33]:
# Fill NaNs with zero:
data_filled_zero = torch.nan_to_num(data, nan=0.0)

print(f"Number of missing values after filling with zero: {torch.isnan(data_filled_zero).sum().item()}")

Number of missing values after filling with zero: 0


In [34]:
# Fill NaNs with the column mean:
column_means = torch.nanmean(data, dim=0)

column_means_expanded = column_means.unsqueeze(0).expand_as(data)  # Expand the column means to match the data shape

data_filled_mean = torch.where(torch.isnan(data), column_means_expanded, data)  # Filling in the mean

print(f"Number of missing values after filling with column mean: {torch.isnan(data_filled_mean).sum().item()}")

Number of missing values after filling with column mean: 0


In [38]:
# Interpolating missing data using pandas:
data_df = pd.DataFrame(data.numpy())

data_interpolated = data_df.interpolate(method='linear', limit_direction='both', axis=0)  # Interpolate missing values with both forward and backward fill for edge cases

data_interpolated_tensor = torch.tensor(data_interpolated.values)

print(f"Number of missing values after interpolation: {torch.isnan(data_interpolated_tensor).sum().item()}")

Number of missing values after interpolation: 0


In [37]:
# Imputation with sklearn:
from sklearn.impute import SimpleImputer

data_np = data.numpy()

imputer = SimpleImputer(strategy='mean')  # Initialize the SimpleImputer to fill missing values with the mean of each column

data_imputed_np = imputer.fit_transform(data_np)  # Fit the imputer on the data and transform it to fill in the missing values

data_imputed_tensor = torch.tensor(data_imputed_np)  # Convert the imputed NumPy array back to a PyTorch tensor

print(f"Number of missing values after imputation: {torch.isnan(data_imputed_tensor).sum().item()}")  # Print the number of missing values after imputation

print(data_imputed_tensor[:5])  # Print the first 5 rows of the imputed data

Number of missing values after imputation: 0
tensor([[ 1.9269e+00,  1.4873e+00,  9.0072e-01, -2.1055e+00, -4.9533e-04,
         -1.2345e+00, -4.3067e-02, -1.6047e+00, -7.5214e-01,  1.6487e+00],
        [-3.9248e-01, -1.4036e+00, -7.2788e-01, -5.5943e-01, -7.6884e-01,
          7.6245e-01,  1.6423e+00, -1.5960e-01, -4.9740e-01,  4.3959e-01],
        [-7.5813e-01,  1.0783e+00,  8.0080e-01,  1.6806e+00,  1.2791e+00,
          1.2964e+00,  6.1047e-01,  1.3347e+00, -2.3162e-01,  4.1759e-02],
        [-2.5158e-01,  8.5986e-01, -1.3847e+00, -8.7124e-01, -2.2337e-01,
          1.7174e+00,  3.1888e-01, -4.2452e-01,  3.0572e-01, -7.7459e-01],
        [-1.5576e+00,  9.9564e-01, -8.7979e-01,  1.3268e-02, -1.2742e+00,
          2.1228e+00,  1.0641e-02, -4.8791e-01, -9.1382e-01,  1.4074e-02]])


## Preprocessing pipelines

##### **Q18: How do you build a preprocessing pipeline that integrates transformations and augmentations in PyTorch?**

In [40]:
# Define the transformation pipeline:
transform_pipeline = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),  # Randomly flip the image horizontally
    transforms.RandomRotation(degrees=15),  # Randomly rotate the image by up to 15 degrees
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),  # Randomly adjust brightness, contrast, etc.
    transforms.Resize((128, 128)),  # Resize the image to 128x128
    transforms.ToTensor(),  # Convert the image to a PyTorch tensor
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  # Normalize the image
])

In [41]:
# With images in a directory structure compatible with ImageFolder:
from torchvision.datasets import ImageFolder

dataset = ImageFolder(root='../00-src/flower_photos', transform=transform_pipeline)

dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [42]:
# Integrate with model training
# for images, labels in dataloader:
    # Forward pass: Compute predicted y by passing x to the model.
    # outputs = model(images)
    # Compute loss, gradients, and update model parameters here

##### **Q19: How do you manage data flow from raw input to a model-ready format in PyTorch?**

In [43]:
raw_data = ImageFolder(root='../00-src/flower_photos')  # Load raw data

transform_pipeline = transforms.Compose([
    transforms.Resize((128, 128)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

preprocessed_data = ImageFolder(root='../00-src/flower_photos', transform=transform_pipeline)  # Apply transformations

dataloader = DataLoader(preprocessed_data, batch_size=32, shuffle=True, num_workers=4)

In [44]:
# Pass through the model:
for batch in dataloader:
    images, labels = batch
    # Pass the batch to the model
    # outputs = model(images)

##### **Q20: How do you create and use custom collate functions in PyTorch to handle variable-length inputs?**

In [46]:

from torch.nn.utils.rnn import pad_sequence

def collate_fn(batch):  # Define a collate function
    sequences, labels = zip(*batch)  # Unzip the batch into separate sequences and labels
    padded_sequences = pad_sequence(sequences, batch_first=True, padding_value=0)  # Pad sequences to the same length
    labels = torch.tensor(labels)  # Stack labels into a tensor
    return padded_sequences, labels

In [47]:
dataloader = DataLoader(dataset, batch_size=32, collate_fn=collate_fn, shuffle=True)  # Use the collate function with DataLoader

In [48]:
# Feed batches into the model:
for batch in dataloader:
    sequences, labels = batch
    # Pass the batch to the model
    # outputs = model(sequences)

##### **Q21: How do you manage different data structures in a preprocessing pipeline?**

In [53]:
text_path = '../00-src/sample_text.txt'
with open(text_path, 'r', encoding='utf-8') as f:
    text_data = f.readlines()

# Simple text processing example:
nltk.download('punkt')
def text_transform(text):
    tokens = nltk.word_tokenize(text.lower())
    return torch.tensor([len(tokens)])  # Simple feature: number of tokens

text_data_processed = [text_transform(text) for text in text_data]

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fellm\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [54]:
from sklearn.preprocessing import StandardScaler

csv_path = '../00-src/sample_data.csv'
tabular_df = pd.read_csv(csv_path)  # Load the tabular data

scaler = StandardScaler()
tabular_data_processed = scaler.fit_transform(tabular_df[['feature1', 'feature2']])  # Normalize the tabular data

tabular_data_processed = torch.tensor(tabular_data_processed, dtype=torch.float32)

In [55]:
image_data_processed = torch.randn(100, 3, 128, 128)  # Create dummy image data: 100 RGB images, 128x128 size

In [59]:
# Combine pipelines:
from itertools import cycle

class MultiModalDataset(torch.utils.data.Dataset):
    def __init__(self, image_data, tabular_data, text_data):
        self.image_data = image_data
        self.tabular_data = tabular_data
        self.text_data = text_data
        self.max_length = len(image_data)
        
        if len(tabular_data) < self.max_length:
            self.tabular_data = list(cycle(tabular_data))[:self.max_length]
        if len(text_data) < self.max_length:
            self.text_data = list(cycle(text_data))[:self.max_length]  # Extend smaller datasets to match the largest one

    def __getitem__(self, idx):
        image = self.image_data[idx]
        tabular = self.tabular_data[idx]
        text = self.text_data[idx]

        return image, tabular, text

    def __len__(self):
        return self.max_length


In [65]:
# Use DataLoader with the combined dataset (it's too big - memory error!)
# dataset = MultiModalDataset(image_data_processed, tabular_data_processed, text_data_processed)
# dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# for batch in dataloader:
#     images, tabular_data, text_data = batch
    # print(images.shape, tabular_data.shape, text_data.shape)
    # Pass to a model

## Advanced data loading techniques

##### **Q22: What strategies can you use to work with large datasets that do not fit in memory in PyTorch?**

In [67]:
dataset_dir = os.path.abspath('../00-src/large_dataset')
os.makedirs(dataset_dir, exist_ok=True)

num_samples = 50000  # Number of samples in the dataset
image_shape = (3, 128, 128)  # Simulating 128x128 RGB images
num_tabular_features = 10  # Number of features in tabular data
text_length = 50  # Length of the text sequences (number of tokens)

# Creating the dataset (it'll take a while!):
for i in range(num_samples):
    image = torch.randn(image_shape)
    torch.save(image, os.path.join(dataset_dir, f'image_{i}.pt'))  # Simulate image data

    tabular = torch.randn(num_tabular_features)
    torch.save(tabular, os.path.join(dataset_dir, f'tabular_{i}.pt'))  # Simulate tabular data

    text = torch.randint(0, 10000, (text_length,))  # Random tokens between 0 and 9999
    torch.save(text, os.path.join(dataset_dir, f'text_{i}.pt'))  # Simulate text data

print(f"Created {num_samples} samples in '{dataset_dir}'")

Created 50000 samples in 'g:\My Drive\Pro\Portfólios\git\pyTorchBasis\00-basic-examples\00-src\large_dataset'


In [71]:
# Use DataLoader with the LargeDataset:
class LargeDataset(Dataset):
    def __init__(self, dataset_dir):
        self.dataset_dir = dataset_dir
        self.num_samples = len([name for name in os.listdir(dataset_dir) if name.startswith('image_')])

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        try:  # Load data lazily from disk
            image = torch.load(os.path.join(self.dataset_dir, f'image_{idx}.pt'))
            tabular = torch.load(os.path.join(self.dataset_dir, f'tabular_{idx}.pt'))
            text = torch.load(os.path.join(self.dataset_dir, f'text_{idx}.pt'))
        except Exception as e:
            print(f"Error loading data at index {idx}: {e}")
            raise
        return image, tabular, text

dataset = LargeDataset('../00-src/large_dataset')

dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=0)  # Try with no multiprocessing first

for batch in dataloader:  # If it works fine without multiprocessing, gradually increase num_workers
    images, tabular_data, text_data = batch
    # print(images.shape, tabular_data.shape, text_data.shape)
    # Add model training code here

  image = torch.load(os.path.join(self.dataset_dir, f'image_{idx}.pt'))
  tabular = torch.load(os.path.join(self.dataset_dir, f'tabular_{idx}.pt'))
  text = torch.load(os.path.join(self.dataset_dir, f'text_{idx}.pt'))


##### **Q23: How do you implement lazy loading to load data as needed in PyTorch?**

In [72]:
# see above!

##### **Q24: How can you speed up data loading by caching preprocessed data in PyTorch?**

In [74]:
class CachedLargeDataset(Dataset):  # Use DataLoader with CachedLargeDataset
    def __init__(self, dataset_dir, cache_dir):
        self.dataset_dir = dataset_dir
        self.cache_dir = cache_dir
        os.makedirs(cache_dir, exist_ok=True)
        self.num_samples = len([name for name in os.listdir(dataset_dir) if 'image_' in name])

    def preprocess_and_cache(self, idx):
        cache_path = os.path.join(self.cache_dir, f'cache_{idx}.pt')
        if os.path.exists(cache_path):
            return torch.load(cache_path)

        image = torch.load(os.path.join(self.dataset_dir, f'image_{idx}.pt'))
        tabular = torch.load(os.path.join(self.dataset_dir, f'tabular_{idx}.pt'))
        text = torch.load(os.path.join(self.dataset_dir, f'text_{idx}.pt'))  # Otherwise, load the raw data and preprocess it

        image = (image - image.mean()) / image.std()
        preprocessed_data = (image, tabular, text)
        torch.save(preprocessed_data, cache_path)
        return preprocessed_data  # Example of a simple preprocessing step

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        return self.preprocess_and_cache(idx)

os.makedirs('../00-src/cache_dir', exist_ok=True)

cached_dataset = CachedLargeDataset('../00-src/large_dataset', '../00-src/cache_dir')
cached_dataloader = DataLoader(cached_dataset, batch_size=32, shuffle=True, num_workers=0)

for batch in cached_dataloader:
    images, tabular_data, text_data = batch
    # Process the batch
    # print(images.shape, tabular_data.shape, text_data.shape)
    # Add model training code here

  image = torch.load(os.path.join(self.dataset_dir, f'image_{idx}.pt'))
  tabular = torch.load(os.path.join(self.dataset_dir, f'tabular_{idx}.pt'))
  text = torch.load(os.path.join(self.dataset_dir, f'text_{idx}.pt'))


In [75]:
import shutil

dataset_dir = os.path.abspath('../00-src/large_dataset')
cache_dir = os.path.abspath('../00-src/cache_dir')

if os.path.exists(dataset_dir):
    shutil.rmtree(dataset_dir)
    print(f"Deleted '{dataset_dir}'")

if os.path.exists(cache_dir):
    shutil.rmtree(cache_dir)
    print(f"Deleted '{cache_dir}'")

Deleted 'g:\My Drive\Pro\Portfólios\git\pyTorchBasis\00-basic-examples\00-src\large_dataset'
Deleted 'g:\My Drive\Pro\Portfólios\git\pyTorchBasis\00-basic-examples\00-src\cache_dir'


## Practical examples and use cases

##### **Q25: How do you prepare image data for classification tasks using CNNs in PyTorch?**

In [76]:
transform_pipeline = transforms.Compose([
    transforms.Resize((128, 128)),  # Resize images to 128x128
    transforms.RandomHorizontalFlip(),  # Randomly flip images horizontally
    transforms.ToTensor(),  # Convert images to PyTorch tensors
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  # Normalize images
])

dataset_dir = '../00-src/flower_photos'
dataset = ImageFolder(root=dataset_dir, transform=transform_pipeline)

dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)

In [77]:
# Use the DataLoader in training
# for images, labels in dataloader:
    # Pass the images to your CNN model
    # outputs = model(images)
    # Compute loss, backpropagate, and update model parameters

##### **Q26: How do you preprocess text data for NLP tasks in PyTorch?**

In [80]:
text_path = '../00-src/sample_text.txt'
with open(text_path, 'r', encoding='utf-8') as f:
    text_data = f.readlines()

tokenizer = nltk.word_tokenize  # Tokenize the text data using nltk
tokenized_data = [tokenizer(line.lower()) for line in text_data]

In [81]:
counter = Counter([word for line in tokenized_data for word in line])
vocab = {word: idx for idx, (word, _) in enumerate(counter.items(), start=1)}
vocab['<unk>'] = 0  # Add an unknown token

text_as_tensor = [torch.tensor([vocab.get(word, 0) for word in line], dtype=torch.long) for line in tokenized_data]

In [82]:
class TextDataset(Dataset):  # Create a custom (text) dataset
    def __init__(self, text_tensor):
        self.text_tensor = text_tensor

    def __len__(self):
        return len(self.text_tensor)

    def __getitem__(self, idx):
        return self.text_tensor[idx]

text_dataset = TextDataset(text_as_tensor)
text_dataloader = DataLoader(text_dataset, batch_size=4, shuffle=True)

##### **Q27: How do you work with multi-modal data, combining image and text data, in PyTorch?**

In [83]:
image_dataset = dataset

In [84]:
class MultiModalDataset(Dataset):  # Create a multi-modal dataset class
    def __init__(self, image_dataset, text_tensors):
        self.image_dataset = image_dataset
        self.text_tensors = text_tensors

    def __len__(self):  # Ensure the dataset is as long as the shorter dataset to prevent indexing errors
        return min(len(self.image_dataset), len(self.text_tensors))

    def __getitem__(self, idx):
        
        image, label = self.image_dataset[idx]  # Get the image and its label from the image dataset

        text_tensor = self.text_tensors[idx % len(self.text_tensors)]  # Get the corresponding text data by cycling through it if necessary

        return image, text_tensor, label

In [85]:
multi_modal_dataset = MultiModalDataset(image_dataset=image_dataset, text_tensors=text_as_tensor)  # Create the multi-modal dataset

multi_modal_dataloader = DataLoader(multi_modal_dataset, batch_size=32, shuffle=True, num_workers=4)  # Create DataLoader

In [86]:
# Example usage of the multi-modal DataLoader (it takes a long while):
# for images, texts, labels in multi_modal_dataloader:
    # Here you can process images and texts together in your model
    # print(f'Images shape: {images.shape}')
    # print(f'Texts shape: {texts.shape}')
    # print(f'Labels shape: {labels.shape}')
    # Add model training code here

In [2]:
src_folder = os.path.abspath('../00-src')

if os.path.exists(src_folder):
    for filename in os.listdir(src_folder):  # Iterate over all files and directories within the folder
        file_path = os.path.join(src_folder, filename)
        try:
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)  # Remove the file or symbolic link
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)  # Remove the directory and all its contents
        except Exception as e:
            print(f'Failed to delete {file_path}. Reason: {e}')
    print(f"Cleared all contents of '{src_folder}'")
else:
    print(f"The directory '{src_folder}' does not exist")

Cleared all contents of 'g:\My Drive\Pro\Portfólios\git\pyTorchBasis\00-basic-examples\00-src'
