#### Dataset and Dataloader

Dataset and Dataloader are the core abstraction of pytorch that define how you decouple how you define your data and how you efficeintly iterate over it in training loops.

##### Dataset class  
It's a blueprint that defines how data is loaded and returned.  

- __init__() -> tells how data should be loaded  
- __len__() -> returns total number os samples  
- __getitem__(index) -> return data at the given index

#### DataLoader class
It wraps a Dataset and handles batching, shuffling and parallelization  
- it start of the epoch, shuffles indices(if needed)
- for each index the data samples are fetched from Datset object using the __getitem__ fn  
- The samples are collected and combined into a batch using collate_fn()  
- the batch is returned into the training loop

In [15]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from sklearn.datasets import make_classification
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

In [2]:
X, y = make_classification(n_samples=10, n_features=20, n_informative=10, n_redundant=5, random_state=42)

X = torch.from_numpy(X).float()
y = torch.from_numpy(y).float()

In [3]:
class customDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

In [4]:
dataset =customDataset(X, y)

In [6]:
dataset[5]

(tensor([ 2.2870,  2.1296, -0.7116,  0.5766,  1.6450, -2.0425, -2.3476,  1.3669,
          0.1319,  1.1865, -2.1554,  1.0296,  0.9065,  2.2981, -0.2490,  4.4895,
         -4.4665,  2.2131, -1.5152,  3.9459]),
 tensor(0.))

In [7]:
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

for batch_features, batch_labels in dataloader:
    print(batch_features)
    print(batch_labels)
    print("-" * 50)

tensor([[-2.7416,  0.2445, -4.4866,  0.6142, -0.1123, -0.7998, -0.3118,  0.2140,
          0.7653,  3.0692, -1.0656, -3.0723,  0.8255, -3.4992, -0.2210, -3.9071,
         -2.8510, -4.5253, -0.6997,  0.0410],
        [-1.2693,  3.7263, -0.1237,  0.0353, -1.1979,  0.8271,  3.1815, -0.5536,
          1.9978,  1.2473,  4.0500, -1.3506, -1.5733, -2.7079,  1.9647,  4.6150,
         -1.1077, -0.0878, -1.0352, -0.0098]])
tensor([1., 1.])
--------------------------------------------------
tensor([[-0.4762,  1.4008, -4.4011, -0.6466, -0.7564, -2.6964,  0.9252,  0.2035,
         -1.4872, -3.0416,  1.3323,  3.0823,  0.9999,  3.1608, -1.4223,  3.3862,
         -2.2152,  0.0906, -1.6064, -0.2761],
        [-3.5718, -0.0941,  0.1049,  1.4799,  0.8816,  0.0315, -3.0798,  1.6871,
         -0.6928, -1.6571,  2.6018,  1.2493,  1.6031, -3.5134, -0.0080, -1.6477,
         -3.7222,  0.8220, -1.0815, -3.5989]])
tensor([1., 1.])
--------------------------------------------------
tensor([[-1.5700,  1.1844,  2.

##### notes

##### Collate Function  
The data loader uses a simple batch collation mechanism but collate functions allows us to customize how data should be processed and batched.  
e.g Let's say we have a text dataset converted into tokens. Now the challenge is that we the sentences are of different lengths. In this case we modify the collate function to add padding to the batches.

##### Number Of Workers  
DataLoader allows multiprocessing by specifying num_workers as a parameter. 

##### Implementing datasset and dataloader on cancer dataset

In [9]:
df = pd.read_csv('https://raw.githubusercontent.com/gscdit/Breast-Cancer-Detection/refs/heads/master/data.csv')
df = df.drop(["id","Unnamed: 32"],axis=1)
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [10]:
#Splitting the dataframe into train and test sets
X = df.drop("diagnosis", axis=1)
y = df["diagnosis"]

ss= StandardScaler()
X = ss.fit_transform(X.values)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

x_train_tensor = torch.from_numpy(x_train).float()
y_train_tensor = torch.from_numpy(y_train.values).float()
x_test_tensor = torch.from_numpy(x_test).float()
y_test_tensor = torch.from_numpy(y_test.values).float()

In [11]:
class customDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

train_dataset = customDataset(x_train_tensor, y_train_tensor)
test_dataset = customDataset(x_test_tensor, y_test_tensor)

train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [12]:
class myModel(nn.Module):
    def __init__(self,x):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(x, 5),
            nn.ReLU(),
            nn.Linear(5, 1),
            nn.Sigmoid()
        )
    
    def forward(self,x):
        out = self.network(x)
        return out
     

In [13]:
model = myModel(x_train_tensor.shape[1])

epochs = 25
loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(epochs):
    
    for batch_features, batch_labels in train_dataloader:
        #forward_pass
        y_pred = model(batch_features)

        #loss_calculation
        loss = loss_fn(y_pred, batch_labels.reshape(-1, 1))

        #zero_grad
        optimizer.zero_grad()

        #backward_pass
        loss.backward()

        #parameter_update_using_optimizer
        with torch.no_grad():
            optimizer.step()
    

    print(f"Loss after epoch {epoch+1}: {loss.item()}")



Loss after epoch 1: 0.582358717918396
Loss after epoch 2: 0.46402010321617126
Loss after epoch 3: 0.650102972984314
Loss after epoch 4: 0.33554166555404663
Loss after epoch 5: 0.33701494336128235
Loss after epoch 6: 0.2150178849697113
Loss after epoch 7: 0.3694963753223419
Loss after epoch 8: 0.23097142577171326
Loss after epoch 9: 0.22115285694599152
Loss after epoch 10: 0.24354352056980133
Loss after epoch 11: 0.21515612304210663
Loss after epoch 12: 0.16019245982170105
Loss after epoch 13: 0.23656785488128662
Loss after epoch 14: 0.11940811574459076
Loss after epoch 15: 0.04832916334271431
Loss after epoch 16: 0.4020020365715027
Loss after epoch 17: 0.09544921666383743
Loss after epoch 18: 0.08118089288473129
Loss after epoch 19: 0.2238500416278839
Loss after epoch 20: 0.06868790090084076
Loss after epoch 21: 0.05220610648393631
Loss after epoch 22: 0.0772932916879654
Loss after epoch 23: 0.12692147493362427
Loss after epoch 24: 0.01513208169490099
Loss after epoch 25: 0.03364253416

In [17]:
with torch.no_grad():
    y_pred_test = model(x_test_tensor)
    y_pred_test = (y_pred_test > 0.5).float()


cm = confusion_matrix(y_test_tensor, y_pred_test)
accuracy = accuracy_score(y_test_tensor, y_pred_test)

print("Confusion Matrix:")
print(cm)
print("Accuracy:", accuracy)    

Confusion Matrix:
[[71  0]
 [ 1 42]]
Accuracy: 0.9912280701754386
