## In this notebook we'll try to create custom datasets and use them in dataloaders. This shows the very basic working of dataloaders

### let's first try to create a dataset for a csv file

### create csv

In [1]:
import numpy as np
import pandas as pd

In [16]:
rows = 1000
col1 = np.arange(0,rows)
col2 = np.random.randn(rows)
col3 = np.random.choice(['cat','dog','orange','apple','coffee','tea'], 1000)

In [17]:
col1.shape

(1000,)

In [18]:
col1.resize(1000,1)
col2.resize(1000,1)
col3.resize(1000,1)

In [19]:
col1.shape

(1000, 1)

In [20]:
concat = np.concatenate((col1, col2, col3), axis=1)

In [22]:
df = pd.DataFrame(concat, columns=['col1', 'col2', 'col3'])

In [23]:
df.head()

Unnamed: 0,col1,col2,col3
0,0,0.7650861070094758,tea
1,1,1.212008972252098,coffee
2,2,-1.0494082831216895,tea
3,3,0.1984456063702433,dog
4,4,0.7155366442448334,coffee


In [24]:
df.to_csv('samplefile.csv', sep='\t', index=False)

In [26]:
data = pd.read_csv('samplefile.csv', delimiter='\t')
data.head()

Unnamed: 0,col1,col2,col3
0,0,0.765086,tea
1,1,1.212009,coffee
2,2,-1.049408,tea
3,3,0.198446,dog
4,4,0.715537,coffee


### create a dataset class

In [141]:
import torch
from torch.utils.data import Dataset, DataLoader

In [90]:
class CsvDataset(Dataset):
    def __init__(self, filename):
        self.data = pd.read_csv(filename, delimiter='\t')
        
    def __getitem__(self, index):
        return self.data.iloc[index]
    
    def __len__(self):
        return self.data.shape[0]

In [91]:
csv_dataset = CsvDataset('samplefile.csv')

In [92]:
type(csv_dataset)

__main__.CsvDataset

In [93]:
dataloader = DataLoader(csv_dataset, 5)

In [94]:
type(dataloader)

torch.utils.data.dataloader.DataLoader

In [95]:
next(iter(dataloader))

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'pandas.core.series.Series'>

#### let's modify the class to return a numpy array instead of a pandas series and try again 

In [120]:
class CsvDataset(Dataset):
    def __init__(self, filename):
        self.data = pd.read_csv(filename, delimiter='\t')
        
    def __getitem__(self, index):
        return self.data.iloc[index].to_numpy()
    
    def __len__(self):
        return self.data.shape[0]

In [121]:
csv_dataset = CsvDataset('samplefile.csv')
dataloader = DataLoader(csv_dataset, 5)

In [122]:
next(iter(dataloader))

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object

hmm.. let's test a bit. looks like the numpy array is returning an object type. maybe we can convert the array to a float?

In [123]:
data.iloc[0].to_numpy(dtype=np.float)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  data.iloc[0].to_numpy(dtype=np.float)


ValueError: could not convert string to float: 'tea'

okay. let's just modify the class to return a list

#### modify the class to return a list instead of a numpy array

In [125]:
class CsvDataset(Dataset):
    def __init__(self, filename):
        self.data = pd.read_csv(filename, delimiter='\t')
        
    def __getitem__(self, index):
        return self.data.iloc[index].tolist()
    
    def __len__(self):
        return self.data.shape[0]

In [126]:
csv_dataset = CsvDataset('samplefile.csv')
dataloader = DataLoader(csv_dataset, 5)

In [129]:
sample = next(iter(dataloader))
sample

[tensor([0, 1, 2, 3, 4]),
 tensor([ 0.7651,  1.2120, -1.0494,  0.1984,  0.7155], dtype=torch.float64),
 ('tea', 'coffee', 'tea', 'dog', 'coffee')]

In [130]:
len(sample)

3

In [131]:
sample[0]

tensor([0, 1, 2, 3, 4])

In [132]:
sample[1]

tensor([ 0.7651,  1.2120, -1.0494,  0.1984,  0.7155], dtype=torch.float64)

In [133]:
sample[2]

('tea', 'coffee', 'tea', 'dog', 'coffee')

**inference:** huh. i expected dataloader to return a list with 5 sub lists, each sublist containing a row. But tensorflow returned 3 list elements, each corresponding to a column. the first 2 were automatically converted to tensors and the last one was converted to a tuple. <br><br>
Looks like dataloader returns one tensor per feature. Let's test this further

let's try to return a single item with the dataloader and see if it will return a list with one element, or will it still separate the columns into different parts

In [136]:
csv_dataset = CsvDataset('samplefile.csv')
dataloader = DataLoader(csv_dataset, 1)

In [137]:
next(iter(dataloader))

[tensor([0]), tensor([0.7651], dtype=torch.float64), ('tea',)]

yup, it still returns a list with 3 tensors, not just a list with one tensor for a row.

we'll modify the class so that only the numbers are returned. and we'll convert the rows to tensors of the same datatype before returning them, and see if it is still 1 tensor/tuple per feature, or one tensor per row

In [143]:
class CsvDataset(Dataset):
    def __init__(self, filename):
        self.data = pd.read_csv(filename, delimiter='\t')
        
    def __getitem__(self, index):
        ret = self.data.iloc[index][:2]
        ret = torch.tensor(ret, dtype=torch.float32)
        return ret
    
    def __len__(self):
        return self.data.shape[0]

In [144]:
csv_dataset = CsvDataset('samplefile.csv')
dataloader = DataLoader(csv_dataset, 5)

In [145]:
sample = next(iter(dataloader))

In [146]:
sample

tensor([[ 0.0000,  0.7651],
        [ 1.0000,  1.2120],
        [ 2.0000, -1.0494],
        [ 3.0000,  0.1984],
        [ 4.0000,  0.7155]])

In [147]:
len(sample)

5

**inference:** look at that! now it returns a tensor with 5 elements, each corresponding to a row. So it seems that it wasnt doing that before because:
* there was a word in the return, which could not be converted to a tensor 
* the first column was a int, while the second was a decimal

### dataloader with shuffle

In [242]:
dataloader = DataLoader(csv_dataset, 5, shuffle=True)

In [243]:
next(iter(dataloader))

tensor([[ 6.8300e+02, -2.3785e-01],
        [ 5.5100e+02, -8.9860e-01],
        [ 5.6900e+02,  1.5202e+00],
        [ 5.9800e+02, -1.6570e+00],
        [ 2.9000e+02,  2.8594e-01]])

In [244]:
next(iter(dataloader))

tensor([[ 2.7000e+02, -3.9455e-01],
        [ 9.2000e+01,  8.0611e-01],
        [ 2.5300e+02,  6.9471e-01],
        [ 3.6400e+02,  1.5529e-01],
        [ 3.2000e+02,  1.3331e+00]])

**inference:** works as expected. the data is shuffled and a subset is returned. Note that the first column looks weird because it is now float and is in the exponential form, but are actually just integer values

now let's try to iterate over the whole dataset

In [250]:
output = []
for i in dataloader:
    output.append(i)
print(f'length of the output: {len(output)}')
print(f'last record from the output:\n {output[199]}')

length of the output: 200
last record from the output:
 tensor([[ 4.4100e+02,  7.3563e-01],
        [ 2.6000e+02,  2.6065e-01],
        [ 8.9000e+01, -2.5596e+00],
        [ 6.4000e+01, -1.5236e-01],
        [ 9.3400e+02, -4.0251e-01]])


it works as expected. the dataloader iterates through the whole dataset (200x5 = 1000). each iteration returns 5 random rows as a tensor