<a href="https://colab.research.google.com/github/avinregmi/PyTorch-Lessons/blob/master/PyTorch%20Text%20Data%20Loader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import torch
from torch.utils.data import Dataset


In [50]:
! wget https://raw.githubusercontent.com/avinregmi/deep-learning-v2-pytorch/master/recurrent-neural-networks/char-rnn/data/anna.txt
! ls

--2020-01-23 16:48:57--  https://raw.githubusercontent.com/avinregmi/deep-learning-v2-pytorch/master/recurrent-neural-networks/char-rnn/data/anna.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2025486 (1.9M) [text/plain]
Saving to: ‘anna.txt’


2020-01-23 16:48:58 (44.7 MB/s) - ‘anna.txt’ saved [2025486/2025486]

anna.txt  sample_data



### To make class iterable, we need to have ```__getitem__``` OR ```__iter__``` and ```__len__```


In Python to create an iterable object we can use two protocols the first is the Iteration ( __iter__() method), 
and the second is the Sequence ( __getitem__()), so as long as we have any of these two methods in our collection that object is iterable.

We will also add __len__() which will allow us to count the size of our container and __getitem__() which will make our class iterable.
We need to carry the Iteration Protocol or Sequence Protocol methods, which are ```__getitem__ and __iter__```

Iterable class can be used in a for loop but not with next(). In order to get value
with next(), we need to make our class as an iterator. 



---


### To make Class as an iterator:

1 — Include ```__getitem__() or __iter__()``` methods to make your class 

1.   Include ```__getitem__() or __iter__()``` methods to make your class
2.  nclude ```__next__()``` method that returns the next item of the container, making it an iterator.


In [0]:
"""
To make class iterable, we need to have __getitem__ OR __iter__ and __len__

Need to carry the Iteration Protocol or Sequence Protocol methods, which are __getitem__ and __iter__ 
In Python to create an iterable object we can use two protocols the first is the Iteration ( __iter__() method), 
and the second is the Sequence ( __getitem__()), so as long as we have any of these two methods in our collection that object is iterable
we will just add the two methods, __len__() which will allow us to count the size of our container and __getitem__() which will make our class iterable.

iterable class can be used in a for loop but not with next(). In order to get value
with next(), we need to make our class as an iterator. 

To make Class as an iterator:

1 — Include __getitem__() or __iter__() methods to make your class iterable.
2 — Include __next__() method that returns the next item of the container, making it an iterator.

"""

from torch.utils.data import Dataset

#Dataset API
class CustomDataset(Dataset):
    # A pytorch dataset class for holding data for a text classification task.
    def __init__(self, filename):
        '''
        Takes as input the name of a file containing sentences with a classification label (comma separated) in each line.
        Stores the text data in a member variable X and labels in y
        '''

        #Opening the file and storing its contents in a list
        with open(filename) as f:
            lines = f.read()

        self.text = lines.split(" ")

    def preprocess(self, text):

        #So some preprocess here
        text_pp = text.lower().strip().replace('\n','')

        return text_pp
    
    def __len__(self):
        return len(self.text)
   
    def __getitem__(self, index):
       '''
       Returns the text and labels present at the specified index of the lists.
       '''
       return self.preprocess(self.text[index])

In [66]:
from torch.utils.data import DataLoader

dataset = CustomDataset('anna.txt')

#Wrap it around a dataloader
dataloader = DataLoader(dataset, batch_size = 64, num_workers = 5)
for text in dataloader:
  print(len(text))
  print(text)
  break

64
['chapter', '1happy', 'families', 'are', 'all', 'alike;', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its', 'ownway.everything', 'was', 'in', 'confusion', 'in', 'the', "oblonskys'", 'house.', 'the', 'wife', 'haddiscovered', 'that', 'the', 'husband', 'was', 'carrying', 'on', 'an', 'intrigue', 'with', 'a', 'frenchgirl,', 'who', 'had', 'been', 'a', 'governess', 'in', 'their', 'family,', 'and', 'she', 'had', 'announced', 'toher', 'husband', 'that', 'she', 'could', 'not', 'go', 'on', 'living', 'in', 'the', 'same', 'house', 'with', 'him.this', 'position', 'of']


### Using IterableDataset to save memory

In [0]:
from torch.utils.data import IterableDataset

class CustomIterableDataset(IterableDataset):

    def __init__(self, filename):

      #Store the filename in object's memory
      self.filename = filename


    def preprocess(self, text):
      # Apply some preprocessing
      text_pp = text.lower().strip().replace("\n","").split(" ")

      return text_pp

    def mapper(self, line):
      '''
      map takes as input an iterator and a function and returns another iterator
      such that elements of this new iterator contain the output of the function 
      when applied to the elements of the original iterator.
      '''
      text = self.preprocess(line)

      return text


    def __iter__(self):
      #Create an iterator
      file_itr = open(self.filename)

      #Map each element using the mapper
      mapped_itr = map(self.mapper, file_itr)
      
      return mapped_itr

In [75]:
dataset = CustomIterableDataset('anna.txt')
dataloader = DataLoader(dataset, batch_size = 64)

for text in dataloader:
    print(len(text)) # 64
    print(text)
    break
   

1
[('chapter', '', '', 'happy', 'way.', '', 'everything', 'discovered', 'girl,', 'her', 'this', 'husband', 'household,', 'felt', 'stray', 'with', 'the', 'been', 'the', 'friend', 'had', 'the', '', 'three', 'oblonsky--stiva,', 'his', "wife's", 'over', 'would', 'the', 'sat', '', '"yes,', 'was', 'darmstadt,', 'america.', 'sang,', 'and', 'women,', '', 'stepan', '"yes,', 'delightful,', 'in', 'beside', 'the', 'present', 'morocco.', 'stretched', 'dressing-gown', 'remembered', 'study,', '', '"ah,', 'happened.', 'present', 'worst', '', '"yes,', 'thing', 'to', 'oh,', 'painful', '')]
