# Natural language processing: project - Dataset Demo

In [1]:
import os

In [2]:
CD_KEY = "--IN_ROOT"

In [3]:
if CD_KEY not in os.environ:
    os.environ[CD_KEY] = "false"

In [4]:
if (
    CD_KEY not in os.environ
    or os.environ[CD_KEY] is None
    or len(os.environ[CD_KEY]) == 0
    or os.environ[CD_KEY] == "false"
):
    %cd ..
else:
    print(os.getcwd())
    
os.environ[CD_KEY] = "true"

/mnt/data/projekti/faks/OPJe/project


## Importing modules

In [5]:
from prado.datasets import ProcessedDataset
from prado.datasets import BasicPradoTransform, BasicPradoAugmentation

from src.modelling.datasets import (
    ImdbDataset
)

## Demonstration

### ImdbDataset

This class serves as a base IMDB dataset Dataset. The dataset is loaded from the TSV file and offered with no further modification. This is generally not useful for model input.

In [6]:
imdb_dataset = ImdbDataset(
    path="data/processed/ready-to-use/imdb/train.tsv",
    delimiter="\t",
)

for text, label in imdb_dataset[:3]:
    label_string = "negative" if label == 0 else "positive"
    
    print(f"{text}\n\nLabeled as {label_string}\n\n")

 This film reminds me of how college students used to protest against the Vietnam War. As if, upon hearing some kids were doing without cheeseburgers in Cow Dung Collehe, the President was going to immediately change all US foreign policy.  The worst thing is that, while dangerous, the concept of a policy based on if the USSR and US went to war it could mean the end of the world, WORKED. The US and USSR NEVER WENT TO WAR.  Had we only conventional weapons, the notion of yet another war, a "win-able" war, in Europe and Asia was not unthinkable.  Not that I think they should get rid of this movie. It should be seen by film students as a splendid example of "How NOT to make a film."  It should be 0 stars or maybe black holes... 

Labeled as negative


 I love Memoirs of a Geisha so I read the book twice; it is one of the best book I've read last year. I was looking forward to the movie and was afraid that reading the book would ruin the viewing pleasure of the movie. I wasn't expecting th

### ProcessedDataset

We use an overloaded **Dataset** object to create a modified Dataset. We use transformations to achieve that with **ProcessedDataset**.

#### Transformations

We define plenty of smaller transformations in `prado.datasets.transforms`, however we export 2 important ones:

- `BasicPradoTransform`
- `BasicPradoAugmentation`

We won't get a lot into detail here on what they do, but keep in mind that both can be used to preprocess the dataset for training.

#### BasicPradoTransform on ProcessedDataset

Let's apply the **BasicPradoTransform** on the text column (index **0**) of the `imdb_dataset`.

In [7]:
basic_prado_transform = BasicPradoTransform()

In [8]:
basic_preprocessed_dataset = ProcessedDataset(
    original_dataset=imdb_dataset,
    transformation_map={
        0: basic_prado_transform
    },
    verbosity=1,
)

Transforming dataset: 100%|██████████| 25000/25000 [01:07<00:00, 370.88it/s]


In [9]:
for text, label in basic_preprocessed_dataset[:3]:
    label_string = "negative" if label == 0 else "positive"
    
    print(f"{text}\n\nLabeled as {label_string}\n\n")

['this', 'film', 'reminds', 'me', 'of', 'how', 'college', 'students', 'used', 'to', 'protest', 'against', 'the', 'vietnam', 'war', '.', 'as', 'if', 'upon', 'hearing', 'some', 'kids', 'were', 'doing', 'without', 'cheeseburgers', 'in', 'cow', 'dung', 'collehe', 'the', 'president', 'was', 'going', 'to', 'immediately', 'change', 'all', 'us', 'foreign', 'policy', '.', 'the', 'worst', 'thing', 'is', 'that', 'while', 'dangerous', 'the', 'concept', 'of', 'a', 'policy', 'based', 'on', 'if', 'the', 'ussr', 'and', 'us', 'went', 'to', 'war', 'it', 'could', 'mean', 'the', 'end', 'of', 'the', 'world', 'worked', '.', 'the', 'us', 'and', 'ussr', 'never', 'went', 'to', 'war', '.', 'had', 'we', 'only', 'conventional', 'weapons', 'the', 'notion', 'of', 'yet', 'another', 'war', 'a', 'win', 'able', 'war', 'in', 'europe', 'and', 'asia', 'was', 'not', 'unthinkable', '.', 'not', 'that', 'i', 'think', 'they', 'should', 'get', 'rid', 'of', 'this', 'movie', '.', 'it', 'should', 'be', 'seen', 'by', 'film', 'stude

#### BasicPradoAugmentation on ProcessedDataset

We can now augment our basic preprocessed dataset using the **BasicPradoAugmentation** class. We'll give each subaugmentation a **10%** chance of happening.

In [10]:
basic_prado_augmentation = BasicPradoAugmentation(
    insertion_probability=0.1,
    deletion_probability=0.1,
    swap_probability=0.1,
)

However, our **BasicPradoAugmentation** is defined for a token, so we need to write a method that will translate whole token lists elementwise.

In [11]:
def elementwise_augmentation(x):
    for i in range(len(x)):
        x[i] = basic_prado_augmentation(x[i])
        
    return x

Now we can use `elementwise_augmentation` as a transformation function for our new **ProcessedDataset** instance.

In [12]:
basic_augmented_dataset = ProcessedDataset(
    original_dataset=basic_preprocessed_dataset,
    transformation_map={
        0: elementwise_augmentation
    },
    verbosity=1,
)

Transforming dataset: 100%|██████████| 25000/25000 [01:44<00:00, 238.95it/s]


In [13]:
for text, label in basic_augmented_dataset[:3]:
    label_string = "negative" if label == 0 else "positive"
    
    print(f"{text}\n\nLabeled as {label_string}\n\n")

['this', 'film', 'reminds', 'me', 'of', 'how', 'college', 'students', 'used', 'to', 'protest', 'against', 'the', 'ivetnam', 'war', '.', 'as', 'if', 'upon', 'hearing', 'some', 'kxid', 'were', 'doing', 'without', 'cheeseburgers', 'in', 'cow', 'dung', 'collehe', 'thje', 'president', 'was', 'going', 'tyo', 'immediately', 'change', 'all', 'us', 'foreign', 'policy', '.', 'the', 'worst', 'thing', 'is', 'that', 'while', 'dangerous', 'the', 'concept', 'f', 'a', 'policy', 'absed', 'on', 'ikf', 'the', 'ussr', 'nad', 'su', 'went', 'to', 'war', 'it', 'could', 'mean', 'the', 'ed', 'of', 'the', 'world', 'worked', 'j.', 'the', 'uls', 'agnd', 'ussr', 'never', 'went', 'to', 'war', '.', 'had', 'we', 'only', 'convntional', 'weapons', 'th', 'notion', 'of', 'yet', 'another', 'war', 'a', 'win', 'able', 'awr', 'in', 'eeurope', 'avnd', 'asi', 'was', 'not', 'unthinkable', '.', 'not', 'that', 'i', 'think', 'they', 'should', 'get', 'rid', 'tf', 'htis', 'movie', '.', 'it', 'should', 'be', 'seen', 'by', 'iflm', 'st

Note that while training, the probability of augmentation will be significantly lower. Also, the lower the augmentation percentage, the quicker the augmentation.