# Natural language processing: project - Dataset Demo

In [1]:
import os

In [2]:
CD_KEY = "--IN_ROOT"

In [3]:
if CD_KEY not in os.environ:
    os.environ[CD_KEY] = "false"

In [4]:
if (
    CD_KEY not in os.environ
    or os.environ[CD_KEY] is None
    or len(os.environ[CD_KEY]) == 0
    or os.environ[CD_KEY] == "false"
):
    %cd ..
else:
    print(os.getcwd())
    
os.environ[CD_KEY] = "true"

## Importing modules

In [5]:
from src.modelling.datasets import (
    IMDBDataset,
    ProcessedIMDBDataset,
    MinHashIMDBDataset,
)

## Demonstration

### IMDBDataset

This class serves as a base IMDB dataset Dataset. The dataset is loaded from the TSV file and offered with no further modification. This is generally not useful for model input.

In [6]:
imdb_dataset = IMDBDataset(
    path="data/processed/ready-to-use/imdb/train.tsv",
    verbosity=1,
)

Reading dataset (data/processed/ready-to-use/imdb/train.tsv): 25000it [00:00, 43718.45it/s]


In [7]:
for text, label in imdb_dataset[:3]:
    print(f"{text}\n\nLabeled as {'negative' if label == 0 else 'positive'}\n\n")

This film reminds me of how college students used to protest against the Vietnam War . As if upon hearing some kids were doing without cheeseburgers in Cow Dung Collehe the President was going to immediately change all US foreign policy . The worst thing is that while dangerous the concept of a policy based on if the USSR and US went to war it could mean the end of the world WORKED . The US and USSR NEVER WENT TO WAR . Had we only conventional weapons the notion of yet another war a win able war in Europe and Asia was not unthinkable . Not that I think they should get rid of this movie . It should be seen by film students as a splendid example of How NOT to make a film . It should be 0 stars or maybe black holes ...

Labeled as negative


I love Memoirs of a Geisha so I read the book twice it is one of the best book I 've read last year . I was looking forward to the movie and was afraid that reading the book would ruin the viewing pleasure of the movie . I was n't expecting the movie 

### ProcessedIMDBDataset

This class is a bit more useful - it uses an existing IMDBDataset object and applies a preprocessing function on it. By default, it is tokenized over whitespace, which should yield us the tokenization we want before hashing.

In [8]:
processed_imdb_dataset = ProcessedIMDBDataset(
    imdb_dataset=imdb_dataset,
    verbosity=1,
)

Preprocessing IMDB dataset: 100%|██████████| 25000/25000 [00:02<00:00, 10073.47it/s]


In [9]:
for text, label in processed_imdb_dataset[:3]:
    print(f"{text}\n\nLabeled as {'negative' if label == 0 else 'positive'}\n\n")

['This', 'film', 'reminds', 'me', 'of', 'how', 'college', 'students', 'used', 'to', 'protest', 'against', 'the', 'Vietnam', 'War', '.', 'As', 'if', 'upon', 'hearing', 'some', 'kids', 'were', 'doing', 'without', 'cheeseburgers', 'in', 'Cow', 'Dung', 'Collehe', 'the', 'President', 'was', 'going', 'to', 'immediately', 'change', 'all', 'US', 'foreign', 'policy', '.', 'The', 'worst', 'thing', 'is', 'that', 'while', 'dangerous', 'the', 'concept', 'of', 'a', 'policy', 'based', 'on', 'if', 'the', 'USSR', 'and', 'US', 'went', 'to', 'war', 'it', 'could', 'mean', 'the', 'end', 'of', 'the', 'world', 'WORKED', '.', 'The', 'US', 'and', 'USSR', 'NEVER', 'WENT', 'TO', 'WAR', '.', 'Had', 'we', 'only', 'conventional', 'weapons', 'the', 'notion', 'of', 'yet', 'another', 'war', 'a', 'win', 'able', 'war', 'in', 'Europe', 'and', 'Asia', 'was', 'not', 'unthinkable', '.', 'Not', 'that', 'I', 'think', 'they', 'should', 'get', 'rid', 'of', 'this', 'movie', '.', 'It', 'should', 'be', 'seen', 'by', 'film', 'stude

### MinHashIMDBDataset

This will likely be the primary dataset class we will use. It builds on the **ProcessedIMDBDataset** base, applying MinHash on each token and returning a bitarray instead of the token string.

We give it an additional parameter, `n_permutations`, which designates how many hash function permutations the MinHash algorithm does. In layman terms, the bigger the `n_permutations` parameter, the bigger the output. Each additional permutation increases the output size by **4 bytes**, or in other words **32 bits**. In PRADO terms, each permutation will increase the **B** parameter by **16**, since our hashing function provides **2B** bytes.

In [10]:
minhash_imdb_dataset = MinHashIMDBDataset(
    imdb_dataset=imdb_dataset,
    n_permutations=2,
    verbosity=1,
)

Preprocessing IMDB dataset: 100%|██████████| 25000/25000 [00:02<00:00, 9113.51it/s] 
Hashing IMDB dataset: 100%|██████████| 25000/25000 [03:20<00:00, 124.39it/s]


In [11]:
for text, label in minhash_imdb_dataset[:1]:
    for token in text:
        print(token)

    print(f"\nLabeled as {'negative' if label == 0 else 'positive'}\n\n")

bitarray('0110000010000100100010100001110111100101101011001001111110100000')
bitarray('0111000110100110111110101101111110010111111010011010011100111111')
bitarray('0111111001011101100011101101101111100000110111010001000100111111')
bitarray('0111011010100001110110101110000000001100000011000011011101000000')
bitarray('1111100101111100000100111101110100011110000110100001101010111100')
bitarray('0110100111101111010101010000100110100001101110000111001000101100')
bitarray('1000000110110001101111000101111111001101001000001100011010000010')
bitarray('1001000111011101001100000001001000110000110010100110100010100100')
bitarray('1001000010101010110011011001100001001110101000010110010111100000')
bitarray('1110111001100111011000100100100111100101100010000010101110001100')
bitarray('0001000110100100000010010101001001000111000101101011110000000111')
bitarray('1101111101100001101110100001010011101111011100110110111110100001')
bitarray('0001110111011100111000010010010010000110111000101100000000100000')