# R8 and R52 text dataset
nltk only contains the whole corpus ([docs](https://www.nltk.org/book/ch02.html)) so I built the subsets by hand by following the explanation [here](https://ana.cachopo.org/datasets-for-single-label-text-categorization).

The number of docs in the datasets **do not match** the numbers in the Text GCN paper as I got slightly more docs! This might be because the nltk version contains some "fixed" docs which incorrectly had multiple or no classes (read explanation). I manually checked some classes with the different numbers and they seemed fine (so all docs had a single class) The classes match at least!

Also note that stop-words are already removed (unlike in Text GCN).

Since our goal is not to reproduce the paper, I just accepted these things

Making a graph dataset out of this is TODO

In [1]:
from torchtext.data import BucketIterator, Field
from torchtext.vocab import GloVe

from datasets.reuters_text import R8, R52

[nltk_data] Downloading package reuters to /home/mat/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /home/mat/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
ID = Field(sequential=False, include_lengths=False)
TEXT = Field(sequential=True, lower=True, include_lengths=True, batch_first=True)
LABEL = Field(sequential=False, include_lengths=False)



In [3]:
r52_train, r52_test, r52_val = R52.splits(ID, TEXT, LABEL, val_size=0.1)
r8_train,  r8_test, r8_val  = R8.splits(ID, TEXT, LABEL, val_size=0.1)



In [4]:
print('R8')
print('train size:', len(r8_train), ' instead of 5485')
print('test size:', len(r8_test), ' instead of 2189')
print('val size:', len(r8_val))

print('R52')
print('train size:', len(r52_train), ' instead of 6532')
print('test size:', len(r52_test), ' instead of 2568')
print('val size:', len(r52_val))

R8
train size: 4951  instead of 5485
test size: 2190  instead of 2189
val size: 550
R52
train size: 5904  instead of 6532
test size: 2570  instead of 2568
val size: 656


In [5]:
ID.build_vocab(r52_train)
TEXT.build_vocab(r52_train, vectors=GloVe(name='840B', dim=300, max_vectors=10000))
LABEL.build_vocab(r52_train)

In [6]:
r52_train_iter, r52_test_iter, r52_val_iter = BucketIterator.splits(
    (r52_train, r52_test, r52_val), 
    batch_size=4,
    sort=False
)



In [7]:
# Import reuters just to prove a point
from nltk.corpus import reuters

for x in r52_train_iter:
    print(x)
    print(x.id)
    print(x.label)
    print(reuters.categories(ID.vocab.itos[x.id[0]]))
    print(LABEL.vocab.itos[x.label[0]])
    print('These two labels should be the same: {} == {}'.format(
        reuters.categories(ID.vocab.itos[x.id[0]])[0],
        LABEL.vocab.itos[x.label[0]]))

    print(x.text) # Padding is 1
    break

for x in r52_test_iter:
    print(x)
    print(x.id)
    print(x.label)
    print(x.text) # Padding is 1
    break
    
for x in r52_val_iter:
    print(x)
    print(x.id)
    print(x.label)
    print(x.text) # Padding is 1
    break


[torchtext.data.batch.Batch of size 4]
	[.id]:[torch.LongTensor of size 4]
	[.text]:('[torch.LongTensor of size 4x489]', '[torch.LongTensor of size 4]')
	[.label]:[torch.LongTensor of size 4]
tensor([1666, 2736, 4494, 1235])
tensor([ 3,  2, 25,  2])
['crude']
crude
These two labels should be the same: crude == crude
(tensor([[2396,  380, 4430,  ...,    1,    1,    1],
        [ 949,   20,   21,  ..., 1286,  928,    2],
        [ 235,   99,  793,  ...,    1,    1,    1],
        [8752, 5676,   10,  ...,    1,    1,    1]]), tensor([111, 489,  40,  22]))

[torchtext.data.batch.Batch of size 4]
	[.id]:[torch.LongTensor of size 4]
	[.text]:('[torch.LongTensor of size 4x899]', '[torch.LongTensor of size 4]')
	[.label]:[torch.LongTensor of size 4]
tensor([0, 0, 0, 0])
tensor([4, 4, 4, 4])
(tensor([[1989, 1201, 3241,  ...,    4, 1065,    2],
        [ 130,  629,  891,  ...,    1,    1,    1],
        [ 239,  187, 1475,  ...,    1,    1,    1],
        [ 130,  629,  891,  ...,    1,    1,    

