# R8 and R52 text dataset
nltk only contains the whole corpus ([docs](https://www.nltk.org/book/ch02.html)) so I built the subsets by hand by following the explanation [here](https://ana.cachopo.org/datasets-for-single-label-text-categorization).

The number of docs in the datasets **do not match** the numbers in the Text GCN paper as I got slightly more docs! This might be because the nltk version contains some "fixed" docs which incorrectly had multiple or no classes (read explanation). I manually checked some classes with the different numbers and they seemed fine (so all docs had a single class) The classes match at least!

Also note that stop-words are already removed (unlike in Text GCN).

Since our goal is not to reproduce the paper, I just accepted these things

Making a graph dataset out of this is TODO

In [1]:
from torchtext.data import BucketIterator, Field
from torchtext.vocab import GloVe

from datasets.reuters_text import R8, R52

[nltk_data] Downloading package reuters to /home/matyi/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /home/matyi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
ID = Field(sequential=False, include_lengths=False)
TEXT = Field(sequential=True, lower=True, include_lengths=True, batch_first=True)
LABEL = Field(sequential=False, include_lengths=False)



In [3]:
r52_train, r52_test = R52.splits(ID, TEXT, LABEL)
r8_train,  r8_test  = R8.splits(ID, TEXT, LABEL)



In [4]:
print('R8')
print('train size:', len(r8_train), ' instead of 5485')
print('test size:', len(r8_test), ' instead of 2189')

print('R52')
print('train size:', len(r52_train), ' instead of 6532')
print('test size:', len(r52_test), ' instead of 2568')

R8
train size: 5501  instead of 5485
test size: 2190  instead of 2189
R52
train size: 6560  instead of 6532
test size: 2570  instead of 2568


In [5]:
ID.build_vocab(r52_train)
TEXT.build_vocab(r52_train, vectors=GloVe(name='840B', dim=300, max_vectors=10000))
LABEL.build_vocab(r52_train)

In [6]:
r52_train_iter, r52_test_iter = BucketIterator.splits((r52_train, r52_test), batch_size=4)



In [7]:
# Import reuters just to prove a point
from nltk.corpus import reuters

for x in r52_train_iter:
    print(x)
    print(x.id)
    print(x.label)
    print('These two labels should be the same: {} == {}'.format(
        reuters.categories(ID.vocab.itos[x.id[0]])[0],
        LABEL.vocab.itos[x.label[0]]))

    print(x.text) # Padding is 1
    break


[torchtext.data.batch.Batch of size 4]
	[.id]:[torch.LongTensor of size 4]
	[.text]:('[torch.LongTensor of size 4x115]', '[torch.LongTensor of size 4]')
	[.label]:[torch.LongTensor of size 4]
tensor([5630, 5824,  981, 4739])
tensor([ 2, 21,  2,  1])
These two labels should be the same: acq == acq
(tensor([[ 3007, 19260,    20,    22,    19, 13461,    26,     6,   226,   571,
           267,     6,  7674,  3007,   256,    46,     9,    29, 10104,    58,
           282,   215,     6,   226,    29, 14739,  4388,  1444,   195,     7,
          1206,     6, 11814,     3,  1960,    14,   259,  7674,    58,     3,
            16,   814,   324,     2,  1200,     5,     4,   362,   504,    32,
           467,     6,   427,    35,     4,  1737,   129,     3,    17,     9,
             2,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,    

