###  Reuters corpus topic classification

This project is about topic classification on the Reuters corpus. It is multi-label classification: there can be more than one topics associated with each document.

#### Data

Test data will be extracted from XML-documents, taking input from <headline></headline> and <text></text>, target classes from <codes class = 'bip:topics:1.0'><code code = "topic_i"></code></codes>

input: 'document text string, each row a document'
target: ['topic_1', '...', 'topic_n'] = [0, ...., 1, 0]


There are 126 topics that are listed in the topic_codes.txt.


In [66]:
import pandas as pd
import numpy as np


In [14]:
topics = pd.read_csv('mock-data/topic_codes.txt', delimiter='\t')
topics

Unnamed: 0,CODE,DESCRIPTION
0,1POL,CURRENT NEWS - POLITICS
1,2ECO,CURRENT NEWS - ECONOMICS
2,3SPO,CURRENT NEWS - SPORT
3,4GEN,CURRENT NEWS - GENERAL
4,6INS,CURRENT NEWS - INSURANCE
...,...,...
121,M142,METALS TRADING
122,M143,ENERGY MARKETS
123,MCAT,MARKETS
124,MEUR,EURO CURRENCY


In [15]:
codes = topics['CODE']
codes

0       1POL
1       2ECO
2       3SPO
3       4GEN
4       6INS
       ...  
121     M142
122     M143
123     MCAT
124     MEUR
125    PRB13
Name: CODE, Length: 126, dtype: object

I will first manually create a small mock data to start model development.

In [56]:
mini_texts = pd.read_csv('mock-data/text_snippets.csv', delimiter=';')
mini_texts

Unnamed: 0,text,target
0,REUTER EC REPORT LONG-TERM DIARY FOR APR 7 - D...,"[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,Brascade Resources Inc Q1 shr falls,"[0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,U.S. corporate bonds - new issue pricings,"[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, ..."
3,CBES Bancorp Inc Q3 March 31 net rises,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,H&R Block chief financial officer resigns,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
5,General Kinetics consents to delisting,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [59]:
#empty_topic_vec = [0] * 126
#mock_documents = list(mini_texts.text.values)
#mock_labels = list(mini_texts.target.values)

### Loading a mini training set

Now there is the actual data in a csv file. 

I will use that, and make a mini training, validation and test set for model development, each a size of 10 000 rows. 

Also because of the target and codes are stored as strings and not as list, I will convert them back to lists

In [68]:
trunc_large_data = pd.read_csv('reuters-csv/inputs_trunc.csv', delimiter=';')

In [98]:
trunc_large_data['target'] = trunc_large_data['target'].apply(eval)
trunc_large_data['codes'] = trunc_large_data['codes'].apply(eval)

In [99]:
trunc_large_data_shuf = trunc_large_data.sample(frac=1, random_state=42) # shuffle all data rows
mini_train = trunc_large_data_shuf[:10000]
mini_dev = trunc_large_data_shuf[10000:20000]
mini_test = trunc_large_data_shuf[20000:30000]

### Trying out transformer and BERT

Next I will be trying out things presented in a blog post: [Transformers for Multi-Label Classification made simple.](https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1)

I will also utilize some code from the home exercises of Deep Learning course.

In [77]:
import torch
import torch.nn as nn
from transformers import *

device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [106]:
# mini sets to lists

train_mini_documents = list(mini_train.text.values)
train_mini_labels = list(mini_train.target.values)
dev_mini_documents = list(mini_dev.text.values)
dev_mini_labels = list(mini_dev.target.values)

In [86]:
doc_max_length = 256 # using the truncated mini train set sequence length
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True) 

# train data encodings

train_encodings = tokenizer.batch_encode_plus(train_mini_documents,max_length=doc_max_length,pad_to_max_length=True) # tokenizer's encoding method
train_input_ids = train_encodings['input_ids'] # tokenized and encoded sentences
train_token_type_ids = train_encodings['token_type_ids'] # token type ids
train_attention_masks = train_encodings['attention_mask'] # attention masks

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [84]:
# validation data encodings

dev_encodings = tokenizer.batch_encode_plus(dev_mini_documents,max_length=doc_max_length,pad_to_max_length=True) # tokenizer's encoding method
dev_input_ids = dev_encodings['input_ids'] # tokenized and encoded sentences
dev_token_type_ids = dev_encodings['token_type_ids'] # token type ids
dev_attention_masks = dev_encodings['attention_mask'] # attention masks

In [107]:
# change to tensors

mini_train_inputs = torch.tensor(train_input_ids)
mini_train_labels = torch.tensor(train_mini_labels)
mini_train_masks = torch.tensor(train_attention_masks)
mini_train_token_types = torch.tensor(train_token_type_ids)

mini_dev_inputs = torch.tensor(dev_input_ids)
mini_dev_labels = torch.tensor(dev_mini_labels)
mini_dev_masks = torch.tensor(dev_attention_masks)
mini_dev_token_types = torch.tensor(dev_token_type_ids)

In [110]:
print(mini_train_inputs.shape)
print(mini_train_labels.shape)
print(mini_train_masks.shape)
print(mini_train_token_types.shape)

torch.Size([10000, 256])
torch.Size([10000, 126])
torch.Size([10000, 256])
torch.Size([10000, 256])
