# 專題（二）：建置Bert新聞分類器之資料集

## 專案目標
- 目標：請試著建製 BertForSequenceClassification 看得懂的資料集 NewsDataset
- news_clustering_train.tsv 有 1800 篇新聞，六種類別的新聞各 300 篇
- news_clustering_test.tsv 有 600 篇新聞，六種類別的新聞各 100 篇
- 六種類別：體育、財經、科技、旅遊、農業、遊戲

## 實作提示
- STEP1：從 news_clustering_train.tsv 和 news_clustering_test.tsv 中取出標題和類別
- STEP2：繼承 torch.utils.data.Dataset 並實作 NewsDataset，其中需要用到 bert tokenizer (請參考官方對BertForSequenceClassification的說明)
- STEP3：因為每一個從 NewsDataset 來的樣本長度不一樣，所以需要實作 collate_fn，來 zero padding 到同一序列長度
- STEP4：使用 torch.utils.data.DataLoader 來創造 train_loader 和 valid_loader

## 重要知識點：專題結束後可以學會
- 如何讀取並處理 NLP 資料，產生可以適用 BertForSequenceClassification 的資料集
- 了解 BERT 的 Sequence Classification 任務如何進行

In [1]:
import pandas as pd

import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from transformers import BertTokenizer

In [2]:
df_train = pd.read_csv('news_clustering_train.tsv', sep='\t')
df_test = pd.read_csv('news_clustering_test.tsv', sep='\t')

In [3]:
train_titles = {row['index']: row['title'] for _, row in df_train.iterrows()}
train_classes = {row['index']: row['class'] for _, row in df_train.iterrows()}

valid_titles = {row['index']: row['title'] for _, row in df_test.iterrows()}
valid_classes = {row['index']: row['class'] for _, row in df_test.iterrows()}

In [4]:
ALL_NEWS_CLASSES = ['體育', '財經', '科技', '旅遊', '農業', '遊戲']

In [5]:
MODEL_NAME = 'bert-base-chinese'

In [6]:
# 建立數據集
class NewsDataset(Dataset):
    def __init__(self, tokenizer, titles, classes):
        self.tokenizer = tokenizer
        self.indexes = []
        self.texts = []
        self.labels = []
        for index in titles:
            self.indexes.append(index)
            self.texts.append(titles[index])
            self.labels.append(classes[index])

    def __getitem__(self, idx):
        text = self.texts[idx]
        input = self.tokenizer(text, return_tensors='pt')
        label = torch.tensor(ALL_NEWS_CLASSES.index(self.labels[idx]))

        return input, label

    def __len__(self):
        return len(self.indexes)

def create_mini_batch(samples):
    input_ids = []
    token_type_ids = []
    attention_mask = []
    labels = []
    for s in samples:
        input_ids.append(s[0]['input_ids'].squeeze(0))
        token_type_ids.append(s[0]['token_type_ids'].squeeze(0))
        attention_mask.append(s[0]['attention_mask'].squeeze(0))
        labels.append(s[1])

    # zero pad 到同一序列長度
    input_ids = pad_sequence(input_ids, batch_first=True, padding_value=0)
    token_type_ids = pad_sequence(token_type_ids, batch_first=True, padding_value=0)
    attention_mask = pad_sequence(attention_mask, batch_first=True, padding_value=0)

    labels = torch.stack(labels)

    return input_ids, token_type_ids, attention_mask, labels

In [7]:
batch_size = 32

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

train_dataset = NewsDataset(tokenizer, train_titles, train_classes)
valid_dataset = NewsDataset(tokenizer, valid_titles, valid_classes)

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    collate_fn=create_mini_batch,
    shuffle=True
)
valid_loader = DataLoader(
    dataset=valid_dataset,
    batch_size=batch_size,
    collate_fn=create_mini_batch
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=109540.0, style=ProgressStyle(descripti…


