# 头条新闻分类NeZha With Head And Focal Loss With FGM PGD

## 对抗训练方法
### Fast Gradient Method(FGM)
对于每个x:
1. 计算x的前向loss, 反向传播得到梯度；
2. 根据embeddign矩阵计算的梯度计算出r, 并加到当前embedding上，相当于x+r
3. 计算x+r的前向loss, 反向传播得到梯度，然后累加到(1)的梯度上；
4. 将embedding恢复为（1）时的embedding；
5. 根据（3）的梯度对参数进行更新。

### Projected Gradient Descent(PGD)
FGM是一下子算出了对抗扰动，这样得到的扰动不一定是最优的。因此PGD进行了改进，多迭代了K/t次，慢慢找到最优的扰动
对于每个x:
1. 计算x的前向loss, 反向传播得到梯度；
对于每步t：
2. 根据embeddign矩阵计算的梯度计算出r, 并加到当前embedding上，相当于x+r；
3. t如果不是最后一步，将梯度归0， 根据2的x+r计算前后向并得到梯度
4. t是最后一步，恢复1的梯度，计算最后的x+r并将梯度累加到(1)上
5. 将embedding恢复为（1）时的embedding
6. 根据（4）的梯度对参数进行更新。

## 编写配置

In [1]:
import torch 
import torch.nn as nn

config = {
    'train_file_path': '../../../data/toutiao_news_cls/train.csv',
    'test_file_path': '../../../data/toutiao_news_cls/test.csv',
    'train_val_ratio': 0.1,  # 10%用作验证集
    'head': 'cnn',
    'model_path': '../../../pt/NeZha_model',
    'batch_size': 16,      # batch 大小 16
    'num_epochs': 1,      # 10次迭代
    
    'warmup_ratio': 0.1,   # warm, Focal Loss优化新增参数
    
    'eps': 0.1,            # 对抗模型需要的参数
    'alpha': 0.3,          # pgd需要的参数
    'adv': 'pgd',          # 对抗训练的方法
    
    'learning_rate': 2e-5, # 学习率
    'logging_step': 300,   # 每跑300个batch记录一次
    'seed': 2022           # 随机种子
}

config['device'] = 'cuda' if torch.cuda.is_available() else 'cpu' # cpu&gpu

import random
import numpy as np

def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    return seed

seed_everything(config['seed'])

2022

## 数据预处理并编写DataLoader

In [2]:
import pandas as pd
from tqdm import tqdm
from collections import defaultdict
from transformers import BertTokenizer
from torch.utils.data import DataLoader

In [3]:
# bert分词器
bertTokenizer = BertTokenizer.from_pretrained(config['model_path'])
# 重写分词器
def tokenizer(sent):
    inputs = bertTokenizer.encode_plus(sent, add_special_tokens=True, return_token_type_ids=True, return_attention_mask=True)
    
    return inputs


In [4]:
def read_data(config, mode='train'):
    
    data_df = pd.read_csv(config[f'{mode}_file_path'], sep=',')
    LABEL, SENTENCE = 'label', 'sentence'
    data_df['bert_encode'] = data_df[SENTENCE].apply(tokenizer)
    data_df['input_ids'] = data_df['bert_encode'].apply(lambda s: s['input_ids'])
    input_ids = np.array([[int(id_) for id_ in v] for v in data_df['input_ids'].values])
    data_df['token_type_ids'] = data_df['bert_encode'].apply(lambda s: s['token_type_ids'])
    token_type_ids = np.array([[int(id_) for id_ in v] for v in data_df['token_type_ids'].values])
    data_df['attention_mask'] = data_df['bert_encode'].apply(lambda s: s['attention_mask'])
    attention_mask = np.array([[int(id_) for id_ in v] for v in data_df['attention_mask'].values])

    if mode == 'train':
        labels = data_df[LABEL].values
        
        X_train, y_train = defaultdict(list), []
        X_val, y_val = defaultdict(list), []
        num_val = int(config['train_val_ratio'] * len(data_df))
        
        # shuffle ids
        ids = np.random.choice(range(len(data_df)), size=len(data_df), replace=False)
        train_ids = ids[num_val:]
        val_ids = ids[:num_val]
        
        # get input_ids
        X_train['input_ids'], y_train = input_ids[train_ids], labels[train_ids]
        X_val['input_ids'], y_val = input_ids[val_ids], labels[val_ids]
         # get token_type_ids
        X_train['token_type_ids'] = token_type_ids[train_ids]
        X_val['token_type_ids'] = token_type_ids[val_ids]
        # get attention_mask
        X_train['attention_mask'] = attention_mask[train_ids]
        X_val['attention_mask'] = attention_mask[val_ids]
     
        # label 
        label2id = {label: i for i, label in enumerate(np.unique(y_train))}
        id2label = {i: label for label, i in label2id.items()}
        y_train = torch.tensor([label2id[y] for y in y_train], dtype=torch.long)
        y_val = torch.tensor([label2id[y] for y in y_val], dtype=torch.long)

        return X_train, y_train, X_val, y_val, label2id, id2label

    else:
        X_test = defaultdict(list)
        X_test['input_ids'] = input_ids
        X_test['token_type_ids'] = token_type_ids
        X_test['attention_mask'] = attention_mask
        y_test = torch.zeros(len(data_df), dtype=torch.long)
        
        return X_test, y_test

In [5]:
# X_train, y_train, X_val, y_val, label2id, id2label = read_data(config, mode='train')

In [6]:
# X_test, y_test = read_data(config, mode='test')

#### Dataset提供数据集的封装，创建/继承Dataset必须实现:
+ __len__: 整个数据集的长度
+ __getitem__: 支持数据集索引的函数

In [7]:
from torch.utils.data import Dataset
class TNEWSDataset(Dataset):
    def __init__(self, X, y):
        self.x = X
        self.y = y

    def __getitem__(self, idx):
        return {
            'input_ids' : self.x['input_ids'][idx],
            'label' : self.y[idx],
            'token_type_ids': self.x['token_type_ids'][idx],
            'attention_mask': self.x['attention_mask'][idx]
        }
    
    def __len__(self):
        return self.y.size(0)

#### 使用DataLoader实现数据集的并行加载
+ DataLoader提供一个可迭代对象，实现数据并行加载，从TNEWSDataset返回一个example，取多次，最后形成一个长度为batch_size的列表examples
+ examples的格式：[dict1, dict2, ...]
+ collate_fn()将examples中的数据合并为Tensor

In [8]:
def collate_fn(examples):
    input_ids_lst = []
    labels = []
    # ------ 与TextCNN不同的地方 ------
    token_type_ids_lst = []
    attention_mask_lst = []
    # ------ 与TextCNN不同的地方 ------

    for example in examples:
        input_ids_lst.append(example['input_ids'])
        labels.append(example['label'])
        # ------ 与TextCNN不同的地方 ------
        token_type_ids_lst.append(example['token_type_ids'])
        attention_mask_lst.append(example['attention_mask'])
        # ------ 与TextCNN不同的地方 ------
        
    # 计算input_ids_lst中最长的句子长度，对齐
    max_length = max(len(input_ids) for input_ids in input_ids_lst)
    # 定义一个Tensor
    input_ids_tensor = torch.zeros((len(labels), max_length), dtype=torch.long)
    # ------ 与TextCNN不同的地方 ------
    token_type_ids_tensor = torch.zeros_like(input_ids_tensor)
    attention_mask_tensor = torch.zeros_like(input_ids_tensor)
    # ------ 与TextCNN不同的地方 ------
    
    for i, input_ids in enumerate(input_ids_lst):
        seq_len = len(input_ids)
        input_ids_tensor[i, :seq_len] = torch.tensor(input_ids, dtype=torch.long)
        # ------ 与TextCNN不同的地方 ------
        token_type_ids_tensor[i, :seq_len] = torch.tensor(token_type_ids_lst[i], dtype=torch.long)
        attention_mask_tensor[i, :seq_len] = torch.tensor(attention_mask_lst[i], dtype=torch.long)
        # ------ 与TextCNN不同的地方 ------
        
    return {
        'input_ids': input_ids_tensor,
        'labels': torch.tensor(labels, dtype=torch.long),
        # ------ 与TextCNN不同的地方 ------
        'token_type_ids': token_type_ids_tensor,
        'attention_mask': attention_mask_tensor
        # ------ 与TextCNN不同的地方 ------
    }

In [9]:
from torch.utils.data import DataLoader

def build_dataloader(config):
    X_train, y_train, X_val, y_val, label2id, id2label = read_data(config, mode='train')
    X_test, y_test = read_data(config, mode='test')
    
    train_dataset = TNEWSDataset(X_train, y_train)
    val_dataset = TNEWSDataset(X_val, y_val)
    test_dataset = TNEWSDataset(X_test, y_test)
    
    train_dataloader = DataLoader(dataset=train_dataset, batch_size=config['batch_size'], num_workers=0, shuffle=True, collate_fn=collate_fn)
    val_dataloader = DataLoader(dataset=val_dataset, batch_size=config['batch_size'], num_workers=0, shuffle=False, collate_fn=collate_fn)
    test_dataloader = DataLoader(dataset=test_dataset, batch_size=config['batch_size'], num_workers=0, shuffle=False, collate_fn=collate_fn)

    return train_dataloader, val_dataloader, test_dataloader, id2label

In [10]:
train_dataloader, val_dataloader, test_dataloader, id2label = build_dataloader(config)

  import sys
  if __name__ == '__main__':
  # This is added back by InteractiveShellApp.init_path()


In [11]:
for batch in train_dataloader:
    print(len(batch['input_ids']), len(batch['labels']), len(batch['token_type_ids']), len(batch['attention_mask']))
    print(batch)
    break

16 16 16 16
{'input_ids': tensor([[ 101,  517, 1353, 2607, 6121, 1220,  518, 2972, 1139, 3859, 1930, 4276,
         1391, 7883, 3952, 2767, 1070, 4374,  751, 7464, 8024,  872, 2582,  720,
         4692, 8043,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0],
        [ 101, 1920, 6825,  671, 3175, 7415, 1730, 3221, 2582,  720, 1355, 2245,
         6629, 3341, 4638, 8043, 3300,  749, 6237, 1355, 2245, 1380, 4638,  720,
         8043,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0],
        [ 101, 4767, 7340, 1344, 5709, 3492, 3420, 4905, 2458, 6792, 5636, 2168,
         6662,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0],
        [ 101, 2809,  689, 5790, 2360, 4692, 6814, 3341, 8024,  872,  812, 3297,
         1068, 2552, 4638, 1762, 6821,  749,  102,    0,    0,    0,    0,    0,
            0,    0,    0,    0,   

## 训练验证

In [12]:
# NeZha + head part2
from NeZha import *
from extra_loss import *

class NeZhaForTNEWS(NeZhaPreTrainedModel):
    # classifier -- head
    def __init__(self, config, model_path, classifier):
        super(NeZhaForTNEWS, self).__init__(config)

        self.bert = NeZhaModel.from_pretrained(model_path, config=config)
        self.classifier = classifier  # head
        self.config = config  # Focal Loss 优化新增代码
    
    def forward(self,
                input_ids: torch.Tensor=None,
                token_type_ids: torch.Tensor=None,
                attention_mask: torch.Tensor=None,
                labels: torch.Tensor=None):

        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask, 
                            token_type_ids=token_type_ids)
        
        hidden_states = outputs[2]
        
        logits = self.classifier(hidden_states, input_ids)
        
        outputs =(logits, )
        # 使用训练集、验证集
        if labels is not None:
            # Focal Loss 损失计算优化代码
            # loss_fct = nn.CrossEntropyLoss()
            loss_fct = FocalLoss(num_classes=self.config.num_labels)
            loss = loss_fct(logits, labels.view(-1))
            outputs =(loss, ) + outputs
        
        return outputs

In [13]:
import torch.nn.functional as F
import torch.nn as nn
class ConvClassifier(nn.Module):
    '''
    CNN + global max pool
    '''
    def __init__(self, config):
        super().__init__()
        self.conv = nn.Conv1d(in_channels=config.hidden_size, out_channels=config.hidden_size, kernel_size=3)
        self.global_max_pool = nn.AdaptiveMaxPool1d(1)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.fc = nn.Linear(config.hidden_size, config.num_labels)
    
    def forward(self, hidden_states, input_ids):
        hidden_states = self.dropout(hidden_states[-1])#只取出最后一层
        # hidden_states shape (bs, seq_len, hidden_size) -> (bs, hidden_size, seq_len) 
        hidden_states = hidden_states.permute(0, 2, 1)
        out = F.relu(self.conv(hidden_states))
        
        # out (bs, hidden_size_out, seq_len_out)
        # out (bs, hidden_size, 1)
        # out (bs, hidden_size)
        out = self.global_max_pool(out).squeeze(dim=2)
        out = self.fc(out)
        return out

In [14]:
def build_model(model_path, config, head):
    heads = {
        'cnn':ConvClassifier
    }
    assert head in heads, "@_@:head must have been implemented!"
    print(f'>>>You are using {head} head ...')
    model = NeZhaForTNEWS(config, model_path, heads[head](config))
    return model

In [15]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

def evaluation(config, model, val_dataloader):
    model.eval()
    preds = []
    labels = []
    val_loss = 0.
    val_iterator = tqdm(val_dataloader, desc='Evaluation...', total=len(val_dataloader))
    with torch.no_grad():
        for batch in val_iterator:
            labels.append(batch['labels'])
            batch = {item:value.to(config['device']) for item, value in batch.items()}
            
            # val output (loss, out)
            loss, logits = model(**batch)[:2]
            val_loss += loss.item()
            
            preds.append(logits.argmax(dim=-1).detach().cpu())
            
    avg_val_loss = val_loss/len(val_dataloader)
    labels = torch.cat(labels, dim=0).numpy()
    preds = torch.cat(preds, dim=0).numpy()
    
    precision = precision_score(labels, preds, average='macro')
    recall = recall_score(labels, preds, average='macro')
    f1 =f1_score(labels, preds, average='macro')
    accuracy = accuracy_score(labels, preds)
    
    return avg_val_loss, f1, precision, recall, accuracy

In [16]:
# NeZha model + Head train
from transformers import BertConfig, BertForSequenceClassification
from transformers import AdamW
from tqdm import trange

from extra_optim import *
from extra_fgm import *
from extra_pgd import *

def train(config, train_dataloader, val_dataloader, model):

    optimizer_grouped_parameters = model.parameters()
    optimizer = AdamW(optimizer_grouped_parameters, lr=config['learning_rate'])
    optimizer = Lookahead(optimizer, 5, 1)
    total_steps = config['num_epochs'] * len(train_dataloader)
    lr_scheduler = WarmupLinearSchedule(optimizer, 
                                        warmup_steps=int(config['warmup_ratio'] * total_steps),
                                        t_total=total_steps)
    
    model.to(config['device'])
    
    # --- 对抗训练优化代码
    if config['adv'] == 'fgm':
        fgm = FGM(model)
    else:
        pgd = PGD(model)
        K = 3
    # --- 对抗训练优化代码
    
    epoches_iterator = trange(config['num_epochs'])
    global_steps = 0
    train_loss = 0.
    logging_loss = 0.
    
    best_f1 = 0.
    best_precision = 0.
    best_recall = 0.
    best_accuracy = 0.
    
    for epoch in epoches_iterator:
        train_iterator = tqdm(train_dataloader, desc='Training', total=len(train_dataloader))
        model.train()
        for batch in train_iterator:
            batch = {item:value.to(config['device']) for item, value in batch.items()}
            
            # train output (loss, out)
            loss = model(**batch)[0]  #计算x的前向loss
            
            model.zero_grad()  # 模型参数梯度清零
            loss.backward()  # 反向传播得到梯度
            
            # --- 对抗训练优化代码
            if config['adv'] == 'fgm':
                # 计算x+r的前向loss, 反向传播得到梯度，然后累加到(1)的梯度上；
                fgm.attack(epsilon=config['eps'])
                # 计算x+r的前向loss
                loss_adv = model(**batch)[0]
                # 反向传播得到梯度，然后累加到(1)的梯度上；
                loss_adv.backward()
                #将embedding恢复为（1）时的embedding；
                fgm.restore()
            else:
                pgd.backup_grad()
                for t in range(K):
                    pgd.attack(epsilon=config['eps'], alpha=config['alpha'], is_first_attack=(t == 0))
                    if t != K - 1:
                        model.zero_grad()
                    else:
                        pgd.restore_grad()
                    loss_adv = model(**batch)[0]
                    loss_adv.backward()
                pgd.restore()
            # --- 对抗训练优化代码
            
            optimizer.step()  # 更新参数
            lr_scheduler.step()
            
            train_loss += loss.item()  # 叠加loss
            global_steps += 1
            
            if global_steps % config['logging_step'] == 0:
                print_train_loss = (train_loss - logging_loss) / config['logging_step']
                logging_loss = train_loss
                avg_val_loss, f1, precision, recall, accuracy = evaluation(config, model, val_dataloader)
                
                if best_f1 < f1:
                    best_f1 = f1
                    best_precision = precision
                    best_recall = recall
                    best_accuracy = accuracy
                    print_log = f'''>>> training loss: {print_train_loss: .4f}, valid loss: {avg_val_loss: .4f}\n
                            valid f1 score: {f1: .4f}, valid precision score: {precision: .4f},
                            valid recall score: {recall: .4f}, valid accuracy score: {accuracy: .4f}'''
                    print(print_log)
                    model.save_pretrained(os.path.join('../../../pt_tmp/cls/nezha_head_fl_fgm_pgd', config['adv']))
                    
                model.train()
                
    return best_f1, best_precision, best_recall, best_accuracy

In [None]:
# 首次运行代码
bert_config = NeZhaConfig.from_pretrained(config['model_path'])
bert_config.output_hidden_states = True
bert_config.num_labels = len(id2label)
model = build_model(config['model_path'], bert_config, config['head'])
f1, precision, recall, accuracy = train(config, train_dataloader, val_dataloader, model)
print_log = f'''valid f1 score: {f1: .4f}, valid precision score: {precision: .4f},
                valid recall score: {recall: .4f}, valid accuracy score: {accuracy: .4f}'''
print(print_log)

# 迭代训练代码
# bert_config = BertConfig.from_pretrained('../../../pt_tmp/cls/nezha_head_fl_fgm_pgd')
# bert_config.output_hidden_states = True
# bert_config.num_labels = len(id2label)
# model = build_model('../../../pt_tmp/cls/nezha_head_fl_fgm_pgd', bert_config, config['head'])
# f1, precision, recall, accuracy = train(config, train_dataloader, val_dataloader, model)
# print_log = f'''valid f1 score: {f1: .4f}, valid precision score: {precision: .4f},
#                 valid recall score: {recall: .4f}, valid accuracy score: {accuracy: .4f}'''
# print(print_log)

>>>You are using cnn head ...


Some weights of the model checkpoint at ../../../pt/NeZha_model were not used when initializing NeZhaModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing NeZhaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NeZhaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of NeZhaModel were not initialized from the model checkpoint at ../../../pt/NeZha_model and are newly initialized: ['bert.encoder.lay

## 预测并保存结果

In [18]:
def predict(config, id2label, model, test_dataloader):
    test_iterator = tqdm(test_dataloader, desc='Testing', total=len(test_dataloader))
    model.eval()
    test_preds = []
    
    with torch.no_grad():
        for batch in test_iterator:
            batch = {item: value.to(config['device']) for item, value in batch.items()}

            logits = model(**batch)[1]
            test_preds.append(logits.argmax(dim=-1).detach().cpu())
            
    test_preds = torch.cat(test_preds, dim=0).numpy()
    test_preds = [id2label[id_] for id_ in test_preds]
        
    test_df = pd.read_csv(config['test_file_path'], sep=',')
    # test_df.insert(1, column=['label_pred'], value=test_preds)
    test_df['label_pred'] = test_preds
    # test_df.drop(columns=['sentence'], inplace=True)
    test_df.to_csv('submission.csv', index=False, encoding='utf8')

In [19]:
predict(config, id2label, best_model, test_dataloader)

Testing: 100%|████████████████████████████████████████████| 625/625 [10:44<00:00,  1.03s/it]


In [20]:
test_df = pd.read_csv(config['test_file_path'], sep=',')

In [21]:
train_df = pd.read_csv(config['train_file_path'], sep=',')

In [22]:
train_df.head(10)

Unnamed: 0,id,label,label_desc,sentence
0,0,108,news_edu,上课时学生手机响个不停，老师一怒之下把手机摔了，家长拿发票让老师赔，大家怎么看待这种事？
1,1,104,news_finance,商赢环球股份有限公司关于延期回复上海证券交易所对公司2017年年度报告的事后审核问询函的公告
2,2,106,news_house,通过中介公司买了二手房，首付都付了，现在卖家不想卖了。怎么处理？
3,3,112,news_travel,2018年去俄罗斯看世界杯得花多少钱？
4,4,109,news_tech,剃须刀的个性革新，雷明登天猫定制版新品首发
5,5,103,news_sports,再次证明了“无敌是多么寂寞”——逆天的中国乒乓球队！
6,6,109,news_tech,三农盾SACC-全球首个推出：互联网+区块链+农产品的电商平台
7,7,116,news_game,重做or新英雄？其实重做对暴雪来说同样重要
8,8,103,news_sports,如何在商业活动中不受人欺骗？
9,9,101,news_culture,87版红楼梦最温柔的四个丫鬟，娶谁都是一生的福气


In [23]:
train_df['label'].unique()

array([108, 104, 106, 112, 109, 103, 116, 101, 107, 100, 102, 110, 115,
       113, 114])