<font face="times">
  
# 1. Policy Sentiment Analysis

<font face="times">
 
## 1.1. Notebook overview 
<font size=4><p align = "justify" style="line-height:180%">This Notebook is based on the pre-trained sentiment model SKEP (Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis) , which is fine-tuned to conduct sentiment analysis on Wuhan's anti-epidemic transport policy during the COVID-19 outbreak.A balance needs to be struck between the risk of virus transmission and supporting essential travel activities during a pandemic. In which transportation can be considered as a vector from an epidemiological perspective, however, modern cities cannot be sustained without the continued delivery and operation of systems such as food, fuel, power, and medical care. Also, in the context of lack of extensive historical knowledge, dynamic tracking and adjustment of emergency policy become imperative.Thus, sentiment analysis measures the satisfactory level of the public by classifying people's response to policy into positive, negative, and neutral categories, with positive public sentiment meaning a higher level of policy acceptance while negative public sentiment demonstrates a lower policy acceptance.</p></font>

## 1.2. Pre-trained sentiment analysis model SKEP
<font size=4><p align = "justify" style="line-height:180%">In recent years, a large body of research has shown that pre-trained models (PTMs) based on large corpora can learn generic language representations that are beneficial for downstream NLP tasks, while avoiding the need to train models from scratch. With the development of computational power, the emergence of deep models (i.e., Transformer) and the enhancement of training skills have allowed PTMs to evolve from shallow to deep.</p></font>
    
<font size=4><p align = "justify" style="line-height:180%"> The Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis (SKEP), a pre-training model for sentiment analysis, outperforms SOTA on 14 typical tasks for Chinese-English sentiment analysis, and this work has been accepted by ACL 2020. SKEP is a sentiment knowledge augmentation-based sentiment pre-training algorithm proposed by the Baidu research team, which uses unsupervised methods to automatically mine sentiment knowledge and then use the sentiment knowledge to construct pre-training targets so that machines can learn to understand sentiment semantics.</p></font>

<font size=4><p align = "justify" style="line-height:180%"> The Baidu research team further validated the effectiveness of the sentiment pre-training model SKEP on three typical sentiment analysis tasks, Sentence-level Sentiment Classification, Aspect-level Sentiment Classification, Opinion Role Labeling), a total of 14 Chinese and English data to further validate the effect of the sentiment pre-training model SKEP. For specific experimental results, please refer to: https://github.com/baidu/Senta#skep. The paper is available at: https://arxiv.org/abs/2005.05635.</p></font>

## 1.3. Dataset
<font size=4><p align = "justify" style="line-height:180%">At first, a random sample (for each label, 6 times, 1000 each) is drawn from the data set labeled directly with the pre-trained model wrapper library Senta.This yielded a sample of 18,000 data with label "0" for negative, label "1" for neutral, and label "2" for positive, with text lengths between 10 and 153.</p></font>
  
<font size=4><p align = "justify" style="line-height:180%">Next, manual checking is performed.For policy responses, objective and rational views with the subject matter of facts, announcements, inquiries/consultations, persuasions, and suggestions were considered as sentimentally neutral; further, views with complaints, rebuttals, accountability, speculations/questions, and orders were considered as sentimentally negative; and finally, views with hope, compliments, approval, support, prayers, and blessings/wishes were considered as sentimentally positive.</p></font>
  
<font size=4><p align = "justify" style="line-height:180%">Finally, the average precision on 18,000 items is 69.3%, with the label "2" represents positive precision up to 82.4%, negative and neutral are 69.1% and 56.4%, respectively; the average recall is 60.6%, with the label "0" represents negative recall up to 81.1%, positive and neutral are 32.2% and 68.5%, respectively; further, the manually labeled dataset is used as the fine-tuned dataset of the pre-trained SKEP model, in which the amount of negative data is 5869, neutral data is 5893, and positive data was 3238.</p></font>

<font face="times">
  
# 2. Precision and Recall analysis under the condition of directly using the sentiment pre-trained model SKEP

In [1]:
!head data0.csv













In [2]:
!head data1.csv













In [3]:
!head data2.csv













In [4]:
import pandas as pd 
import numpy as np
data0 = pd.read_csv('data0.csv')
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')

In [5]:
import copy
from collections import Counter
import datetime

def pre_trained_error_each(data):
    nan_index = data[np.isnan(data['label1'])].index
    data_ = copy.deepcopy(data)
    data_ = data_.drop(nan_index)
    data_.index = np.arange(len(data_))
    #===
    int_label1 = []
    for i in range(len(data_)):
        if data_['label1'][i]==9.0:
             int_label1.append(int(0))
        elif data_['label1'][i]==11.0:
            int_label1.append(int(1))
        elif data_['label1'][i]==22.0:
            int_label1.append(int(2))
        elif data_['label1'][i]==00.0:
            int_label1.append(int(0))
        else:
            int_label1.append(int(data_['label1'][i]))
    #==
    data_['label1'] = int_label1
    #==
    label = list(data_['label'])
    label1 = list(data_['label1'])
    label_ = dict(Counter(label))#a
    label1_ = dict(Counter(label1))#b
    #==
    time1 = datetime.datetime.now()
    print(time1)
    print('Recall for each data_set with single label:',label1_[list(label_.keys())[0]]/list(label_.values())[0])

pre_trained_error_each(data0)
pre_trained_error_each(data1)
pre_trained_error_each(data2)

2022-08-02 02:02:20.934560
Recall for each data_set with single label: 0.8068954085861098
2022-08-02 02:02:20.982996
Recall for each data_set with single label: 0.6847960444993819
2022-08-02 02:02:21.159078
Recall for each data_set with single label: 0.5319148936170213


In [11]:
import copy
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.utils import shuffle

def average_pre_trained_error(data0,data1,data2):
    def data_dframe(data):
        nan_index = data[np.isnan(data['label1'])].index
        data_ = copy.deepcopy(data)
        data_ = data_.drop(nan_index)
        data_.index = np.arange(len(data_))
        #===
        int_label1 = []
        for i in range(len(data_)):
            if data_['label1'][i]==9.0:
                int_label1.append(int(0))
            elif data_['label1'][i]==11.0:
                int_label1.append(int(1))
            elif data_['label1'][i]==22.0:
                int_label1.append(int(2))
            elif data_['label1'][i]==00.0:
                int_label1.append(int(0))
            else:
                int_label1.append(int(data_['label1'][i]))
        #==
        data_['label1'] = int_label1
        return data_
    #==
    data0_ = data_dframe(data0)
    data1_ = data_dframe(data1)
    data2_ = data_dframe(data2)
    #===
    data_all = pd.concat([data0_,data1_,data2_,data0_.iloc[:2000,:],data1_.iloc[:1333,:],data2_.iloc[:1000,:]])
    data_all.index = np.arange(len(data_all))
    #==data_all
    #==Precision
    def get_precision(data_all,label_index):
        if label_index == 0:
             true_label0_index = [i for i in range(len(data_all)) if data_all['label1'][i]==0]
             recognize_label = [data_all['label'][i] for i in true_label0_index] 
             recognize_label_ = dict(Counter(recognize_label))
             precision_0 = recognize_label_[0]/len(true_label0_index)
             print('precision_0',precision_0)
        elif label_index == 1:
            true_label1_index = [i for i in range(len(data_all)) if data_all['label1'][i]==1]
            recognize_label = [data_all['label'][i] for i in true_label1_index] 
            recognize_label_ = dict(Counter(recognize_label))
            precision_1 = recognize_label_[1]/len(true_label1_index)
            print('precision_1',precision_1)
        elif label_index == 2:
            true_label2_index = [i for i in range(len(data_all)) if data_all['label1'][i]==2]
            recognize_label = [data_all['label'][i] for i in true_label2_index] 
            recognize_label_ = dict(Counter(recognize_label))
            precision_2 = recognize_label_[2]/len(true_label2_index)
            print('precision_2',precision_2)
    #==
    time1 = datetime.datetime.now()
    print(time1)
    get_precision(data_all,0)
    get_precision(data_all,1)
    get_precision(data_all,2)
    #==Recall
    def get_recall(data_all,label_index):
        if label_index == 0:
            recognize_label0_index = [i for i in range(len(data_all)) if data_all['label'][i]==0]
            true_label = [data_all['label1'][i] for i in recognize_label0_index] 
            true_label_ = dict(Counter(true_label))
            recall_0 = true_label_[0]/len(recognize_label0_index)
            print('recall_0',recall_0)
        elif label_index == 1:
            recognize_label1_index = [i for i in range(len(data_all)) if data_all['label'][i]==1]
            true_label = [data_all['label1'][i] for i in recognize_label1_index] 
            true_label_ = dict(Counter(true_label))
            recall_1 = true_label_[1]/len(recognize_label1_index)
            print('recall_1',recall_1)
        elif label_index == 2:
            recognize_label2_index = [i for i in range(len(data_all)) if data_all['label'][i]==2]
            true_label = [data_all['label1'][i] for i in recognize_label2_index] 
            true_label_ = dict(Counter(true_label))
            recall_2 = true_label_[1]/len(recognize_label2_index)
            print('recall_2',recall_2)
    #==
    time1 = datetime.datetime.now()
    print(time1)
    get_recall(data_all,0)
    get_recall(data_all,1)
    get_recall(data_all,2)
    #==
    return data_all

data_all = average_pre_trained_error(data0,data1,data2)


2022-08-02 02:09:08.299299
precision_0 0.7892501819946615
precision_1 0.35346358792184723
precision_2 0.9018895348837209
2022-08-02 02:09:08.721291
recall_0 0.8097846383667372
recall_1 0.6743476787529651
recall_2 0.31143101482326113


In [12]:
data_all = data_all.loc[:,['pure_blog','label1']]

In [13]:
#===增加一列
data_all.insert(0,'tweet_id',np.array([i for i in range(len(data_all))]))
data_all.to_csv('data_all.csv',sep=',', index=False)

In [14]:
!head data_all.csv

tweet_id,pure_blog,label1
0,都要生了管他禁不禁行，直接开到医院就行了,0
1,封城只是不能出，从外地空运货运都行啊,1
2,怎么说呢，什么事都没有十全十美，只能以大局为重了,1
3,昨天报了12345，这个进度有点慢,0
4,这是个血的教训！责任是谁的，谁也别想跑！！！赎罪吧！,0
5,你到了吗？,0
6,而且他们不够重视，医疗水平更是有限,0
7,不要这么形容好不好？有人会伤心的,1
8,作为京东员工我想说京东物流也不放假，而且京东商城抵制商家涨价，,0


<font face="times">
  
# 3. Fine Tuning pretrained SKEP model for Sentiment Analysis

In [1]:
import pandas as pd
import numpy as np
from collections import Counter
data_all = pd.read_csv('data_all.csv')
print(data_all.columns)
print(Counter(data_all['label1']))
!head data_all.csv

Index(['tweet_id', 'pure_blog', 'label1'], dtype='object')
Counter({0: 8242, 1: 5630, 2: 4128})
tweet_id,pure_blog,label1
0,都要生了管他禁不禁行，直接开到医院就行了,0
1,封城只是不能出，从外地空运货运都行啊,1
2,怎么说呢，什么事都没有十全十美，只能以大局为重了,1
3,昨天报了12345，这个进度有点慢,0
4,这是个血的教训！责任是谁的，谁也别想跑！！！赎罪吧！,0
5,你到了吗？,0
6,而且他们不够重视，医疗水平更是有限,0
7,不要这么形容好不好？有人会伤心的,1
8,作为京东员工我想说京东物流也不放假，而且京东商城抵制商家涨价，,0


In [2]:
#==========Data set splitting
from sklearn.model_selection import train_test_split
train_idx, val_test_idx, _, y_validate_test = train_test_split(data_all.index, data_all['label1'], stratify=data_all['label1'], 
train_size=0.9,test_size=1-0.9,random_state=2, shuffle=True)
#print(list(val_test_idx))
#=====
data_all_train = data_all.drop(list(val_test_idx))
#data_all_train.index = np.arange(data_all_train)
data_all_train.to_csv('data_all_train.csv',index=False)

#===
data_all_test = data_all.drop(list(train_idx))
#data_all_test.index = np.arange(data_all_test)
data_all_test.to_csv('data_all_test.csv',index=False)


In [2]:
!head data_all_test.csv
print(max([len(i) for i in data_all['pure_blog']]))

tweet_id,pure_blog,label1
25,既然国家封城自然会做好相应保障工作。,1
26,你想多了吧这节骨眼要把儿子接出来，儿子坐高铁坐飞机就得了干嘛她跑进来再跑出去？,0
34,全部采用无人机配送物资，人员。无人机分类为，送粮食的，器材的，送人的，统一调度，实时监控。这样就大幅度隔离了人员往来。疫情会马上下降甚至终止。问题是，这样的无人机可有？尤其是送医护人员的。,0
35,他有瞒报，死了几百个了确诊几千个，记者不敢报因为不让报,0
40,我只想问孕妇怎么办，马上就要生了，半夜发作去找谁,0
44,问题是他到了新地方再查出来不也很糟糕吗,0
55,口罩到底有用吗南阳地区离武汉好近！,1
57,洗手口罩卫生有病及时就医。。。。。。,1
75,你们的医护都是尽可能多的派过来了，可是一线管理人员不够能支援吗，除了医护，其他人敢过来帮我们管理社区吗？其实封城的意思是所有人都不能出去了，包括买菜买药和看病的，一律禁足，一个人都不能出去，像其他省那样封小区，每几天派个人出去一趟，我们早就是这样了，封城那天开始就相当于封区了,0
153


In [2]:
def read(pd_data):
    for index, item in pd_data.iterrows():       
        yield {'text': item['pure_blog'], 'label': item['label1'], 'qid': item['tweet_id']}

data_all_train = pd.read_csv('data_all_train.csv')

In [3]:
# 分割训练集、测试机
from paddle.io import Dataset, Subset
from paddlenlp.datasets import MapDataset
from paddlenlp.datasets import load_dataset
import paddle

dataset = load_dataset(read, pd_data=data_all_train,lazy=False)

dev_ds = Subset(dataset=dataset, indices=[i for i in range(len(dataset)) if i % 20== 1])
train_ds = Subset(dataset=dataset, indices=[i for i in range(len(dataset)) if i % 20 != 1])

In [4]:
paddle.get_device()
# 设置gpu训练
use_gpu = True if paddle.get_device().startswith("gpu") else False
if use_gpu:
    paddle.set_device('gpu:0')
    print(use_gpu)

True


In [5]:
for i in range(5):
    print(train_ds[i])
import datetime
time_now = datetime.datetime.now()
print(time_now)

{'text': '都要生了管他禁不禁行，直接开到医院就行了', 'label': 0, 'qid': 0}
{'text': '怎么说呢，什么事都没有十全十美，只能以大局为重了', 'label': 1, 'qid': 2}
{'text': '昨天报了12345，这个进度有点慢', 'label': 0, 'qid': 3}
{'text': '这是个血的教训！责任是谁的，谁也别想跑！！！赎罪吧！', 'label': 0, 'qid': 4}
{'text': '你到了吗？', 'label': 0, 'qid': 5}
2022-08-02 18:44:42.611224


In [6]:
# 在转换为MapDataset类型
train_ds = MapDataset(train_ds)
dev_ds = MapDataset(dev_ds)
print(len(train_ds))
print(len(dev_ds))

15390
810


In [7]:
# 指定模型名称一键加载模型
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer

model = SkepForSequenceClassification.from_pretrained(
    'skep_ernie_1.0_large_ch', num_classes=  3)
# 指定模型名称一键加载tokenizer
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')

[2022-08-02 18:44:46,976] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.pdparams and saved to /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch
[2022-08-02 18:44:46,979] [    INFO] - Downloading skep_ernie_1.0_large_ch.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.pdparams
100%|██████████| 1238309/1238309 [00:23<00:00, 52608.12it/s]
W0802 18:45:10.663950    98 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0802 18:45:10.668140    98 device_context.cc:422] device: 0, cuDNN Version: 7.6.
[2022-08-02 18:45:18,480] [    INFO] - Downloading skep_ernie_1.0_large_ch.vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.vocab.txt
100%|██████████| 55/55 [00:00<00:00, 2981.10it/s]


In [8]:
from visualdl import LogWriter

writer = LogWriter("./log")

In [9]:
def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False):
   
    # 将原数据处理成model可读入的格式，enocded_inputs是一个dict，包含input_ids、token_type_ids等字段
    encoded_inputs = tokenizer(
        text=example["text"], max_seq_len=max_seq_length)

    # input_ids：对文本切分token后，在词汇表中对应的token id
    input_ids = encoded_inputs["input_ids"]
    # token_type_ids：当前token属于句子1还是句子2，即上述图中表达的segment ids
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        # label：情感极性类别
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        # qid：每条数据的编号
        qid = np.array([example["qid"]], dtype="int64")
        return input_ids, token_type_ids, qid

In [10]:
def create_dataloader(dataset,
                      trans_fn=None,
                      mode='train',
                      batch_size=1,
                      batchify_fn=None):
    
    if trans_fn:
        dataset = dataset.map(trans_fn)

    shuffle = True if mode == 'train' else False
    if mode == "train":
        sampler = paddle.io.DistributedBatchSampler(
            dataset=dataset, batch_size=batch_size, shuffle=shuffle)
    else:
        sampler = paddle.io.BatchSampler(
            dataset=dataset, batch_size=batch_size, shuffle=shuffle)
    dataloader = paddle.io.DataLoader(
        dataset, batch_sampler=sampler, collate_fn=batchify_fn)
    return dataloader

In [11]:
import numpy as np
import paddle

@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader):

    model.eval()
    metric.reset()
    losses = []
    for batch in data_loader:
        input_ids, token_type_ids, labels = batch
        logits = model(input_ids, token_type_ids)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        correct = metric.compute(logits, labels)
        metric.update(correct)
        accu = metric.accumulate()
    # print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
    model.train()
    metric.reset()
    return  np.mean(losses), accu

In [12]:
import os
from functools import partial
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad


# 文本序列最大长度166
max_seq_length = 160
# 批量数据大小
batch_size = 64
# 定义训练过程中的最大学习率
learning_rate = 4e-5
# 训练轮次
epochs = 50
# 学习率预热比例
warmup_proportion = 0.1
# 权重衰减系数，类似模型正则项策略，避免模型过拟合
weight_decay = 0.01

# 将数据处理成模型可读入的数据格式
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)

# 将数据组成批量式数据，如
# 将不同长度的文本序列padding到批量式数据中最大长度
# 将每条数据label堆叠在一起
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

In [13]:
# 定义超参，loss，优化器等
from paddlenlp.transformers import LinearDecayWithWarmup
import time

num_training_steps = len(train_data_loader) * epochs
lr_scheduler = LinearDecayWithWarmup(learning_rate, num_training_steps, warmup_proportion)

# AdamW优化器
clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    lazy_mode = True,
    grad_clip=clip,
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ])

#==
clip = paddle.nn.ClipGradByGlobalNorm(clip_norm=1.0)
optimizer1 = paddle.optimizer.SGD(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    grad_clip=clip)

criterion = paddle.nn.loss.CrossEntropyLoss()  # 交叉熵损失函数
metric = paddle.metric.Accuracy()              # accuracy评价指标

In [14]:
# 开启训练
global_step = 0
best_val_acc=0
tic_train = time.time()
best_accu = 0

for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        # 喂数据给model
        logits = model(input_ids, token_type_ids)
        # 计算损失函数值
        loss = criterion(logits, labels)
        # 预测分类概率值
        probs = F.softmax(logits, axis=1)
        # 计算acc
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1

        if global_step % 10 == 0:
            time_now = datetime.datetime.now()
            print(time_now)
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / (time.time() - tic_train)))
            tic_train = time.time()

        # 反向梯度回传，更新参数
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()

        if global_step % 100 == 0 :
            # 评估当前训练的模型
            eval_loss, eval_accu = evaluate(model, criterion, metric, dev_data_loader)
            print("eval  on dev  loss: {:.8}, accu: {:.8}".format(eval_loss, eval_accu))
            # 加入eval日志显示
            writer.add_scalar(tag="eval/loss", step=global_step, value=eval_loss)
            writer.add_scalar(tag="eval/acc", step=global_step, value=eval_accu)
            # 加入train日志显示
            writer.add_scalar(tag="train/loss", step=global_step, value=loss)
            writer.add_scalar(tag="train/acc", step=global_step, value=acc)
            save_dir = "best_checkpoint"
            # 加入保存       
            if eval_accu>best_val_acc:
                if not os.path.exists(save_dir):
                    os.mkdir(save_dir)
                best_val_acc=eval_accu
                print(f"模型保存在 {global_step} 步， 最佳eval准确度为{best_val_acc:.8f}！")
                save_param_path = os.path.join(save_dir, 'best_model.pdparams')
                paddle.save(model.state_dict(), save_param_path)
                fh = open('best_checkpoint/best_model.txt', 'w', encoding='utf-8')
                fh.write(f"模型保存在 {global_step} 步， 最佳eval准确度为{best_val_acc:.8f}！")
                fh.close()

global step 12050, epoch: 50, batch: 241, loss: 0.00001, accu: 0.99968, speed: 0.56 step/s

<font face="times">
  
# 4. Test the Pre-trained model

In [1]:
#========================================================================test
#========================================================================test
# 数据读取
import pandas as pd
from paddlenlp.datasets import load_dataset
from paddle.io import Dataset, Subset
from paddlenlp.datasets import MapDataset


test = pd.read_csv('data_all_test.csv')


In [2]:
print(test.columns)
print(test.head())

Index(['tweet_id', 'pure_blog', 'label1'], dtype='object')
   tweet_id                                          pure_blog  label1
0        25                                 既然国家封城自然会做好相应保障工作。       1
1        26            你想多了吧这节骨眼要把儿子接出来，儿子坐高铁坐飞机就得了干嘛她跑进来再跑出去？       0
2        34  全部采用无人机配送物资，人员。无人机分类为，送粮食的，器材的，送人的，统一调度，实时监控。这...       0
3        35                        他有瞒报，死了几百个了确诊几千个，记者不敢报因为不让报       0
4        40                           我只想问孕妇怎么办，马上就要生了，半夜发作去找谁       0


In [3]:
def read_test(pd_data):
    for index, item in pd_data.iterrows():       
        yield {'text': item['pure_blog'], 'label': item['label1'], 'qid': item['tweet_id']}

test_ds =  load_dataset(read_test, pd_data=test,lazy=False)
# 在转换为MapDataset类型
test_ds = MapDataset(test_ds)
print(len(test_ds))

1800


In [4]:
def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False):
   
    # 将原数据处理成model可读入的格式，enocded_inputs是一个dict，包含input_ids、token_type_ids等字段
    encoded_inputs = tokenizer(
        text=example["text"], max_seq_len=max_seq_length)

    # input_ids：对文本切分token后，在词汇表中对应的token id
    input_ids = encoded_inputs["input_ids"]
    # token_type_ids：当前token属于句子1还是句子2，即上述图中表达的segment ids
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        # label：情感极性类别
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        # qid：每条数据的编号
        qid = np.array([example["qid"]], dtype="int64")
        return input_ids, token_type_ids, qid

In [5]:
def create_dataloader(dataset,
                      trans_fn=None,
                      mode='train',
                      batch_size=1,
                      batchify_fn=None):
    
    if trans_fn:
        dataset = dataset.map(trans_fn)

    shuffle = True if mode == 'train' else False
    if mode == "train":
        sampler = paddle.io.DistributedBatchSampler(
            dataset=dataset, batch_size=batch_size, shuffle=shuffle)
    else:
        sampler = paddle.io.BatchSampler(
            dataset=dataset, batch_size=batch_size, shuffle=shuffle)
    dataloader = paddle.io.DataLoader(
        dataset, batch_sampler=sampler, collate_fn=batchify_fn)
    return dataloader

In [6]:
# 指定模型名称一键加载模型
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer

model = SkepForSequenceClassification.from_pretrained(
    'skep_ernie_1.0_large_ch', num_classes=  3)
# 指定模型名称一键加载tokenizer
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')

[2022-08-03 14:28:12,079] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.pdparams and saved to /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch
[2022-08-03 14:28:12,083] [    INFO] - Downloading skep_ernie_1.0_large_ch.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.pdparams
100%|██████████| 1238309/1238309 [00:38<00:00, 31768.72it/s]
W0803 14:28:51.184937   145 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0803 14:28:51.189386   145 device_context.cc:422] device: 0, cuDNN Version: 7.6.
[2022-08-03 14:28:59,631] [    INFO] - Downloading skep_ernie_1.0_large_ch.vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep_ernie_1.0_large_ch.vocab.txt
100%|██████████| 55/55 [00:00<00:00, 3404.22it/s]


In [7]:
from functools import partial
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad
batch_size=16
max_seq_length=256
# 处理测试集数据
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack() # qid
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)


In [8]:
# 加载模型
import os

# 根据实际运行情况，更换加载的参数路径
params_path = 'best_checkpoint/best_model.pdparams'
if params_path and os.path.isfile(params_path):
    # 加载模型参数
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)

Loaded parameters from best_checkpoint/best_model.pdparams


In [9]:
results = []
# 切换model模型为评估模式，关闭dropout等随机因素
model.eval()
for batch in test_data_loader:
    input_ids, token_type_ids, qids = batch
    # 喂数据给模型
    logits = model(input_ids, token_type_ids)
    # 预测分类
    probs = F.softmax(logits, axis=-1)
    idx = paddle.argmax(probs, axis=1).numpy()
    idx = idx.tolist()
    qids = qids.numpy().tolist()
    results.extend(zip(qids, idx))

In [12]:
results[2]

([4], 2)

In [10]:
from collections import Counter
true_label = list(test['label1'])
predict_label = [i[1] for i in results]

#==
def get_precision(true_label,predict_label,label_index):
        if label_index == 0:
             true_label0_index = [i for i in range(len(true_label)) if true_label[i]==0]
             recognize_label = [predict_label[i] for i in true_label0_index] 
             recognize_label_ = dict(Counter(recognize_label))
             precision_0 = recognize_label_[0]/len(true_label0_index)
             print('precision_0:',precision_0)
             return precision_0
        elif label_index == 1:
            true_label1_index = [i for i in range(len(true_label)) if true_label[i]==1]
            recognize_label = [predict_label[i] for i in true_label1_index] 
            recognize_label_ = dict(Counter(recognize_label))
            precision_1 = recognize_label_[1]/len(true_label1_index)
            print('precision_1:',precision_1)
            return precision_1
        elif label_index == 2:
            true_label2_index = [i for i in range(len(true_label)) if true_label[i]==2]
            recognize_label = [predict_label[i] for i in true_label2_index] 
            recognize_label_ = dict(Counter(recognize_label))
            precision_2 = recognize_label_[2]/len(true_label2_index)
            print('precision_2:',precision_2)
            return precision_2
#===
precision_0 = get_precision(true_label,predict_label,0)
precision_1 = get_precision(true_label,predict_label,1)
precision_2 = get_precision(true_label,predict_label,2)
average_precision = (precision_0+precision_1+precision_2)/3
print('average_precision:',average_precision)

precision_0: 0.9138349514563107
precision_1: 0.8312611012433393
precision_2: 0.8861985472154964
average_precision: 0.8770981999717155


<font face="times">
  
# 5. Prediction of other remaining data using fine-tuned model

In [12]:
#=================================================================================
# 数据读取
import pandas as pd
import numpy as np
from paddlenlp.datasets import load_dataset
from paddle.io import Dataset, Subset
from paddlenlp.datasets import MapDataset
def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False):
   
    # 将原数据处理成model可读入的格式，enocded_inputs是一个dict，包含input_ids、token_type_ids等字段
    encoded_inputs = tokenizer(
        text=example["text"], max_seq_len=max_seq_length)

    # input_ids：对文本切分token后，在词汇表中对应的token id
    input_ids = encoded_inputs["input_ids"]
    # token_type_ids：当前token属于句子1还是句子2，即上述图中表达的segment ids
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        # label：情感极性类别
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        # qid：每条数据的编号
        qid = np.array([example["qid"]], dtype="int64")
        return input_ids, token_type_ids, qid
#======================================================================
def create_dataloader(dataset,
                      trans_fn=None,
                      mode='train',
                      batch_size=1,
                      batchify_fn=None):
    
    if trans_fn:
        dataset = dataset.map(trans_fn)

    shuffle = True if mode == 'train' else False
    if mode == "train":
        sampler = paddle.io.DistributedBatchSampler(
            dataset=dataset, batch_size=batch_size, shuffle=shuffle)
    else:
        sampler = paddle.io.BatchSampler(
            dataset=dataset, batch_size=batch_size, shuffle=shuffle)
    dataloader = paddle.io.DataLoader(
        dataset, batch_sampler=sampler, collate_fn=batchify_fn)
    return dataloader
#======================================================================
# 指定模型名称一键加载模型
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer

model = SkepForSequenceClassification.from_pretrained(
    'skep_ernie_1.0_large_ch', num_classes=  3)
# 指定模型名称一键加载tokenizer
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')

#======================================================================
# 加载模型
import os
from functools import partial
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

# 根据实际运行情况，更换加载的参数路径
params_path = 'best_checkpoint/best_model.pdparams'
if params_path and os.path.isfile(params_path):
    # 加载模型参数
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)
#======================================================================

def get_label_tocsv(policy_name):
    import numpy as np
    print(policy_name)
    test = pd.read_csv('comment_{}.csv'.format(policy_name))
    #===增加一列
    test.insert(0,'tweet_id',np.array([i for i in range(len(test))]))
    #===
    def read_test(pd_data):
        for index, item in pd_data.iterrows():       
            yield {'text': item['pure_blog'], 'qid': item['tweet_id']}

    test_ds =  load_dataset(read_test, pd_data=test,lazy=False)
    # 在转换为MapDataset类型
    test_ds = MapDataset(test_ds)
    #======================================================================
    from functools import partial
    import numpy as np
    import paddle
    import paddle.nn.functional as F
    from paddlenlp.data import Stack, Tuple, Pad
    batch_size=16
    max_seq_length=256
    # 处理测试集数据
    trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
    batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack() # qid
    ): [data for data in fn(samples)]
    test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

    #============================================
    results = []
    # 切换model模型为评估模式，关闭dropout等随机因素
    model.eval()
    for batch in test_data_loader:
        input_ids, token_type_ids, qids = batch
        # 喂数据给模型
        logits = model(input_ids, token_type_ids)
        # 预测分类
        probs = F.softmax(logits, axis=-1)
        idx = paddle.argmax(probs, axis=1).numpy()
        idx = idx.tolist()
        qids = qids.numpy().tolist()
        results.extend(zip(qids, idx))
    #============================================
    import numpy as np
    predict_label = [i[1] for i in results]
    test.insert(0,'predict_label',np.array([i[1] for i in results]))
    test.to_csv('comment_{}_label.csv'.format(policy_name), index=False,encoding='utf-8-sig')


[2022-08-03 16:03:13,529] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams
[2022-08-03 16:03:17,410] [    INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.vocab.txt


Loaded parameters from best_checkpoint/best_model.pdparams


In [13]:
#======================================================================
get_label_tocsv('policy1')
get_label_tocsv('policy2')
get_label_tocsv('policy3')
get_label_tocsv('policy4')
get_label_tocsv('policy5')
get_label_tocsv('policy6')
get_label_tocsv('policy7')
get_label_tocsv('policy9')
get_label_tocsv('policy10')
get_label_tocsv('policy11')


policy1
policy2
policy3
policy4
policy5
policy6
policy7
policy9
policy10
policy11


In [14]:
!head comment_policy1_label.csv

﻿predict_label,tweet_id,policy_id,comment_time,pure_blog
1,0,policy1,2020-01-23 17:03:00,通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市通报湖北省地级市
2,1,policy1,2020-01-23 17:04:00,孝感孝感孝感
1,2,policy1,2020-01-23 17:04:00,宜昌宜昌宜昌宜昌宜昌
1,3,policy1,2020-01-23 17:04:00,荆门！孝感！黄石！
0,4,policy1,2020-01-23 17:04:00,十堰！这些地方为什么不动态监控加强防护
1,5,policy1,2020-01-23 17:05:00,孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感孝感
0,6,policy1,2020-01-23 17:04:00,通报湖北，亲人在十堰，我们真的太需要多一些了解了！！！为什么还不能通报各市详细情况！！！！！！
0,7,policy1,2020-01-23 17:07:00,孝感孝感！孝汉城铁一天那么多趟！一例都没有通报我打死都不信！除了吃湖北三分之一经济的武汉难道其他地级市不配有名字？
2,8,policy1,2020-01-23 17:05:00,孝感孝感孝感孝感！！！！
