希望在BERT模型的基础上生成词向量表征，并将这些表征连同标签一起用于训练一个分类模型。在之前提供的大批量处理代码中，如果没有将标签信息包括在生成的数据集中，这可能是导致问题的原因。你可以使用小批量数据处理示例中的策略来调整你的大批量处理代码，确保每个生成的.npy文件既包含了特征也包含了标签。

以下是如何修改你的大批量数据生成和加载策略，使其包含标签信息的一个示例：

In [2]:
# 1. 修改BERT特征和标签保存代码
# 在保存BERT向量时，同时保存对应的标签。这将确保在加载数据进行模型训练时，每个批次数据都有相应的标签。
import pandas as pd
import numpy as np
from transformers import BertTokenizer, TFBertModel
from sklearn.preprocessing import LabelEncoder
import os

# 定义批次大小和统一的向量维度
BATCH_SIZE = 8
UNIFORM_LENGTH = 512  # 假设所有词向量都填充或截断到这个长度
FEATURE_DIM = 768     # BERT基本模型的特征维度
batch_size = 8  

# 初始化BERT分词器和模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

# 读取数据
df = pd.read_csv("data_syn_with_label.csv", encoding='utf-8')
texts = df['text'].astype(str).tolist()
labels = df['label'].tolist()  # 假设标签列为'label'

# 标签编码
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# 创建存储目录
vector_dir = 'bert_vectors'
if not os.path.exists(vector_dir):
    os.makedirs(vector_dir)

def batch_encode_and_save(texts, labels, batch_size):
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        batch_labels = labels[i:i+batch_size]
        encoded = tokenizer(batch_texts, padding='max_length', truncation=True, max_length=UNIFORM_LENGTH, return_tensors="tf")
        outputs = bert_model(encoded['input_ids'], attention_mask=encoded['attention_mask'])
        vectors = outputs.last_hidden_state.numpy()[:, :UNIFORM_LENGTH, :]  # 获取向量

        # 保存向量和标签
        batch_data = {'features': vectors, 'labels': batch_labels}
        file_name = os.path.join(vector_dir, f'batch_{i//batch_size:04d}.npy')
        np.save(file_name, batch_data)
        print(f"Batch {i//batch_size} saved.")

batch_encode_and_save(texts, encoded_labels, batch_size)


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Batch 0 saved.
Batch 1 saved.
Batch 2 saved.
Batch 3 saved.
Batch 4 saved.
Batch 5 saved.
Batch 6 saved.
Batch 7 saved.
Batch 8 saved.
Batch 9 saved.
Batch 10 saved.
Batch 11 saved.
Batch 12 saved.
Batch 13 saved.
Batch 14 saved.
Batch 15 saved.
Batch 16 saved.
Batch 17 saved.
Batch 18 saved.
Batch 19 saved.
Batch 20 saved.
Batch 21 saved.
Batch 22 saved.
Batch 23 saved.
Batch 24 saved.
Batch 25 saved.
Batch 26 saved.
Batch 27 saved.
Batch 28 saved.
Batch 29 saved.
Batch 30 saved.
Batch 31 saved.
Batch 32 saved.
Batch 33 saved.
Batch 34 saved.
Batch 35 saved.
Batch 36 saved.
Batch 37 saved.
Batch 38 saved.
Batch 39 saved.
Batch 40 saved.
Batch 41 saved.
Batch 42 saved.
Batch 43 saved.
Batch 44 saved.
Batch 45 saved.
Batch 46 saved.
Batch 47 saved.
Batch 48 saved.
Batch 49 saved.
Batch 50 saved.
Batch 51 saved.
Batch 52 saved.
Batch 53 saved.
Batch 54 saved.
Batch 55 saved.
Batch 56 saved.
Batch 57 saved.
Batch 58 saved.
Batch 59 saved.
Batch 60 saved.
Batch 61 saved.
Batch 62 saved.
Ba

In [3]:
import tensorflow as tf
# 2. 修改数据加载器以同时读取特征和标签
# 修改你的数据加载器，使其能够从每个.npy文件中加载特征和标签，并正确地返回这些信息。
def data_generator(file_paths, batch_size):    
    for file_path in file_paths:
        print("Loading file:", file_path)  # 调试输出
        batch_data = np.load(file_path, allow_pickle=True).item()
        features = batch_data['features']
        labels = batch_data['labels']
        # 根据批次大小将数据分块
        for i in range(0, len(features), batch_size):
            print("Loaded data shape:", features.shape, labels.shape)  # 调试输出
            yield features[i:i+batch_size], labels[i:i+batch_size]


def load_dataset(file_paths, batch_size):
    dataset = tf.data.Dataset.from_generator(
        lambda: data_generator(file_paths, batch_size),
        output_types=(tf.float32, tf.int32),
        output_shapes=((batch_size, UNIFORM_LENGTH, FEATURE_DIM), (batch_size,))
    )
    return dataset.prefetch(tf.data.experimental.AUTOTUNE)



In [4]:
# 划分数据集
from sklearn.model_selection import train_test_split
files = [os.path.join(vector_dir, file) for file in sorted(os.listdir(vector_dir)) if file.endswith('.npy')]
# 确保去除数据量不足的最后一个文件
sample_data = np.load(files[-1], allow_pickle=True).item()
if sample_data['features'].shape[0] < BATCH_SIZE:
    files = files[:-1]

# 指定训练集、验证集和测试集的比例
train_size = 0.7
val_size = 0.15
test_size = 0.15  # Note: train_size + val_size + test_size should be 1

# 计算划分的索引
# 划分训练集、验证集、测试集文件列表
train_files, test_files = train_test_split(files, test_size=test_size, random_state=42)
train_files, val_files = train_test_split(train_files, test_size=val_size / (train_size + val_size), random_state=42)

# 现在你有了训练集(train_files)、验证集(val_files)和测试集(test_files)的文件列表
print(f"Train files: {len(train_files)}")
print(f"Validation files: {len(val_files)}")
print(f"Test files: {len(test_files)}")

# 创建数据集
train_dataset = load_dataset(train_files, batch_size)
val_dataset = load_dataset(val_files, batch_size)
test_dataset = load_dataset(test_files, batch_size)

print("训练集为：",train_dataset)

Train files: 695
Validation files: 150
Test files: 150
训练集为： <_PrefetchDataset element_spec=(TensorSpec(shape=(8, 512, 768), dtype=tf.float32, name=None), TensorSpec(shape=(8,), dtype=tf.int32, name=None))>


In [5]:
for features, labels in train_dataset.take(1):
    print("Features shape:", features.shape)
    print("Labels shape:", labels.shape)

Loading file: bert_vectors\batch_0640.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0479.npy
Features shape: (8, 512, 768)
Labels shape: (8,)
Loaded data shape: (8, 512, 768) (8,)


In [6]:
# 单一的CNN模型
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv1D, MaxPooling1D, concatenate, Dense, Dropout, Flatten
def create_cnn_model(input_shape, num_classes):
    # 输入层
    input_layer = Input(shape=input_shape, name='input_layer')

    # 卷积层和池化层
    conv_3 = Conv1D(filters=128, kernel_size=3, activation='relu', padding='same', name='conv_3x1')(input_layer)
    pool_3 = MaxPooling1D(pool_size=2, padding='same', name='maxpool_3')(conv_3)

    conv_4 = Conv1D(filters=128, kernel_size=4, activation='relu', padding='same', name='conv_4x1')(input_layer)
    pool_4 = MaxPooling1D(pool_size=2, padding='same', name='maxpool_4')(conv_4)

    conv_5 = Conv1D(filters=128, kernel_size=5, activation='relu', padding='same', name='conv_5x1')(input_layer)
    pool_5 = MaxPooling1D(pool_size=2, padding='same', name='maxpool_5')(conv_5)

    # 拼接卷积层的输出
    concatenated = concatenate([pool_3, pool_4, pool_5], axis=-1)

    # 平坦化后接一个全连接层
    flatten = Flatten()(concatenated)
    dense = Dense(128, activation='relu', name='dense_layer')(flatten)
    # Dropout层
    dropout = Dropout(0.5, name='dropout')(dense)
    # 输出层
    output_layer = Dense(num_classes, activation='softmax', name='output_layer')(dropout)
    # 创建模型
    model = Model(inputs=input_layer, outputs=output_layer)
    return model


# 定义模型输入的维度
input_shape = (UNIFORM_LENGTH, FEATURE_DIM)  # 根据实际情况设置
num_classes = 2  # 二分类

# 调用函数创建模型
model = create_cnn_model(input_shape, num_classes)

# 编译模型
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# 打印模型概况
model.summary()


In [7]:
# 现在使用创建的模型进行训练
# 注意，这里假设 train_dataset 是一个包含输入特征和标签的 TensorFlow 数据集对象
model.fit(train_dataset, epochs=10, validation_data=val_dataset)

Epoch 1/10
Loading file: bert_vectors\batch_0640.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0479.npy
Loaded data shape: (8, 512, 768) (8,)
      1/Unknown [1m1s[0m 1s/step - accuracy: 0.7500 - loss: 0.3864Loading file: bert_vectors\batch_0291.npy
Loaded data shape: (8, 512, 768) (8,)
      2/Unknown [1m1s[0m 81ms/step - accuracy: 0.7812 - loss: 6.9369Loading file: bert_vectors\batch_0433.npy
Loaded data shape: (8, 512, 768) (8,)
      3/Unknown [1m1s[0m 80ms/step - accuracy: 0.8125 - loss: 7.6218Loading file: bert_vectors\batch_0523.npy
Loaded data shape: (8, 512, 768) (8,)
      4/Unknown [1m1s[0m 82ms/step - accuracy: 0.8359 - loss: 7.4023Loading file: bert_vectors\batch_0159.npy
Loaded data shape: (8, 512, 768) (8,)
      5/Unknown [1m1s[0m 81ms/step - accuracy: 0.8438 - loss: 7.6811Loading file: bert_vectors\batch_0578.npy
Loaded data shape: (8, 512, 768) (8,)
      6/Unknown [1m2s[0m 81ms/step - accuracy: 0.8351 - loss: 7.7000Loading fil

  self.gen.throw(value)


Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0295.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0899.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0921.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0189.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0989.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0480.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0593.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0879.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0942.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0458.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0356.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0297.npy
Loaded data shape: (8, 512, 768) (8,)
Lo

<keras.src.callbacks.history.History at 0x19886e50800>

In [8]:
#  评估模型
# 使用验证集（你已经分配的 val_dataset）来评估模型性能。这通常涉及计算模型在验证数据上的损失和准确率等指标。
# 评估模型性能
val_loss, val_accuracy = model.evaluate(val_dataset)
print(f"Validation Loss: {val_loss}")
print(f"Validation Accuracy: {val_accuracy}")

Loading file: bert_vectors\batch_0962.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0173.npy
Loaded data shape: (8, 512, 768) (8,)
      1/Unknown [1m0s[0m 116ms/step - accuracy: 1.0000 - loss: 0.0013Loading file: bert_vectors\batch_0709.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0295.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0899.npy
Loaded data shape: (8, 512, 768) (8,)
      4/Unknown [1m0s[0m 18ms/step - accuracy: 1.0000 - loss: 0.0096 Loading file: bert_vectors\batch_0921.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0189.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0989.npy
Loaded data shape: (8, 512, 768) (8,)
      7/Unknown [1m0s[0m 18ms/step - accuracy: 1.0000 - loss: 0.0114Loading file: bert_vectors\batch_0480.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0593.npy
Loaded data shape: (8, 512

In [9]:
# 模型测试
# 使用测试集（test_dataset）来测试模型的泛化能力。这是评估模型在未见过的数据上表现的重要步骤。
# 测试模型
test_loss, test_accuracy = model.evaluate(test_dataset)
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")


Loading file: bert_vectors\batch_0920.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0525.npy
Loaded data shape: (8, 512, 768) (8,)
      1/Unknown [1m0s[0m 59ms/step - accuracy: 1.0000 - loss: 0.0097Loading file: bert_vectors\batch_0567.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0657.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0633.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0429.npy
      4/Unknown [1m0s[0m 18ms/step - accuracy: 0.9297 - loss: 0.2465Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0857.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0712.npy
Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0174.npy
      7/Unknown [1m0s[0m 19ms/step - accuracy: 0.9300 - loss: 0.2284Loaded data shape: (8, 512, 768) (8,)
Loading file: bert_vectors\batch_0604.npy
Loaded data shape: (8, 512, 

In [None]:
# 模型预测
# 对一些新的数据实例进行预测，这可以帮助你了解模型在实际应用中的表现
# 假设你有一些新的数据实例
# 这里需要你自己提供或创建这些数据
# 示例：new_data = ...

# 预测新数据
predictions = model.predict(new_data)
predicted_classes = np.argmax(predictions, axis=1)
print("Predictions:", predicted_classes)


In [10]:
# 保存模型
model.save('trained_cnn_model.h5')
print("Model saved successfully.")

# 加载模型
loaded_model = tf.keras.models.load_model('trained_cnn_model.h5')




Model saved successfully.
