**BERT is designed** to understand the context and semantics of words within their original context. This includes punctuation and special characters, which can provide important context clues for BERT's attention mechanisms. Therefore, we will use the `full_text` column without lemmatization, stemming, or removal of stop words, punctuation, and special characters.
- The code loads a dataset and preprocesses the full_text column by converting it to lowercase. 
- It then uses the bert-base-uncased model to tokenize the text, truncating and padding each to a maximum length of 512 tokens
- Then extracts 768-dimensional embeddings from the `[CLS] token` for each text. `[CLS] Token Embeddings` (but not Full embeddings) are used in
classification tasks such as essay scoring
- Finally, it saves these embeddings to a new DataFrame and exports it as bert_features.csv

**Loss of Information**:
- Texts longer than 512 tokens (all with score=6 and majority with score=5) will lose the content beyond this limit, potentially leading to an incomplete representation of the text's quality and coherence, which are crucial for essay scoring.


In [1]:
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertModel
import torch
from tqdm import tqdm

# Load the dataset
df = pd.read_csv('transformed_data_v1.csv')

# Use the full_text column
text_column = 'full_text'

# Convert text to lowercase
df[text_column] = df[text_column].str.lower()

# Choose the BERT model (uncased for this example)
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Function to get BERT embeddings
def get_bert_embedding(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding='max_length', max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the [CLS] token representation for the embedding (768-dimensional vector)
    cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()
    return cls_embedding

# Apply the function to get BERT embeddings for the dataset
embeddings = []

for text in tqdm(df[text_column].tolist(), desc="Getting BERT embeddings"):
    embedding = get_bert_embedding(text, tokenizer, model)
    embeddings.append(embedding)

# Convert embeddings to DataFrame
bert_embeddings_df = pd.DataFrame(embeddings, columns=[f'bert_feature_{i}' for i in range(embeddings[0].shape[0])])

# Save the BERT embeddings DataFrame
bert_embeddings_df.to_csv('bert_features.csv', index=False)

# Check the result
print(bert_embeddings_df.head())


Getting BERT embeddings: 100%|██████████| 13843/13843 [7:27:05<00:00,  1.94s/it]       


   bert_feature_0  bert_feature_1  bert_feature_2  bert_feature_3  \
0       -0.068888       -0.556835        0.466894        0.602734   
1       -0.490990       -0.266561        0.205436        0.357459   
2       -0.090818       -0.227917       -0.149506        0.005978   
3       -0.349526       -0.106863       -0.125005        0.271403   
4        0.113317       -0.388882       -0.021579        0.849989   

   bert_feature_4  bert_feature_5  bert_feature_6  bert_feature_7  \
0        0.041977       -0.487987       -0.031749        1.373013   
1       -0.648571       -0.154488        0.004012        0.070033   
2       -0.407537       -0.576724        0.743775        1.054340   
3       -0.090610       -0.555049        0.115859        0.360005   
4        0.051454       -0.316866        0.492008        0.901263   

   bert_feature_8  bert_feature_9  ...  bert_feature_758  bert_feature_759  \
0       -0.359459       -0.257405  ...         -0.093191         -0.262483   
1        0.133