### Extracting BERT embeddings from texts for readability assessment.
##### used for listening texts in my case
##### original code https://github.com/imperialite/BERT-Embeddings-For-ARA
##### cite Imperial (2021). BERT Embeddings for Automatic Readability Assessment

In [7]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
import torch
import pickle
import numpy
import pandas as pd

In [2]:
df = pd.read_csv('bc_cam_txt_df.csv')
df

Unnamed: 0,filename,text
0,A1Movers_1_1,"Look, Grandpa. My friend's family are in the g..."
1,A1Movers_1_2,"Come quickly, children. The train's waiting to..."
2,A1Movers_1_3,"Hello, Mrs Castle. Hello Sally, Oh I'm tired. ..."
3,A1Movers_1_4,"Dad, come and watch this DVD with me. What's i..."
4,A1Movers_1_5,Can you colour this mountain picture now? Yes!...
...,...,...
723,C2Prof_16-20,"Today, we're talking to marine biologists Gina..."
724,C2Prof_21-30,I knew I'd be short of money if I didn't work ...
725,C2Prof_3-4,"Last year, Tim Fitzgerald exhibited photograph..."
726,C2Prof_5-6,One of my own thoughts about this piece is the...


Read contents from a specific corpus. Corpus should be 1 per line and in .txt format.

In [3]:
# titles = []
contents = []

for i in df['text']:
  parsed_text = i.strip()
  print(parsed_text)
  contents.append(parsed_text)

Look, Grandpa. My friend's family are in the garden. What's your friend's name? It's Sally. Can you see her? She's got glasses. Is she opening a present? That's right. It's her birthday today. Is that boy your friend's brother? Which boy? He's sitting on the mat. Oh, yes. And he's playing with a toy truck. That's right. That boy's name's Ben. He's Sally's cousin. I know that man. Look at his hat. You mean the man with the sandwiches? Yes. He's called Paul. He's got lots. Yes. People get hungry at parties. And is that your friend's mum? The woman who's cleaning the table? Yes. That's right. Her name's Mary. That table's very dirty. Yes. That's because it's always outside. Look at that woman! Where? She's putting something in the tree. Oh, that's Aunt Jane. She's putting some lamps there for this evening. What a nice party!
Come quickly, children. The train's waiting to take us to the zoo. Great, Mrs White. It's exciting going to the zoo. Yes. And I love going by train. Me too. Is the zo

Mean Pooling - take attention mask into account for correct averaging.

In [4]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

Load BERT models from Huggingface model repository.

In [5]:
# FOR FILIPINO
# tokenizer = AutoTokenizer.from_pretrained("jcblaise/bert-tagalog-base-cased")
# model = AutoModel.from_pretrained("jcblaise/bert-tagalog-base-cased")

# FOR ENGLISH
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModel.from_pretrained("bert-base-cased")

Preprocess and compute embeddings

In [6]:
#Tokenize sentences
encoded_input = tokenizer(contents, padding=True, truncation=True, max_length=512, return_tensors='pt')

#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

Transform embeddings to numpy format. Show example at index 0.

In [8]:
sentence_embeddings_np = sentence_embeddings.numpy()
sentence_embeddings_np[0]

array([ 5.30104265e-02,  2.93767691e-01, -9.75917205e-02,  3.46060365e-01,
        3.83216858e-01,  2.27772668e-01,  1.67975530e-01, -1.40662506e-01,
       -5.17273843e-02, -1.62387446e-01,  1.59120247e-01,  3.03999752e-01,
       -1.37991741e-01,  4.20357943e-01, -3.58055383e-01, -2.15227336e-01,
        6.18731752e-02,  9.45001654e-03, -1.35933742e-01,  4.19850759e-02,
       -1.19844303e-01, -3.74062881e-02, -1.29873902e-01,  6.10786118e-02,
        2.41056815e-01, -3.14644694e-01,  2.91067034e-01,  8.63365293e-01,
        4.07750070e-01,  3.63554239e-01, -9.53591838e-02,  9.39316601e-02,
       -1.39173329e-01, -6.32574409e-02, -8.67822468e-02, -1.07803419e-01,
        1.86185390e-01,  4.81984466e-01, -7.19272112e-03,  2.95656174e-01,
        2.97115833e-01,  1.50157452e-01,  3.54122855e-02,  2.81291723e-01,
       -1.47666082e-01, -1.75567135e-01, -9.86326635e-02, -1.26234025e-01,
        7.23193213e-02,  8.00520554e-02, -4.93141450e-02,  1.48435548e-01,
        4.87572163e-01, -

Save in csv format. Can be added to csv files of linguistic features for readability assessment.

In [9]:
# numpy.savetxt('bc_cam_bert_embeddings.csv', sentence_embeddings_np, delimiter=',')

I decided to save the embeddings as a pandas df so I could add column names and add the file names too

In [14]:
column_names = [f'bert_{i+1}' for i in range(768)]

df2 = pd.DataFrame(sentence_embeddings_np, columns=column_names)
df2

Unnamed: 0,bert_1,bert_2,bert_3,bert_4,bert_5,bert_6,bert_7,bert_8,bert_9,bert_10,...,bert_759,bert_760,bert_761,bert_762,bert_763,bert_764,bert_765,bert_766,bert_767,bert_768
0,0.053010,0.293768,-0.097592,0.346060,0.383217,0.227773,0.167976,-0.140663,-0.051727,-0.162387,...,-0.001186,0.040110,-0.104577,-0.307515,0.023724,-0.186511,0.129780,0.211728,0.133709,-0.000617
1,0.414724,0.265030,-0.254253,0.262461,0.390834,0.005777,0.160487,-0.017125,-0.163743,0.054893,...,0.118676,-0.006873,-0.131482,-0.299764,0.054476,-0.014990,0.220584,0.394695,-0.108314,0.111879
2,0.311979,0.260253,-0.287606,0.144175,0.477552,0.144640,0.136279,-0.092433,-0.102352,0.125829,...,0.235627,-0.110906,-0.043273,-0.289649,0.075836,0.105506,0.341360,0.331849,-0.089021,0.036760
3,0.168283,0.283034,-0.174079,0.160189,0.326133,0.182786,0.190903,0.019549,-0.031205,0.065830,...,0.183421,-0.019079,-0.116316,-0.387733,0.029947,-0.093259,0.178463,0.332096,0.003985,0.090877
4,0.113609,0.013535,-0.062040,0.289461,0.302949,0.093426,0.060676,0.092458,-0.120968,-0.000810,...,0.092306,0.057574,-0.120216,-0.294299,-0.204686,-0.083530,0.180239,0.424102,-0.012113,0.091106
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
723,0.401525,0.031742,-0.295002,-0.014102,0.473157,0.115989,-0.092625,0.104994,0.035862,0.131671,...,0.275154,0.069934,-0.223100,-0.089625,0.094113,0.173621,0.136563,0.221396,-0.035504,0.075902
724,0.330500,0.121378,-0.341339,-0.041109,0.434514,0.115061,-0.054262,0.167414,-0.039859,0.021126,...,0.165344,0.060851,-0.299681,-0.128314,-0.010718,0.302664,0.241780,0.271533,-0.116985,0.036222
725,0.201988,0.069796,-0.060673,0.190671,0.463188,0.168991,-0.028868,-0.002389,-0.154003,0.036007,...,0.002933,-0.029062,-0.339931,-0.423854,0.046176,0.072496,0.173474,0.287201,-0.003583,-0.003119
726,0.288296,0.146299,-0.158673,0.102414,0.435514,0.109758,-0.014951,0.098363,0.190461,0.053542,...,-0.237834,-0.090920,-0.452645,-0.381974,0.172661,0.119560,0.190162,0.227809,-0.041571,-0.037134


In [15]:
df2.insert(0, 'filename', df['filename'])
df2

Unnamed: 0,filename,bert_1,bert_2,bert_3,bert_4,bert_5,bert_6,bert_7,bert_8,bert_9,...,bert_759,bert_760,bert_761,bert_762,bert_763,bert_764,bert_765,bert_766,bert_767,bert_768
0,A1Movers_1_1,0.053010,0.293768,-0.097592,0.346060,0.383217,0.227773,0.167976,-0.140663,-0.051727,...,-0.001186,0.040110,-0.104577,-0.307515,0.023724,-0.186511,0.129780,0.211728,0.133709,-0.000617
1,A1Movers_1_2,0.414724,0.265030,-0.254253,0.262461,0.390834,0.005777,0.160487,-0.017125,-0.163743,...,0.118676,-0.006873,-0.131482,-0.299764,0.054476,-0.014990,0.220584,0.394695,-0.108314,0.111879
2,A1Movers_1_3,0.311979,0.260253,-0.287606,0.144175,0.477552,0.144640,0.136279,-0.092433,-0.102352,...,0.235627,-0.110906,-0.043273,-0.289649,0.075836,0.105506,0.341360,0.331849,-0.089021,0.036760
3,A1Movers_1_4,0.168283,0.283034,-0.174079,0.160189,0.326133,0.182786,0.190903,0.019549,-0.031205,...,0.183421,-0.019079,-0.116316,-0.387733,0.029947,-0.093259,0.178463,0.332096,0.003985,0.090877
4,A1Movers_1_5,0.113609,0.013535,-0.062040,0.289461,0.302949,0.093426,0.060676,0.092458,-0.120968,...,0.092306,0.057574,-0.120216,-0.294299,-0.204686,-0.083530,0.180239,0.424102,-0.012113,0.091106
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
723,C2Prof_16-20,0.401525,0.031742,-0.295002,-0.014102,0.473157,0.115989,-0.092625,0.104994,0.035862,...,0.275154,0.069934,-0.223100,-0.089625,0.094113,0.173621,0.136563,0.221396,-0.035504,0.075902
724,C2Prof_21-30,0.330500,0.121378,-0.341339,-0.041109,0.434514,0.115061,-0.054262,0.167414,-0.039859,...,0.165344,0.060851,-0.299681,-0.128314,-0.010718,0.302664,0.241780,0.271533,-0.116985,0.036222
725,C2Prof_3-4,0.201988,0.069796,-0.060673,0.190671,0.463188,0.168991,-0.028868,-0.002389,-0.154003,...,0.002933,-0.029062,-0.339931,-0.423854,0.046176,0.072496,0.173474,0.287201,-0.003583,-0.003119
726,C2Prof_5-6,0.288296,0.146299,-0.158673,0.102414,0.435514,0.109758,-0.014951,0.098363,0.190461,...,-0.237834,-0.090920,-0.452645,-0.381974,0.172661,0.119560,0.190162,0.227809,-0.041571,-0.037134


In [17]:
df2.to_csv('bc_cam_bert_embeddings.csv', index=False, encoding='utf-8')