# Data Challenge: Node Classification for Greek websites
## Part: Text Embeddings

<br />
<div style="text-align: left"> <b> Date: </b> June 2024 </div>

---

> Didimiotou-Kaoukaki Konstantina, ID: p3352206 <br />
> Kortsinoglou Eirini, ID: p3352212 <br />
> Fountas Dimitrios, ID: p3352228 <br />

> MSc in Data Science (PT) <br />
> Department of Informatics <br />
> Athens University of Economics and Business <br />

In [17]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch
from transformers import AutoTokenizer, AutoModel
import pickle 
import os
from tqdm import tqdm
import zipfile
import re
from io import BytesIO
# !pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer


for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/preprocessed-data-final/test_text_data.pkl
/kaggle/input/preprocessed-data-final/train_text_data.pkl


# **GREEK BERT - WORD EMBEDDINGS**

We firstly utilize the Greek Bert embeddings from https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1.

In [4]:
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")
model = AutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1")

tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/530k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/454M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


The function processes the list of train and test texts in batches (for memory effiency purposes). 
It uses the pre-trained Greek Bert model, to generate the embeddings for each word in each sentence, and then each sentence is represented by averaging the word embeddings. 

As max length of the sentences, we define 512 tokens. So, sentences of size less than this will be padded to reach length of size 512 tokens, and sentences of size larger that 512 tokens, will be truncated to this length. 

In [5]:
def get_embeddings(text_list, tokenizer, max_length, batch_size):
    
    embeddings_list = []

    # Process in batches
    for i in tqdm(range(0, len(text_list), batch_size), desc="Processing Batches"):
        batch_texts = text_list[i:i + batch_size]
        encoding = tokenizer.batch_encode_plus(
            batch_texts,                    # List of input texts
            padding=True,                  # Pad to the maximum sequence length
            truncation=True,              # Truncate to the maximum sequence length if necessary
            max_length = max_length,
            return_tensors='pt',        # Return PyTorch tensors
            add_special_tokens=True    # Add special tokens CLS and SEP
        )
 
        input_ids = encoding['input_ids']  # Token IDs
        attention_mask = encoding['attention_mask']  # Attention mask

        with torch.no_grad():
            outputs = model(input_ids, attention_mask)
            embeddings = outputs.last_hidden_state.mean(dim=1)
            embeddings_list.append(embeddings)
        # Concatenate all batch embeddings
    
    embeddings = torch.cat(embeddings_list, dim=0)    
    return embeddings

In [7]:
# Load preprocessed data from the pickle file
with open('/kaggle/input/preprocessed-data-final/train_text_data.pkl', 'rb') as f:
    train_text_data = pickle.load(f)
    
with open('/kaggle/input/preprocessed-data-final/test_text_data.pkl', 'rb') as f:
    test_text_data = pickle.load(f)

In [8]:
print(len(train_text_data))
print(len(test_text_data))

1812
605


In [11]:
embeddings_train = get_embeddings(train_text_data, tokenizer, 512, 500)
embeddings_test = get_embeddings(test_text_data, tokenizer, 512, 500)

Processing Batches: 100%|██████████| 4/4 [37:35<00:00, 563.91s/it]
Processing Batches: 100%|██████████| 2/2 [12:36<00:00, 378.08s/it]


Save the embeddings as pickle files, to be used in the modelling process.

In [12]:
# Bert embeddings
with open('train_embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings_train, f)
    
with open('test_embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings_test, f)

# Sentence Transformers (based on Greek Bert) - Sentence Embeddings 

Following, we also create embeddings using another Bert-based pre-trained model, which is trained on Greek media texts, and can be found in this link: https://huggingface.co/dimitriz/st-greek-media-bert-base-uncased  
In this case, we utilize the model as Sentence Tokenizer, which produces the embeddings for each sentence on the train and on the text datasets. 

In [15]:
emb_model = SentenceTransformer('dimitriz/greek-media-bert-base-uncased')



config.json:   0%|          | 0.00/658 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/452M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/530k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [16]:
embed_train = emb_model.encode(train_text_data, show_progress_bar=True,
                              batch_size=128)
embed_test = emb_model.encode(test_text_data, show_progress_bar=True,
                              batch_size=128)

Batches:   0%|          | 0/15 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

In [18]:
with open('train_embed_sent.pkl', 'wb') as f:
    pickle.dump(embed_train, f)
    
with open('test_embed_sent.pkl', 'wb') as f:
    pickle.dump(embed_test, f)