<h1>Overview</h1>

In this notebook, we will examine the used dataset which is a collection of scientific publications. Each entry consists of an ID, title, and the abstract section of the scientific publication or research paper. The abstracts are going to be fed to our model in order to obtain thier vector representations (embeddings).

This notebook is going to do the following:
1. <b>Data cleaning</b>: Remove nulls and non-English entries.
2. <b>Encoding of abstracts</b>: Obtain the vector representation (encoding) of each abstract.
3. <b>Keyword extraction</b>: Extract the most relevant keywords from each abstract.
4. <b>Output</b>: Export the results to be used in the next stage.

<b style="color:green;">Don't worry, I will provide explanations for each step along the way..</b>


<h1>Import libraries</h1>


In [1]:
import pandas as pd
import numpy as np
import torch
from langdetect import detect, DetectorFactory
from tqdm import tqdm

pd.set_option('display.max_columns', None)
tqdm.pandas() # Initialize tqdm for pandas (for the progress bar)
DetectorFactory.seed = 0 # To have consistant results from langdetect

<h1>Import data</h1>


In [2]:
# Import the dataset
df_raw = pd.read_json(r'input/abstracts.json', encoding="utf8")
df_raw.head(5)

Unnamed: 0,id,title,abstractText
0,000212c0-370f-df46-a901-1a8962645f6f,Measuring perceived depth in natural images an...,The perception of depth in images and video se...
1,0009bed2-bf63-1f43-977c-b7823fa0029d,Quasi-continuous High Pressure Preservation – ...,
2,00101882-6f36-a74b-bc93-231cd4d09a68,Refractive Microprisms with Improved Surface Q...,
3,00130705-0760-d24e-81db-7bb32b537bf9,Gemeinsam Fotografien dokumentieren – Kollabor...,
4,0020df72-aae3-ae41-b007-ce47fbaba7ca,"Das ""Grundgesetz"" für die Leistungsbeschreibun...",


<h1>Clean data</h1>

<h3 style="color:blue;">Examine the data</h3>

In [3]:
# Let's take a look at what we have
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7548 entries, 0 to 7547
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            7548 non-null   object
 1   title         7548 non-null   object
 2   abstractText  1500 non-null   object
dtypes: object(3)
memory usage: 235.9+ KB


The above cell shows that we have 3 properties (id, title, and abstractText), however, we are only interested in the abstractText. The abstract serves as an ideal input for the model since it holds the summarized semantic content of the scientific research.

We see that the dataset contains only 1500 non-null entries, therefore we will exclude null values in the next step.

<h3 style="color:blue;">Remove NA</h3>

In [4]:
# Exclude entries that don't have an abstract text
df_cleaned = df_raw[~df_raw['abstractText'].isna()]

# For testing purposes, uncomment the next line to limit the number of entries
# df_cleaned = df_cleaned.head(10)

df_cleaned.head(3)

Unnamed: 0,id,title,abstractText
0,000212c0-370f-df46-a901-1a8962645f6f,Measuring perceived depth in natural images an...,The perception of depth in images and video se...
6,00260961-8319-c84a-b84f-a83f557af45f,VNREAL: Virtual Network Resource Embedding Alg...,Network virtualization is recognized as an ena...
8,00650278-d57c-874e-8fd3-355e26a35f21,Komplementärmedizin in der Onkologie – Evidenz...,Tumorpatienten sind im Verlauf ihrer Erkrankun...


<h3 style="color:blue;">Remove non-English abstracts</h3>

You may have noticed some abstracts are written in German. We'll exclude any non-English entries since the models we're using are trained on corpus of English data. If you're considering including other languages, then multilingual models are suitable for that purpose. To detect the language of each abstract, we'll use langdetect package.

In [6]:
# Define a function that returns the language of a given text
def detect_language(text):
    try:
        return detect(text)
    except:
        return 'unknown'

# Apply the function to create a new column 'language'
df_cleaned.loc[:, 'language'] = df_cleaned['abstractText'].progress_apply(detect_language)
df_cleaned.head(5)

100%|██████████| 1500/1500 [00:14<00:00, 105.78it/s]


Unnamed: 0,id,title,abstractText,language
0,000212c0-370f-df46-a901-1a8962645f6f,Measuring perceived depth in natural images an...,The perception of depth in images and video se...,en
6,00260961-8319-c84a-b84f-a83f557af45f,VNREAL: Virtual Network Resource Embedding Alg...,Network virtualization is recognized as an ena...,en
8,00650278-d57c-874e-8fd3-355e26a35f21,Komplementärmedizin in der Onkologie – Evidenz...,Tumorpatienten sind im Verlauf ihrer Erkrankun...,de
10,006db203-8627-2f40-979e-5cd17aee5785,Kurzer Prozeß mit Ebenheit - Ebenheitsmessung ...,Evenness measurement on polished support plate...,en
15,00963793-ef84-b84c-9dd9-37f43546fdf9,"Konkurrierende Partner, kooperierende Wettbewe...",Innovationsnetzwerke sind primär durch Koopera...,de


In [7]:
# Let's look at the number of abstracts per language
df_cleaned[['language', 'id']].groupby(['language']).count().sort_values(by='id', ascending=False)

Unnamed: 0_level_0,id
language,Unnamed: 1_level_1
en,1330
de,167
es,1
fr,1
sw,1


In [8]:
# Filter rows where the language is English
df_cleaned_en = df_cleaned[df_cleaned['language'] == 'en']

# Drop the 'language' column since we don't need it anymore
df_cleaned_en = df_cleaned_en.drop(columns=['language'])

In [9]:
# After manually checking the data, I found that some abstracts were labeled as English but
# they contained the translated version in German as well. These cases can be handled in some way by extracting
# only the English part, but let's keep things simple and just remove those entries.

ids_to_remove = [
    '0da4e154-6bbb-5a44-90be-18a8f032d716',
    '37f017ad-6be1-e84e-81e0-664282061d63',
    '47e9ddcd-3c80-5448-860a-55ac1623b72b',
    '9be88a39-0572-8047-b3e4-2725d791876b',
    'd6044191-6573-1848-ae7b-f215c922f201',
    'b6e0b22c-c85a-224a-a1c6-39a7daebc058',
    '006db203-8627-2f40-979e-5cd17aee5785'
]

df_cleaned_en = df_cleaned_en[~df_cleaned_en['id'].isin(ids_to_remove)]
print('We have', df_cleaned_en.shape[0], 'abstracts after excluding non-English entries')

We have 1324 abstracts after excluding non-English entries


<h1>Encoding of abstracts using TinyLlama-1.1B-Chat</h1>

Now we come to the part where we will use TinyLlama to encode our abstracts into vector representations. For this, ensure that you have the transformers library installed. The encodings will be stored and used later for clustering.

Note that TinyLlama adopts an encoder-decoder architecture. However, we're only interested in the encoder component since it outputs the contextualized vector representation of the input text.

<h3 style="color:blue;">Import the model and tokenizer from the transformers library</h3>

In [10]:
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


<h3 style="color:blue;">Encode abstracts</h3>

In [14]:
# Define a function to extract keywords from the abstract text
def encode_text(text):
    # Tokenize the sentence
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, padding=True)

    # Forward pass through the model
    model_output = model(**encoded_input, output_hidden_states=True)

    # Extract the embeddings from the model's output
    last_hidden_state = model_output.hidden_states[-1][0]


    # Since the encoder outputs a sequence on embeddings representing the contextual meaning of each token
    # we need to calculate the average vector which represents the overall meaning of the entire sentence
    text_embedding = torch.mean(last_hidden_state, dim=0).detach().numpy().tolist()

    return text_embedding

# Apply the function to the 'abstractText' column and create a new 'embedding' column
df_cleaned_en['encoding'] = df_cleaned_en['abstractText'].progress_apply(encode_text)
df_cleaned_en.head(5)

100%|██████████| 10/10 [01:50<00:00, 11.09s/it]


Unnamed: 0,id,title,abstractText,encoding
0,000212c0-370f-df46-a901-1a8962645f6f,Measuring perceived depth in natural images an...,The perception of depth in images and video se...,"[-0.08983136713504791, 0.9594758152961731, -0...."
6,00260961-8319-c84a-b84f-a83f557af45f,VNREAL: Virtual Network Resource Embedding Alg...,Network virtualization is recognized as an ena...,"[0.19866445660591125, -0.5716489553451538, -0...."
32,0139e0f7-90c3-714e-8172-8be67ef9a66e,The heat flux and temperature distribution of ...,The thermal diffusion of nanostructured W fuzz...,"[0.31982874870300293, 1.3281383514404297, 0.71..."
35,013fa062-7f01-0647-bde9-baf4e13844bc,Characterization of Additively Produced RF-Str...,"Usually, the printed circuit board industry ha...","[0.00956014171242714, 0.9851690530776978, 0.64..."
37,0146526d-6171-6244-85c4-94752d2cfde3,Stokes parameters in the unfolding of an optic...,Following our earlier work (F. Flossmann et al...,"[0.3496668040752411, 0.44515469670295715, 0.35..."


<h1> Keywords extraction using KeyBERT</h1>

Now we want to use KeyBERT to extract the most relavent keywords from each abstract. So why this step is important?
Extracting keywords with KeyBERT is crucial for interpreting the clusters later on. Since we are working in an unsupervised setting and we lack predefined labels, keywords help us understand the main topics of the clusters. We can identify the topic by analyzing the most frequent keywords within a cluster.

<h3 style="color:blue;">Import the model and create an instance</h3>

In [None]:
from keybert import KeyBERT
kw_model = KeyBERT()

<h3 style="color:blue;">Extract keywords from the abstracts</h3>

In [None]:
# Define a function to extract keywords from the abstract text
def extract_keywords_from_text(text):
    keywords = kw_model.extract_keywords(text)
    return [keyword[0] for keyword in keywords]

# Apply the function to the 'abstractText' column and create a new 'keywords' column
df_cleaned_en['keywords'] = df_cleaned_en['abstractText'].progress_apply(extract_keywords_from_text)
df_cleaned_en.head(5)

100%|██████████████████████████████████████████████████████████████████████████████| 1324/1324 [05:00<00:00,  4.41it/s]


Unnamed: 0,id,title,abstractText,encoding,keywords
0,000212c0-370f-df46-a901-1a8962645f6f,Measuring perceived depth in natural images an...,The perception of depth in images and video se...,"[-0.3123984634876251, 0.2054155468940735, 0.27...","[binocular, depth, perception, perspective, blur]"
6,00260961-8319-c84a-b84f-a83f557af45f,VNREAL: Virtual Network Resource Embedding Alg...,Network virtualization is recognized as an ena...,"[-0.47408998012542725, 0.15451061725616455, 0....","[virtualization, vnreal, vne, virtual, network]"
32,0139e0f7-90c3-714e-8172-8be67ef9a66e,The heat flux and temperature distribution of ...,The thermal diffusion of nanostructured W fuzz...,"[-0.31249314546585083, -0.014937079511582851, ...","[irradiations, fusion, fuzz, thermal, nanostru..."
35,013fa062-7f01-0647-bde9-baf4e13844bc,Characterization of Additively Produced RF-Str...,"Usually, the printed circuit board industry ha...","[-0.2447933405637741, 0.0641787126660347, 0.36...","[inkjet, impedance, rf, ghz, substrates]"
37,0146526d-6171-6244-85c4-94752d2cfde3,Stokes parameters in the unfolding of an optic...,Following our earlier work (F. Flossmann et al...,"[-0.26581481099128723, -0.08620522916316986, 0...","[polarization, birefringent, vortices, stokes,..."


<h1>Export the final processed data</h1>

In [None]:
# Convert to JSON
json_data = df_cleaned_en.to_json(orient='records', lines=True)

# Write JSON data to a file
with open('output/abstracts_processed.json', 'w', encoding='utf-8') as json_file:
    json_file.write(json_data)

We've processed the abstracts, encoded them, and extracted keywords. Now, let's head to the next notebook for the clustering stage.