# ACIT5900: Master Thesis
### *PDF Extraction along with Metadata*

>-------------------------------------------
> *Spring 2025*

>--------------------------------------------

<a id="top"></a>
1. [**PDF Text Extraction**](#pdf)<br>
2. [**Metadata Extraction**](#metadata)<br>
3. [**Merge Data Frames**](#merge)<br>

In [1]:
# import modules needed
import os
import sys
import pandas as pd
import bibtexparser

In [2]:
# set option for visability
pd.set_option('display.max_colwidth', 200)  

In [3]:
# ensure src/ is in the Python path
BASE_DIR = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(os.path.join(BASE_DIR, "src"))
print(sys.path)

['/opt/homebrew/Cellar/python@3.13/3.13.1/Frameworks/Python.framework/Versions/3.13/lib/python313.zip', '/opt/homebrew/Cellar/python@3.13/3.13.1/Frameworks/Python.framework/Versions/3.13/lib/python3.13', '/opt/homebrew/Cellar/python@3.13/3.13.1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/lib-dynload', '', '/Users/celinelangeland/projects/master-thesis/venv/lib/python3.13/site-packages', '/Users/celinelangeland/projects/master-thesis/src']


## 1) PDF Text Extraction <a id="pdf"></a>

This section extracts text from all PDF files located in the `data/documents` directory using the `extract_text_from_pdf()` function from `retrieval.py`. For each file, the text from all pages is combined into a single string, and the document name is extracted. The results are stored in a DataFrame (`df_text`) where each row contains the full text of one document and its corresponding filename (without extension). This structure is suitable for downstream tasks such as chunking, embedding, and retrieval.

[⬆️ Back to Top](#top)

In [4]:
# import function from file
from retrieval import extract_text_from_pdf

# define path to documents
documents_dir = os.path.join(BASE_DIR, "data", "documents")

# list all PDF files in the folder
pdf_files = [f for f in os.listdir(documents_dir) if f.endswith('.pdf')]
pdf_files

['doc16.pdf',
 'doc17.pdf',
 'doc15.pdf',
 'doc29.pdf',
 'doc28.pdf',
 'doc14.pdf',
 'doc10.pdf',
 'doc11.pdf',
 'doc13.pdf',
 'doc12.pdf',
 'doc9.pdf',
 'doc8.pdf',
 'doc5.pdf',
 'doc4.pdf',
 'doc6.pdf',
 'doc7.pdf',
 'doc3.pdf',
 'doc2.pdf',
 'doc1.pdf',
 'doc23.pdf',
 'doc37.pdf',
 'doc36.pdf',
 'doc22.pdf',
 'doc34.pdf',
 'doc20.pdf',
 'doc21.pdf',
 'doc35.pdf',
 'doc19.pdf',
 'doc31.pdf',
 'doc25.pdf',
 'doc24.pdf',
 'doc30.pdf',
 'doc18.pdf',
 'doc26.pdf',
 'doc32.pdf',
 'doc33.pdf',
 'doc27.pdf']

In [5]:
# list to store the text for each document
texts = []

# loop through all PDF files and extract text
for pdf_file in pdf_files:
    pdf_path = os.path.join(documents_dir, pdf_file)
    df_text = extract_text_from_pdf(pdf_path)
    
    # combine the text content from all pages 
    combined_content = df_text['content'].str.cat(sep=' ')
    
    # extract only doc name
    doc_name = os.path.splitext(pdf_file)[0]

    # create df for document
    df_text = pd.DataFrame([[combined_content, doc_name]], columns=["content", "file"])
    
    # append df to list
    texts.append(df_text)

# concatenate all individual dfs into one combined df
df_text = pd.concat(texts, ignore_index=True)
df_text.head()

Unnamed: 0,content,file
0,"Contrastive autoencoder for anomaly detection in multivariate\ntime series\nHao Zhou a, Ke Yu a,⇑, Xuan Zhang b, Guanlin Wu a, Anis Yazidi c,d,e\na Beijing University of Posts and Telecommunicatio...",doc16
1,Computers in Biology and Medicine 127 (2020) 104094\nAvailable online 27 October 2020\n0010-4825/© 2020 Elsevier Ltd. All rights reserved.\nDetection of abnormality in wireless capsule endoscopy i...,doc17
2,J. Vis. Commun. Image R. 74 (2021) 103008\nAvailable online 21 December 2020\n1047-3203/© 2020 Elsevier Inc. All rights reserved.\nContents lists available at ScienceDirect\nJ. Vis. Commun. Image ...,doc15
3,Pattern Recognition 122 (2022) 108339 \nContents lists available at ScienceDirect \nPattern Recognition \njournal homepage: www.elsevier.com/locate/patcog \nEstimating Tukey depth using incrementa...,doc29
4,"Advanced Passive Operating System Fingerprinting\nUsing Machine Learning and Deep Learning\nDesta Haileselassie Hagos∗, Martin Løland†, Anis Yazidi‡, Øivind Kure §, Paal E. Engelstad ¶\n∗§¶Univers...",doc28


## 2) Metadata Extraction <a id="metadata"></a>

This section processes all BibTeX files found in the `data/metadata` directory. Each file is parsed using the `load_bibtex()` function, which extracts bibliographic entries and converts them into a DataFrame. All metadata entries are combined into a single DataFrame `df_bibtex`, with an added `file` column to indicate the source of each entry. The final DataFrame is filtered to keep only the most relevant metadata fields such as `title`, `author`, `year`, and `doi`.

[⬆️ Back to Top](#top)


In [6]:
# define the path to metadata
metadata_dir = os.path.join(BASE_DIR, "data", "metadata")

# list all BibTeX files in the
bib_files = [f for f in os.listdir(metadata_dir) if f.endswith('.bib')]
bib_files

['doc22.bib',
 'doc36.bib',
 'doc37.bib',
 'doc23.bib',
 'doc35.bib',
 'doc21.bib',
 'doc20.bib',
 'doc34.bib',
 'doc18.bib',
 'doc30.bib',
 'doc24.bib',
 'doc25.bib',
 'doc31.bib',
 'doc19.bib',
 'doc27.bib',
 'doc33.bib',
 'doc32.bib',
 'doc26.bib',
 'doc4.bib',
 'doc5.bib',
 'doc7.bib',
 'doc6.bib',
 'doc2.bib',
 'doc3.bib',
 'doc1.bib',
 'doc8.bib',
 'doc9.bib',
 'doc17.bib',
 'doc16.bib',
 'doc14.bib',
 'doc28.bib',
 'doc29.bib',
 'doc15.bib',
 'doc11.bib',
 'doc10.bib',
 'doc12.bib',
 'doc13.bib']

In [None]:
def load_bibtex(bibtex_file_path):
    """
    Parses a single BibTeX file and returns a df.
    """
    with open(bibtex_file_path, 'r') as bibtex_file:
        
        # parse the BibTeX file
        bib_database = bibtexparser.load(bibtex_file)
    
    # convert the BibTeX entries into a list of dictionaries
    bib_entries = bib_database.entries
    
    # convert list of dictionaries into a df
    df_bibtex = pd.DataFrame(bib_entries)
    
    return df_bibtex

In [8]:
# list to store dfs
bib_dataframes = []

# loop through all BibTeX files and load them into dfs
for bib_file in bib_files:
    bib_file_path = os.path.join(metadata_dir, bib_file)
    df_bibtex = load_bibtex(bib_file_path)
    
    # extract only doc name
    doc_name = os.path.splitext(bib_file)[0]
    
    # add column to identify the source BibTeX file
    df_bibtex['file'] = doc_name
    
    # append the df to the list
    bib_dataframes.append(df_bibtex)

# concatenate all the dfs into one combined df
df_bibtex = pd.concat(bib_dataframes, ignore_index=True)
df_bibtex.head()


Unnamed: 0,doi,keywords,pages,number,volume,year,title,journal,author,ENTRYTYPE,...,booktitle,article-number,note,isbn,address,editor,bibsource,biburl,timestamp,eprinttype
0,10.1109/TIM.2022.3216366,Clustering algorithms;Vibrations;Mechanical systems;Computer science;Kernel;Indexes;Feature extraction;Density peaks clustering (DPC);mixed data (MD);S-distance;symmetric favored c-nearest neighbo...,1-16,,71.0,2022,A New Adaptive Mixture Distance-Based Improved Density Peaks Clustering for Gearbox Fault Diagnosis,IEEE Transactions on Instrumentation and Measurement,"Sharma, Krishna Kumar and Seal, Ayan and Yazidi, Anis and Krejcar, Ondrej",article,...,,,,,,,,,,
1,10.1103/PhysRevA.110.063120,,063120,,110.0,2024,Nonrelativistic Dirac equation: An application to photoionization of highly charged hydrogenlike ions,Phys. Rev. A,"Lindblom, Tor Kjellsson and Br\ae{}ck, Simen and Selst\o{}, S\o{}lve",article,...,,,,,,,,,,
2,10.1103/PhysRevA.106.042213,,042213,,106.0,2022,Absorbers as detectors for unbound quantum systems,Phys. Rev. A,"Selst\o{}, S\o{}lve",article,...,,,,,,,,,,
3,https://doi.org/10.1016/j.ejor.2021.05.030,"Decision support systems, Group decision making, Fuzzy preference relations, Rank Centrality, Markov chains",1030-1041,3.0,297.0,2022,A new decision making model based on Rank Centrality for GDM with fuzzy preference relations,European Journal of Operational Research,Anis Yazidi and Magdalena Ivanovska and Fabio M. Zennaro and Pedro G. Lind and Enrique Herrera Viedma,article,...,,,,,,,,,,
4,10.1101/2025.01.21.633559,,,,,2025,Influence of neural network bursts on functional development,bioRxiv,"Ramstad, Ola Huse and Sandvig, Axel and Nichele, Stefano and Sandvig, Ioanna",article,...,,,,,,,,,,


In [9]:
# select only relevant columns
df_bibtex = df_bibtex[['file', 'title', 'author', 'year', 'number', 'volume', 'journal', 'ENTRYTYPE', 'doi']]
df_bibtex.head()

Unnamed: 0,file,title,author,year,number,volume,journal,ENTRYTYPE,doi
0,doc22,A New Adaptive Mixture Distance-Based Improved Density Peaks Clustering for Gearbox Fault Diagnosis,"Sharma, Krishna Kumar and Seal, Ayan and Yazidi, Anis and Krejcar, Ondrej",2022,,71.0,IEEE Transactions on Instrumentation and Measurement,article,10.1109/TIM.2022.3216366
1,doc36,Nonrelativistic Dirac equation: An application to photoionization of highly charged hydrogenlike ions,"Lindblom, Tor Kjellsson and Br\ae{}ck, Simen and Selst\o{}, S\o{}lve",2024,,110.0,Phys. Rev. A,article,10.1103/PhysRevA.110.063120
2,doc37,Absorbers as detectors for unbound quantum systems,"Selst\o{}, S\o{}lve",2022,,106.0,Phys. Rev. A,article,10.1103/PhysRevA.106.042213
3,doc23,A new decision making model based on Rank Centrality for GDM with fuzzy preference relations,Anis Yazidi and Magdalena Ivanovska and Fabio M. Zennaro and Pedro G. Lind and Enrique Herrera Viedma,2022,3.0,297.0,European Journal of Operational Research,article,https://doi.org/10.1016/j.ejor.2021.05.030
4,doc35,Influence of neural network bursts on functional development,"Ramstad, Ola Huse and Sandvig, Axel and Nichele, Stefano and Sandvig, Ioanna",2025,,,bioRxiv,article,10.1101/2025.01.21.633559


## 3) Merge Data Frames <a id="merge"></a>

This section merges the extracted PDF content (`df_text`) with the corresponding metadata (`df_bibtex`) using the `file` column as the key. The resulting DataFrame `df_combined` contains both the full document text and its bibliographic metadata.After merging, the columns are reordered and renamed for clarity, and the final dataset is saved as a CSV file (`df_combined.csv`) for later use.

[⬆️ Back to Top](#top)

In [10]:
# merge the df_text and df_bibtex on the index columns
df_combined = pd.merge(df_text, df_bibtex, on='file', how='left')
df_combined.head()

Unnamed: 0,content,file,title,author,year,number,volume,journal,ENTRYTYPE,doi
0,"Contrastive autoencoder for anomaly detection in multivariate\ntime series\nHao Zhou a, Ke Yu a,⇑, Xuan Zhang b, Guanlin Wu a, Anis Yazidi c,d,e\na Beijing University of Posts and Telecommunicatio...",doc16,Contrastive autoencoder for anomaly detection in multivariate time series,Hao Zhou and Ke Yu and Xuan Zhang and Guanlin Wu and Anis Yazidi,2022,,610.0,Information Sciences,article,https://doi.org/10.1016/j.ins.2022.07.179
1,Computers in Biology and Medicine 127 (2020) 104094\nAvailable online 27 October 2020\n0010-4825/© 2020 Elsevier Ltd. All rights reserved.\nDetection of abnormality in wireless capsule endoscopy i...,doc17,Detection of abnormality in wireless capsule endoscopy images using fractal features,Samir Jain and Ayan Seal and Aparajita Ojha and Ondrej Krejcar and Jan Bureš and Ilja Tachecí and Anis Yazidi,2020,,127.0,Computers in Biology and Medicine,article,https://doi.org/10.1016/j.compbiomed.2020.104094
2,J. Vis. Commun. Image R. 74 (2021) 103008\nAvailable online 21 December 2020\n1047-3203/© 2020 Elsevier Inc. All rights reserved.\nContents lists available at ScienceDirect\nJ. Vis. Commun. Image ...,doc15,Single image dehazing using a new color channel,Geet Sahu and Ayan Seal and Ondrej Krejcar and Anis Yazidi,2021,,74.0,Journal of Visual Communication and Image Representation,article,https://doi.org/10.1016/j.jvcir.2020.103008
3,Pattern Recognition 122 (2022) 108339 \nContents lists available at ScienceDirect \nPattern Recognition \njournal homepage: www.elsevier.com/locate/patcog \nEstimating Tukey depth using incrementa...,doc29,Estimating Tukey depth using incremental quantile estimators,Hugo L. Hammer and Anis Yazidi and Håvard Rue,2022,,122.0,Pattern Recognition,article,https://doi.org/10.1016/j.patcog.2021.108339
4,"Advanced Passive Operating System Fingerprinting\nUsing Machine Learning and Deep Learning\nDesta Haileselassie Hagos∗, Martin Løland†, Anis Yazidi‡, Øivind Kure §, Paal E. Engelstad ¶\n∗§¶Univers...",doc28,Advanced Passive Operating System Fingerprinting Using Machine Learning and Deep Learning,"Hagos, Desta Haileselassie and Løland, Martin and Yazidi, Anis and Kure, Øivind and Engelstad, Paal E.",2020,,,,inproceedings,10.1109/ICCCN49398.2020.9209694


In [11]:
# rearrange column order
df_combined = df_combined[['title', 'author', 'year', 'number', 'volume', 'journal', 'ENTRYTYPE', 'content', 'doi', 'file']]
df_combined.head()

Unnamed: 0,title,author,year,number,volume,journal,ENTRYTYPE,content,doi,file
0,Contrastive autoencoder for anomaly detection in multivariate time series,Hao Zhou and Ke Yu and Xuan Zhang and Guanlin Wu and Anis Yazidi,2022,,610.0,Information Sciences,article,"Contrastive autoencoder for anomaly detection in multivariate\ntime series\nHao Zhou a, Ke Yu a,⇑, Xuan Zhang b, Guanlin Wu a, Anis Yazidi c,d,e\na Beijing University of Posts and Telecommunicatio...",https://doi.org/10.1016/j.ins.2022.07.179,doc16
1,Detection of abnormality in wireless capsule endoscopy images using fractal features,Samir Jain and Ayan Seal and Aparajita Ojha and Ondrej Krejcar and Jan Bureš and Ilja Tachecí and Anis Yazidi,2020,,127.0,Computers in Biology and Medicine,article,Computers in Biology and Medicine 127 (2020) 104094\nAvailable online 27 October 2020\n0010-4825/© 2020 Elsevier Ltd. All rights reserved.\nDetection of abnormality in wireless capsule endoscopy i...,https://doi.org/10.1016/j.compbiomed.2020.104094,doc17
2,Single image dehazing using a new color channel,Geet Sahu and Ayan Seal and Ondrej Krejcar and Anis Yazidi,2021,,74.0,Journal of Visual Communication and Image Representation,article,J. Vis. Commun. Image R. 74 (2021) 103008\nAvailable online 21 December 2020\n1047-3203/© 2020 Elsevier Inc. All rights reserved.\nContents lists available at ScienceDirect\nJ. Vis. Commun. Image ...,https://doi.org/10.1016/j.jvcir.2020.103008,doc15
3,Estimating Tukey depth using incremental quantile estimators,Hugo L. Hammer and Anis Yazidi and Håvard Rue,2022,,122.0,Pattern Recognition,article,Pattern Recognition 122 (2022) 108339 \nContents lists available at ScienceDirect \nPattern Recognition \njournal homepage: www.elsevier.com/locate/patcog \nEstimating Tukey depth using incrementa...,https://doi.org/10.1016/j.patcog.2021.108339,doc29
4,Advanced Passive Operating System Fingerprinting Using Machine Learning and Deep Learning,"Hagos, Desta Haileselassie and Løland, Martin and Yazidi, Anis and Kure, Øivind and Engelstad, Paal E.",2020,,,,inproceedings,"Advanced Passive Operating System Fingerprinting\nUsing Machine Learning and Deep Learning\nDesta Haileselassie Hagos∗, Martin Løland†, Anis Yazidi‡, Øivind Kure §, Paal E. Engelstad ¶\n∗§¶Univers...",10.1109/ICCCN49398.2020.9209694,doc28


In [12]:
# rename columns
df_combined = df_combined.rename(columns={'ENTRYTYPE':'type', 'author':'authors', 'year':'year_published'})
df_combined.head()

Unnamed: 0,title,authors,year_published,number,volume,journal,type,content,doi,file
0,Contrastive autoencoder for anomaly detection in multivariate time series,Hao Zhou and Ke Yu and Xuan Zhang and Guanlin Wu and Anis Yazidi,2022,,610.0,Information Sciences,article,"Contrastive autoencoder for anomaly detection in multivariate\ntime series\nHao Zhou a, Ke Yu a,⇑, Xuan Zhang b, Guanlin Wu a, Anis Yazidi c,d,e\na Beijing University of Posts and Telecommunicatio...",https://doi.org/10.1016/j.ins.2022.07.179,doc16
1,Detection of abnormality in wireless capsule endoscopy images using fractal features,Samir Jain and Ayan Seal and Aparajita Ojha and Ondrej Krejcar and Jan Bureš and Ilja Tachecí and Anis Yazidi,2020,,127.0,Computers in Biology and Medicine,article,Computers in Biology and Medicine 127 (2020) 104094\nAvailable online 27 October 2020\n0010-4825/© 2020 Elsevier Ltd. All rights reserved.\nDetection of abnormality in wireless capsule endoscopy i...,https://doi.org/10.1016/j.compbiomed.2020.104094,doc17
2,Single image dehazing using a new color channel,Geet Sahu and Ayan Seal and Ondrej Krejcar and Anis Yazidi,2021,,74.0,Journal of Visual Communication and Image Representation,article,J. Vis. Commun. Image R. 74 (2021) 103008\nAvailable online 21 December 2020\n1047-3203/© 2020 Elsevier Inc. All rights reserved.\nContents lists available at ScienceDirect\nJ. Vis. Commun. Image ...,https://doi.org/10.1016/j.jvcir.2020.103008,doc15
3,Estimating Tukey depth using incremental quantile estimators,Hugo L. Hammer and Anis Yazidi and Håvard Rue,2022,,122.0,Pattern Recognition,article,Pattern Recognition 122 (2022) 108339 \nContents lists available at ScienceDirect \nPattern Recognition \njournal homepage: www.elsevier.com/locate/patcog \nEstimating Tukey depth using incrementa...,https://doi.org/10.1016/j.patcog.2021.108339,doc29
4,Advanced Passive Operating System Fingerprinting Using Machine Learning and Deep Learning,"Hagos, Desta Haileselassie and Løland, Martin and Yazidi, Anis and Kure, Øivind and Engelstad, Paal E.",2020,,,,inproceedings,"Advanced Passive Operating System Fingerprinting\nUsing Machine Learning and Deep Learning\nDesta Haileselassie Hagos∗, Martin Løland†, Anis Yazidi‡, Øivind Kure §, Paal E. Engelstad ¶\n∗§¶Univers...",10.1109/ICCCN49398.2020.9209694,doc28


In [13]:
# save df as csv file
df_combined.to_csv('df_combined.csv', index=False)