# Lab 1 Big Data Management

## Processing DBLP dataset

## installing necessary packages

Note: it is necessary to have the XMLToCSV function in the working directory

In [1]:
#!pip install lxml

## importing necessary packages

In [1]:
import csv
from lxml import etree
import os
import pandas as pd
import numpy as np
from collections import Counter
import lxml.etree as ET
import copy

## Setting working directory

In [2]:
# Current working directory: Lab_1
cwd = os.getcwd()

# Define paths relative to cwd
script = os.path.join(cwd, "dblp-to-csv", "XMLToCSV.py")
xml_file = os.path.join(cwd, "dblp.xml")
dtd_file = os.path.join(cwd, "dblp.dtd")
output_file = os.path.join(cwd, "dblp.csv")

## Using python script to parse the XML

note: this takes approx. 10 minutes to run

In [7]:
!python "{script}" "{xml_file}" "{dtd_file}" "{output_file}"

Start!
Reading elements from DTD file...
Finding unique attributes for all elements...
Opening output files...
Parsing XML and writing to CSV files...
Done after 526.742296 seconds


## Deleting non-necessary files

In [8]:
# List of files you want to keep
files_to_keep = {"dblp_article.csv", "dblp_inproceedings.csv"}

# List all files in the current directory
for filename in os.listdir():
    # Delete only CSV files not in the keep list
    if filename.endswith(".csv") and filename not in files_to_keep:
        os.remove(filename)
        print(f"Deleted: {filename}")

print("Cleanup complete!")

Deleted: dblp_book.csv
Deleted: dblp_data.csv
Deleted: dblp_incollection.csv
Deleted: dblp_mastersthesis.csv
Deleted: dblp_phdthesis.csv
Deleted: dblp_proceedings.csv
Deleted: dblp_www.csv
Cleanup complete!


## Cleaning and sampling the CSV

we are keeping only the first 4k papers because of computational power limitations. The cleaning part handles bad quotations cases within the csv, e.g.: (' " ', instead of '') and also deletes empty rows (some rows are empty in the input csv's because of XMLToCSV.py processing).

Adittionally, it deletes titles that are equal to "Preface." and "Editorial." which are not real papers but short introductions of papers.

In [3]:
# Function to clean and trim CSV files
def clean_and_trim_csv_to_df(input_file, max_rows=2000):
    cleaned_rows = []
    with open(input_file, 'r', encoding='utf-8') as infile:
        reader = csv.reader(infile, delimiter=';')
        header = next(reader)
        clean_header = [field.replace('"', '').strip() for field in header]

        # Find indexes for required columns
        try:
            title_idx = clean_header.index('title')
            author_idx = clean_header.index('author')
            author_orcid_idx = clean_header.index('author-orcid')
        except ValueError as e:
            raise ValueError(f"Missing required column: {e}")

        for row in reader:
            if not any(field.strip() for field in row):
                continue

            clean_row = [field.replace('"', '').strip() for field in row]

            if (title_idx >= len(clean_row) or clean_row[title_idx] == '' or
                author_idx >= len(clean_row) or clean_row[author_idx] == '' or
                author_orcid_idx >= len(clean_row) or clean_row[author_orcid_idx] == ''):
                continue

            # Clean title
            clean_row[title_idx] = clean_row[title_idx].replace('Preface.', '').strip()

            # Filter out unwanted titles
            title = clean_row[title_idx]
            if title == "Editorial.":
                continue
            if title.startswith("Editorial:"):
                continue
            if title.strip() == "":
                continue  # Extra safeguard for empty titles

            cleaned_rows.append(clean_row)
            if len(cleaned_rows) >= max_rows:
                break

    # Remove completely empty columns
    transposed = list(zip(*cleaned_rows))
    non_empty_column_indexes = [i for i, col in enumerate(transposed) if any(field.strip() for field in col)]

    final_header = [clean_header[i] for i in non_empty_column_indexes]
    final_rows = [[row[i] for i in non_empty_column_indexes] for row in cleaned_rows]

    df = pd.DataFrame(final_rows, columns=final_header)
    print(f"✅ Cleaned {len(final_rows)} rows from {input_file} into a DataFrame.")
    return df

# Clean and load into DataFrames
articles_df = clean_and_trim_csv_to_df('dblp_article.csv')
inproceedings_df = clean_and_trim_csv_to_df('dblp_inproceedings.csv')


✅ Cleaned 2000 rows from dblp_article.csv into a DataFrame.
✅ Cleaned 2000 rows from dblp_inproceedings.csv into a DataFrame.


# Generating fake data

## Fake data for inproceedings (type, venue) consistent across papers

This code generates consistent fake venue and type data for inproceedings. With consistent we are refering to: inproceedings "events" are uniquely identfied by the concatenation of booktitle_year, so if a 2 papers were published within the same event, they will share exactly the same venue and type

In [4]:
import random


# Step 1: Define a list of fake city names
venues = [
    'Barcelona', 'New York', 'Berlin', 'Tokyo', 'Paris',
    'London', 'San Francisco', 'Lisbon', 'Amsterdam', 'Singapore'
]

# Step 2: Generate a mapping for each (booktitle, year) pair
group_keys = inproceedings_df[['booktitle', 'year']].drop_duplicates().copy()
group_keys['type'] = group_keys.apply(lambda _: random.choice(['conference', 'workshop']), axis=1)
group_keys['venue'] = group_keys.apply(lambda _: random.choice(venues), axis=1)

# Step 3: Merge this mapping back into the main DataFrame
inproceedings_df_fake = inproceedings_df.merge(group_keys, on=['booktitle', 'year'], how='left')


# Creating Abstract, Keywords and Citations between papers

We are generating here a random abstract and a random assignment of keywords of papers.Later, within cypher we will link these papers together using the graph theory, as it is much more faster than checking the relations one by one and creating a citation's csv

In [5]:
# Creating Abstracts with faker
from faker import Faker

# Initialize Faker
fake = Faker()
Faker.seed(42)  # Optional: for reproducibility

# Function to generate fake abstract with multiple sentences
def generate_fake_abstract(num_sentences=5):
    return ' '.join(fake.paragraphs(nb=random.randint(3, num_sentences)))

# Apply the abstract generation
inproceedings_df_fake['abstract'] = inproceedings_df_fake.apply(lambda _: generate_fake_abstract(), axis=1)
articles_df['abstract'] = articles_df.apply(lambda _: generate_fake_abstract(), axis=1)

In [6]:
# Creating Keywords
import random
import json

# Keyword pool for generating random keywords
keyword_pool = [
    "graph processing", "property graph", "data quality", "data mining",
    "machine learning", "distributed systems", "query optimization",
    "graph databases", "big data", "semantic web", "information retrieval",
    "knowledge graphs", "scalability", "neural networks", "clustering",
    "text mining", "deep learning", "data integration", "cloud computing",
    "edge computing", "stream processing", "natural language processing",
    "transformer models", "recommender systems", "multi-modal learning",
    "fairness in AI", "bias detection", "explainable AI", "federated learning",
    "representation learning", "graph neural networks", "zero-shot learning",
    "active learning", "anomaly detection", "semantic similarity",
    "entity resolution", "ontology alignment", "blockchain applications",
    "privacy-preserving ML", "data augmentation", "knowledge distillation",
    "meta-learning", "reinforcement learning", "autonomous systems",
    "computational social science", "medical informatics", "cybersecurity analytics",
    "data provenance", "information diffusion", "social network analysis"
]

def assign_keywords(pool, min_k=5, max_k=10):
    return random.sample(pool, k=random.randint(min_k, max_k))

# Add keywords column to each dataframe
articles_df['keywords'] = articles_df.apply(lambda _: assign_keywords(keyword_pool), axis=1)
inproceedings_df_fake['keywords'] = inproceedings_df_fake.apply(lambda _: assign_keywords(keyword_pool), axis=1)

# Convert keyword lists to JSON strings
articles_df['keywords'] = articles_df['keywords'].apply(json.dumps)
inproceedings_df_fake['keywords'] = inproceedings_df_fake['keywords'].apply(json.dumps)

## Creating Reviews

First we will create a "Reviewer Pool". These are authors that have published more than 2 papers in conference/journals in our case. Then we will assign these reviewers to papers, taking the constrain into account. It is important to note that we filter for inproceedings equal to conference, as workshop's don't have reviewers according to the lab's statement

In [7]:
# Step 1: Filter only journal articles + conference-type inproceedings
inproceedings_conferences = inproceedings_df_fake[inproceedings_df_fake['type'] == 'conference']
all_papers_df = pd.concat([articles_df, inproceedings_conferences], ignore_index=True)

# Step 2: Explode authors into individual rows
author_paper_df = all_papers_df[['title', 'author']].copy()
author_paper_df['author'] = author_paper_df['author'].str.split('|')
author_paper_df = author_paper_df.explode('author').dropna()
author_paper_df['author'] = author_paper_df['author'].str.strip()

# Step 3: Build reviewer pool = authors with >= 2 papers
author_counts = author_paper_df['author'].value_counts()
reviewer_pool = author_counts[author_counts >= 2].index.tolist()

# Step 4: Map each paper to its authors
paper_authors = author_paper_df.groupby('title')['author'].apply(set).to_dict()

# Step 5: Assign reviewers (3 different ones, not in author list)
review_edges = []

for title in all_papers_df['title']:
    paper_auths = paper_authors.get(title, set())
    eligible_reviewers = list(set(reviewer_pool) - paper_auths)

    if len(eligible_reviewers) >= 3:
        reviewers = random.sample(eligible_reviewers, 3)
        for reviewer in reviewers:
            review_edges.append({'reviewer': reviewer, 'paper': title})

# Step 6: Create DataFrame and save to CSV
review_edges_df = pd.DataFrame(review_edges)
review_edges_df.to_csv('review_edges.csv', sep=';', index=False)

print("✅ Generated reviewer assignments for journal + conference papers only. Saved to review_edges.csv.")


✅ Generated reviewer assignments for journal + conference papers only. Saved to review_edges.csv.


# Exporting final csv

Checking for duplicate titles before exporting

In [8]:
# Check duplicates in articles_df
duplicated_articles = articles_df[articles_df['title'].duplicated(keep=False)]

# Check duplicates in inproceedings_df_fake
duplicated_inproceedings = inproceedings_df_fake[inproceedings_df_fake['title'].duplicated(keep=False)]

# Report
print(f"🔎 Duplicate titles in articles_df: {duplicated_articles.shape[0]} rows")
print(f"🔎 Duplicate titles in inproceedings_df_fake: {duplicated_inproceedings.shape[0]} rows")

# If needed, print the titles
if not duplicated_articles.empty:
    print("Duplicated titles in articles_df:")
    print(duplicated_articles['title'].unique())

if not duplicated_inproceedings.empty:
    print("Duplicated titles in inproceedings_df_fake:")
    print(duplicated_inproceedings['title'].unique())


🔎 Duplicate titles in articles_df: 2 rows
🔎 Duplicate titles in inproceedings_df_fake: 4 rows
Duplicated titles in articles_df:
['CIGRE Austria Next Generation Network.']
Duplicated titles in inproceedings_df_fake:
['International conference on computational science, ICCS 2010 data-driven pill monitoring.'
 'Workshop on tools for program development and analysis in computational science.']


Checking these 3 special cases we have the same papers just for a different venue. We will keep the first occurence.

In [9]:
# Drop duplicate titles, keeping the first occurrence
articles_df = articles_df.drop_duplicates(subset='title', keep='first')
inproceedings_df_fake = inproceedings_df_fake.drop_duplicates(subset='title', keep='first')

# Then export safely
articles_df.to_csv('dblp_article_clean.csv', sep=';', index=False)
inproceedings_df_fake.to_csv('dblp_inproceedings_clean.csv', sep=';', index=False)

print("✅ Exported clean CSVs with no duplicate titles.")

✅ Exported clean CSVs with no duplicate titles.


In [10]:
# Check for duplicates in articles_df
has_duplicates_articles = articles_df['title'].duplicated().any()

# Check for duplicates in inproceedings_df_fake
has_duplicates_inproceedings = inproceedings_df_fake['title'].duplicated().any()

# Report results
if has_duplicates_articles:
    print("❌ Duplicates still exist in articles_df!")
else:
    print("✅ No duplicates in articles_df.")

if has_duplicates_inproceedings:
    print("❌ Duplicates still exist in inproceedings_df_fake!")
else:
    print("✅ No duplicates in inproceedings_df_fake.")

✅ No duplicates in articles_df.
✅ No duplicates in inproceedings_df_fake.
