# Systematic Literature Review Metadata Extraction Demo

Welcome to the **Systematic Literature Review Metadata Curation** tool! This notebook demonstrates how to use our automated pipeline to extract and clean metadata from academic sources.

## What This Tool Does

This project automates the metadata curation process for systematic literature reviews by:
- **Extracting missing metadata** from 8 major academic databases (IEEE, ACM, ScienceDirect, etc.)
- **Cleaning and standardizing** author names, titles, abstracts, and other fields
- **Quality validation** to ensure data consistency
- **Supporting 16 datasets** with 32,614+ research articles total

## Key Statistics
- **99% article recovery rate** from academic databases
- **97% automation success rate** for metadata extraction
- **8 academic sources** supported (IEEE, ACM, ScienceDirect, Springer, Scopus, Web of Science, arXiv, PubMed)

---

## Getting Started

Let's start by setting up the environment and loading the Demo dataset.

In [1]:
# Import required libraries
import sys
import os
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML, Markdown

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)
plt.style.use('default')
sns.set_palette('husl')

print("Libraries imported successfully!")
print(f"Current working directory: {os.getcwd()}")

Libraries imported successfully!
Current working directory: c:\Users\guill\OneDrive - Universite de Montreal\Projet Curation des métadonnées


In [2]:
# Display current path configuration
print("System Path Configuration")
print("=" * 40)

try:
    from Scripts.core.os_path import display_current_paths
    display_current_paths()
except ImportError as e:
    print(f"Error importing path configuration: {e}")
    print("Please ensure Scripts/core/os_path.py exists and is accessible.")
except Exception as e:
    print(f"Error displaying paths: {e}")
    print("Check that all paths in os_path.py are correctly configured.")

System Path Configuration
Windows
Current Path Configuration
Operating System: Windows

MAIN_PATH:
   Path: C:/Users/guill/OneDrive - Universite de Montreal/Projet Curation des métadonnées
   Status: EXISTS
   Purpose: Project root directory (source code and datasets)
   Use: Contains Scripts/, Datasets/, Database/ folders
   Required: Yes
   User Configurable: Yes

EXTRACTED_PATH:
   Path: C:/Users/guill/OneDrive - Universite de Montreal/Projet Curation des métadonnées/Database
   Status: EXISTS
   Purpose: Cache storage for extracted HTML/BibTeX files
   Use: High-capacity storage for web scraping cache
   Required: No
   User Configurable: Yes

DOWNLOAD_PATH:
   Path: C:/Users/guill/Downloads
   Status: EXISTS
   Purpose: Browser downloads directory
   Use: Temporary downloads during web scraping
   Required: No
   User Configurable: Yes

FIREFOX_PROFILE_PATH:
   Path: C:/Users/guill/AppData/Roaming/Mozilla/Firefox/Profiles/4am1ne92.default-release-1609958750563
   Status: EXISTS
  

In [3]:
# Display current path configuration
print("🔧 System Path Configuration")
print("=" * 40)

try:
    from Scripts.core.os_path import display_current_paths
    display_current_paths()
except ImportError as e:
    print(f"❌ Error importing path configuration: {e}")
    print("Please ensure Scripts/core/os_path.py exists and is accessible.")
except Exception as e:
    print(f"❌ Error displaying paths: {e}")
    print("Check that all paths in os_path.py are correctly configured.")

🔧 System Path Configuration
Current Path Configuration
Operating System: Windows

MAIN_PATH:
   Path: C:/Users/guill/OneDrive - Universite de Montreal/Projet Curation des métadonnées
   Status: EXISTS
   Purpose: Project root directory (source code and datasets)
   Use: Contains Scripts/, Datasets/, Database/ folders
   Required: Yes
   User Configurable: Yes

EXTRACTED_PATH:
   Path: C:/Users/guill/OneDrive - Universite de Montreal/Projet Curation des métadonnées/Database
   Status: EXISTS
   Purpose: Cache storage for extracted HTML/BibTeX files
   Use: High-capacity storage for web scraping cache
   Required: No
   User Configurable: Yes

DOWNLOAD_PATH:
   Path: C:/Users/guill/Downloads
   Status: EXISTS
   Purpose: Browser downloads directory
   Use: Temporary downloads during web scraping
   Required: No
   User Configurable: Yes

FIREFOX_PROFILE_PATH:
   Path: C:/Users/guill/AppData/Roaming/Mozilla/Firefox/Profiles/4am1ne92.default-release-1609958750563
   Status: EXISTS
   Purpo

## Demo Dataset Overview

The **Demo dataset** focuses on **Digital Twin Cyber-Physical Systems Testing**. Let's explore what this dataset contains and how our pipeline processes it.

In [4]:
# Load the source Demo dataset
try:
    demo_data = pd.read_excel('Datasets/Demo/Demo-source.xlsx')
    print(f"Source dataset loaded: {demo_data.shape[0]} articles × {demo_data.shape[1]} fields")
except FileNotFoundError:
    print("No dataset files found. Please ensure the Demo dataset exists.")
    demo_data = pd.DataFrame()  # Empty dataframe as fallback

if not demo_data.empty:
    print(f"\nDataset columns: {list(demo_data.columns)}")
    display(demo_data.head(3))

Source dataset loaded: 5 articles × 23 fields

Dataset columns: ['Title', 'year', 'issue', 'author', 'doi', 'title', 'url', 'city', 'pmid', 'pages', 'abstract', 'issn', 'month', 'isbn', 'publisher', 'keywords', 'journal', 'volume', 'Duplicate', 'Title + Abstract', 'Full Text', 'Snowballing', 'Total']


Unnamed: 0,Title,year,issue,author,doi,title,url,city,pmid,pages,abstract,issn,month,isbn,publisher,keywords,journal,volume,Duplicate,Title + Abstract,Full Text,Snowballing,Total
0,Fiber orientation measurement from mesoscale CT scans of prepreg platelet molded composites,2018,,Benjamin R. Denos and Drew E. Sommer and Anthony J. Favaloro and R. Byron Pipes and William B. A...,10.1016/j.compositesa.2018.08.024,Fiber orientation measurement from mesoscale CT scans of prepreg platelet molded composites,,,,241-249,REMOVED TO COMPLY WITH COPYRIGHT,1359835X,11.0,,Elsevier Ltd,REMOVED TO COMPLY WITH COPYRIGHT,Composites Part A: Applied Science and Manufacturing,114,Accepted,Rejected - Model based verification,,,1.0
1,Dynamic fault injection into digital twins of safety-critical systems,2021,,Thomas Markwirth and Roland Jancke and Christoph Sohrmann,10.23919/DATE51398.2021.9474066,Dynamic fault injection into digital twins of safety-critical systems,,,,446-450,REMOVED TO COMPLY WITH COPYRIGHT,15301591,2.0,9783982000000.0,Institute of Electrical and Electronics Engineers Inc.,REMOVED TO COMPLY WITH COPYRIGHT,Proceedings -Design Automation and Test in Europe DATE,2021-02-01 00:00:00,Accepted,Accepted,Rejected - DT not used for testing CPS,,
2,Towards security-aware virtual environments for digital twins,2018,,,10.1145/3198458.3198464,Towards security-aware virtual environments for digital twins,,,,61-72,REMOVED TO COMPLY WITH COPYRIGHT,,5.0,9781450000000.0,Association for Computing Machinery Inc,REMOVED TO COMPLY WITH COPYRIGHT,CPSS 2018 - Proceedings of the 4th ACM Workshop on Cyber-Physical System Security Co-located wit...,,Accepted,Accepted,Accepted,,


In [5]:
from Scripts.specialized.Demo import Demo
# Initialize the Demo dataset
demo = Demo()

print("Demo Dataset Information:")
print("Topic: Digital Twin Cyber-Physical Systems Testing")

# Display inclusion and exclusion criteria from the Demo class
print("\nInclusion & Exclusion Criteria:")

# Import the criteria from the Demo module
from Scripts.specialized.Demo import CRITERIA_DESCRIPTIONS

criteria_data = []
for criteria_id, description in CRITERIA_DESCRIPTIONS.items():
    if criteria_id.startswith('IC'):
        criteria_data.append({"Type": "Inclusion", "ID": criteria_id, "Description": description})
    elif criteria_id.startswith('EC'):
        criteria_data.append({"Type": "Exclusion", "ID": criteria_id, "Description": description})
    elif criteria_id.startswith('QC'):
        criteria_data.append({"Type": "Quality", "ID": criteria_id, "Description": description})

criteria_df = pd.DataFrame(criteria_data)
display(criteria_df)

Loaded 5 articles from Demo source (including duplicates)
After duplicate removal: 5 articles
After title/abstract screening: 4 articles
After full-text screening: 1 articles
Demo dataset initialized with 5 articles
Demo Dataset Information:
Topic: Digital Twin Cyber-Physical Systems Testing

Inclusion & Exclusion Criteria:


Unnamed: 0,Type,ID,Description
0,Inclusion,IC1,At least one testing technique is described
1,Inclusion,IC2,The system under test must be a cyber–physical system
2,Inclusion,IC3,Testing is performed using a digital twin
3,Exclusion,EC1,The digital twin described does not use a live data coupling
4,Exclusion,EC2,The study describes future use of a digital twin
5,Exclusion,EC3,Non-english study
6,Exclusion,EC4,Not published in a journal or conference proceedings
7,Quality,QC1,Are the research questions of the examined study answered?
8,Quality,QC2,Is the study reproducible?


In [6]:
# Load the processed Demo dataset
try:
    demo_data = pd.read_excel('Datasets/Demo/Demo_pre-extract.xlsx')
    print(f"Processed dataset loaded: {demo_data.shape[0]} articles × {demo_data.shape[1]} fields")
except FileNotFoundError:
    print("No dataset files found. Please ensure the Demo dataset exists.")
    demo_data = pd.DataFrame()  # Empty dataframe as fallback

if not demo_data.empty:
    print(f"\nDataset columns: {list(demo_data.columns)}")
    display(demo_data.head(3))

Processed dataset loaded: 5 articles × 24 fields

Dataset columns: ['Unnamed: 0', 'key', 'project', 'title', 'abstract', 'keywords', 'authors', 'venue', 'doi', 'references', 'pages', 'bibtex', 'screened_decision', 'final_decision', 'mode', 'inclusion_criteria', 'exclusion_criteria', 'reviewer_count', 'source', 'year', 'meta_title', 'link', 'publisher', 'metadata_missing']


Unnamed: 0.1,Unnamed: 0,key,project,title,abstract,keywords,authors,venue,doi,references,pages,bibtex,screened_decision,final_decision,mode,inclusion_criteria,exclusion_criteria,reviewer_count,source,year,meta_title,link,publisher,metadata_missing
0,0,,Demo,Fiber orientation measurement from mesoscale CT scans of prepreg platelet molded composites,,,Benjamin R. Denos and Drew E. Sommer and Anthony J. Favaloro and R. Byron Pipes and William B. A...,Composites Part A: Applied Science and Manufacturing,https://doi.org/10.1016/j.compositesa.2018.08.024,,241-249,,Excluded,Excluded,new_screen,IC1: At least one testing technique is described,,2,ScienceDirect,2018,,,Elsevier Ltd,
1,1,,Demo,Dynamic fault injection into digital twins of safety-critical systems,,,Thomas Markwirth and Roland Jancke and Christoph Sohrmann,Proceedings -Design Automation and Test in Europe DATE,https://doi.org/10.23919/DATE51398.2021.9474066,,446-450,,Included,Excluded,new_screen,IC2: The system under test must be a cyber–physical system,,2,IEEE,2021,,,Institute of Electrical and Electronics Engineers Inc.,
2,2,,Demo,Towards security-aware virtual environments for digital twins,,,,CPSS 2018 - Proceedings of the 4th ACM Workshop on Cyber-Physical System Security Co-located wit...,https://doi.org/10.1145/3198458.3198464,,61-72,,Included,Included,new_screen,,,2,ACM,2018,,,Association for Computing Machinery Inc,


In [7]:
# Display basic dataset statistics
if not demo_data.empty:
    print("Dataset Overview:")
    print(f"Total articles: {len(demo_data)}")
    
    # Show data completeness
    completeness = demo_data.count() / len(demo_data) * 100
    print("\nData Completeness by Field:")
    for col in demo_data.columns:
        if col in ['title', 'abstract', 'authors', 'venue', 'doi']:
            print(f"- {col.title()}: {completeness[col]:.1f}% complete")
    
    # Display sample data
    print("\nSample Articles (first 3 rows):")
    display_cols = [col for col in ['title', 'authors', 'venue', 'year'] if col in demo_data.columns]
    if display_cols:
        display(demo_data[display_cols].head(3))
else:
    print("No data to display")

Dataset Overview:
Total articles: 5

Data Completeness by Field:
- Title: 100.0% complete
- Abstract: 0.0% complete
- Authors: 80.0% complete
- Venue: 80.0% complete
- Doi: 100.0% complete

Sample Articles (first 3 rows):


Unnamed: 0,title,authors,venue,year
0,Fiber orientation measurement from mesoscale CT scans of prepreg platelet molded composites,Benjamin R. Denos and Drew E. Sommer and Anthony J. Favaloro and R. Byron Pipes and William B. A...,Composites Part A: Applied Science and Manufacturing,2018
1,Dynamic fault injection into digital twins of safety-critical systems,Thomas Markwirth and Roland Jancke and Christoph Sohrmann,Proceedings -Design Automation and Test in Europe DATE,2021
2,Towards security-aware virtual environments for digital twins,,CPSS 2018 - Proceedings of the 4th ACM Workshop on Cyber-Physical System Security Co-located wit...,2018


## Processing Without Web Extraction

When running the pipeline **without web extraction** (using `--no-extraction` flag), the system performs the following operations:

### Data Processing Steps:
1. **Dataset Loading**: Loads the source systematic review data from Excel/TSV files
2. **Schema Standardization**: Converts source columns to standardized metadata schema
3. **Duplicate Handling**: Identifies and processes duplicate article titles
4. **Decision Processing**: Maps screening and final decisions with exclusion criteria
5. **Data Cleaning**: Removes special characters, standardizes encoding, and validates data
6. **Quality Validation**: Checks for missing fields and data consistency
7. **Export**: Saves processed dataset as UTF-8 TSV file

### What is NOT performed:
- No web scraping or HTTP requests
- No HTML content downloading
- No metadata extraction from external sources
- No browser automation (Firefox/Selenium)
- No cache file creation

### Use Cases:
- **Initial data exploration** and validation
- **Testing dataset processing** without network dependencies
- **Offline development** and debugging
- **Data format conversion** from source to standardized schema
- **Quick processing** of datasets with complete metadata

### Expected Output:
The processed dataset will contain all original metadata but may have missing fields (abstract, keywords, references, bibtex) that would normally be filled by web extraction. These fields will remain empty (NaN) until web extraction is performed.

### Performance:
Processing without web extraction is significantly faster (seconds vs. minutes) and requires no internet connectivity or browser setup.

## Running the Metadata Extraction Pipeline

Now let's see how to run the automated metadata extraction process. This will:
1. **Identify missing metadata** in the dataset
2. **Search academic databases** for missing information
3. **Extract and clean** the metadata
4. **Validate and standardize** the results

Metadata Extraction Pipeline Workflow:

1. Load source dataset and identify missing metadata fields
2. Generate search queries for articles with incomplete data
3. Search academic databases (IEEE, ACM, ScienceDirect, etc.)
4. Download and cache HTML content from found articles
5. Parse HTML using source-specific extractors
6. Clean and standardize extracted metadata
7. Validate data quality and title matching
8. Export final standardized dataset

The entire process is automated and typically achieves 97% success rate!

## Processing Execution Without Web Extraction

When running the pipeline **without web extraction**, the system performs streamlined data processing operations that focus on standardizing and cleaning existing metadata without accessing external sources.

### Expected Results:
The output dataset will contain all available source metadata in standardized format, but fields like abstract, keywords, references, and bibtex may remain empty (NaN) since they are typically extracted from external sources. These gaps can be filled later by running with web extraction enabled.

It allows to compile the dataset with the accumulated extraction even if it is not complete.

In [8]:
# LIVE EXECUTION: Running actual processing for Demo dataset WITHOUT web extraction
print("Starting Demo dataset processing WITHOUT web extraction...")
print("="*60)

import os
from datetime import datetime

# Record start time
start_time = datetime.now()
print(f"Started at: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
print()

try:
    # Change to project directory if needed
    project_dir = os.getcwd()
    print(f"Working directory: {project_dir}")
    
    # Verify main.py exists
    main_script = "Scripts/main.py"
    if not os.path.exists(main_script):
        raise FileNotFoundError(f"Main script not found: {main_script}")
    
    print(f"Found main script: {main_script}")
    from Scripts.main import main
    print()
    
    # Execute the main script with Demo argument - NO WEB EXTRACTION
    print("Executing: python Scripts/main.py Demo --no-extraction")
    print("This will process data WITHOUT web extraction...")
    print("Processing standardization, cleaning, and export only...")
    print()
    
    # Run the command WITHOUT web extraction
    main(['Demo'], do_extraction=False, do_filter=False)
    
    # Calculate execution time
    end_time = datetime.now()
    duration = end_time - start_time
    
    print()
    print("="*60)
    print("Execution completed!")
    print(f"Total time: {duration}")
    
    print("Processing completed successfully!")
    
    # Check if output files were created
    output_file = "Datasets/Demo/Demo.tsv"
    if os.path.exists(output_file):
        print(f"Output file created: {output_file}")
    else:
        print(f"Output file not found: {output_file}")
        
except Exception as e:
    print(f"Error during execution: {e}")
    import traceback
    traceback.print_exc()

print("\nDemo processing (without web extraction) complete!")

Starting Demo dataset processing WITHOUT web extraction...
Started at: 2025-09-18 22:46:00

Working directory: c:\Users\guill\OneDrive - Universite de Montreal\Projet Curation des métadonnées
Found main script: Scripts/main.py

Executing: python Scripts/main.py Demo --no-extraction
This will process data WITHOUT web extraction...
Processing standardization, cleaning, and export only...

Loaded 5 articles from Demo source (including duplicates)
After duplicate removal: 5 articles
After title/abstract screening: 4 articles
After full-text screening: 1 articles
Demo dataset initialized with 5 articles
5 articles require metadata extraction
ℹ️ Web scraping disabled - using cached files only
📁 Loading cached extraction files...
Found 0 HTML files and 0 BibTeX files in cache

🔄 Starting metadata extraction for 5 articles...
📊 Run configuration: 999 (Complete)

--- Processing article 0 ---
📖 Title: Fiber orientation measurement from mesoscale CT scans of prepreg platelet molded...
🔗 Attemptin

In [9]:
# Read and show summary of results (without web extraction)
import pandas as pd
try:
    result_df = pd.read_csv("Datasets/Demo/Demo.tsv", sep='\t')
    print(f"Final dataset contains {len(result_df)} articles")
    
    # Show metadata completion after processing (without web extraction)
    completion_after = {
        'abstract': result_df['abstract'].notna().sum(),
        'authors': result_df['authors'].notna().sum(),
        'keywords': result_df['keywords'].notna().sum(),
        'bibtex': result_df['bibtex'].notna().sum()
    }
    
    print("\nMetadata completion after processing (without web extraction):")
    for field, count in completion_after.items():
        percentage = (count / len(result_df)) * 100
        print(f"- {field.title()}: {count}/{len(result_df)} complete ({percentage:.1f}%)")
    
    print("\nNote: Fields like abstract, keywords, and bibtex are typically empty")
    print("without web extraction since they come from external sources.")
        
except Exception as e:
    print(f"Could not analyze results: {e}")

Final dataset contains 5 articles

Metadata completion after processing (without web extraction):
- Abstract: 0/5 complete (0.0%)
- Authors: 4/5 complete (80.0%)
- Keywords: 0/5 complete (0.0%)
- Bibtex: 0/5 complete (0.0%)

Note: Fields like abstract, keywords, and bibtex are typically empty
without web extraction since they come from external sources.


## LIVE EXECUTION: Running Web Extraction for Demo Dataset

**This will perform actual web extraction with the following steps:**
1. Search academic databases (IEEE, ACM, ScienceDirect, etc.)
2. Download HTML content from article pages
3. Extract missing metadata (abstracts, keywords, references)
4. Save results to the Demo dataset

**Requirements:**
- Internet connectivity
- Academic database access (institutional access recommended)
- Estimated time: 2-5 minutes for 5 articles

The `do_filter` parameter allows to run web extraction only on missing articles, so the compilation time on already extracted can be saved. 

In [10]:
# LIVE EXECUTION: Running actual web extraction for Demo dataset
print("Starting LIVE web extraction for Demo dataset...")
print("="*60)

import os
from datetime import datetime

# Record start time
start_time = datetime.now()
print(f"Started at: {start_time.strftime('%Y-%m-%d %H:%M:%S')}")
print()

try:
    # Change to project directory if needed
    project_dir = os.getcwd()
    print(f"Working directory: {project_dir}")
    
    # Verify main.py exists
    main_script = "Scripts/main.py"
    if not os.path.exists(main_script):
        raise FileNotFoundError(f"Main script not found: {main_script}")
    
    print(f"Found main script: {main_script}")
    from Scripts.main import main
    print()
    
    # Execute the main script with Demo argument
    print("Executing: python Scripts/main.py Demo")
    print("This will perform actual web extraction...")
    print("Please wait, this may take several minutes...")
    print()
    
    # Run the command WITH web extraction
    main(['Demo'], do_extraction=True, do_filter=False)
    
    # Calculate execution time
    end_time = datetime.now()
    duration = end_time - start_time
    
    print()
    print("="*60)
    print("Execution completed!")
    print(f"Total time: {duration}")
    
    print("Web extraction completed successfully!")
    
    # Check if output files were created
    output_file = "Datasets/Demo/Demo.tsv"
    if os.path.exists(output_file):
        print(f"Output file created: {output_file}")
    else:
        print(f"Output file not found: {output_file}")
        
except Exception as e:
    print(f"Error during execution: {e}")
    import traceback
    traceback.print_exc()

print("\nWeb extraction demonstration complete!")

Starting LIVE web extraction for Demo dataset...
Started at: 2025-09-18 22:46:01

Working directory: c:\Users\guill\OneDrive - Universite de Montreal\Projet Curation des métadonnées
Found main script: Scripts/main.py

Executing: python Scripts/main.py Demo
This will perform actual web extraction...
Please wait, this may take several minutes...

Loaded 5 articles from Demo source (including duplicates)
After duplicate removal: 5 articles
After title/abstract screening: 4 articles
After full-text screening: 1 articles
Demo dataset initialized with 5 articles
5 articles require metadata extraction
✅ Web scraper initialized for missing metadata extraction
📁 Loading cached extraction files...
Found 0 HTML files and 0 BibTeX files in cache

🔄 Starting metadata extraction for 5 articles...
📊 Run configuration: 999 (Complete)

--- Processing article 0 ---
📖 Title: Fiber orientation measurement from mesoscale CT scans of prepreg platelet molded...
🔗 Attempting extraction with existing link
form

KeyboardInterrupt: 

In [None]:
# Read and show summary of results (with web extraction)
import pandas as pd
try:
    result_df = pd.read_csv("Datasets/Demo/Demo.tsv", sep='\t')
    print(f"Final dataset contains {len(result_df)} articles")
    
    # Show metadata completion after web extraction
    completion_after = {
        'abstract': result_df['abstract'].notna().sum(),
        'keywords': result_df['keywords'].notna().sum(),
        'references': result_df['references'].notna().sum(),
        'bibtex': result_df['bibtex'].notna().sum()
    }
    
    print("\nMetadata completion after web extraction:")
    for field, count in completion_after.items():
        percentage = (count / len(result_df)) * 100
        print(f"- {field.title()}: {count}/{len(result_df)} complete ({percentage:.1f}%)")
    
    # Show sample of extracted content if available
    if result_df['abstract'].notna().any():
        print("\nSample extracted abstract:")
        sample_abstract = result_df[result_df['abstract'].notna()]['abstract'].iloc[0]
        print(f"'{sample_abstract[:150]}...'")
        
except Exception as e:
    print(f"Could not analyze results: {e}")

Final dataset contains 5 articles

Metadata completion after web extraction:
- Abstract: 5/5 complete (100.0%)
- Keywords: 4/5 complete (80.0%)
- References: 1/5 complete (20.0%)
- Bibtex: 5/5 complete (100.0%)

Sample extracted abstract:
'X-ray computed tomography (CT) analysis is used to measure the heterogeneous fiber orientation fields in a 20 cm3 composite bracket made from prepreg ...'
