# Scopus Data Structure

## Root: abstracts-retrieval-response

### 1. item
- **`ait:process-info`** (Processing metadata)
  - ait:status - Processing state
  - ait:date-delivered - Delivery timestamp
  - ait:date-sort - Sorting date

- **`bibrecord`** (Bibliographic record)
  - **`head`** (Header information)
    - **`author-group[]`** (Array of author-affiliation groupings)
      - affiliation - Organization details
      - author[] - Array of authors
    - citation-title - Chapter title
    - abstracts - Abstract content (null in this case)
    - correspondence - Contact author info
    - **`citation-info`**
      - citation-type - Document type
      - citation-language - Language
    - **`source`** (Publication source)
      - sourcetitle-abbrev - Abbreviated source title
      - website - Publisher URL
      - volisspag - Volume/issue/pages
      - publicationyear
      - isbn[] - Array of ISBNs
      - publisher
      - sourcetitle - Full book title
      - publicationdate
    - **`enhancement`**
      - classificationgroup
        - classifications[] - Array of subject codes

  - **`item-info`** (Item metadata)
    - copyright
    - dbcollection[] - Database sources
    - history - Creation date
    - **`itemidlist`** (Identifiers)
      - itemid[] - Array of various ID types
      - ce:doi - Digital Object Identifier

  - **`tail`**
    - **`bibliography`** (References section)
      - reference[] (Array of 76 references)
        - ref-fulltext - Full citation text
        - ref-info - Structured reference data

### 2. affiliation[] (Array)
- Detailed affiliation objects with:
  - @id - Affiliation ID
  - affiliation-city
  - affilname
  - affiliation-country

### 3. coredata (Core metadata)
- srctype - Source type
- eid - Electronic ID
- prism:coverDate - Cover date
- prism:aggregationType - Aggregation type
- dc:creator - Author information
- **`link[]`** - Related URLs
  - @rel="self" - Self link
  - @rel="scopus" - Scopus link
  - @rel="scopus-citedby" - Cited-by link
- prism:isbn[] - ISBN array
- prism:publicationName - Publication name
- source-id - Source ID
- citedby-count - Citation count
- subtype - Document subtype
- prism:pageRange - Page range
- dc:title - Title
- openaccess - Open access status
- prism:doi - DOI
- dc:identifier - Identifier
- dc:publisher - Publisher

### 4. idxterms
- Index terms (null in this case)

### 5. language
- Document language

### 6. authkeywords
- Author keywords (null in this case)

### 7. subject-areas
- **`subject-area[]`** (Array)
  - Discipline classifications

### 8. authors
- **`author[]`** (Array)
  - Author objects with affiliation links

## Key Structural Features:

- **Dual Author Representations**: Authors in both grouped and flat formats
- **Multiple Identifier Systems**: Various IDs for cross-referencing
- **Hierarchical Affiliations**: Detailed organizational structure
- **Comprehensive References**: 76 fully-structured citations
- **Rich Metadata**: Extensive processing and delivery information

In [None]:
import json
import pandas as pd
import os
from pprint import pprint


def ensure_list(obj):
    if isinstance(obj, list):
        return obj
    elif obj is None:
        return []
    else:
        return [obj]
# Load and analyze the complete structure with proper error handling
def extract_all_to_dataframes(data):
    # Check if the expected structure exists
    if 'abstracts-retrieval-response' not in data:
        print("Error: Expected 'abstracts-retrieval-response' not found in JSON structure")
        print("Available keys:", list(data.keys()))
        return {}
    
    root = data['abstracts-retrieval-response']
    
    def safe_get(obj, keys, default=None):
        """Safely navigate nested dictionaries"""
        if not isinstance(obj, dict):
            return default
        current = obj
        for key in keys:
            if isinstance(current, dict) and key in current:
                current = current[key]
            else:
                return default
        return current

    # Initialize all DataFrames
    dataframes = {}
    
    # 1. CORE METADATA DataFrame
    coredata = root.get('coredata', {})
    
    # Extract ISBNs
    isbns = ensure_list(coredata.get('prism:isbn', []))
    isbn_electronic = ''
    isbn_print = ''
    for isbn in isbns:
        if isinstance(isbn, dict) and '$' in isbn:
            if not isbn_electronic:
                isbn_electronic = isbn['$']
            else:
                isbn_print = isbn['$']
    
    # Extract links
    links = ensure_list(coredata.get('link', []))
    link_info = []
    for link in links:
        if isinstance(link, dict):
            link_info.append(f"{link.get('@rel', '')}: {link.get('@href', '')}")
    
    core_metadata_data = {
        'title': [coredata.get('dc:title', '')],
        'description':[coredata.get('dc:description', '')] ,
        'document_type': [coredata.get('subtypeDescription', '')],
        'publication_date': [coredata.get('prism:coverDate', '')],
        'doi': [coredata.get('prism:doi', '')],
        'pages': [f"{coredata.get('prism:startingPage', '')}-{coredata.get('prism:endingPage', '')}"],
        'publication_name': [coredata.get('prism:publicationName', '')],
        'publisher': [coredata.get('dc:publisher', '')],
        'cited_by_count': [coredata.get('citedby-count', '')],
        'scopus_id': [coredata.get('dc:identifier', '')],
        'eid': [coredata.get('eid', '')],
        'open_access': [coredata.get('openaccessFlag', '')],
        'isbn_electronic': [isbn_electronic],
        'isbn_print': [isbn_print],
        'links': ['; '.join(link_info)]
    }
    dataframes['core_metadata'] = pd.DataFrame(core_metadata_data)

    # 2. AUTHORS INFORMATION DataFrame
    authors_data = []
    authors_dict = root.get('authors', {})
    authors = ensure_list(authors_dict.get('author', [])) if authors_dict else []
    
    for author in authors:
        if isinstance(author, dict):
            affils = ensure_list(author.get('affiliation', []))
            authors_data.append({
                'author_id': author.get('@auid', ''),
                'sequence': author.get('@seq', ''),
                'given_name': author.get('ce:given-name', ''),
                'surname': author.get('ce:surname', ''),
                'full_name': f"{author.get('ce:given-name', '')} {author.get('ce:surname', '')}",
                'degrees': author.get('ce:degrees', ''),
                'affiliation_count': len(affils)
            })
    dataframes['authors'] = pd.DataFrame(authors_data)

    # 3. AFFILIATIONS DataFrame
    affiliations_data = []
    affiliations = ensure_list(root.get('affiliation', []))
    for affil in affiliations:
        if isinstance(affil, dict):
            affiliations_data.append({
                'affiliation_id': affil.get('@id', ''),
                'name': affil.get('affilname', ''),
                'city': affil.get('affiliation-city', ''),
                'country': affil.get('affiliation-country', '')
            })
    dataframes['affiliations'] = pd.DataFrame(affiliations_data)

    # 4. SUBJECT AREAS DataFrame
    subject_areas_data = []
    subject_areas = ensure_list(safe_get(root, ['subject-areas', 'subject-area'], []))
    for area in subject_areas:
        if isinstance(area, dict):
            subject_areas_data.append({
                'subject_area': area.get('$', ''),
                'subject_code': area.get('@code', '')
            })
    dataframes['subject_areas'] = pd.DataFrame(subject_areas_data)

    # 5. PROCESSING INFORMATION DataFrame
    item = root.get('item', {})
    process_info = item.get('ait:process-info', {})
    status_info = process_info.get('ait:status', {})
    date_delivered = process_info.get('ait:date-delivered', {})
    
    processing_data = {
        'status': [status_info.get('@state', '')],
        'delivery_year': [date_delivered.get('@year', '')],
        'delivery_month': [date_delivered.get('@month', '')],
        'delivery_day': [date_delivered.get('@day', '')],
        'delivery_date': [f"{date_delivered.get('@year', '')}-{date_delivered.get('@month', '')}-{date_delivered.get('@day', '')}"]
    }
    dataframes['processing_info'] = pd.DataFrame(processing_data)

    # 6. BIBRECORD - Detailed bibliographic data
    bibrecord = item.get('bibrecord', {})
    head = bibrecord.get('head', {})
    
    # Author Groups DataFrame
    author_groups_data = []
    author_groups = ensure_list(head.get('author-group', []))
    for i, group in enumerate(author_groups):
        if isinstance(group, dict):
            affiliation = group.get('affiliation', {})
            organizations = ensure_list(affiliation.get('organization', []))
            org_names = [org.get('$', '') for org in organizations if isinstance(org, dict)]
            
            author_groups_data.append({
                'group_number': i + 1,
                'country': affiliation.get('@country', ''),
                'city': affiliation.get('city', ''),
                'organizations': '; '.join(org_names)
            })
    dataframes['author_groups'] = pd.DataFrame(author_groups_data)

    # Citation Information
    citation_info_obj = head.get('citation-info', {})

    citation_language = ''
    citation_type = ''

    if isinstance(citation_info_obj, dict):
        citation_language = citation_info_obj.get('citation-language', {})
        if isinstance(citation_language, dict):
            citation_language = citation_language.get('@language', '')
        else:
            citation_language = ''
        
        citation_type = citation_info_obj.get('citation-type', {})
        if isinstance(citation_type, dict):
            citation_type = citation_type.get('@code', '')
        else:
            citation_type = ''
    elif isinstance(citation_info_obj, list) and len(citation_info_obj) > 0:
        first_cit = citation_info_obj[0]
        if isinstance(first_cit, dict):
            citation_language = first_cit.get('citation-language', {})
            if isinstance(citation_language, dict):
                citation_language = citation_language.get('@language', '')
            else:
                citation_language = ''
            citation_type = first_cit.get('citation-type', {})
            if isinstance(citation_type, dict):
                citation_type = citation_type.get('@code', '')
            else:
                citation_type = ''
        else:
            citation_language = ''
            citation_type = ''
    else:
        citation_language = ''
        citation_type = ''

    citation_data = {
        'citation_language': [citation_language],
        'citation_type': [citation_type]
    }

    dataframes['citation_info'] = pd.DataFrame(citation_data)


    # Source Information
    source = head.get('source', {})

    publication_year = source.get('publicationyear', {})
    publication_date = source.get('publicationdate', {})

    # Handle website safely
    website_obj = source.get('website', {})
    if isinstance(website_obj, dict):
        website = website_obj.get('ce:e-address', {})
    elif isinstance(website_obj, list) and len(website_obj) > 0:
        # If itâ€™s a list, take the first element
        website = website_obj[0].get('ce:e-address', {}) if isinstance(website_obj[0], dict) else {}
    else:
        website = {}

    # Handle publisher info safely
    publisher_info = source.get('publisher', {})
    if not isinstance(publisher_info, dict):
        publisher_info = {}
    
    # Extract publication date details
    date_text = publication_date.get('date-text', {})
    
    # Extract ISSNs with separate fields
    issns = ensure_list(source.get('issn', []))
    issn_electronic = ''
    issn_print = ''
    issn_types = []
    issn_values = []
    
    for issn in issns:
        if isinstance(issn, dict):
            issn_type = issn.get('@type', '')
            issn_value = issn.get('$', '')
            issn_types.append(issn_type)
            issn_values.append(issn_value)
            
            if issn_type == 'electronic':
                issn_electronic = issn_value
            elif issn_type == 'print':
                issn_print = issn_value
    # Safely handle publication date info
    if isinstance(publication_date, dict):
        date_text = publication_date.get('date-text', '')
        if isinstance(date_text, dict):
            pub_date = date_text.get('$', '')
        else:
            pub_date = str(date_text)
        
        pub_year_full = publication_date.get('year', '')
        pub_month = publication_date.get('month', '')
        pub_day = publication_date.get('day', '')
    else:
        # fallback if publication_date is a string
        pub_date = str(publication_date)
        pub_year_full = ''
        pub_month = ''
        pub_day = ''

    # Safely handle website
    if isinstance(website, dict):
        website_url = website.get('$', '')
    else:
        website_url = str(website)

    # Safely handle translated title
    translated_title_obj = source.get('translated-sourcetitle', '')
    if isinstance(translated_title_obj, dict):
        translated_title = translated_title_obj.get('$', '')
    else:
        translated_title = str(translated_title_obj)
    source_data = {
        'book_title': [source.get('sourcetitle', '')],
        'abbreviated_title': [source.get('sourcetitle-abbrev', '')],
        'publication_year': [publication_year.get('@first', '')],
        'publication_date': [pub_date],
        'publication_year_full': [pub_year_full],
        'publication_month': [pub_month],
        'publication_day': [pub_day],
        'website': [website_url],
        'publisher_name': [publisher_info.get('publishername', '')],
        'source_country': [source.get('@country', '')],
        'source_type': [source.get('@type', '')],
        'source_id': [source.get('@srcid', '')],
        'first_page': [source.get('volisspag', {}).get('pagerange', {}).get('@first', '')],
        'last_page': [source.get('volisspag', {}).get('pagerange', {}).get('@last', '')],
        'page_range': [f"{source.get('volisspag', {}).get('pagerange', {}).get('@first', '')}-{source.get('volisspag', {}).get('pagerange', {}).get('@last', '')}"],
        'issn_electronic': [issn_electronic],
        'issn_print': [issn_print],
        'issn_combined': ['; '.join([f"{issn_type}: {issn_value}" for issn_type, issn_value in zip(issn_types, issn_values)])],
        'issue_title': [source.get('issuetitle', '')],
        'translated_title': [translated_title]
    }
    dataframes['source_info'] = pd.DataFrame(source_data)

    # Correspondence Information
    correspondence_data = {}

    correspondence_obj = head.get('correspondence', {})

    if isinstance(correspondence_obj, dict):
        person = correspondence_obj.get('person', {})
        corr_affiliation = correspondence_obj.get('affiliation', {})
    elif isinstance(correspondence_obj, list) and len(correspondence_obj) > 0:
        # If list, take the first element
        first_corr = correspondence_obj[0]
        person = first_corr.get('person', {}) if isinstance(first_corr, dict) else {}
        corr_affiliation = first_corr.get('affiliation', {}) if isinstance(first_corr, dict) else {}
    else:
        person = {}
        corr_affiliation = {}

    corr_organizations = ensure_list(corr_affiliation.get('organization', []))
    corr_org_names = [org.get('$', '') for org in corr_organizations if isinstance(org, dict)]

    correspondence_data = {
        'correspondence_author': [f"{person.get('ce:given-name', '')} {person.get('ce:surname', '')}"],
        'correspondence_organizations': ['; '.join(corr_org_names)]
    }

    dataframes['correspondence'] = pd.DataFrame(correspondence_data)

    # 7. ITEM INFO (Identifiers and history)
    item_info = bibrecord.get('item-info', {})
    history = item_info.get('history', {}).get('date-created', {})
    
    # Item IDs
    item_ids = ensure_list(item_info.get('itemidlist', {}).get('itemid', []))
    itemid_data = []
    for itemid in item_ids:
        if isinstance(itemid, dict):
            itemid_data.append({
                'id_type': itemid.get('@idtype', ''),
                'id_value': itemid.get('$', '')
            })
    dataframes['item_ids'] = pd.DataFrame(itemid_data)
    
    # Database Collections
    db_collections = ensure_list(item_info.get('dbcollection', []))
    dbcollection_data = []
    for db in db_collections:
        if isinstance(db, dict):
            dbcollection_data.append({
                'database_name': db.get('$', '')
            })
    dataframes['database_collections'] = pd.DataFrame(dbcollection_data)

    item_info_data = {
        'copyright': [item_info.get('copyright', {}).get('$', '')],
        'doi': [item_info.get('itemidlist', {}).get('ce:doi', '')],
        'creation_year': [history.get('@year', '')],
        'creation_month': [history.get('@month', '')],
        'creation_day': [history.get('@day', '')],
        'creation_date': [f"{history.get('@year', '')}-{history.get('@month', '')}-{history.get('@day', '')}"]
    }
    dataframes['item_info'] = pd.DataFrame(item_info_data)

    # 8. REFERENCES DataFrame
    tail = bibrecord.get('tail') if isinstance(bibrecord.get('tail'), dict) else {}
    bibliography = tail.get('bibliography') if isinstance(tail.get('bibliography'), dict) else {}
    references = ensure_list(bibliography.get('reference', []))
    
    references_data = []
    for i, ref in enumerate(references):
        if isinstance(ref, dict):
            ref_info = ref.get('ref-info', {})
            
            # Extract authors with separate fields
            authors_data = []
            authors_info = ref_info.get('ref-authors', {})
            authors_list = ensure_list(authors_info.get('author', []))
            for author in authors_list:
                if isinstance(author, dict):
                    authors_data.append({
                        'sequence': author.get('@seq', ''),
                        'surname': author.get('ce:surname', ''),
                        'initials': author.get('ce:initials', ''),
                        'indexed_name': author.get('ce:indexed-name', '')
                    })
            
            # Extract publication year
            pub_year = ref_info.get('ref-publicationyear', {}).get('@first', '')
            
            # Extract title
            ref_title = ref_info.get('ref-title', {}).get('ref-titletext', '')
            
            # Extract volume and page info
            volisspag = ref_info.get('ref-volisspag', {})
            volume = volisspag.get('voliss', {}).get('@volume', '')
            page_range = volisspag.get('pagerange', {})
            first_page = page_range.get('@first', '')
            last_page = page_range.get('@last', '')
            
            # Extract source title
            source_title = ref_info.get('ref-sourcetitle', '')
            
            # Extract item IDs with separate fields
            item_ids_data = []
            itemidlist = ref_info.get('refd-itemidlist', {})
            itemid_items = ensure_list(itemidlist.get('itemid', []))
            for itemid in itemid_items:
                if isinstance(itemid, dict):
                    item_ids_data.append({
                        'id_type': itemid.get('@idtype', ''),
                        'id_value': itemid.get('$', '')
                    })
            
            # Prepare author strings
            author_names = [f"{auth['surname']} {auth['initials']}" for auth in authors_data]
            author_surnames = [auth['surname'] for auth in authors_data]
            author_initials = [auth['initials'] for auth in authors_data]
            
            # Prepare item ID strings
            item_id_types = [item['id_type'] for item in item_ids_data]
            item_id_values = [item['id_value'] for item in item_ids_data]
            
            references_data.append({
                'reference_number': i + 1,
                'reference_id': ref.get('@id', ''),
                'ref_fulltext': ref.get('ref-fulltext', '')[:500],
                
                # Author information - separated
                'authors_combined': '; '.join(author_names),
                'author_surnames': '; '.join(author_surnames),
                'author_initials': '; '.join(author_initials),
                'author_count': len(authors_data),
                'authors_full_data': authors_data,  # Keep full author data as list of dicts
                
                # Title and publication info
                'title': ref_title,
                'publication_year': pub_year,
                'volume': volume,
                'first_page': first_page,
                'last_page': last_page,
                'page_range': f"{first_page}-{last_page}" if first_page and last_page else '',
                'source_title': source_title,
                
                # Item IDs - separated
                'item_ids_combined': '; '.join([f"{item['id_type']}: {item['id_value']}" for item in item_ids_data]),
                'item_id_types': '; '.join(item_id_types),
                'item_id_values': '; '.join(item_id_values),
                'item_ids_full_data': item_ids_data,  # Keep full item ID data as list of dicts
                
                'ref_info': ref_info  # Keep the full ref-info as backup
            })
    dataframes['references'] = pd.DataFrame(references_data)
    
    references_summary_data = {
        'total_references': [bibliography.get('@refcount', '')]
    }
    dataframes['references_summary'] = pd.DataFrame(references_summary_data)

    # 9. CLASSIFICATION ENHANCEMENT DataFrame
    enhancement = head.get('enhancement', {})
    classification_group = enhancement.get('classificationgroup', {})
    classifications = ensure_list(classification_group.get('classifications', []))
    
    classification_data = []
    for classification in classifications:
        if isinstance(classification, dict):
            classification_data.append({
                'classification_type': classification.get('@type', ''),
                'classification_code': classification.get('classification', '')
            })
    dataframes['classifications'] = pd.DataFrame(classification_data)

    # 10. OTHER INFORMATION DataFrame
    language = ''
    lang_obj = root.get('language')
    if isinstance(lang_obj, dict):
        language = lang_obj.get('@xml:lang', '')


    head = (
    root.get("item", {})
        .get("bibrecord", {})
        .get("head", {})
)

    # Extract abstract
    abstracts = head.get("abstracts", "Not available")

    # Handle case when abstracts is a list of dicts
    if isinstance(abstracts, list):
        # Usually Elsevier puts the text under the "$" key
        abstracts_text = " ".join([a.get("$", "") for a in abstracts])
    elif isinstance(abstracts, str):
        abstracts_text = abstracts
    else:
        abstracts_text = "Not available"
    dc_identifier = root.get('coredata', {}).get('dc:identifier', 'Not available')
    prism_publicationName = root.get('coredata', {}).get('prism:publicationName', 'Not available')

    
    other_data = {
        'language': [language],
        "abstracts": [abstracts_text],
        'author_keywords': [str(root.get('authkeywords', 'Not available'))],
        'index_terms': [str(root.get('idxterms', 'Not available'))],
        'dc:identifier': [dc_identifier],
        'prism:publicationName' : [prism_publicationName]
    }
    dataframes['other_info'] = pd.DataFrame(other_data)

    return dataframes

In [None]:
folder_path = r"d:\projectData\ScopusData2018-2023\2023"

# Get all JSON files
json_files = [
    os.path.join(folder_path, f)
    for f in os.listdir(folder_path)
    if os.path.isfile(os.path.join(folder_path, f))
]

print(f"Loaded {len(json_files)} JSON files")

all_results = []   # will store every file's extracted dataframes

for idx, file_path in enumerate(json_files, start=1):
    print(f"\n===== Processing file {idx}/{len(json_files)}: {file_path}")

    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
        continue

    # Extract DataFrames
    dataframes = extract_all_to_dataframes(data)

    if not dataframes:
        print(f"Skipped {file_path} (structure mismatch)")
        continue

    # Store result (include filename for tracking)
    all_results.append({
        "file": os.path.basename(file_path),
        "dataframes": dataframes
    })

print("\n=====================================")
print(f"Finished processing {len(all_results)} valid files out of {len(json_files)}")


Loaded 2890 JSON files

===== Processing file 1/2890: d:\projectData\ScopusData2018-2023\2023\202300000

===== Processing file 2/2890: d:\projectData\ScopusData2018-2023\2023\202300001

===== Processing file 3/2890: d:\projectData\ScopusData2018-2023\2023\202300002

===== Processing file 4/2890: d:\projectData\ScopusData2018-2023\2023\202300003

===== Processing file 5/2890: d:\projectData\ScopusData2018-2023\2023\202300004

===== Processing file 6/2890: d:\projectData\ScopusData2018-2023\2023\202300005

===== Processing file 7/2890: d:\projectData\ScopusData2018-2023\2023\202300006

===== Processing file 8/2890: d:\projectData\ScopusData2018-2023\2023\202300007

===== Processing file 9/2890: d:\projectData\ScopusData2018-2023\2023\202300008

===== Processing file 10/2890: d:\projectData\ScopusData2018-2023\2023\202300009

===== Processing file 11/2890: d:\projectData\ScopusData2018-2023\2023\202300010

===== Processing file 12/2890: d:\projectData\ScopusData2018-2023\2023\202300011

=

In [29]:
all_results = []

for file_path in json_files:
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    dataframes = extract_all_to_dataframes(data)
    if not dataframes:
        continue

    # Build a NEW output dictionary each iteration
    output = {
        "title": dataframes['core_metadata']["title"].iloc[0],
        "abstract": dataframes['other_info']["abstracts"].iloc[0],
        "author_keywords": dataframes['other_info']["author_keywords"].nunique(),
        "author_counted": int(dataframes['authors']["sequence"].nunique()),
        "subject_area": dataframes["subject_areas"]["subject_area"].tolist(),
        "subject_code": dataframes["subject_areas"]["subject_code"].tolist(),
        "classification_type": dataframes["classifications"]["classification_type"].tolist(),
        "classification_code": dataframes["classifications"]["classification_code"].tolist(),
        "cited_by_count": dataframes["core_metadata"]["cited_by_count"].iloc[0],
        "description": dataframes["core_metadata"]["description"].iloc[0],
        "document_type": dataframes["core_metadata"]["document_type"].iloc[0],
        "publication_year": dataframes['source_info']["publication_year"].iloc[0],
        "publisher": dataframes['source_info']['publisher_name'].iloc[0],
        "issn": dataframes['source_info']["issn_print"].iloc[0],
        "affiliation_countries": dataframes['affiliations']['country'].unique().tolist(),
        "institution_name": dataframes['affiliations']['name'].tolist(),
        "citation_code": dataframes['citation_info']['citation_type'].iloc[0],
        "dc_identifier": dataframes['other_info']["dc:identifier"].iloc[0],
        "prism_publicationName" :dataframes['other_info']["prism:publicationName"].iloc[0],
        "references": []
    }

    ref_df = dataframes['references']
    for _, row in ref_df.iterrows():
        output["references"].append({
            "ref_fulltext": row["ref_fulltext"],
            "author_surnames": row["author_surnames"],
            "author_initials": row["author_initials"],
            "author_count": int(row["author_count"]),
            "title": row["title"],
            "publication_year": row["publication_year"],
            "volume": row["volume"],
            "first_page": row["first_page"],
            "last_page": row["last_page"],
            "page_range": row["page_range"],
            "source_title": row["source_title"],
            "item_id_types": row["item_id_types"],
            "item_id_values": row["item_id_values"]
        })

    # Append a NEW copy each time
    all_results.append(output)


In [None]:
import json

with open("year2019.json", "w", encoding="utf-8") as f:
    json.dump(all_results, f, indent=4, ensure_ascii=False)


: 