# EU Fact Force - Exploration - Docling

In [1]:
import json
import pandas as pd
from pathlib import Path
from IPython.display import display, Markdown, HTML
from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor

from docling_experiment import run_mini_benchmark_docling, run_mini_benchmark_pypdf2

  from .autonotebook import tqdm as notebook_tqdm


## Présentation

### Introduction

Docling est une librairie permettant de parser un document, et d'en extraire son contenu de manière structurée. 
- Lien: https://www.docling.ai/
- Documentation: https://docling-project.github.io/docling/

### Exemple simple

In [2]:
# File path
filename = "40359_2023_Article_1210.pdf"
doc_path = Path("docs/")
file_path = Path(filename)

In [3]:
# Define Docling converter
converter = DocumentConverter()

In [4]:
# Convert file
result = converter.convert(doc_path / file_path)

[32m[INFO] 2026-02-20 16:00:46,364 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2026-02-20 16:00:46,395 [RapidOCR] device_config.py:50: Using CPU device[0m
[32m[INFO] 2026-02-20 16:00:46,441 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\de_ol\git\_git_d4g\14_EUFactForce\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.pth[0m
[32m[INFO] 2026-02-20 16:00:46,443 [RapidOCR] main.py:50: Using C:\Users\de_ol\git\_git_d4g\14_EUFactForce\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.pth[0m
[32m[INFO] 2026-02-20 16:00:46,691 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2026-02-20 16:00:46,693 [RapidOCR] device_config.py:50: Using CPU device[0m
[32m[INFO] 2026-02-20 16:00:46,696 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\de_ol\git\_git_d4g\14_EUFactForce\.venv\Lib\site-packages\rapidocr\models\ch_ptocr_mobile_v2.0_cls_infer.pth[0m
[32m[INFO] 2026-02-20 16:00:46,697 [RapidOCR

In [5]:
# Export conversion results
doc_text = result.document.export_to_text()
print(f"> Document as text: {doc_text[:100]}...")

Parameter `strict_text` has been deprecated and will be ignored.


> Document as text: ## EDITORIAL

## Mental Health, Discourse and Stigma

Olga Zayts-Spence 1* , David Edmonds 1 and Zoe...


### Export JSON et structure

In [6]:
# Get json export of the previously parsed document
doc_json = result.document.export_to_dict()
print(f"> JSON keys: {doc_json.keys()}")

> JSON keys: dict_keys(['schema_name', 'version', 'name', 'origin', 'furniture', 'body', 'groups', 'texts', 'pictures', 'tables', 'key_value_items', 'form_items', 'pages'])


In [7]:
# General info about the doc - dict
for key in ["schema_name", "version", "name", "origin"]:
    print(f"> {key}: {doc_json[key]}")

> schema_name: DoclingDocument
> version: 1.9.0
> name: 40359_2023_Article_1210
> origin: {'mimetype': 'application/pdf', 'binary_hash': 4990694571264383334, 'filename': '40359_2023_Article_1210.pdf'}


In [8]:
# Pages data - sixe, page number - dict
for key in doc_json["pages"]:
    print(f"> {key}: {doc_json['pages'][key]}")

> 1: {'size': {'width': 595.2760009765625, 'height': 793.7009887695312}, 'page_no': 1}
> 2: {'size': {'width': 595.2760009765625, 'height': 793.7009887695312}, 'page_no': 2}
> 3: {'size': {'width': 595.2760009765625, 'height': 793.7009887695312}, 'page_no': 3}
> 4: {'size': {'width': 595.2760009765625, 'height': 793.7009887695312}, 'page_no': 4}
> 5: {'size': {'width': 595.2760009765625, 'height': 793.7009887695312}, 'page_no': 5}


In [9]:
# Body - root node of the tree of the main document structure - dict
for key in doc_json["body"]:
    print(f"> {key}: {json.dumps(doc_json['body'][key])[:200]}...")

> self_ref: "#/body"...
> children: [{"$ref": "#/texts/0"}, {"$ref": "#/texts/1"}, {"$ref": "#/texts/2"}, {"$ref": "#/texts/3"}, {"$ref": "#/texts/4"}, {"$ref": "#/texts/5"}, {"$ref": "#/texts/6"}, {"$ref": "#/texts/7"}, {"$ref": "#/tex...
> content_layer: "body"...
> name: "_root_"...
> label: "unspecified"...


In [10]:
# Groups - set of items that don't represent content, but act as containers for other content items - List[dict]
for i, group in enumerate(doc_json["groups"]):
    print(f"> {i}: {json.dumps(group)[:200]}...")

> 0: {"self_ref": "#/groups/0", "parent": {"$ref": "#/body"}, "children": [{"$ref": "#/texts/24"}, {"$ref": "#/texts/25"}, {"$ref": "#/texts/26"}], "content_layer": "body", "name": "list", "label": "list"}...
> 1: {"self_ref": "#/groups/1", "parent": {"$ref": "#/body"}, "children": [{"$ref": "#/texts/30"}, {"$ref": "#/texts/31"}, {"$ref": "#/texts/32"}], "content_layer": "body", "name": "list", "label": "list"}...
> 2: {"self_ref": "#/groups/2", "parent": {"$ref": "#/body"}, "children": [{"$ref": "#/texts/68"}], "content_layer": "body", "name": "group", "label": "key_value_area"}...
> 3: {"self_ref": "#/groups/3", "parent": {"$ref": "#/body"}, "children": [], "content_layer": "body", "name": "group", "label": "key_value_area"}...
> 4: {"self_ref": "#/groups/4", "parent": {"$ref": "#/body"}, "children": [{"$ref": "#/texts/70"}, {"$ref": "#/texts/71"}, {"$ref": "#/texts/72"}, {"$ref": "#/texts/73"}, {"$ref": "#/texts/74"}, {"$ref": "...
> 5: {"self_ref": "#/groups/5", "parent": {"$ref": 

In [11]:
# Furniture - everythong outside of the body of the document (footer, header...)
for key in doc_json["furniture"]:
    print(f"> {key}: {doc_json['furniture'][key]}")

> self_ref: #/furniture
> children: []
> content_layer: furniture
> name: _root_
> label: unspecified


In [12]:
# Content of the doc - extracted elements ('texts', 'pictures', 'tables', 'key_value_items', 'form_items')
for key in ["texts", "pictures", "tables", "key_value_items", "form_items"]:
    print(f"> {key}: {json.dumps(doc_json[key])[:200]}...")

> texts: [{"self_ref": "#/texts/0", "parent": {"$ref": "#/body"}, "children": [], "content_layer": "furniture", "label": "page_header", "prov": [{"page_no": 1, "bbox": {"l": 56.693, "t": 758.0929887695312, "r"...
> pictures: [{"self_ref": "#/pictures/0", "parent": {"$ref": "#/body"}, "children": [], "content_layer": "body", "label": "picture", "prov": [{"page_no": 1, "bbox": {"l": 56.216514587402344, "t": 121.424560546875...
> tables: []...
> key_value_items: []...
> form_items: []...


In [13]:
# Example of the content of the first text element extracted:
ref = "#/texts/0"
element = [x for x in doc_json["texts"] if x["self_ref"] == ref][0]
for key in element:
    print(f"> {key}: {element[key]}")

> self_ref: #/texts/0
> parent: {'$ref': '#/body'}
> children: []
> content_layer: furniture
> label: page_header
> prov: [{'page_no': 1, 'bbox': {'l': 56.693, 't': 758.0929887695312, 'r': 235.499, 'b': 741.2849887695312, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 91]}]
> orig: Zayts-Spence et al. BMC Psychology (2023) 11:180 https://doi.org/10.1186/s40359-023-01210-6
> text: Zayts-Spence et al. BMC Psychology (2023) 11:180 https://doi.org/10.1186/s40359-023-01210-6


Pour chaque élément, on peut voir son `parent` (ici `#/body` donc la racine) et ses `children` (ici vide car début du doc - en réalité header non-identifié) 

In [14]:
# Example for "Introduction" section
ref = "#/texts/8"
element = [x for x in doc_json["texts"] if x["self_ref"] == ref][0]
for key in element:
    print(f"> {key}: {element[key]}")

> self_ref: #/texts/8
> parent: {'$ref': '#/body'}
> children: []
> content_layer: body
> label: section_header
> prov: [{'page_no': 1, 'bbox': {'l': 56.693, 't': 392.6429887695312, 'r': 114.753, 'b': 383.94598876953125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 12]}]
> orig: Introduction
> text: Introduction
> level: 1


Ici, on voit que le `parent` est `#/body` (correct), mais que les `children` sont vides. On s'attendrait à avoir au moins le texte de l'introduction (`#/texts/9`), mais celui-ci n'est pas identifié comme child de "Introduction", mais comme un autre élément, à la suite et au même niveau que "Introduction".

In [15]:
# Example for "Introduction" section
ref = "#/texts/9"
element = [x for x in doc_json["texts"] if x["self_ref"] == ref][0]
for key in element:
    print(f"> {key}: {element[key]}")

> self_ref: #/texts/9
> parent: {'$ref': '#/body'}
> children: []
> content_layer: body
> label: text
> prov: [{'page_no': 1, 'bbox': {'l': 56.693, 't': 380.1589887695312, 'r': 292.861, 'b': 214.07698876953123, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 785]}]
> orig: This  special  collection  brings  together  the  three  broad themes of mental health, discourse and stigma as they are examined  through  sociolinguistic  lenses.  We  first  present what we mean by mental health, discourse and stigma and  discuss  the  interrelationships  between  these  concepts. We then offer a brief overview of existing sociolinguistic research on mental health and stigma and identify continuing areas of under-research that we hope this special collection will contribute to. Finally, we ask the questions of 'why?' and 'so what?' in relation to sociolinguistic research on mental health and stigma and outline some ways in which this growing area of research could meaningfully  contribute  to  broa

In [16]:
# Search elements with childrens
texts_with_child = [x for x in doc_json["texts"] if len(x["children"]) > 0]
print(f" Number of texts with child: {len(texts_with_child)}")

 Number of texts with child: 0


Donc pas de child récupérés ici...
Après un peu de recherche, problème connu pour les PDFs: https://github.com/docling-project/docling/issues/2774.

Solution: post processing (?)

In [17]:
ResultPostprocessor(result).process()
doc_json_processed = result.document.export_to_dict()
doc_json_processed = result.document.export_to_dict()
print(f"> JSON keys: {doc_json_processed.keys()}")



> JSON keys: dict_keys(['schema_name', 'version', 'name', 'origin', 'furniture', 'body', 'groups', 'texts', 'pictures', 'tables', 'key_value_items', 'form_items', 'pages'])


In [18]:
# Example for "Introduction" section
ref = "#/texts/8"
element = [x for x in doc_json_processed["texts"] if x["self_ref"] == ref][0]
for key in element:
    print(f"> {key}: {element[key]}")

> self_ref: #/texts/8
> parent: {'$ref': '#/texts/3'}
> children: [{'$ref': '#/texts/9'}, {'$ref': '#/texts/10'}, {'$ref': '#/texts/11'}, {'$ref': '#/texts/12'}, {'$ref': '#/pictures/0'}, {'$ref': '#/texts/13'}, {'$ref': '#/texts/14'}, {'$ref': '#/texts/38'}, {'$ref': '#/texts/43'}]
> content_layer: body
> label: section_header
> prov: [{'page_no': 1, 'bbox': {'l': 56.693, 't': 392.6429887695312, 'r': 114.753, 'b': 383.94598876953125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 12]}]
> orig: Introduction
> text: Introduction
> level: 2


Après processing, le text `#/texts/9` apparait, avec d'autres, en tant que children (et `#/texts/8` comme parent de `#/texts/9`).

In [19]:
# Example for "Introduction" section
ref = "#/texts/9"
element = [x for x in doc_json_processed["texts"] if x["self_ref"] == ref][0]
for key in element:
    print(f"> {key}: {element[key]}")

> self_ref: #/texts/9
> parent: {'$ref': '#/texts/8'}
> children: []
> content_layer: body
> label: text
> prov: [{'page_no': 1, 'bbox': {'l': 56.693, 't': 380.1589887695312, 'r': 292.861, 'b': 214.07698876953123, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 785]}]
> orig: This  special  collection  brings  together  the  three  broad themes of mental health, discourse and stigma as they are examined  through  sociolinguistic  lenses.  We  first  present what we mean by mental health, discourse and stigma and  discuss  the  interrelationships  between  these  concepts. We then offer a brief overview of existing sociolinguistic research on mental health and stigma and identify continuing areas of under-research that we hope this special collection will contribute to. Finally, we ask the questions of 'why?' and 'so what?' in relation to sociolinguistic research on mental health and stigma and outline some ways in which this growing area of research could meaningfully  contribute  to  b

Après processing, `#/texts/3` apparait comme parent de "Introduction". C'est en réalité le titre de l'article. So far, so good.

In [20]:
ref = "#/texts/3"
element = [x for x in doc_json_processed["texts"] if x["self_ref"] == ref][0]
for key in element:
    print(f"> {key}: {element[key]}")

> self_ref: #/texts/3
> parent: {'$ref': '#/body'}
> children: [{'$ref': '#/texts/4'}, {'$ref': '#/texts/5'}, {'$ref': '#/texts/8'}, {'$ref': '#/texts/69'}]
> content_layer: body
> label: section_header
> prov: [{'page_no': 1, 'bbox': {'l': 56.693, 't': 666.3869887695312, 'r': 405.761, 'b': 645.9629887695312, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 35]}]
> orig: Mental Health, Discourse and Stigma
> text: Mental Health, Discourse and Stigma
> level: 1


### Conclusion
- Docling permet de parser un PDF et de récupérer du contenu structuré
- Par défaut, sur des PDFs, les niveaux de hierarchie (titre, section, sous section etc...) ne sont pas identifiés
- Après processing, on arrive à les récupérer
- Il faudrait pouvoir mesurer la qualité de l'identification des sections pour chaque élément

## Mini-benchmark

Description du mini-benchmark:
- Check les documents PDF dans `docs/`
- Converti le document en Docling document
- Récupère le temps de conversion total, nombre de page, temps par page, taille totale du text résultant
- Exporte chaque document converti en `html`, `md` et `json`, dans les dossiers du même nom situés dans `results/`
- Exporte les résultats du benchmark dans `results/` 
- Même démarche en utilsant une autre librairie plus simple d'extraction de texte d'un PDF: `PyPDF2` (_note: `PyPDF2` ne permet d'extraire que le texte brut par page, i.e. pas de section, et pas d'OCR_)

In [None]:
# Run mini-benchmark - Docling
run_mini_benchmark_docling()

In [None]:
# Run mini-benchmark - PyPDF2
run_mini_benchmark_pypdf2()

In [23]:
# Load results
with open("results/mini_benchmark_results_docling.json", "r") as f:
    mini_benchmark_results_docling = json.load(f)
mini_benchmark_results_docling = pd.DataFrame(mini_benchmark_results_docling).transpose()
with open("results/mini_benchmark_results_pypdf2.json", "r") as f:
    mini_benchmark_results_pypdf2 = json.load(f)
mini_benchmark_results_pypdf2 = pd.DataFrame(mini_benchmark_results_pypdf2).transpose()

In [24]:
mini_benchmark_results = mini_benchmark_results_docling.join(mini_benchmark_results_pypdf2, lsuffix='_docling', rsuffix='_pypdf2')
mini_benchmark_results

Unnamed: 0,total_time_docling,total_pages_docling,time_per_page_docling,total_chars_docling,total_time_pypdf2,total_pages_pypdf2,time_per_page_pypdf2,total_chars_pypdf2
1-s2.0-S2352250X23001574-main.pdf,7.702517,6.0,1.283753,33786.0,0.001942,6.0,0.000324,34792.0
40359_2023_Article_1210.pdf,5.38652,5.0,1.077304,24823.0,0.001056,5.0,0.000211,24822.0
jhab032.pdf,18.713449,20.0,0.935672,61295.0,0.000759,20.0,3.8e-05,62405.0


In [25]:
# Select a given doc and show exported results
doc_name = "40359_2023_Article_1210"

In [26]:
# Check json export - PyPDF2
with open(f"results/json/{doc_name}_pypdf2.json", "r", encoding="utf-8") as f:
    doc_json = json.load(f)
print(f"> Doc: {doc_name} - JSON data loaded (docling): {json.dumps(doc_json)[:100]}...")

> Doc: 40359_2023_Article_1210 - JSON data loaded (docling): {"num_pages": 5, "pages": [{"page": 1, "text": "EDITORIAL Open Access\u00a9 The Author(s) 2023. Open...


In [27]:
# Check json export - Docling
with open(f"results/json/{doc_name}_docling.json", "r", encoding="utf-8") as f:
    doc_json = json.load(f)
print(f"> Doc: {doc_name} - JSON data loaded (pypdf2): {json.dumps(doc_json)[:100]}...")

> Doc: 40359_2023_Article_1210 - JSON data loaded (pypdf2): {"schema_name": "DoclingDocument", "version": "1.9.0", "name": "40359_2023_Article_1210", "origin": ...


In [28]:
# Check md export
with open(f"results/md/{doc_name}_docling.md", "r", encoding="utf-8") as f:
    doc_md = f.read()
display(Markdown(doc_md))

EDITORIAL

## Mental Health, Discourse and Stigma

Olga Zayts-Spence 1* , David Edmonds 1 and Zoe Fortune 1

### Abstract

In this editorial to the special collection 'Mental Health, Discourse and Stigma' we outline the concepts of mental, health, discourse and stigma as they are examined through sociolinguistic lenses. We examine the sociolinguistic approach to mental health and stigma and discuss the different theoretical frameworks and methodological approaches that have been applied in such contexts. Sociolinguistics views mental health and stigma as discursively constructed and constituted, i.e. they are both manifest, negotiated, reinforced or contested in the language that people use. We highlight existing gaps in sociolinguistic research and outline how it could enrich research in psychology and psychiatry and contribute to professional practice. Specifically, sociolinguistics provides well-established methodological tools to research the 'voices' of people with a history of mental ill health, their family, carers and mental health professionals in both online and off-line contexts. This is vital to develop targeted interventions and to contribute to de-stigmatization of mental health. To conclude, we highlight the importance of transdisciplinary research that brings together expertise in psychology, psychiatry and sociolinguistics.

Keywords Mental health, Discourse, Stigma, Sociolinguistics

### Introduction

This  special  collection  brings  together  the  three  broad themes of mental health, discourse and stigma as they are examined  through  sociolinguistic  lenses.  We  first  present what we mean by mental health, discourse and stigma and  discuss  the  interrelationships  between  these  concepts. We then offer a brief overview of existing sociolinguistic research on mental health and stigma and identify continuing areas of under-research that we hope this special collection will contribute to. Finally, we ask the questions of 'why?' and 'so what?' in relation to sociolinguistic research on mental health and stigma and outline some ways in which this growing area of research could meaningfully  contribute  to  broader  professional  practice  in psychology and psychiatry.

*Correspondence:

Olga Zayts-Spence zayts@hku.hk

1 School of English, The University of Hong Kong, Pokfulam, Hong Kong

<!-- image -->

© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

#### Defining mental health, mental health stigma and discourse

A  range  of  theoretical  frameworks  and  methodological approaches have been used to investigate mental health and stigma across different  disciplines,  including  sociolinguistics.  Historically,  the  study  of  mental  health  has been  dominated  by  psychology  and  psychiatry  which has  led  to  a  'psychiatrization'  [1]  of  mental  health  and illness,  approaching  the  matter  from  a  clinical  perspective. In this special collection, we adopt the World Health Organization's (WHO) encompassing definition of mental  health  as  'a  state  of  mental  well-being  that  enables people to cope with the stresses of life, realize their abilities,  learn  well  and  work  well,  and  contribute  to  their community'  [2]. The WHO's  description  of mental health acknowledges that 'mental health is broader than the lack of a mental disorder' , and encompasses mental disorders,  psychosocial  disabilities,  and  'other  mental states associated with significant distress, impairment in functioning, or risk of self-harm' [2]. This definition also

Open Access

<!-- image -->

emphasizes  the  close  interrelationship  between  mental health and social aspects of life.

Closely intertwined with the concept of mental health is  the  concept  of  stigma  that  has  also  been  widely researched  in  psychology,  psychiatry,  as  well  as  sociology. In his seminal essay, sociologist Erving Goffman [3] defines stigma as 'an attribute that is deeply discrediting' . Stigmatized individuals are often perceived as being different and 'lesser' than 'the normals' [3]. Contemporary definitions  of  mental  health  stigma  largely  follow  Goffman's  work,  highlighting  its  discrediting  attributes  and the negative attitudes attached to it [4, 5]. These attitudes also  extend  to  people  in  the  immediate  surroundings, such as family [6], and even mental health professionals [7,  8].  This  is  known  as  associative  or  courtesy  stigma, i.e.  being  stigmatized  because  of  a  relationship  to  an individual  experiencing  a  mental  health  problem  [9]. Long describes  associative  stigma  experienced  by  mental  health  professionals  who  are  stigmatized  because  of being  'attached  to  their  patients  and  […]  positioned  as a  less  prestigious  branch  of  a  broader  medical  profession' [7]. Shipman and Zayts discuss an extreme case of such associative stigma where a psychiatrist practising in Hong Kong is stigmatized by his own family for working with mentally ill people and for fear of 'transmission' of a mental illness [8].

Psychological  research  on  mental  health  stigma  has burgeoned since Goffman's study. Focusing on the 'micro-level social interactions' , this research has examined  the  causes  of  stigma,  its  cognitive  dimensions,  the consequences and the coping responses [10, 11].  Fewer sociological  studies  existed  [10],  but  following  a  muchcited  publication  by  Link  and  Phelan  [12],  sociological research  has  proliferated.  Link  and  Phelan  re-defined stigma,  surmising  it  to  four  processes:  labelling  differences,  stereotyping  differences,  separating  the  stigmatized from 'us' , and discriminating against the stigmatized [12]. By highlighting the processes of discrimination and separation of the stigmatized from 'us' , Link and Phelan essentially  expanded  the  definition  of  stigma  to  the macro-level  social  processes,  such  as  social  inequality, discrimination based on one's mental health status, and discriminating  societal  ideologies.  These  macro-social forms  of  stigma  are  known  as  structural  stigma  [11]. Workplace  settings  are  one  example  when  employers are hesitant to hire or promote people with a history of mental  health  problems,  although  these  discriminatory attitudes  are  typically  subtle  and  indirect.  Sociological research  has  largely  focused  on  how  different  types  of stigma contribute to inequalities and impact social relationships between different groups [10].

Sociolinguistics  studies  the  interrelationship  between language  and  society  [13]  of  which  discourse  is  a  central sociolinguistic concept. It is equally as nebulous and multifaceted as the concepts of mental health and stigma. Discourse may refer to:

- -A stretch of language above a clause or a sentence level [14].
-  Language used by speakers to covey and negotiate certain meanings and achieve particular purposes [15].
-  Language used to represent various social practices, social actors and ideologies [16, 17].

Relating  these  different  interpretations  of  discourse  to mental health contexts, in its micro-analytic understanding  discourse  may  refer,  for  example,  to  negative  stereotypical  attributes,  such  as  'mad' ,  'crazy' ,  and  'insane' . Culturally, there exist striking differences in how mental health  disorders  are  described,  although  negative  connotations  typically  prevail.  For  example,  the  term 精神 分裂症 in  Chinese (jīng-shén-fēn liè zhèn ɡ ,  schizophrenia)  has  a  literal  translation  of  'the  split-mind  disease' . This term is heavily stigmatizing. Substantial efforts have been made by mental health professionals in Hong Kong to  introduce  less  stigmatizing  terms,  such  as 思覺失調 (sī jué shī tiáo, psychosis) that translates as 'thought and perceptual dysregulation' [18].

In relation to discourse as 'language-in-use' , the following example from an interview with a psychiatrist illustrates  how  the  diagnosis  of  a  'schizophrenic  patient'  is used to account for the tragic event of mass killing that has  marked,  in  this  psychiatrist's  words,  'a  milestone' in  the  development  of  psychiatric  services  in  Hong Kong.  It  is  an  objectivized,  'clinical'  account  of  a  mental  health disorder and its impact on one's behaviour. It also highlights a wider societal impact in response to the recounted incident.

Example 1 Psy - psychiatrist; I - interviewer .

6.  Psy: Yes. The most well known in Hong Kong is this 1982 tragedy […] Schizophrenic patient, young man killed his mother and sister at home [.] and then went down with two knives and entered a kindergarten and killed four more.
11.  I: Children?
12.  Psy: Four kids and wounded forty something. […] This incident created, it may be regarded as a milestone for the development of the service, because after this incident all the society turned attention to mental health, mental patients.

The last approach to discourse foregrounds how mental health is constructed through social practices. Mental health  and  mental  health  stigma  are  'socially  and  discursively  constituted'  [19]  with  a  bidirectional  relationship  between  discourse  and  social  practices.  Crudely speaking, discourse is the 'mirror' through which social practices  and  ideologies  become  evident.  For  example, different linguistic choices, such as the use of derogative,

direct or figurative language to talk about mental health, reflect the dominant societal practice and ideologies. In the reverse, ways of communicating about mental health impact  social  practices  and  ideologies.  One  example could  be  media  portrayals  of  mental  health  as  both  a reflection  of  prevalent  societal  ideologies  and  ways  to impact  them.  These  different  conceptualizations  of  discourse have been employed in sociolinguistic studies of mental health and mental health stigma.

#### Sociolinguistic research on mental health and mental health stigma

In this special collection, we use the term sociolinguistics broadly  to  include  linguistic  approaches  and  methodologies as diverse as corpus linguistics, different types of discourse  analysis  (e.g.  thematic  and  critical),  conversation analysis, narrative inquiry, to name just a few. While these approaches conceptualise discourse differently, most sociolinguistic studies on mental health and mental health stigma take a social constructivist view, viewing language as a means of constructing social reality. Sociolinguistic  studies  include  quantitative,  qualitative  and mixed methods studies. They examine diverse discourse data,  from  interactions  in  clinical  contexts,  to  online interactions between members of mental health support groups,  to  large  media  corpora.  Each  of  these  different types  of  data  provides  insights  into  different  aspects  of mental health and has both strengths and limitations.

Arguably, one of the most potent sociolinguistic approaches  to  mental  health  research  to  date  has  been corpus  linguistics  [19-23].  Corpus  linguistics  refers  to methods  that  use  computerized  tools  (e.g.  Wordsmith, Sketch Engine) to analyse large collections of data (corpora). While the corpus size could be substantial, the use of  tools  allows  consistent  and  fairly  easy  identification of  patterns  in  the  data  [1].  Another  common  sociolinguistic approach is critical discourse analysis (CDA) [24, 25]. For example, Price uses corpus linguistics and CDA to interrogate news reports on mental health in the UK from  1984  to  2014  [26].  Substantial  corpus  data  delve into media's portrayals of mental health, and how mental health stigma is created and perpetuated by media.

Notably,  sociolinguistic  research  often  includes  the 'voices' of under-represented, vulnerable or underresearched demographic groups. For example, Galinsky's and colleagues' research focuses on discourses surrounding male depression and suicide [27-29]. Societal ideologies around men as strong and powerful often stop men from  seeking  help  and  opening  up  about  their  mental health  struggles.  Sociolinguistic  research  has  much  to contribute to elucidating these dominant ideologies and support  organizations targeting men's mental  health (e.g.  Mind  UK  or  Manup).  It  could  also  contribute  to understanding  groups  'associated'  with  people  with  a history  of  mental  ill  health.  For  example,  Ziółkowskaa and  Galasiński,  examine  the  narratives  of  children  of fathers who died by suicide and how they deal with both bereavement  of  their  deceased  parent  and  the  stigma attached to death by suicide [29].

These  are  just  a  few  examples  of  previous  and  ongoing sociolinguistic research,  and  in  this special  collection  we  welcome  contributions  that  apply  different theoretical  frameworks  and  methodological  approaches in sociolinguistics.

#### The 'why?' and the 'so what?' of sociolinguistic research on mental health and mental health stigma

This brief overview of sociolinguistic research points to some of the possible applications of research to professional  practice.  Sociolinguistic  research  focuses  on  discourses of mental health. These discourses are powerful, they are the means to talk about mental health, the locale where  mental  health  issues  are  manifest,  the  means  to seek and offer help, and the ways to offer education and develop interventions. They are also the means to challenge and contest negative ideologies. De-stigmatisation of  mental  health  can  be  achieved  through  structural changes  (e.g.  offering  equal  employment  opportunities) but  most,  if  not  all,  social  activities  and  practices  are mediated through language. Therefore, sociolinguistic  research  continues  to  make  a  strong  contribution  to mental health de-stigmatization, research and practice.

There  is  an  increasing  emphasis  in  psychology  and psychiatry on participatory research, including with vulnerable  demographic  groups  [30].  As  this  editorial emphasizes, a strength of current sociolinguistic research is  that  it  investigates  the  'voices'  of  different  groups  of people affected by mental ill health. Established sociolinguistic approaches (e.g. narrative inquiry, rhetorical discourse analysis) provide tools to examine different types of  accounts  for  the  social  actions  that  people  perform, why  and  when  people  give  accounts,  and  the  language that they use when they do it. Investigating these data is important to develop targeted interventions for different groups of people affected by mental ill health.

Sociolinguistic research also 'weaves together' the micro-interactions  with  other  contexts,  the  meso  (e.g. institutional) and the macro (societal), bringing personal and the social aspects of mental health together to provide a more holistic picture.

Our brief overview has identified research gaps. There remains  a  paucity  of  empirical  sociolinguistic  research that uses real-life interactional data in face-to-face communicative  encounters  in  mental  health  contexts, for  example,  in  counselling  or  psychotherapy  encounters. This may be partly due to ethical considerations of access and the use of sensitive data (for exceptions, see, for  example,  Lavie  and  Nakash)  [31].  Examining  these

types of data allows exploring in detail how social actions are  accomplished in  situ ,  that  is,  in  real  time  during  an interaction, for example, how diagnosis or possible treatment negotiations are accomplished. There is also limited research on inter-professional communication in mental health contexts which could examine the linguistic repertoires of professional practices, professional ethos, and how diagnoses or treatment recommendations are negotiated inter-professionally, among other issues. Research cited  in  this  editorial  mostly  comes  from  Anglophone contexts. Research from 'global peripheries' [32] remains scarce. While there are a few exceptions, more research from other geographical contexts is called for [33].

To conclude, there are ample opportunities for transdisciplinary  research  that  brings  together  expertise  in psychology,  psychiatry  and  sociolinguistics.  While  discursive  psychology,  for  example,  has  long  been  concerned with investigating issues pertaining to psychology through  language,  sociolinguistics  offers  more  versatile and nuanced ways of doing it by offering a range of different  approaches  and  methodologies.  In  this  special collection  we  welcome  contributions  that  demonstrate the  value  of  sociolinguistic  research  and  how  it  could enhance existing research and practice in psychology and psychiatry. We invite contributions that draw on diverse empirical  data  from  different  clinical  and  non-clinical contexts and that focus on different mental health conditions.  We  also  welcome  authors  working  in  diverse sociocultural  contexts  whose  work  could  advance  our understanding  of  the  cultural  aspects  present  in  discourses of mental health and stigma. It is through such trans-disciplinary effort that we can challenge the existing social practices and ideologies of mental health and ultimately  contribute  to  addressing  some  long-standing societal  issues  of  discrimination  and  stigmatization  of people with mental health issues.

Acknowledgements

Not applicable.

Author contributions

Olga Zayts-Spence wrote the paper, David Edmonds and Zoe Fortune reviewed the paper.

Funding

The writing of this paper was fully supported by the Collaborative Research Funding (CRF) of the Research Grants Council (RGC) of Hong Kong (project C7086-21G).

Availability of Data and Materials

Not applicable.

Declarations

Ethical approval and consent to participate

Not applicable.

Consent to publish

Not applicable.

Conflict of interest

The authors declare no conflict of interest.

Received: 2 May 2023 / Accepted: 17 May 2023

### References

1. Harvey K. Disclosures of depression using corpus linguistics methods to examine young people's online health concerns. Int J Corpus Linguistics. 2012;17:349-79.
2. World Health Organization. Mental health: Strengthening Our Response. World Health Organization. 2022. https://www.who.int/news-room/ fact-sheets/detail/mental-health-strengthening-our-response.
3. Goffman E. Stigma: notes on the management of Spoiled Identity. First touchstone edition. New York: Simon &amp; Schuster Inc; 1986.
4. Thornicroft G, Rose D, Kassam A, Sartorius N. Stigma: ignorance, prejudice or discrimination? Br J Psychiatry. 2007;190:192-3.
5. Yanos PT, Written-Off. Mental Health Stigma and the loss of human potential. New York: Cambridge University Press;: Cambridge; 2018.
6. Ng S, Reidy H, Wong PW, Zayts-Spence O. The relationship between personal and interpersonal mental health experiences and stigma-related outcomes in Hong Kong. BJPsych Open. 2023.
7. Long V. Destigmatising Mental Illness? Professional Politics and Public Education in Britain, 1870-1970. Manchester: Manchester University Press; 2014.
8. Shipman H, Zayts-Spence O. In: Zayts-Spence O, Bridges S, Language, editors. The 'mad consultant dealing with mad people': a discursive historical approach to tensions regarding mental health stigma in Hong Kong. Health and Culture: Problematizing the Centers and Peripheries of Healthcare Communication Research. Routledge; 2023.
9. Mehta SI, Farina A, Associative, Stigma. Perceptions of the difficulties of College-Aged children of stigmatized fathers. J Soc Clin Psychol. 1988;7:192-202.
10. Clair M. Stigma. In: Ryan JM, editor. Core concepts in sociology. Wiley; 2018. pp. 318-21.
11. Hatzenbuehler ML, Link BG. Introduction to the special issue on structural stigma and health. Soc Sci Med. 2014;103:1-6.
12. Link BG, Phelan JC. Conceptualizing Stigma. Ann Rev Sociol. 2001;27:363-85.
13. Holmes J. An introduction to Sociolinguistics. 4th ed. Routledge; 2013.
14. Stubbs M. Discourse analysis: the sociolinguistic analysis of natural language. Chicago: University Of Chicago Press; 1983.
15. Thomas JA. Meaning in Interaction an introduction to Pragmatics. Oxford; New York: Routledge; 1995.
16. Fairclough N. Discourse and Social Change. Cambridge: Polity Press; 1992.
17. van Dijk TA. Discourse and communication: New Approaches to the analysis of Mass Media discourse and communication. De Gruyter; 1985.
18. Chiu CP-Y, Lam MM-L, ., Chan SK-W, ., Chung DW-S, ., Hung S-F, Tang JY-M, et al. Naming psychosis: the Hong Kong experience. Early Interv Psychiat. 2010;4:270-4.
19. Hunt D, Brookes G. Corpus, discourse and Mental Health. Bloomsbury Publishing; 2020.
20. Harvey K. Investigating Adolescent Health Communication A Corpus Linguistics Approach. Bloomsbury Publishing; 2014.
21. Harvey K, Brown B. Health Communication and Psychological Distress: exploring the Language of Self-harm. Can Mod Lang Rev. 2012;68:316-40.
22. Jaworska S, Kinloch K. Using multiple data sets. In: Taylor C, Marchi A, editors. Corpus approaches to discourse: a critical review. London: Routledge; 2018.
23. McDonald D, Woodward-Kron R. Member roles and identities in online support groups: perspectives from corpus and systemic functional linguistics. Discourse &amp; Communication. 2016;10:157-75.
24. Berring LL, Pedersen L, Buus N. Discourses of aggression in forensic mental health: a critical discourse analysis of mental health nursing staff records. Nurs Inq. 2015;22:296-305.
25. Jørgensen K, Praestegaard J, Holen M. The conditions of possibilities for recovery: a critical discourse analysis in a danish psychiatric context. J Clin Nurs. 2020;29:3012-24.
26. Price H. The Language of Mental Illness Corpus Linguistics and the construction of Mental Illness in the press. Cambridge University Press; 2022.
27. Galasinski D. Men's discourses of Depression. Basingstoke: Palgrave Macmillan; 2014.

28. Galasinski D. Discourses of men's suicide notes. Bloomsbury Publishing; 2017.
29. Ziółkowska J, Galasiński D. Discursive construction of fatherly suicide. Crit Discourse Stud. 2016;14:150-66.
30. Levac L, Ronis S, Cowper-Smith Y, Vaccarino O. A scoping review: the utility of participatory research approaches in psychology. J Community Psychol. 2019;47:1865-92.
31. Lavie-Ajayi M, Nakash O. If she had helped me to solve the problem at my workplace, she would have cured me': a critical discourse analysis of a mental health intake. Qualitative Social Work: Research and Practice. 2016;16:60-77.
32. Coifman KG, Flynn JJ, Pinto LA. When context matters: negative emotions predict psychological health and adjustment. Motivation and Emotion. 2016;40:602-24.
33. Ladegaard HJ. Talking about trauma in migrant worker returnee narratives: Mental Health Issues. In: Watson B, Krieger J, editors. Expanding Horizons in Health Communication an asian perspective. Springer; 2020. pp. 3-27.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Olga Zayts-Spence, PhD, is the Director of the Research and Impact Initiative for Communication in Healthcare (HKU RIICH) at the University of Hong Kong. She has a dual background in sociolinguistics and public health. Her research focuses on healthcare interactions in diverse setting, including mental health. David Edmonds is a postdoctoral fellow at HKU RIICH. His background is in sociolinguistics and psychology. His research interests are in mental health of vulnerable demographic groups. Zoe Fortune is a clinical psychologist and an Adjunct Assistant Professor, and she leads a mental health research cluster at HKU RIICH.

In [29]:
# Check html export # Bug when displaying markdwon in VS Code notebook - notebook background color change -> ouvrir le doc html dans le browser pour un meilleur visuel.
# doc_name = "40359_2023_Article_1210"
# with open(f"results/html/{doc_name}.html", "r", encoding="utf-8") as f:
#     doc_html = f.read()
# display(HTML(doc_html))

## Chunk extraction

Ici on crée une première version de chunker pour regarder en détails ce qui est récupéré lors du parsing. Le but est de processer tous les éléments text, et de créer une liste de dict, chaque dict représentant un object text identifié par Docling, avec comme clefs:
- `text`: le contenu de l'élément text
- `page`: la page
- `section`: une liste des sections, correspondants au texte de tous les éléments plus haut dans l'arbre hierarchique

_Note: On ne process ici que les éléments `texts`, mais on pourra suivre la même logique pour les autres éléments (`pictures`, `tables`, ...)_

In [30]:
class DoclingChunker:
    def __init__(self, doc_json):
        self.doc_json = doc_json

    def explode_ref(self, ref):
        parts = ref.lstrip("#/").split("/")
        parent_type = parts[0]
        parent_index = int(parts[1])
        return parent_type, parent_index
    
    def get_page(self, data):
        return data['prov'][0]['page_no']
    
    def get_data(self, ref):
        type, index = self.explode_ref(ref)
        return self.doc_json[type][index]
    
    def extract_text_metadata(self, text_data):
        content = text_data['text']
        page = self.get_page(text_data)
        sections = []
        parent = text_data['parent']['$ref']
        while 'body' not in parent:
            parent_type, _ = self.explode_ref(parent)
            parent_data = self.get_data(parent)
            if parent_type == 'texts':
                sections = [parent_data['text']] + sections
            parent = parent_data['parent']['$ref']
        return {
            'text': content,
            'page': page,
            'sections': sections
        }
    
    def process_texts(self):
        chunks = []
        for text in self.doc_json['texts']:
            chunks.append(self.extract_text_metadata(text))
        return chunks

In [31]:
# Select a given doc and show exported results
doc_name = "40359_2023_Article_1210"

# Check json export - Docling
with open(f"results/json/{doc_name}_docling.json", "r", encoding="utf-8") as f:
    doc_json = json.load(f)

In [32]:
# Instanciate chunker
chunker = DoclingChunker(doc_json=doc_json)

In [33]:
# Get text data by reference
text_data = chunker.get_data('#/texts/9')
text_data

{'self_ref': '#/texts/9',
 'parent': {'$ref': '#/texts/8'},
 'children': [],
 'content_layer': 'body',
 'label': 'text',
 'prov': [{'page_no': 1,
   'bbox': {'l': 56.693,
    't': 380.1589887695312,
    'r': 292.861,
    'b': 214.07698876953123,
    'coord_origin': 'BOTTOMLEFT'},
   'charspan': [0, 785]}],
 'orig': "This  special  collection  brings  together  the  three  broad themes of mental health, discourse and stigma as they are examined  through  sociolinguistic  lenses.  We  first  present what we mean by mental health, discourse and stigma and  discuss  the  interrelationships  between  these  concepts. We then offer a brief overview of existing sociolinguistic research on mental health and stigma and identify continuing areas of under-research that we hope this special collection will contribute to. Finally, we ask the questions of 'why?' and 'so what?' in relation to sociolinguistic research on mental health and stigma and outline some ways in which this growing area of rese

In [34]:
# Process metadata for a given text element
text_metadata = chunker.extract_text_metadata(text_data)
text_metadata

{'text': "This  special  collection  brings  together  the  three  broad themes of mental health, discourse and stigma as they are examined  through  sociolinguistic  lenses.  We  first  present what we mean by mental health, discourse and stigma and  discuss  the  interrelationships  between  these  concepts. We then offer a brief overview of existing sociolinguistic research on mental health and stigma and identify continuing areas of under-research that we hope this special collection will contribute to. Finally, we ask the questions of 'why?' and 'so what?' in relation to sociolinguistic research on mental health and stigma and outline some ways in which this growing area of research could meaningfully  contribute  to  broader  professional  practice  in psychology and psychiatry.",
 'page': 1,
 'sections': ['Mental Health, Discourse and Stigma', 'Introduction']}

In [35]:
# Process chunks and display first 5 of the document
chunks = chunker.process_texts()
chunks[:5]

[{'text': 'Zayts-Spence et al. BMC Psychology (2023) 11:180 https://doi.org/10.1186/s40359-023-01210-6',
  'page': 1,
  'sections': []},
 {'text': 'BMC Psychology', 'page': 1, 'sections': []},
 {'text': 'EDITORIAL', 'page': 1, 'sections': []},
 {'text': 'Mental Health, Discourse and Stigma', 'page': 1, 'sections': []},
 {'text': 'Olga Zayts-Spence 1* , David Edmonds 1 and Zoe Fortune 1',
  'page': 1,
  'sections': ['Mental Health, Discourse and Stigma']}]

In [36]:
# Check all sections identified
set([' - '.join(x['sections']) for x in chunks])

{'',
 'Mental Health, Discourse and Stigma',
 'Mental Health, Discourse and Stigma - Abstract',
 'Mental Health, Discourse and Stigma - Introduction',
 'Mental Health, Discourse and Stigma - Introduction - Defining mental health, mental health stigma and discourse',
 'Mental Health, Discourse and Stigma - Introduction - Sociolinguistic research on mental health and mental health stigma',
 "Mental Health, Discourse and Stigma - Introduction - The 'why?' and the 'so what?' of sociolinguistic research on mental health and mental health stigma",
 'Mental Health, Discourse and Stigma - References'}

__Conclusion__
- Si on met de côté les chunks avec section vide, on voit une certaine logique dans les sections identifiées. 
- A noté que "Introduction" n'est pas repéré au bon niveau, interprété comme supérieur aux autrex titres de section...
- Certains chunks ne sont pas très intéressants pris individuellements mais correspondent à des metadata du document. On pourrait faire une heuristique de filtrage / regroupement pour ces éléments au premier niveau de herarchie si c'est un problème général, et pas lié à ce document en particulier.