Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

85% working list item detection and notation #65

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions marker/cleaners/lists.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
from marker.bbox import merge_boxes
from marker.schema import Line, Span, Block, Page
from typing import List
import re
from copy import deepcopy
import math


def merge_list_blocks(blocks: List[Page]):

current_lines = []
current_bbox = None
debug = False

## do these loops infer the structure is
## blocks > pages > blocks > lines > spans

for page in blocks:
new_page_blocks = []
pnum = page.pnum

current_list_item_span = None

for block in page.blocks:

## pass through the data that isn't list-items
if block.most_common_block_type() != "List-item":

## handle the case of starting a new non-list item block with a dangling list item present
## close it out and clear the reference
if current_list_item_span is not None:
current_lines.append(Line(spans=[current_list_item_span], bbox=current_list_item_span.bbox))
current_list_item_span = None


if len(current_lines) > 0:
new_block = Block(
lines=deepcopy(current_lines),
pnum=pnum,
bbox=current_bbox
)
new_page_blocks.append(new_block)
current_lines = []
current_bbox = None

new_page_blocks.append(block)

if debug:
for line in block.lines:
for span in line.spans:
if span.text.strip():
print(f"[ypos: {str(span.bbox[1])[:3]},{str(span.bbox[3])[:3]}], Text: {span.text}")

continue

## begin working on data that are list items
##current_lines.extend(block.lines)

## i have no idea what this does
if current_bbox is None:
current_bbox = block.bbox
else:
current_bbox = merge_boxes(current_bbox, block.bbox)

## extend creates a reference so this is also updating current_lines
for line in block.lines:
for span in line.spans:
trimmed_text = span.text.strip()
xpos = math.floor(span.bbox[0])
ypos = math.floor(span.bbox[1])
ypos2 = math.floor(span.bbox[3])
indent = math.floor((xpos - 48) / 18)

if is_list_item_indicator(trimmed_text):
if current_list_item_span is not None:
## since we're starting a new list item
## and we already have on in the loop
## write it out to where ever it will get to the output
## then make a new one
## Probably this means adding a new block with new lines with this span
## but i dont' know where to put it or how to get it to interleave with
## the rest of the items on the page
## print("\n\nTODO: add this span to the output")
if debug:
print(current_list_item_span.text)

## appending the lines here using various bboxes creates mangled output
## the items do not appear on new lines, even though they new lines and their bbox is assigned the same
## as the start of the list-item
## printing the output looks perfect, but there's some magical something that occurs after this
## that reorders the items on the page mysteriously
current_lines.append(Line(spans=[current_list_item_span], bbox=current_list_item_span.bbox))


ind = "\t" * indent
text = f" {ind}{span.text.strip()}"
if debug:
text = f" {ind}[ypos: {ypos},{ypos2}]{span.text.strip()}"

current_list_item_span = Span(
bbox=span.bbox,
span_id=span.span_id,
font="List-item",
color=span.color,
block_type="List-item",
text=text
)
span.text = "" #preferably delete this but i don't know what that will do to the schema
else:
if current_list_item_span is not None:
# Append text to the current list item span
current_list_item_span.text += " " + span.text.strip()
current_list_item_span.bbox = merge_boxes(current_list_item_span.bbox, span.bbox)
span.text = "" #preferably delete this but i don't know what that will do to the schema
#perferably delete empty lines but i don't know what that will do to the schema
#note, the current_list_item spans blocks, and probably needs to span pages so this whole loop nesting process might need another layer

if len(current_lines) > 0:
new_block = Block(
lines=deepcopy(current_lines),
pnum=pnum,
bbox=current_bbox
)

new_page_blocks.append(new_block)
current_lines = []
current_bbox = None

page.blocks = new_page_blocks

def create_new_lists(blocks: List[Page]):
return None

def is_list_item_indicator(text):
# Regular expression to match list item indicators (e.g., bullets, alphabetic, numeric, roman numerals)
pattern = r'^(\s*[\u2022\u25E6\u25AA\u25AB\u25CF]|(?:[ivxlcdm]+\.)|(?:[a-zA-Z]\.)|(?:\d+\.)|\(\d+\))\s*'
return re.match(pattern, text.strip()) is not None
6 changes: 6 additions & 0 deletions marker/convert.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import fitz as pymupdf

from marker.cleaners.table import merge_table_blocks, create_new_tables
from marker.cleaners.lists import merge_list_blocks, create_new_lists
from marker.debug.data import dump_bbox_debug_data
from marker.extract_text import get_text_blocks
from marker.cleaners.headers import filter_header_footer, filter_common_titles
Expand Down Expand Up @@ -137,6 +138,11 @@ def convert_single_pdf(
table_count = create_new_tables(blocks)
out_meta["block_stats"]["table"] = table_count

# Fix List blocks
merge_list_blocks(blocks)
#list_count = create_new_lists(blocks)
#out_meta["block_stats"]["lists"] = list_count

for page in blocks:
for block in page.blocks:
block.filter_spans(bad_span_ids)
Expand Down
3 changes: 3 additions & 0 deletions marker/debug/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,9 @@ def dump_bbox_debug_data(doc, blocks: List[Page]):
page_data["image"] = b64_image
debug_data.append(page_data)

# Create the directories if they don't exist
os.makedirs(os.path.dirname(debug_file), exist_ok=True)

with open(debug_file, "w+") as f:
json.dump(debug_data, f)

Expand Down
Binary file added test/611-mentally-ill-persons-12-21-2020.pdf
Binary file not shown.
30 changes: 30 additions & 0 deletions test/test.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@

## Procedure 611 - Mentally Ill Persons

5. Thinks people are watching or talking to him;
6. Exhibits an extreme degree of panic or fright;
7. Behaves in a way dangerous to himself or others (i.e., hostile, suicidal, makes threats towards others, etc.);
8. Poor personal hygiene or appearance; or
9. Demonstrates an unusual thought process or verbal expressions or is catatonic.
C. Upon recognition of a mental health crisis situation the officer's responsibilities include:
1. Maintaining a high degree of caution in dealing with the potentially unpredictable nature of persons with mental illness;
2. Protecting the general public from the actions of the persons with mental illness;
3. Protecting the persons with mental illness from his/her own actions; and
4. Providing the most effective remedy available at the time to resolve the crisis situation.

## .05 *Crisis Intervention Team (Cit) Officers*

A. A Crisis Intervention Team (CIT) officer is defined as any officer on the Department who has successfully completed the 40 hours Crisis Intervention Team training.
B. CIT Officers are assigned to regular patrol duties and when available respond to situations involving persons who are experiencing a mental health crisis.
C. The CIT Officer at the scene of a call involving a mental health crisis situation has the responsibility for handling the situation unless otherwise directed by a supervisor. The CIT Officer should ask for additional support, if necessary.
D. CIT Officers may only take the same courses of action as other patrol officers when handling a mental health crisis. The courses of action are listed in Section .08 of this procedure.

## .06 *Initial Response*

A. Communications Unit - Dispatchers responsibilities include:
1. Attempt to determine if a service call is a mental health crisis;
2. Determine if weapons or any violent acts have been committed which may create an Escalated Mental Health Crisis Call.
a. An Escalated Mental Health Crisis Call is a two-pronged approach where weapons are involved, or violence has occurred or is occurring, and corroborating factors exist that establish a mental health nexus.
b. If the call meets the listed criteria for an Escalated Mental Health Crisis Call, a supervisor will be assigned and dispatched to the scene.
3. Identify mental health crisis calls by using appropriate code; (Escalated Mental Health Crisis Call, Mental Health in Progress, Mental Health Disturbance, Mental Health Routine);
4. Assign and dispatch a CIT Officer when available, along with a cover officer, to mental health crisis situations;
Binary file added test/test.pdf
Binary file not shown.
Binary file added test/test2.pdf
Binary file not shown.
24 changes: 24 additions & 0 deletions test/test_meta.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"language": "English",
"filetype": "pdf",
"toc": [],
"pages": 1,
"ocr_stats": {
"ocr_pages": 0,
"ocr_failed": 0,
"ocr_success": 0
},
"block_stats": {
"header_footer": 4,
"code": 0,
"table": 0,
"equations": {
"successful_ocr": 0,
"unsuccessful_ocr": 0,
"equations": 0
}
},
"postprocess_stats": {
"edit": {}
}
}
53 changes: 53 additions & 0 deletions test/testa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@

## Procedure 611 - Mentally Ill Persons

5. Thinks people are watching or talking to him;

6. Exhibits an extreme degree of panic or fright;


7. Behaves in a way dangerous to himself or others (i.e., hostile, suicidal, makes threats towards others, etc.);

8. Poor personal hygiene or appearance; or

9. Demonstrates an unusual thought process or verbal expressions or is catatonic.

C.
Upon recognition of a mental health crisis situation the officer's responsibilities include:

1. Maintaining a high degree of caution in dealing with the potentially unpredictable nature of persons with mental
illness;

2. Protecting the general public from the actions of the persons with mental illness;

3. Protecting the persons with mental illness from his/her own actions; and

4. Providing the most effective remedy available at the time to resolve the crisis situation.

## .05 *Crisis Intervention Team (Cit) Officers*

A. A Crisis Intervention Team (CIT) officer is defined as any officer on the Department who has successfully completed
the 40 hours Crisis Intervention Team training. B. CIT Officers are assigned to regular patrol duties and when available respond to situations involving persons who are
experiencing a mental health crisis. C. The CIT Officer at the scene of a call involving a mental health crisis situation has the responsibility for handling
the situation unless otherwise directed by a supervisor. The CIT Officer should ask for additional support, if necessary. D. CIT Officers may only take the same courses of action as other patrol officers when handling a mental health crisis.
The courses of action are listed in Section .08 of this procedure.


## .06 *Initial Response*

A. Communications Unit - Dispatchers responsibilities include:

1. Attempt to determine if a service call is a mental health crisis; 2. Determine if weapons or any violent acts have been committed which may create an Escalated Mental Health
Crisis Call.

a. An Escalated Mental Health Crisis Call is a two-pronged approach where weapons are involved, or violence
has occurred or is occurring, and corroborating factors exist that establish a mental health nexus.

b. If the call meets the listed criteria for an Escalated Mental Health Crisis Call, a supervisor will be assigned
and dispatched to the scene.

3. Identify mental health crisis calls by using appropriate code; (Escalated Mental Health Crisis Call, Mental Health
in Progress, Mental Health Disturbance, Mental Health Routine);

4. Assign and dispatch a CIT Officer when available, along with a cover officer, to mental health crisis situations;