chore: bump version to 0.18.0 #439

percevalw · 2025-09-02T08:14:42Z

Changelog

Added support for multiple loggers (tensorboard, wandb, comet_ml, aim, mlflow, clearml, dvclive, csv, json, rich) in edsnlp.train via the logger parameter. Default is [json and rich] for backward compatibility.
Sub batch sizes for gradient accumulation can now be defined as simple "splits" of the original batch, e.g. batch_size = 10000 tokens and sub_batch_size = 5 splits to accumulate batches of 2000 tokens.
Parquet writer now has a pyarrow_write_kwargs to pass to pyarrow.dataset.write_dataset
LinearSchedule (mostly used for LR scheduling) now allows a end_value parameter to configure if the learning rate should decay to zero or another value.
New eds.explode pipe that splits one document into multiple documents, one per span yielded by its span_getter parameter, each new document containing exactly that single span.
New Training a span classifier tutorial, and reorganized deep-learning docs
ScheduledOptimizer now warns when a parameter selector does not match any parameter.

Fixed

use_section in eds.history should now correctly handle cases when there are other sections following history sections.
Added clickable snippets in the documentation for more registered functions
Pyarrow dataset writing with multiprocessing should be faster, as we removed a useless data transfer
We should now correctly support loading transformers in offline mode if they were already in huggingface's cache
We now support words[-10:10] syntax in trainable span classifier context_getter parameter
🚑 Until now, post_init was applied after the instantiation of the optimizer : if the model discovered new labels, and therefore changed its parameter tensors to reflect that, these new tensors were not taken into account by the optimizer, which could likely lead to subpar performance. Now, post_init is applied before the optimizer is instantiated, so that the optimizer can correctly handle the new tensors.
Added missing entry points for readers and writers in the registry, including write_parquet and support for polars in pyproject.toml. Now all implemented readers and writers are correctly registered as entry points.

Changed

Sections cues in eds.history are now section titles, and not the full section.
💥 Validation metrics are now found under the root field validation in the training logs (e.g. metrics['validation']['ner']['micro']['f'])
It is now recommended to define optimizer groups of ScheduledOptimizer as a list of dicts of optim hyper-parameters, each containing a selector regex key, rather than as a single dict with a selector as keys and a dict of optim hyper-parameters as values. This allows for more flexibility in defining the optimizer groups, and is more consistent with the rest of the EDS-NLP API. This makes it easier to reference groups values from other places in config files, since their path doesn't contain a complex regex string anymore. See the updated training tutorials for more details.
If this PR is a bug fix, the bug is documented in the test suite.
Changes were documented in the changelog (pending section).
If necessary, changes were made to the documentation (eg new pipeline).

github-actions · 2025-09-02T08:18:32Z

Docs preview URL

https://edsnlp-v0180.vercel.app/

github-actions · 2025-09-02T08:44:32Z

Coverage Report

Name	Stmts	Miss	∆ Miss	Cover
TOTAL	11751	237	0	97.98%

Files without new missing coverage

Name	Stmts	Miss	Cover
edsnlp/utils/torch.py Was already missing at line 102 def load_pruned_obj(obj, _): - return obj Was already missing at line 118 def save_align_devices_hook(pickler, obj): - pickler.save_reduce(load_align_devices_hook, (obj.__dict__,), obj=obj) Was already missing at lines 121-128 def load_align_devices_hook(state): - state["execution_device"] = MAP_LOCATION ... - AlignDevicesHook = None Was already missing at line 143 if torch.Tensor in copyreg.dispatch_table: - old_dispatch[torch.Tensor] = copyreg.dispatch_table[torch.Tensor] copyreg.pickle(torch.Tensor, reduce_empty)	83	9	89.16%
edsnlp/utils/span_getters.py Was already missing at lines 78-80 if span_getter is None: - yield doc[:], None - return if callable(span_getter): Was already missing at lines 81-83 if callable(span_getter): - yield from span_getter(doc) - return for key, span_filter in span_getter.items(): Was already missing at line 85 if key == "*": - candidates = ( (span, group) for group in doc.spans.values() for span in group Was already missing at lines 94-97 else: - for span, group in candidates: - if span.label_ in span_filter: - yield span, group Was already missing at line 101 if callable(span_setter): - span_setter(doc, matches) else: Was already missing at line 132 if callable(value): - return value if isinstance(value, str): Was already missing at line 181 elif isinstance(v, str): - new_value[k] = [v] elif isinstance(v, list) and all(isinstance(i, str) for i in v):	231	11	95.24%
edsnlp/utils/resources.py Was already missing at line 33 if not verbs: - return conjugated_verbs	24	1	95.83%
edsnlp/utils/numbers.py Was already missing at line 34 else: - string = s string = string.lower().strip() Was already missing at lines 38-41 return int(string) - except ValueError: - parsed = DIGITS_MAPPINGS.get(string, None) - return parsed	16	4	75.00%
edsnlp/utils/filter.py Was already missing at line 206 if isinstance(label, int): - return [span for span in spans if span.label == label] else:	74	1	98.65%
edsnlp/tune.py Was already missing at line 169 ) - except RuntimeError as e: if "zero total variance" in str(e): # pragma: no cover Was already missing at line 684 else: - n_trials = compute_n_trials( gpu_hours, compute_time_per_trial(study, ema=True)	289	2	99.31%
edsnlp/training/trainer.py Was already missing at line 57 if result is None: - result = {} if isinstance(x, dict): Was already missing at lines 365-371 if self.sub_batch_size and self.sub_batch_size[1] == "splits": - data = data.batchify( ... - data = data.map(lambda b: [nlp.collate(sb, device=device) for sb in b]) elif self.sub_batch_size: Was already missing at lines 882-889 raise - except Exception: ... - raise Was already missing at lines 916-918 ) > grad_max_dev * math.sqrt(grad_var): - spike = True - spikes += 1 else: Was already missing at line 925 if spike and grad_dev_policy == "clip_mean": - torch.nn.utils.clip_grad_norm_( grad_params, grad_mean, norm_type=2 Was already missing at line 929 elif spike and grad_dev_policy == "clip_threshold": - torch.nn.utils.clip_grad_norm_( grad_params,	333	12	96.40%
edsnlp/training/loggers.py Was already missing at line 109 if col not in values and col != "step": - row.append("") else: Was already missing at line 278 def tracker(self): - return self.printer Was already missing at lines 369-388 """ - env_logging_dir = os.environ.get("AIM_LOGGING_DIR", None) ... - accelerate.tracking.logger.debug( f"Initialized Aim run {self.writer.hash} in project {project_name}" Was already missing at lines 392-394 def log(self, values: dict, step: Optional[int], kwargs): - values = flatten_dict(values) - return super().log(values, step, kwargs)	140	11	92.14%
edsnlp/reducers.py Was already missing at line 115 if not hasattr(module, "__file__"): - return True if module.__file__ is None: Was already missing at line 117 if module.__file__ is None: - return False # Hack to avoid copying the full module dict	67	2	97.01%
edsnlp/processing/spark.py Was already missing at line 50 getActiveSession = SparkSession.getActiveSession - except AttributeError:	47	1	97.87%
edsnlp/processing/multiprocessing.py Was already missing at lines 393-398 self.on_stop() - except BaseException as e: ... - self.main_control_queue.put(e) finally: Was already missing at lines 402-404 pass - except StopSignal: - pass for name, queue in self.consumer_queues(stage): Was already missing at line 542 while schedule[task_idx] is None: - task_idx = (task_idx + 1) % len(schedule) Was already missing at lines 606-608 if isinstance(docs, StreamSentinel): - self.active_batches[stage].append([None, None, None, docs]) - continue batch_id = str(hash(tuple(id(x) for x in docs)))[-8:] + "-" + self.uid Was already missing at lines 1121-1127 if out[0].kind == requires_sentinel: - missing_sentinels -= 1 ... - missing_sentinels = len(self.cpu_worker_names) continue	626	14	97.76%
edsnlp/processing/deprecated_pipe.py Was already missing at lines 207-209 def converter(doc): - res = results_extractor(doc) - return ( [{"note_id": doc._.note_id, **row} for row in res]	57	2	96.49%
edsnlp/pipes/trainable/span_linker/span_linker.py Was already missing at lines 402-404 if self.reference_mode == "synonym": - embeds = embeds.to(new_lin.weight) - new_lin.weight.data = embeds else:	173	2	98.84%
edsnlp/pipes/trainable/span_classifier/span_classifier.py Was already missing at line 379 if not all(keep_bindings): - logger.warning( "Some attributes have no labels or values and have been removed:"	164	1	99.39%
edsnlp/pipes/trainable/ner_crf/ner_crf.py Was already missing at line 301 if self.labels is not None and not self.infer_span_setter: - return Was already missing at lines 309-311 if callable(self.target_span_getter): - for span in get_spans(doc, self.target_span_getter): - inferred_labels.add(span.label_) else:	173	3	98.27%
edsnlp/pipes/trainable/layers/crf.py Was already missing at line 21 # out: 2 * N * O - return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).logsumexp(-2) Was already missing at line 29 # out: 2 * N * O - return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).max(-2) Was already missing at line 98 if learnable_transitions: - self.transitions = torch.nn.Parameter( torch.zeros_like(forbidden_transitions, dtype=torch.float) Was already missing at line 108 if learnable_transitions and with_start_end_transitions: - self.start_transitions = torch.nn.Parameter( torch.zeros(num_tags, dtype=torch.float) Was already missing at line 117 if learnable_transitions and with_start_end_transitions: - self.end_transitions = torch.nn.Parameter( torch.zeros(num_tags, dtype=torch.float)	137	5	96.35%
edsnlp/pipes/trainable/embeddings/transformer/transformer.py Was already missing at line 166 if quantization is not None: - kwargs["quantization_config"] = quantization Was already missing at line 189 if self.cls_token_id is None: - [self.cls_token_id] = self.tokenizer.convert_tokens_to_ids( [self.tokenizer.special_tokens_map["bos_token"]] Was already missing at line 193 if self.sep_token_id is None: - [self.sep_token_id] = self.tokenizer.convert_tokens_to_ids( [self.tokenizer.special_tokens_map["eos_token"]]	168	3	98.21%
edsnlp/pipes/qualifiers/reported_speech/reported_speech.py Was already missing at lines 24-28 return "REPORTED" - elif token._.rspeech is False: - return "DIRECT" - else: - return None	100	3	97.00%
edsnlp/pipes/qualifiers/negation/negation.py Was already missing at line 28 else: - return None	101	1	99.01%
edsnlp/pipes/qualifiers/hypothesis/hypothesis.py Was already missing at line 27 else: - return None	98	1	98.98%
edsnlp/pipes/qualifiers/history/history.py Was already missing at lines 26-32 def history_getter(token: Union[Token, Span]) -> Optional[str]: - if token._.history is True: - return "ATCD" - elif token._.history is False: - return "CURRENT" - else: - return None Was already missing at lines 353-359 ) - except ValueError: ... - note_datetime = None Was already missing at lines 368-374 ) - except ValueError: ... - birth_datetime = None Was already missing at lines 440-443 ) - except ValueError as e: - absolute_date = None - logger.warning( "In doc {}, the following date {} raises this error: {}. "	180	14	92.22%
edsnlp/pipes/qualifiers/family/family.py Was already missing at line 27 else: - return None	83	1	98.80%
edsnlp/pipes/ner/tnm/model.py Was already missing at line 147 def __str__(self): - return self.norm() Was already missing at line 171 ) - exclude_unset = skip_defaults	112	2	98.21%
edsnlp/pipes/ner/scores/sofa/sofa.py Was already missing at line 32 if not assigned: - continue if assigned.get("method_max") is not None: Was already missing at line 40 else: - method = "Non précisée"	25	2	92.00%
edsnlp/pipes/ner/scores/elston_ellis/patterns.py Was already missing at line 26 if x <= 5: - return 1 Was already missing at lines 32-36 else: - return 3 - - except ValueError: - return None	21	4	80.95%
edsnlp/pipes/ner/scores/charlson/patterns.py Was already missing at lines 21-23 return int(extracted_score) - except ValueError: - return None	13	2	84.62%
edsnlp/pipes/ner/disorders/solid_tumor/solid_tumor.py Was already missing at lines 131-137 for span in spans: - span.label_ = "solid_tumor" ... - yield span	38	6	84.21%
edsnlp/pipes/ner/disorders/peripheral_vascular_disease/peripheral_vascular_disease.py Was already missing at line 108 if "peripheral" not in span._.assigned.keys(): - continue	16	1	93.75%
edsnlp/pipes/ner/disorders/diabetes/diabetes.py Was already missing at line 131 # Mostly FP - continue Was already missing at line 134 elif self.has_far_complications(span): - span._.status = 2 Was already missing at line 145 if next(iter(self.complication_matcher(context)), None) is not None: - return True return False	30	3	90.00%
edsnlp/pipes/ner/disorders/connective_tissue_disease/connective_tissue_disease.py Was already missing at line 104 # Huge change of FP / Title section - continue	15	1	93.33%
edsnlp/pipes/ner/disorders/ckd/ckd.py Was already missing at lines 121-124 dfg_value = float(dfg_span.text.replace(",", ".").strip()) - except ValueError: - logger.trace(f"DFG value couldn't be extracted from {dfg_span.text}") - return False	30	3	90.00%
edsnlp/pipes/ner/disorders/cerebrovascular_accident/cerebrovascular_accident.py Was already missing at lines 112-114 if span._.source == "ischemia": - if "brain" not in span._.assigned.keys(): - continue	18	2	88.89%
edsnlp/pipes/ner/adicap/models.py Was already missing at line 15 def norm(self) -> str: - return self.code Was already missing at line 18 def __str__(self): - return self.norm()	16	2	87.50%
edsnlp/pipes/misc/split/split.py Was already missing at lines 186-188 if max_length <= 0 and self.regex is None: - yield doc - return	73	2	97.26%
edsnlp/pipes/misc/sections/sections.py Was already missing at line 126 if sections is None: - sections = patterns.sections sections = dict(sections)	45	1	97.78%
edsnlp/pipes/misc/quantities/quantities.py Was already missing at lines 147-149 def __getitem__(self, item: int): - assert isinstance(item, int) - return [self][item] Was already missing at lines 160-163 def __eq__(self, other: Any): - if isinstance(other, SimpleQuantity): - return self.convert_to(other.unit) == other.value - return False Was already missing at line 166 if other.unit == self.unit: - return SimpleQuantity(self.value + other.value, self.unit, self.registry) return SimpleQuantity( Was already missing at line 193 return self.convert_to(other_unit) - except KeyError: raise AttributeError(f"Unit {other_unit} not found") Was already missing at line 198 def verify(cls, ent): - return True Was already missing at line 264 def __lt__(self, other: Union[SimpleQuantity, "RangeQuantity"]): - return max(self.convert_to(other.unit)) < min((part.value for part in other)) Was already missing at line 275 return self.convert_to(other.unit) == other.value - return False Was already missing at line 289 def verify(cls, ent): - return True Was already missing at line 888 if snippet.end != last and doclike.doc[last: snippet.end].text.strip() == "": - pseudo.append("w") pseudo = "".join(pseudo) Was already missing at line 1069 if start_line is None: - continue Was already missing at lines 1100-1102 unit_norm = self.unit_followers[unit_before.label_] - except (KeyError, AttributeError, IndexError): - pass Was already missing at line 1145 ): - ent = doc[unit_text.start: number.end] else: Was already missing at lines 1152-1154 dims = self.unit_registry.parse_unit(unit_norm)[0] - except KeyError: - continue Was already missing at lines 1260-1262 last._.set(last.label_, new_value) - except (AttributeError, TypeError): - merged.append(ent) else:	440	20	95.45%
edsnlp/pipes/misc/dates/models.py Was already missing at line 165 else: - d["month"] = note_datetime.month if self.day is None: Was already missing at lines 169-175 else: - if self.year is None: ... - d["day"] = default_day Was already missing at lines 183-185 return dt - except ValueError: - return None Was already missing at line 201 else: - return None Was already missing at line 217 if self.second: - norm += f"{self.second:02}s"	206	11	94.66%
edsnlp/pipes/misc/dates/dates.py Was already missing at line 249 if isinstance(absolute, str): - absolute = [absolute] if isinstance(relative, str): Was already missing at line 251 if isinstance(relative, str): - relative = [relative] if isinstance(duration, str): Was already missing at line 253 if isinstance(duration, str): - relative = [duration] if isinstance(false_positive, str): Was already missing at lines 357-366 if self.merge_mode == "align": - alignments = align_spans(matches, spans, sort_by_overlap=True) ... - matches.append(span) Was already missing at lines 462-464 if v1.mode == Mode.DURATION: - m1 = Bound.FROM if v2.bound == Bound.UNTIL else Bound.UNTIL - m2 = v2.mode or Bound.FROM elif v2.mode == Mode.DURATION:	153	14	90.85%
edsnlp/pipes/misc/consultation_dates/consultation_dates.py Was already missing at line 131 else: - self.date_matcher = None Was already missing at line 134 if not consultation_mention: - consultation_mention = [] elif consultation_mention is True:	48	2	95.83%
edsnlp/pipes/core/normalizer/__init__.py Was already missing at line 7 def excluded_or_space_getter(t): - return t.is_space or t.tag_ == "EXCLUDED"	5	1	80.00%
edsnlp/pipes/core/endlines/endlines.py Was already missing at lines 160-164 if end_lines_model is None: - path = build_path(__file__, "base_model.pkl") - - with open(path, "rb") as inp: - self.model = pickle.load(inp) elif isinstance(end_lines_model, str): Was already missing at lines 167-169 self.model = pickle.load(inp) - elif isinstance(end_lines_model, EndLinesModel): - self.model = end_lines_model else: Was already missing at line 200 ): - return "ENUMERATION" Was already missing at line 287 if np.isnan(sigma): - sigma = 1	89	7	92.13%
edsnlp/pipes/core/contextual_matcher/contextual_matcher.py Was already missing at lines 241-243 ): - to_keep = False - break	130	2	98.46%
edsnlp/patch_spacy.py Was already missing at lines 67-69 # if module is reloaded. - existing_func = registry.factories.get(internal_name) - if not util.is_same_func(factory_func, existing_func): raise ValueError(	31	2	93.55%
edsnlp/package.py Was already missing at lines 474-476 version = version or pyproject["project"]["version"] - except (KeyError, TypeError): - version = "0.1.0" name = name or pyproject["project"]["name"] Was already missing at line 480 else: - main_package = None model_package = snake_case(name.lower())	214	3	98.60%
edsnlp/metrics/span_attribute.py Was already missing at lines 67-69 ) - assert attributes is None - attributes = kwargs.pop("qualifiers") if attributes is None:	73	2	97.26%
edsnlp/matchers/simstring.py Was already missing at line 280 if custom: - attr = attr[1:].lower() Was already missing at line 295 if custom: - token_text = getattr(token._, attr) else:	146	2	98.63%
edsnlp/language.py Was already missing at line 103 if last != begin: - logger.warning( "Missed some characters during"	51	1	98.04%
edsnlp/data/standoff.py Was already missing at line 38 def __init__(self, ann_file, line): - super().__init__(f"File {ann_file}, unrecognized Brat line {line}") Was already missing at line 192 ) - except Exception: raise Exception(	186	2	98.92%
edsnlp/data/polars.py Was already missing at line 36 if hasattr(data, "collect"): - data = data.collect() assert isinstance(data, pl.DataFrame)	55	1	98.18%
edsnlp/data/json.py Was already missing at line 81 return records - except Exception as e: raise Exception(f"Cannot read {file}: {e}")	112	1	99.11%
edsnlp/data/converters.py Was already missing at line 428 elif key == "XPOS": - word.tag_ = value elif key == "FEATS": Was already missing at line 822 for attr in bool_attributes: - self.default_attributes[attr] = False self.opener = opener or self.PRESETS[preset]["opener"] Was already missing at line 830 if self.keep_raw_attribute_values: - return value try: Was already missing at lines 873-876 ) - except StopIteration: - warnings.warn(f"Unmatched closing tag for '{sep.group()}'") - continue start, start_label, start_attrs = starts.pop(idx) Was already missing at line 904 ): - if not Span.has_extension(dst): Span.set_extension(dst, default=None) Was already missing at line 911 if span is None: - continue for k, v in attrs.items(): Was already missing at lines 926-929 for attr, value in self.default_attributes.items(): - for span in spans: - if span._.get(attr) is None: - span._.set(attr, value) Was already missing at line 964 if isinstance(converter, type): - return converter(**kwargs), {} return converter, validate_kwargs(converter, kwargs)	317	11	96.53%
edsnlp/data/conll.py Was already missing at lines 81-83 ) - except StopIteration: - cols = DEFAULT_COLUMNS warnings.warn( Was already missing at lines 92-96 if not line: - if doc["words"]: - yield doc - doc = {"words": []} - continue if line.startswith("#"):	76	6	92.11%
edsnlp/core/torch_component.py Was already missing at line 405 if hasattr(self, "compiled"): - res = self.compiled(batch) else: Was already missing at line 451 """ - return self.preprocess(doc)	189	2	98.94%
edsnlp/core/stream.py Was already missing at lines 190-192 if isinstance(batch, StreamSentinel): - yield batch - continue results = [] Was already missing at lines 1007-1009 elif op.batch_fn is None: - batch_size = op.size - batch_fn = batchify else:	356	4	98.88%
edsnlp/core/pipeline.py Was already missing at line 605 if name in exclude: - continue if name not in components: Was already missing at lines 716-719 """ - res = Stream.ensure_stream(docs) - res = res.map(functools.partial(self.preprocess, supervision=supervision)) - return res	446	4	99.10%
edsnlp/connectors/omop.py Was already missing at line 69 if not isinstance(row.ents, list): - continue Was already missing at line 87 else: - doc.spans[span.label_].append(span) Was already missing at line 127 if df.note_id.isna().any(): - df["note_id"] = range(len(df)) Was already missing at line 171 if i > 0: - df.term_modifiers += ";" df.term_modifiers += ext + "=" + df[ext].astype(str)	84	4	95.24%

278 files skipped due to complete coverage.

Coverage success: total of 97.98% is above 97.98% 🎉

…e names

sonarqubecloud · 2025-09-02T17:50:57Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

percevalw force-pushed the v0.18.0 branch from e54a6d1 to 75851da Compare September 2, 2025 09:59

fix: make entrypoint tests compatible with old python versions

2862898

percevalw force-pushed the v0.18.0 branch 2 times, most recently from e0f6e0a to 357d0ad Compare September 2, 2025 15:07

percevalw added 2 commits September 2, 2025 19:49

docs: fix docs tutorials and warn in training func for mismatched pip…

be4c2ff

…e names

chore: bump version to 0.18.0

4fa377d

percevalw force-pushed the v0.18.0 branch from 357d0ad to 4fa377d Compare September 2, 2025 17:50

percevalw merged commit be748b3 into master Sep 2, 2025
18 checks passed

percevalw deleted the v0.18.0 branch September 2, 2025 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: bump version to 0.18.0 #439

chore: bump version to 0.18.0 #439

Uh oh!

percevalw commented Sep 2, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

github-actions bot commented Sep 2, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Sep 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chore: bump version to 0.18.0 #439

chore: bump version to 0.18.0 #439

Uh oh!

Conversation

percevalw commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog

Fixed

Changed

Uh oh!

github-actions bot commented Sep 2, 2025

Docs preview URL

Uh oh!

github-actions bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

sonarqubecloud bot commented Sep 2, 2025

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

percevalw commented Sep 2, 2025 •

edited

Loading

github-actions bot commented Sep 2, 2025 •

edited

Loading