I'm an Machine Learning Engineer, working on cool projects at the intersection of NLP and CV. I finished my PhD in 2011, worked as a Research Scientist at the Research Institute for AI (Romanian Academy) for 7 years, then switched applied ML as an ML engineer at Sustainalytics (2017-2019) and now at Adobe.
I'm active in open source, especially on Romanian NLP. Throughout the years I've published, teached and coded, all while having fun. I like to build stuff.
Showcase on HuggingFace:
- Named Entity Recognition playground
- Write with transformers, but for Romanian :)
- Romanian Text Corpus (joint project with Mihai Ilie)
- Word Sense Disambiguation Corpus & Models for Romanian (large scale, long running project)
- NLI Corpus for Romanian
- Sentence segmentation for Romanian (because current Romanian tools fail miserably for anything but clean text)
May
Appeared on live TV discussing AI (#1, #2)Apr
Participated in WE Smart Diaspora conference in Timisoara, Romania, presenting "The Impact of Large Language Models"
Nov
Released the first T5-base and large pretrained checkpoints on Romanian, trained on TPUv4s from TRC. Invaluable help from Mihai Ilie and Per Egil.Nov
Organized the first edition of the LiRo NLP Hackathon in Politehnica University of Bucharest, with over 80 participants. Thanks to Viorica Patraucean, Traian Rebedea and all the wonderful volunteers.Jan
Released roner, a pip-installable custom NER based on RONECv2, providing SOTA results on Romanian.
Aug
Trained and released a monolingual GPT-Neo 780M model, trained on a TPUv3-32 with the help of Mihai IlieNov
Lead the development of the first Named Entity Recognition dataset for Romanian. Currently, at version 2.0, holds 12330 sentences with over 0.5M tokens, annotated with 15 classes, to a total of 80.283 distinctly annotated entities. Invaluable help from termene.ro.
Aug
I lead the development of the first ML leaderboard named LiRo Benchmark, together with Viorica Patraucean and other amazing RomaniaAI volunteers.Jun
Proposed and lead the development of the Romanian Semantic Textual Similarity dataset. It's a 1:1 high-quality human translation of the English STS dataset.Apr
: Trained an released the first monolingual Romanian BERT model, which became the most used BERT model in Romania, with thousdands of monthly downloads.
- RoWordNet pip package providing quick access to the Romanian WordNet. After all these years it's still the only python plug-and-play package for Romanian - seems to be working well :)
- Developed NLP-Cube with Tiberiu Boros (lead). Started as an entry in the 2018 Conll competition and evolved into a multilingual toolkit providing Tokenization, Sentence Segmentation, Lemmatization, POS and DEP parsing, trained on the Universal Dependencies dataset.
Google Scholar profile , h-index: 9
- Liro: Benchmark and leaderboard for romanian language tasks, SD Dumitrescu et all., 2021
- The birth of Romanian BERT, SD Dumitrescu, AM Avram, S Pyysalo, 2020
- Introducing RONEC - the Romanian Named Entity Corpus, SD Dumitrescu, AM Avram, 2019
- NLP-Cube: End-to-End Raw Text Processing With Neural Networks, T Boroș, SD Dumitrescu, R Burtica, 2018