[WIP] additionnal lib/cleanup for French language to improve quality of inputs #635

CapitainFlam · 2022-09-13T11:57:04Z

better french input for better french output ?

Hi !
⚠️ Firsts commits in GitHub, first pull request ever in a repo, so please neither shoot or shout at me 😅 ⚠️

This commit comes from different discussion, that I propose you to have a look to understand the context.

Rejet des abbréviations commonvoice-fr#21 (...This issue should be partially solved with this pull)
setup basic FR preprocessing CorporaCreator#87 (...This is inspirational commit that I took some ideas from)

and it is somehow connected to :

TL;DR:
I'm trying to improve input quality for french sentence collector, to avoid
A/ garbage IN - garbage OUT, and
B/ avoid to remove in the end (in the CorporaCreator) the garbage that went through collecting, recording, review... to finally being dropped and not being included in the final release for training batchs. Better to remove it as soon as sentence collector.

What have you done ? 😱

To do so, I duplicated a EN to FR file in sentence collector, and modified it according to a previous job made by Nicolas Panel for CorporaCreator (and sadly not commited).

According to discussion (links here... Did you follow it from the list above ?!), it's recommanded that I create a WIP pull request, to allow everyone to comment, throw tomatos and/or additionnal commits to it.

footnote : it can be hard (...it IS hard !!!) to understand REGEX (REGular EXpressions). Do not hesitate to catch up with https://regex101.com/ to understand and test it.

...some of my first commits, and my first steps in JS. reviews are welcome. (...please neither shoot or shout at me ;-D ) Co-Authored-By: Nicolas Panel <2500584+nicolaspanel@users.noreply.github.com>

roman + century numerals, roman numerals, full convertion of all ACRONYMES

eleventh, ... first, second,... 1/2 => 1 sur 2

drzraf · 2022-09-13T14:31:22Z

server/lib/cleanup/languages/fr.js

+	  .replace(/(^|\s)Ie(r)? s.(\s|\.|,|\?|!|$)/g, ' premier siècle ')
+	  .replace(/(^|\s)II(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' deuxième siècle ')
+	  .replace(/(^|\s)III(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' troisième siècle ')
+	  .replace(/(^|\s)IV(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' quatrième siècle ')
+	  .replace(/(^|\s)V(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' cinquième siècle ')
+	  .replace(/(^|\s)VI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' sixième siècle ')
+	  .replace(/(^|\s)VII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' septième siècle ')
+	  .replace(/(^|\s)VIII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' huitième siècle ')
+	  .replace(/(^|\s)(VIIII|IX)(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' neuvième siècle ')
+	  .replace(/(^|\s)X(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dixième siècle ')
+	  .replace(/(^|\s)XI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' onzième siècle ')
+	  .replace(/(^|\s)XII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' douxième siècle ')
+	  .replace(/(^|\s)XIII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' treizième siècle ')
+	  .replace(/(^|\s)(XIIII|XIV)(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' quatorzième siècle ')
+	  .replace(/(^|\s)XV(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' quinzième siècle ')
+	  .replace(/(^|\s)XVI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' seixième siècle ')
+	  .replace(/(^|\s)XVII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dix-septième siècle ')
+	  .replace(/(^|\s)XVIII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dix-huitième siècle ')
+	  .replace(/(^|\s)(XIX|XVIIII)(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' dix_neuvième siècle ')
+	  .replace(/(^|\s)XX(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' vingtième siècle ')
+	  .replace(/(^|\s)XXI(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' vingt-et-unième siècle ')
+	  .replace(/(^|\s)XXII(e|è)(me)? s.(\s|\.|,|\?|!|$)/g, ' vingt-deuxième siècle ')


You could use a dedicated function like
https://stackoverflow.com/a/9083076 in combination with passing a callback to replace() (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace)

You could drop the test for the s. suffix.

It would be reusable for the non-ordinal case you handled below

well,

First,

I have my regex to work with :
-- kudos to gregory-r in this post in stackoverflow, licence CC BY-SA

function deromanize(roman) { var r = 0; // regular expressions to check if valid Roman Number. if (!/^M*(?:D?C{0,3}|C[MD])(?:L?X{0,3}|X[CL])(?:V?I{0,3}|I[XV])$/.test(roman)) throw new Error('Invalid Roman Numeral.'); roman.replace(/[MDLV]|C[MD]?|X[CL]?|I[XV]?/g, function(i) { r += {M:1000, CM:900, D:500, CD:400, C:100, XC:90, L:50, XL:40, X:10, IX:9, V:5, IV:4, I:1}[i]; }); return r; }

So, I'll put it here to not forget it, because :

Second,

I don't have a clue of how to implement a function in JS ! (I know Pascal and VB ✌️ , but never learned JS 😭... I have to jump in the rabbit hole 🕳️ 🐇 (it is this way) )

Edit: My feeling about that ? "Why JS? because!" xkcd explined it well.

...To however follow, either the white rabbit, or crumbs, including future me, well, this link
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#specifying_a_function_as_the_replacement
could be a good start to look at.

/me : jumps in the 🕳️

before you jump into that rabbit hole, you might want to check out my other comment regarding the roman numerals, as I don't think we need this at all. I'll also have a look at your other issue and add the documentation to the README for better understanding

Indeed, I don't understand that, collecting sentences, we can NOT "correct and clean" before "validation".

no problem, I'll wait for the issue #636 😸

drzraf · 2022-09-13T14:38:50Z

server/lib/cleanup/languages/fr.js

+      //first, second, etc.
+	  .replace(/(^|\s)1er?s?(\s|\.|,|\?|!|$)/g, ' premier ')
+	  .replace(/(^|\s)1(e|è)res?(\s|\.|,|\?|!|$)/g, ' premier ')
+	  .replace(/(^|\s)2(e|è)?me?s?(\s|\.|,|\?|!|$)/g, ' deuxième ')
+	  .replace(/(^|\s)2n?ds?(\s|\.|,|\?|!|$)/g, ' second ')
+	  .replace(/(^|\s)2n?des?(\s|\.|,|\?|!|$)/g, ' seconde ')
+	  .replace(/(^|\s)3i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' troisième ')
+	  .replace(/(^|\s)4i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' quatrième ')
+	  .replace(/(^|\s)5i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' cinquième ')
+	  .replace(/(^|\s)6i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' sixième ')
+	  .replace(/(^|\s)7i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' septième ')
+	  .replace(/(^|\s)8i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' huitième ')
+	  .replace(/(^|\s)9i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' neuvième ')
+	  .replace(/(^|\s)10i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' dixième ')


Maybe worth using a library like this one or this one (although these are not actually handle ordinal but be useful for plain numbers)

That may help other languages too.

I actually don't know how to import external library, I think it's a bit out of my league right now.

drzraf · 2022-09-13T14:40:37Z

server/lib/cleanup/languages/fr.js

+	  .replace(/(^|\s)ANPE(\s|\.|,|\?|!|$)/g, ' Agence Nationale Pour l\'Emploi ') 
+      .replace(/(^|\s)APL(\s|\.|,|\?|!|$)/g, ' Aide personnalisée au logement ')
+      .replace(/(^|\s)CDI(\s|\.|,|\?|!|$)/g, ' Contrat à Durée Indéterminée ')
+      .replace(/(^|\s)CICE(\s|\.|,|\?|!|$)/g, ' Crédit d\'impôt pour la compétitivité et l\'emploi ')
+      .replace(/(^|\s)DRH(\s|\.|,|\?|!|$)/g, ' Direction des Ressources Humaines ')
+      .replace(/(^|\s)EDF(\s|\.|,|\?|!|$)/g, ' Electricité de France ')
+      .replace(/(^|\s)FN(\s|\.|,|\?|!|$)/g, ' Front National ')
+      .replace(/(^|\s)HLM(\s|\.|,|\?|!|$)/g, ' Habitation à Loyer Modéré ')
+      .replace(/(^|\s)IGN(\s|\.|,|\?|!|$)/g, ' Institut Géographique National ')
+      .replace(/(^|\s)INPI(\s|\.|,|\?|!|$)/g, ' Institut  National de la Propriété Intellectuelle ')
+      .replace(/(^|\s)ISF(\s|\.|,|\?|!|$)/g, ' Impôt sur la fortune ')
+      .replace(/(^|\s)IUT(\s|\.|,|\?|!|$)/g, ' Institut Universitaire de Technologie ')
+      .replace(/(^|\s)LREM(\s|\.|,|\?|!|$)/g, ' La Réplublique En Marche ')
+      .replace(/(^|\s)NUPES(\s|\.|,|\?|!|$)/g, ' Nupès ')
+      .replace(/(^|\s)PHP(\s|\.|,|\?|!|$)/g, ' Protocole Hypertexte Protocolaire ')
+      .replace(/(^|\s)PMA(\s|\.|,|\?|!|$)/g, ' Procréation médicalement assistée ')
+      .replace(/(^|\s)PME(\s|\.|,|\?|!|$)/g, ' Petite et Moyenne Entreprise ')
+      .replace(/(^|\s)RN(\s|\.|,|\?|!|$)/g, ' Rassemblement National ')
+      .replace(/(^|\s)RSA(\s|\.|,|\?|!|$)/g, ' Revenu de Solidarité Active ')
+      .replace(/(^|\s)RSA(\s|\.|,|\?|!|$)/g, ' Revenu de Solidarité Active ')
+      .replace(/(^|\s)RSI(\s|\.|,|\?|!|$)/g, ' Régime Social des Indépendants ')
+      .replace(/(^|\s)RTE(\s|\.|,|\?|!|$)/g, ' Réseau de Transport d\'Électricité ')
+      .replace(/(^|\s)SNCF(\s|\.|,|\?|!|$)/g, ' Société Nationale des Chemins de Fer ')
+      .replace(/(^|\s)TGV(\s|\.|,|\?|!|$)/g, ' Train à Grande Vitesse ')
+      .replace(/(^|\s)TVA(\s|\.|,|\?|!|$)/g, ' Taxe sur la Valeur Ajoutée ')
+      .replace(/(^|\s)UDI(\s|\.|,|\?|!|$)/g, ' Union des Démocrates Indépendants ')
+      .replace(/(^|\s)UMP(\s|\.|,|\?|!|$)/g, ' Union pour un Mouvement Populaire ')
+      .replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ')


If the last character was a ? or a ! I don't think it should be removed. It serves for the intonation purpose and isn't part of the acronym.

Est ce que le commit aef4284 répond au problème ?

EDIT : sorry, replied in FR. Translation : does the commit answer the issue ?

drzraf · 2022-09-13T14:41:48Z

server/lib/cleanup/languages/fr.js

+      .replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ')
+
+      //replace fraction 1/2 => '1 sur 2'
+     ¨.replace(/(^| )(\d+)(\s)?(\/)(\s)?(\d+)(\s|\.|,|\?|!|$)/g, '$2 sur $6')


Not sure about this one. This could very well be a date. Il est né le 25/12 qui ne devrait pas être remplacé de la sorte et prend priorité sur la regexp suivante (date format mm/yy)

oh shoot ! you're right !
To solve it, we shall test numbers, and see "if it's in date [1..31]/[1..12] (dd/mm) or [1..12]/[00..99] or [1..12]/[1700..2030] (mm/yy(yy)?) range, it shall be a date stamp, otherwise it's a fraction ?!"
Wow! It escalated quickly!
Sometimes, only context will show us that "mettre 1/4 de litre de lait" will go to "mettre un avril de litre de lait"... But it shall be more readable than "mettre de litre de lait".
On the other hand, if we manage poolry date (not checking mm/yy but only mm/yyyy), it will go like : "il est né en 12/88." to "il est né en douze sur quatre-vingt huit." instead of "il est née en décembre quatre-vingt huit."... Again, it's more readable than ""il est né en."

IMHO, [1..31]/[1..12] (dd/mm) or [1..12]/[1700..2030] (mm/yyyy) range shall work fine. The reste will go through 'fraction replacement', and will always be better that 'full removal'.

Let's first wait for issue #636 resolution, 'coz I don't want to work for nothing :-/

drzraf · 2022-09-13T14:44:06Z

server/lib/cleanup/languages/fr.js

+	  .replace((^|\s)\d{1,2}\/\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format dd/mm/yy ou dd/mm/yyyy
+	  .replace((^|\s)\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format mm/yy ou mm/yyyy


Given the impact of creating blank in the middle of a sentence after the replacement of a date, I'm a proponent of either:

omitting these sentence altogether

making deep replacement (spelling date, month, year) which may or may not be practicable.

drzraf · 2022-09-13T14:46:57Z

server/lib/validation/languages/fr.js

+  error: 'Sentence should not contain numbers - les phrases ne doivent pas contenir de nombres',
+}, {
+  regex: /[<>+*#@%^[\]()/]/,
+  error: 'Sentence should not contain symbols - les phrases de doivent pas contenir de symboles \(\*, \#, \(, etc\)',


I suggested substituting "–" (long hyphen) with the common ("-") counterpart and testing against it.

commit 18fdf37

drzraf · 2022-09-13T14:49:23Z

server/lib/validation/languages/fr.js

+  // as users wouldn't know how to pronounce the uppercase letters.
+  regex: /[A-Z]{2,}|[A-Z]+\.*[A-Z]+/,
+  error: 'Sentence should not contain abbreviations - Les phrases ne doivent pas contenir des abréviations ou sigles',
+}];


Aren't we missing something about the french apostrophe (´)? I don't know how the model would benefit (or not) from this situation. I'm pretty sure text processing would do the uniformisation (before TTS for example) so I believe we should make it too.

commit 1a0cff5

BTW, sources files like https://github.com/common-voice/commonvoice-fr/blob/master/CommonVoice-Data/data/debats-assemblee-nationale/20130718093000000.txt are using french apostrophe ?!
(it only shows in "ui-monospace" code font, not the usual "Segoe UI" font.

line 2 : J’appelle maintenant, dans le texte de la commission, les articles du projet de loi. if I change it myself : line 2 : J'appelle maintenant, dans le texte de la commission, les articles du projet de loi.

so... commit 8d7d973

server/lib/cleanup/languages/fr.js

drzraf · 2022-09-13T15:01:43Z

server/lib/cleanup/languages/fr.js

+      .replace(/(^|\s)TVA(\s|\.|,|\?|!|$)/g, ' Taxe sur la Valeur Ajoutée ')
+      .replace(/(^|\s)UDI(\s|\.|,|\?|!|$)/g, ' Union des Démocrates Indépendants ')
+      .replace(/(^|\s)UMP(\s|\.|,|\?|!|$)/g, ' Union pour un Mouvement Populaire ')
+      .replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ')


Side note: Acronyms are a strange case : They are actually spelled as distincts letters (except for NUPES).

But we don't want "STT" to process R.S.A. as erre et ça but we currently have nothing to train it that way.

Should we expect a specific future dictionary or just not strip some of them in the first place (but remove the dots)?

MichaelKohler

⚠️ Firsts commits in GitHub, first pull request ever in a repo, so please neither shoot or shout at me 😅 ⚠️

No worries, you're doing fine :) Thanks for your contributions!

I have a few comments, but generally I think it's a good thing to update these files.

server/lib/cleanup/languages/fr.js

MichaelKohler · 2022-09-13T22:16:42Z

server/lib/cleanup/languages/fr.js

+	  .replace(/(^|\s)(XIX|XVIIII)(\s|\.|,|\?|!|$)/g, ' dix-neuf ')
+	  .replace(/(^|\s)XX(\s|\.|,|\?|!|$)/g, ' vingt ')
+ 	  .replace(/(^|\s)XXI(\s|\.|,|\?|!|$)/g, ' vingt-et-un ')
+ 	  .replace(/(^|\s)XXII(\s|\.|,|\?|!|$)/g, ' vingt-deux ')


I think we can remove all of the roman numerals, as they are not allowed from the beginning (see the abbreviation pattern in the validation file).

MichaelKohler · 2022-09-13T22:17:18Z

server/lib/cleanup/languages/fr.js

+	  .replace(/(^|\s)7i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' septième ')
+	  .replace(/(^|\s)8i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' huitième ')
+	  .replace(/(^|\s)9i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' neuvième ')
+	  .replace(/(^|\s)10i?(e|è)me?s?(\s|\.|,|\?|!|$)/g, ' dixième ')


These can be removed, as numbers are not allowed anyway in the Sentence Collector.

MichaelKohler · 2022-09-13T22:17:39Z

server/lib/cleanup/languages/fr.js

+      .replace(/(^|\s)TVA(\s|\.|,|\?|!|$)/g, ' Taxe sur la Valeur Ajoutée ')
+      .replace(/(^|\s)UDI(\s|\.|,|\?|!|$)/g, ' Union des Démocrates Indépendants ')
+      .replace(/(^|\s)UMP(\s|\.|,|\?|!|$)/g, ' Union pour un Mouvement Populaire ')
+      .replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ')


These can be removed, as abbreviations are not allowed in the validation file.

Well, I think I understand wrongly how it works. Let's solve this #636 issue first 😸

MichaelKohler · 2022-09-13T22:18:15Z

server/lib/cleanup/languages/fr.js

+      .replace(/(^|\s)USA(\s|\.|,|\?|!|$)/g, ' Etats Unis d\'Amérique ')
+
+      //replace fraction 1/2 => '1 sur 2'
+     ¨.replace(/(^| )(\d+)(\s)?(\/)(\s)?(\d+)(\s|\.|,|\?|!|$)/g, '$2 sur $6')


Numbers are not allowed by the validation file.

Shouldn't we keep it for "future-proof rules" and/or "bulk load rules if you don't have yours" ?
Let's discuss it here : https://discourse.mozilla.org/t/sentence-collector-cleanup-before-export-vs-cleanup-on-upload/105411/15

MichaelKohler · 2022-09-13T22:18:24Z

server/lib/cleanup/languages/fr.js

+	  //dates, digits and numbers fr-FR cleanup
+	  //todo : CONVERT TO TEXT instead of removing it
+	  .replace((^|\s)\d{1,2}\/\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format dd/mm/yy ou dd/mm/yyyy
+	  .replace((^|\s)\d{1,2}\/(\d{2}[^\d]|\d{4})(\s|$), ' ') //date format mm/yy ou mm/yyyy


Numbers are not allowed by the validation file.

server/lib/validation/languages/fr.js

MichaelKohler · 2022-09-13T22:39:23Z

Also note https://discourse.mozilla.org/t/sentence-collector-cleanup-before-export-vs-cleanup-on-upload/105411, though that probably doesn't matter too much for this here :)

server/lib/cleanup/languages/fr.js

of course :) Co-authored-by: Michael Kohler <me@michaelkohler.info>

Notepad++ > Edit > transform TABS in SPACE. Yeahhhhhh

as requested, only keeping FR and removing EN error description... But adding EN comment, for ^FR readers sake.

We do not remove PUNCTUATION (?!, etc.) after finding them, we just put them back as we found them.

remplacing – (long hyphen) by - (short hyphen)

Normalize ´ (french apostroph) into ' (usual apostroph)

adding an other apostroph found in original source document as https://github.com/common-voice/commonvoice-fr/blob/master/CommonVoice-Data/data/debats-assemblee-nationale/20130718093000000.txt

drzraf · 2023-03-15T22:18:34Z

Any hope to get this in, in one shape or another?

MichaelKohler · 2023-05-10T21:07:57Z

The Sentence Collector has now been integrated into the Common Voice platform. Therefore I'm archiving this project here. The validation files now live here and I'm sure it would still benefit from the validation rules being added there: https://github.com/common-voice/common-voice/tree/main/server/src/core/sentences/validation. Unfortunately moving this PR over there is way harder than manually recreating it. Would you mind creating a new PR for this? Sorry for the troubles.

CapitainFlam and others added 2 commits September 6, 2022 22:54

adding FR rules in languages and index.js

8128068

adding many FR rules, transcoding acronyms, and removing numbers

db15ee1

...some of my first commits, and my first steps in JS. reviews are welcome. (...please neither shoot or shout at me ;-D ) Co-Authored-By: Nicolas Panel <2500584+nicolaspanel@users.noreply.github.com>

CapitainFlam changed the title ~~[WIP]~~ [WIP] additionnal lib/cleanup for French language to improve quality of inputs Sep 13, 2022

CapitainFlam added 3 commits September 13, 2022 13:59

Update fr.js

9e0d82f

Update fr.js

58430bb

roman + century numerals, roman numerals, full convertion of all ACRONYMES

Update fr.js

13cbb55

eleventh, ... first, second,... 1/2 => 1 sur 2

drzraf reviewed Sep 13, 2022

View reviewed changes

MichaelKohler suggested changes Sep 13, 2022

View reviewed changes

MichaelKohler reviewed Sep 13, 2022

View reviewed changes

server/lib/cleanup/languages/fr.js Outdated Show resolved Hide resolved

Update server/lib/cleanup/languages/fr.js

2abf67c

of course :) Co-authored-by: Michael Kohler <me@michaelkohler.info>

CapitainFlam mentioned this pull request Sep 14, 2022

Improving the introduction README file with very simple TLDR #636

Closed

CapitainFlam added 6 commits September 15, 2022 12:34

Update fr.js

12fcb1e

Notepad++ > Edit > transform TABS in SPACE. Yeahhhhhh

Update fr.js

4a2ea4e

as requested, only keeping FR and removing EN error description... But adding EN comment, for ^FR readers sake.

Update fr.js

aef4284

We do not remove PUNCTUATION (?!, etc.) after finding them, we just put them back as we found them.

Update fr.js

18fdf37

remplacing – (long hyphen) by - (short hyphen)

Update fr.js

1a0cff5

Normalize ´ (french apostroph) into ' (usual apostroph)

Update fr.js

8d7d973

adding an other apostroph found in original source document as https://github.com/common-voice/commonvoice-fr/blob/master/CommonVoice-Data/data/debats-assemblee-nationale/20130718093000000.txt

MichaelKohler closed this May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] additionnal lib/cleanup for French language to improve quality of inputs #635

[WIP] additionnal lib/cleanup for French language to improve quality of inputs #635

CapitainFlam commented Sep 13, 2022

drzraf Sep 13, 2022 •

edited

Loading

CapitainFlam Sep 15, 2022 •

edited

Loading

MichaelKohler Sep 15, 2022

CapitainFlam Sep 15, 2022

drzraf Sep 13, 2022

CapitainFlam Sep 15, 2022

drzraf Sep 13, 2022

CapitainFlam Sep 15, 2022 •

edited

Loading

drzraf Sep 13, 2022

CapitainFlam Sep 14, 2022 •

edited

Loading

CapitainFlam Sep 15, 2022

drzraf Sep 13, 2022

drzraf Sep 13, 2022

CapitainFlam Sep 15, 2022

drzraf Sep 13, 2022

CapitainFlam Sep 15, 2022

CapitainFlam Sep 15, 2022

CapitainFlam Sep 15, 2022

drzraf Sep 13, 2022

MichaelKohler left a comment

MichaelKohler Sep 13, 2022

MichaelKohler Sep 13, 2022

MichaelKohler Sep 13, 2022

CapitainFlam Sep 14, 2022

MichaelKohler Sep 13, 2022

CapitainFlam Sep 30, 2022

MichaelKohler Sep 13, 2022

MichaelKohler commented Sep 13, 2022

drzraf commented Mar 15, 2023

MichaelKohler commented May 10, 2023

		.replace((^\|\s)\d{1,2}\/\d{1,2}\/(\d{2}[^\d]\|\d{4})(\s\|$), ' ') //date format dd/mm/yy ou dd/mm/yyyy
		.replace((^\|\s)\d{1,2}\/(\d{2}[^\d]\|\d{4})(\s\|$), ' ') //date format mm/yy ou mm/yyyy

[WIP] additionnal lib/cleanup for French language to improve quality of inputs #635

[WIP] additionnal lib/cleanup for French language to improve quality of inputs #635

Conversation

CapitainFlam commented Sep 13, 2022

better french input for better french output ?

What have you done ? 😱

drzraf Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

CapitainFlam Sep 15, 2022 • edited Loading

Choose a reason for hiding this comment

First,

Second,

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CapitainFlam Sep 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CapitainFlam Sep 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelKohler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelKohler commented Sep 13, 2022

drzraf commented Mar 15, 2023

MichaelKohler commented May 10, 2023

drzraf Sep 13, 2022 •

edited

Loading

CapitainFlam Sep 15, 2022 •

edited

Loading

CapitainFlam Sep 15, 2022 •

edited

Loading

CapitainFlam Sep 14, 2022 •

edited

Loading