Request : Apply Lemma / stemming in FeaturizeText options #5281

ErwanL08 · 2020-07-03T10:15:14Z

Hi
First Thank you for all the work done, i know that FeaturizeText apply NLP preprocessing like skipword with a specifique language :

But is there a way to apply lemma / stemming in this function ?

antoniovs1029 · 2020-07-06T19:32:27Z

Hi, @ErwanL08 . Unfortunately, there's no option for doing lemmatization or stemming in ML.NET, so I will mark this issue as a feature request so that we can take it into account when planning future features.

In the meantime, there are a couple of options you can explore:

Apply lemmatization/stemming before creating the input DataView. I notice in your screenshot that you're using LoadFromEnumerable<>() to get your data into a DataView. If possible you can try to lemmatize/stem the strings on your input "Utterance" string field, before creating the DV. I'm not able to recommend any C# library for this, but a quick google search points to some NLP-related nugets which maybe have this functionality... I've also found some open-source implementations of basic english stemming on C#, which you might be able to add to your project without installing any nuget.
Apply lemmatization/stemming inside a CustomMappingTransformer. A CustomMappingTransformer lets the user define a method that will be used to apply transformations to every row of the input; this function will be applied on an streaming fashion. You can create a function that does lemmatization/stemming (either using your own implementation or another library), and use it inside a CustomMappingTransformer. See more about this transformer on the docs.

AniaBerthelot · 2020-11-26T12:47:39Z

This feature is very important, I'm impatient to see it inside the awesome ML.NET. Also NLP is a very essential today, I hope a serious attention will be granted to it.

justinormont · 2020-12-05T01:00:54Z

I agree, there should be a direct lemmatizer/stemmer.

The default in the FeaturizeText transform uses unigrams (one word) + bigrams (two words) + tricharactergrams (three letter ngram).

The default tricharactergrams gives a good part of the gains of a full stemmer.

For example, it will extract the same tricharactergram r|u|n from runner/running/runs. This allows the model to learn the common concept of "run" from all of these, and with the ngrams it maintains the original unstemmed words, allowing the model to also learn running (unigram) and i|n|g (tricharactergram).

The word embedding transform can also help. The fastTextWikipedia300D model in particular has a large vocabulary, and already has a word vector for runner/running/runs and they will be in similar position in the embedding space.

All this said, the world is moving towards transformer networks like BERT.
There's an external BERT implementation for ML․NET -- https://github.com/GerjanVlot/BERT-ML.NET by @GerjanVlot.

ErwanL08 · 2021-01-19T13:40:39Z

I totally agree with @AniaBerthelot , if ML.Net can have a .Net version of a stemmer / lemmatizer (up to date) the framework will be so awesome 👍

WhitWaldo · 2022-09-12T23:00:35Z

I would also like to see lemmatization support built into ML.NET.

michaelgsharp · 2022-10-10T19:53:57Z

@luisquintanilla for prioritization.

AlbelTec · 2022-12-04T13:42:27Z

Hi, @luisquintanilla. Is there any chance to get this feature in near future ? Actually for text data preprocessing I rely on spacy (python) and for my current C# project I really need to stick with ML.NET to avoid dependencies with libraries like python.net.
So, for now my project is on hold until I figure out the best solution. any insight ?

luisquintanilla · 2022-12-05T14:43:43Z

Hi @AlbelTec

Thanks for your question. Our current NLP solutions are focused on deep learning, Text Classification and Sentence Similarity being a few examples. As a result, there are no immediate plans to work on lemmatization / stemming at this time.

That being said, would you be willing to share your use case and scenario? As we get more feedback on the topic we can think about where this fits in our future roadmap.

In the meantime, I would take a look at Antonio's comment above as a potential workaround.

AlbelTec · 2022-12-11T16:57:58Z

Hi @luisquintanilla,
I followed Antonio's comment and I'm getting fair result regarding lemmatization. Basically, I imported the nuget package (LemmaSharp) into my project and create a liitle function that return the lemmatized text.

// Lemmatization (https://github.com/hc-ro/LemmaGenerator-std)
        
        private string Lemmatization(string text, string language)
        {
            var tokens = Tokenize(text);
            var sb = new System.Text.StringBuilder();

            //string dataLemmaFile = <path to lemmas file> + "\\full7z-multext-" + language + ".lem"; 
            string dataLemmaFile = <path to lemmas file>+ "\\full7z-mlteast-" + language + ".lem";

            var stream = File.OpenRead(dataLemmaFile);
            var lemmatizer = new Lemmatizer(stream);
            foreach (var token in tokens)
            {
                sb.Append(lemmatizer.Lemmatize(token));
                sb.Append(" ");
            }
            return sb.ToString();
        }

maryamariyan · 2023-12-01T17:27:29Z

I have been experimenting on this as well.

Seems like this could be a good addition to ML.NET, since there's a lot of upvotes for this feature and it would be convenient for .NET developers to use a built-in feature for Lemmatization.

frank-dong-ms-zz added the P3 Doc bugs, questions, minor issues, etc. label Jul 6, 2020

antoniovs1029 added enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point. and removed P3 Doc bugs, questions, minor issues, etc. labels Jul 6, 2020

antoniovs1029 self-assigned this Jul 6, 2020

antoniovs1029 removed their assignment Jul 6, 2020

michaelgsharp added this to the ML.NET Future milestone Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request : Apply Lemma / stemming in FeaturizeText options #5281

Request : Apply Lemma / stemming in FeaturizeText options #5281

ErwanL08 commented Jul 3, 2020

antoniovs1029 commented Jul 6, 2020 •

edited

AniaBerthelot commented Nov 26, 2020

justinormont commented Dec 5, 2020

ErwanL08 commented Jan 19, 2021

WhitWaldo commented Sep 12, 2022

michaelgsharp commented Oct 10, 2022

AlbelTec commented Dec 4, 2022

luisquintanilla commented Dec 5, 2022

AlbelTec commented Dec 11, 2022

maryamariyan commented Dec 1, 2023

Request : Apply Lemma / stemming in FeaturizeText options #5281

Request : Apply Lemma / stemming in FeaturizeText options #5281

Comments

ErwanL08 commented Jul 3, 2020

antoniovs1029 commented Jul 6, 2020 • edited

AniaBerthelot commented Nov 26, 2020

justinormont commented Dec 5, 2020

ErwanL08 commented Jan 19, 2021

WhitWaldo commented Sep 12, 2022

michaelgsharp commented Oct 10, 2022

AlbelTec commented Dec 4, 2022

luisquintanilla commented Dec 5, 2022

AlbelTec commented Dec 11, 2022

maryamariyan commented Dec 1, 2023

antoniovs1029 commented Jul 6, 2020 •

edited