Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request : Apply Lemma / stemming in FeaturizeText options #5281

Open
ErwanL08 opened this issue Jul 3, 2020 · 10 comments
Open

Request : Apply Lemma / stemming in FeaturizeText options #5281

ErwanL08 opened this issue Jul 3, 2020 · 10 comments
Labels
enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point.
Milestone

Comments

@ErwanL08
Copy link

ErwanL08 commented Jul 3, 2020

Hi
First Thank you for all the work done, i know that FeaturizeText apply NLP preprocessing like skipword with a specifique language :
image

But is there a way to apply lemma / stemming in this function ?

@frank-dong-ms-zz frank-dong-ms-zz added the P3 Doc bugs, questions, minor issues, etc. label Jul 6, 2020
@antoniovs1029 antoniovs1029 added enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point. and removed P3 Doc bugs, questions, minor issues, etc. labels Jul 6, 2020
@antoniovs1029 antoniovs1029 self-assigned this Jul 6, 2020
@antoniovs1029
Copy link
Member

antoniovs1029 commented Jul 6, 2020

Hi, @ErwanL08 . Unfortunately, there's no option for doing lemmatization or stemming in ML.NET, so I will mark this issue as a feature request so that we can take it into account when planning future features.

In the meantime, there are a couple of options you can explore:

  1. Apply lemmatization/stemming before creating the input DataView. I notice in your screenshot that you're using LoadFromEnumerable<>() to get your data into a DataView. If possible you can try to lemmatize/stem the strings on your input "Utterance" string field, before creating the DV. I'm not able to recommend any C# library for this, but a quick google search points to some NLP-related nugets which maybe have this functionality... I've also found some open-source implementations of basic english stemming on C#, which you might be able to add to your project without installing any nuget.

  2. Apply lemmatization/stemming inside a CustomMappingTransformer. A CustomMappingTransformer lets the user define a method that will be used to apply transformations to every row of the input; this function will be applied on an streaming fashion. You can create a function that does lemmatization/stemming (either using your own implementation or another library), and use it inside a CustomMappingTransformer. See more about this transformer on the docs.

@antoniovs1029 antoniovs1029 removed their assignment Jul 6, 2020
@AniaBerthelot
Copy link

This feature is very important, I'm impatient to see it inside the awesome ML.NET. Also NLP is a very essential today, I hope a serious attention will be granted to it.

@justinormont
Copy link
Contributor

I agree, there should be a direct lemmatizer/stemmer.

The default in the FeaturizeText transform uses unigrams (one word) + bigrams (two words) + tricharactergrams (three letter ngram).

The default tricharactergrams gives a good part of the gains of a full stemmer.

For example, it will extract the same tricharactergram r|u|n from runner/running/runs. This allows the model to learn the common concept of "run" from all of these, and with the ngrams it maintains the original unstemmed words, allowing the model to also learn running (unigram) and i|n|g (tricharactergram).

The word embedding transform can also help. The fastTextWikipedia300D model in particular has a large vocabulary, and already has a word vector for runner/running/runs and they will be in similar position in the embedding space.

All this said, the world is moving towards transformer networks like BERT.
There's an external BERT implementation for ML․NET -- https://github.com/GerjanVlot/BERT-ML.NET by @GerjanVlot.

@ErwanL08
Copy link
Author

I totally agree with @AniaBerthelot , if ML.Net can have a .Net version of a stemmer / lemmatizer (up to date) the framework will be so awesome 👍

@WhitWaldo
Copy link

I would also like to see lemmatization support built into ML.NET.

@michaelgsharp michaelgsharp added this to the ML.NET Future milestone Oct 10, 2022
@michaelgsharp
Copy link
Member

@luisquintanilla for prioritization.

@AlbelTec
Copy link

AlbelTec commented Dec 4, 2022

Hi, @luisquintanilla. Is there any chance to get this feature in near future ? Actually for text data preprocessing I rely on spacy (python) and for my current C# project I really need to stick with ML.NET to avoid dependencies with libraries like python.net.
So, for now my project is on hold until I figure out the best solution. any insight ?

@luisquintanilla
Copy link
Contributor

Hi @AlbelTec

Thanks for your question. Our current NLP solutions are focused on deep learning, Text Classification and Sentence Similarity being a few examples. As a result, there are no immediate plans to work on lemmatization / stemming at this time.

That being said, would you be willing to share your use case and scenario? As we get more feedback on the topic we can think about where this fits in our future roadmap.

In the meantime, I would take a look at Antonio's comment above as a potential workaround.

@AlbelTec
Copy link

Hi @luisquintanilla,
I followed Antonio's comment and I'm getting fair result regarding lemmatization. Basically, I imported the nuget package (LemmaSharp) into my project and create a liitle function that return the lemmatized text.

// Lemmatization (https://github.com/hc-ro/LemmaGenerator-std)
        
        private string Lemmatization(string text, string language)
        {
            var tokens = Tokenize(text);
            var sb = new System.Text.StringBuilder();

            //string dataLemmaFile = <path to lemmas file> + "\\full7z-multext-" + language + ".lem"; 
            string dataLemmaFile = <path to lemmas file>+ "\\full7z-mlteast-" + language + ".lem";

            var stream = File.OpenRead(dataLemmaFile);
            var lemmatizer = new Lemmatizer(stream);
            foreach (var token in tokens)
            {
                sb.Append(lemmatizer.Lemmatize(token));
                sb.Append(" ");
            }
            return sb.ToString();
        }

@maryamariyan
Copy link
Member

I have been experimenting on this as well.

Seems like this could be a good addition to ML.NET, since there's a lot of upvotes for this feature and it would be convenient for .NET developers to use a built-in feature for Lemmatization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P2 Priority of the issue for triage purpose: Needs to be fixed at some point.
Projects
None yet
Development

No branches or pull requests

10 participants