UnitaryHACK: Replace unknown words in diagrams with `UNK` token #84

le-big-mac · 2023-05-04T14:38:11Z

Task description

One of the most common challenges in NLP is the handling of unknown words, or out-of-vocabulary (OOV) words. The term refers to words that may appear during evaluation and testing, but they were not present in the training data of the model. A common technique to handle unknown word is to introduce a special token UNK. In the simplest possible case, a way to do that is the following:

Replace every rare word in the training data (e.g. every word that occurs less than a specified threshold, for example 3 times) with a special token UNK.
During training, learn a representation for UNK as if there was any other token.
During evaluation, when you meet an unknown word, use the representation of UNK instead.

However, in the syntax-based models of lambeq (such as DisCoCat) this method would not work. This is because of two reasons:

The sentences are used as input to a parser, which has no way to recognise the special token UNK and assign to it the proper part-of-speech in each case.
In order for the produced diagram to be valid, each part of speech (in lambeq, defined by the pregroup type of each word) needs a different UNK token.

This task is about adding a feature in lambeq that handles unknown words. For the reasons explained before, in lambeq the unknown words need to be replaced after the diagrams have been generated. You will have to learn how to construct DisCoPy diagrams in lambeq and manipulate them with functors. The overall goal of this task is to develop a DisCoPy functor (pretty much similar to how a lambeq's RewriteRule is implemented) that takes a list of unknown words to be replaced with UNK, and that, when passed a diagram, replaces all the boxes containing an unknown word with an UNK box corresponding to the same pregroup type.

Notes

lambeq also contains compositional schemes that do not support syntax, such as the CupsReader and StairsReader. For these cases, the simple algorithm proposed above would suffice, and it could be applied directly at the sentence level. For this task, however, we are interested to provide handling of unknown words for the syntax-based models of lambeq, such as DisCoCat and TreeReader.
In lambeq's pipeline, the replacement should take place after the generation of the string diagrams and before the application of any rewrite rule or ansatz.
In case you are not familiar with functors, a less preferred way to implement this task is to create a function that processes a passed list of diagrams in a simple imperative way.

Resources

Some useful resources for this task can be found below:

The text was updated successfully, but these errors were encountered:

WingCode · 2023-05-26T10:50:37Z

@le-big-mac / @le-big-mac I would like to take a stab at this issue. Could you assign it to me?

le-big-mac · 2023-05-26T11:17:13Z

Hi @WingCode, we're very glad you've taken an interest in this issue! The way the unitaryHack bounty tracking works means that we'll assign this issue to you if you are the first one to open a PR that solves it. You can work on it without being assigned and open a PR on this repo, after which we'll assign the issue to you and close it if it solves the problem!

dimkart · 2023-05-26T11:21:26Z

@WingCode Note that more than one users can work on the same issue, in which case the maintainers decide which one is the best solution (or they --we-- can even split the bounty).

mithunpaul08 · 2023-05-26T19:54:29Z

@dimkart Why not use the technique of a trained FFNN/MLP which learns a mapping between new/unknown words to their FastText equivalent- which @nikhilkhatri suggested in his Masters thesis? I am using it, and its brilliant.

dimkart · 2023-05-27T07:49:10Z

@mithunpaul08

@dimkart Why not use the technique of a trained FFNN/MLP which learns a mapping between new/unknown words to their FastText equivalent- which @nikhilkhatri suggested in his Masters thesis? I am using it, and its brilliant.

You are right it's much preferable, but it would be too much for this hackathon. We didn't want to add any tasks that involve real experiments.

ACE07-Sev · 2023-05-28T14:56:26Z

Greetings,

I have a code prepared for exactly that, but it's a function I defined, not a class instance of RewriteRule. Reason being is to allow the user to apply it to the diagrams with respect to the dataset they are using. I read the source code, and the manner I think it's possible (just what I understand for now, not saying it's impossible HEHEHE) to have the other rewrite rules, especially the determiner and punctuation and such is because we have defined what words they'll be BEFOREHAND, whereas the UNK will change for each dataset.

Proof of work :

To

There are two functions, one for applying UNK rewriting for the training which has less than some threshold occurrence condition, the other is for applying to a test sentence which has to look at the entire vocabulary of the words the model has seen before.

Shall I make my PR in the form of a jupyter notebook providing the approach?

ACE07-Sev · 2023-05-28T15:48:05Z

I finished my Jupyter notebook (removed irrelevant details like other tokenizers and other ansatzes). Here is the link for it, based on feedback I'll make a PR if requested.

My understanding of the problem :
"The overall goal of this task is to develop a DisCoPy functor that takes a list of unknown words to be replaced with UNK, and that, when passed a diagram, replaces all the boxes containing an unknown word with an UNK box corresponding to the same pregroup type."

So in my function, I am defining a DisCoPy Functor, given the unknown words, and the status wanted (using for low occurence or never seen before words), and then apply the functor to diagrams to rewrite them. I think this should be ok, my only hesitation at the moment is it not being in the same trend of the other rewrite rules, which I'll work on now.

https://github.com/ACE07-Sev/Quantum-Natural-Language-Processing-with-Lambeq/blob/main/QNLP-UNK.ipynb

ACE07-Sev · 2023-05-28T16:15:07Z

By the way @mithunpaul08, I'd love to help you with implementing that for Lambeq.

dimkart · 2023-05-28T16:30:03Z

@ACE07-Sev Hi, unfortunately we cannot review code that is not part of a PR in this repository. So if you want to participate, you will have to open a proper PR here. Note though that we are not asking for a notebook, but for a functor, rewrite rule, or method that is available from lambeq's public interface. If at the end there are more than one PRs open for the same issue, we will select the solution we consider the best (or split the bounty, as mentioned above).

ACE07-Sev · 2023-05-28T18:02:08Z

@dimkart dear, I have made the PR. I did two sets of codes, one is the one I made a PR for, the other is basically like something you would write (same structure and trend as the other classes), but I couldn't really test it to see if it works, so I made the PR for the one that I was able to test.

I don't like my current PR exactly because it's not a class. I am certain the idea is correct, but there is some syntax error somewhere that I can't find hehe. I'll try to see if I can fix that as well.

ACE07-Sev · 2023-05-30T06:49:45Z

@dimkart dear, I have made the PR with the class format. I added it to the Rewriter class as an _available_rules and to use we have to simply pass the words and apply it to the diagram.

dimkart · 2023-06-19T10:34:44Z

This is now completed. Thank you all for your work!

le-big-mac added the unitaryHACK Tasks for unitaryHACK 2023 label May 4, 2023

Thommy257 removed the unitaryHACK Tasks for unitaryHACK 2023 label May 5, 2023

Thommy257 changed the title ~~unitaryHACK - Replace unknown words in diagrams with UNK token~~ Replace unknown words in diagrams with UNK token May 5, 2023

dimkart added the enhancement New feature or request label May 25, 2023

Thommy257 added the unitaryHACK Tasks for unitaryHACK 2023 label May 26, 2023

Thommy257 changed the title ~~Replace unknown words in diagrams with UNK token~~ UnitaryHACK: Replace unknown words in diagrams with UNK token May 26, 2023

WingCode mentioned this issue May 27, 2023

[UnitaryHack] Intial commit unknown words rewrite rule #94

Closed

This was referenced May 29, 2023

Unitary Fund 2023 : UNK Tokenizer #95

Closed

UNK tokenizer class #96

Closed

nikhilkhatri mentioned this issue Jun 13, 2023

[UnitaryHack] Make ansatz class a training hyperparameter #101

Closed

WingCode mentioned this issue Jun 13, 2023

[UnitaryHack] HandleUnknownWords modules #105

Closed

le-big-mac assigned WingCode and ACE07-Sev Jun 14, 2023

dimkart closed this as completed Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnitaryHACK: Replace unknown words in diagrams with `UNK` token #84

UnitaryHACK: Replace unknown words in diagrams with `UNK` token #84

le-big-mac commented May 4, 2023 •

edited by dimkart

Loading

WingCode commented May 26, 2023

le-big-mac commented May 26, 2023 •

edited

Loading

dimkart commented May 26, 2023

mithunpaul08 commented May 26, 2023

dimkart commented May 27, 2023 •

edited

Loading

ACE07-Sev commented May 28, 2023 •

edited

Loading

ACE07-Sev commented May 28, 2023 •

edited

Loading

ACE07-Sev commented May 28, 2023

dimkart commented May 28, 2023 •

edited

Loading

ACE07-Sev commented May 28, 2023

ACE07-Sev commented May 30, 2023

dimkart commented Jun 19, 2023

UnitaryHACK: Replace unknown words in diagrams with UNK token #84

UnitaryHACK: Replace unknown words in diagrams with UNK token #84

Comments

le-big-mac commented May 4, 2023 • edited by dimkart Loading

Task description

Notes

Resources

WingCode commented May 26, 2023

le-big-mac commented May 26, 2023 • edited Loading

dimkart commented May 26, 2023

mithunpaul08 commented May 26, 2023

dimkart commented May 27, 2023 • edited Loading

ACE07-Sev commented May 28, 2023 • edited Loading

ACE07-Sev commented May 28, 2023 • edited Loading

ACE07-Sev commented May 28, 2023

dimkart commented May 28, 2023 • edited Loading

ACE07-Sev commented May 28, 2023

ACE07-Sev commented May 30, 2023

dimkart commented Jun 19, 2023

UnitaryHACK: Replace unknown words in diagrams with `UNK` token #84

UnitaryHACK: Replace unknown words in diagrams with `UNK` token #84

le-big-mac commented May 4, 2023 •

edited by dimkart

Loading

le-big-mac commented May 26, 2023 •

edited

Loading

dimkart commented May 27, 2023 •

edited

Loading

ACE07-Sev commented May 28, 2023 •

edited

Loading

ACE07-Sev commented May 28, 2023 •

edited

Loading

dimkart commented May 28, 2023 •

edited

Loading