Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnitaryHACK: Replace unknown words in diagrams with UNK token #84

Closed
le-big-mac opened this issue May 4, 2023 · 12 comments
Closed

UnitaryHACK: Replace unknown words in diagrams with UNK token #84

le-big-mac opened this issue May 4, 2023 · 12 comments
Assignees
Labels
enhancement New feature or request unitaryHACK Tasks for unitaryHACK 2023

Comments

@le-big-mac
Copy link
Collaborator

le-big-mac commented May 4, 2023

Task description

One of the most common challenges in NLP is the handling of unknown words, or out-of-vocabulary (OOV) words. The term refers to words that may appear during evaluation and testing, but they were not present in the training data of the model. A common technique to handle unknown word is to introduce a special token UNK. In the simplest possible case, a way to do that is the following:

  1. Replace every rare word in the training data (e.g. every word that occurs less than a specified threshold, for example 3 times) with a special token UNK.
  2. During training, learn a representation for UNK as if there was any other token.
  3. During evaluation, when you meet an unknown word, use the representation of UNK instead.

However, in the syntax-based models of lambeq (such as DisCoCat) this method would not work. This is because of two reasons:

  • The sentences are used as input to a parser, which has no way to recognise the special token UNK and assign to it the proper part-of-speech in each case.
  • In order for the produced diagram to be valid, each part of speech (in lambeq, defined by the pregroup type of each word) needs a different UNK token.

This task is about adding a feature in lambeq that handles unknown words. For the reasons explained before, in lambeq the unknown words need to be replaced after the diagrams have been generated. You will have to learn how to construct DisCoPy diagrams in lambeq and manipulate them with functors. The overall goal of this task is to develop a DisCoPy functor (pretty much similar to how a lambeq's RewriteRule is implemented) that takes a list of unknown words to be replaced with UNK, and that, when passed a diagram, replaces all the boxes containing an unknown word with an UNK box corresponding to the same pregroup type.

Notes

  • lambeq also contains compositional schemes that do not support syntax, such as the CupsReader and StairsReader. For these cases, the simple algorithm proposed above would suffice, and it could be applied directly at the sentence level. For this task, however, we are interested to provide handling of unknown words for the syntax-based models of lambeq, such as DisCoCat and TreeReader.
  • In lambeq's pipeline, the replacement should take place after the generation of the string diagrams and before the application of any rewrite rule or ansatz.
  • In case you are not familiar with functors, a less preferred way to implement this task is to create a function that processes a passed list of diagrams in a simple imperative way.

Resources

Some useful resources for this task can be found below:

@le-big-mac le-big-mac added the unitaryHACK Tasks for unitaryHACK 2023 label May 4, 2023
@Thommy257 Thommy257 removed the unitaryHACK Tasks for unitaryHACK 2023 label May 5, 2023
@Thommy257 Thommy257 changed the title unitaryHACK - Replace unknown words in diagrams with UNK token Replace unknown words in diagrams with UNK token May 5, 2023
@dimkart dimkart added the enhancement New feature or request label May 25, 2023
@Thommy257 Thommy257 added the unitaryHACK Tasks for unitaryHACK 2023 label May 26, 2023
@Thommy257 Thommy257 changed the title Replace unknown words in diagrams with UNK token UnitaryHACK: Replace unknown words in diagrams with UNK token May 26, 2023
@WingCode
Copy link
Contributor

@le-big-mac / @le-big-mac I would like to take a stab at this issue. Could you assign it to me?

@le-big-mac
Copy link
Collaborator Author

le-big-mac commented May 26, 2023

Hi @WingCode, we're very glad you've taken an interest in this issue! The way the unitaryHack bounty tracking works means that we'll assign this issue to you if you are the first one to open a PR that solves it. You can work on it without being assigned and open a PR on this repo, after which we'll assign the issue to you and close it if it solves the problem!

@dimkart
Copy link
Contributor

dimkart commented May 26, 2023

@WingCode Note that more than one users can work on the same issue, in which case the maintainers decide which one is the best solution (or they --we-- can even split the bounty).

@mithunpaul08
Copy link

@dimkart Why not use the technique of a trained FFNN/MLP which learns a mapping between new/unknown words to their FastText equivalent- which @nikhilkhatri suggested in his Masters thesis? I am using it, and its brilliant.

@dimkart
Copy link
Contributor

dimkart commented May 27, 2023

@mithunpaul08

@dimkart Why not use the technique of a trained FFNN/MLP which learns a mapping between new/unknown words to their FastText equivalent- which @nikhilkhatri suggested in his Masters thesis? I am using it, and its brilliant.

You are right it's much preferable, but it would be too much for this hackathon. We didn't want to add any tasks that involve real experiments.

@ACE07-Sev
Copy link
Contributor

ACE07-Sev commented May 28, 2023

Greetings,

I have a code prepared for exactly that, but it's a function I defined, not a class instance of RewriteRule. Reason being is to allow the user to apply it to the diagrams with respect to the dataset they are using. I read the source code, and the manner I think it's possible (just what I understand for now, not saying it's impossible HEHEHE) to have the other rewrite rules, especially the determiner and punctuation and such is because we have defined what words they'll be BEFOREHAND, whereas the UNK will change for each dataset.

Proof of work :
image

To

image

There are two functions, one for applying UNK rewriting for the training which has less than some threshold occurrence condition, the other is for applying to a test sentence which has to look at the entire vocabulary of the words the model has seen before.

Shall I make my PR in the form of a jupyter notebook providing the approach?

@ACE07-Sev
Copy link
Contributor

ACE07-Sev commented May 28, 2023

I finished my Jupyter notebook (removed irrelevant details like other tokenizers and other ansatzes). Here is the link for it, based on feedback I'll make a PR if requested.

My understanding of the problem :
"The overall goal of this task is to develop a DisCoPy functor that takes a list of unknown words to be replaced with UNK, and that, when passed a diagram, replaces all the boxes containing an unknown word with an UNK box corresponding to the same pregroup type."

So in my function, I am defining a DisCoPy Functor, given the unknown words, and the status wanted (using for low occurence or never seen before words), and then apply the functor to diagrams to rewrite them. I think this should be ok, my only hesitation at the moment is it not being in the same trend of the other rewrite rules, which I'll work on now.

https://github.com/ACE07-Sev/Quantum-Natural-Language-Processing-with-Lambeq/blob/main/QNLP-UNK.ipynb

@ACE07-Sev
Copy link
Contributor

By the way @mithunpaul08, I'd love to help you with implementing that for Lambeq.

@dimkart
Copy link
Contributor

dimkart commented May 28, 2023

@ACE07-Sev Hi, unfortunately we cannot review code that is not part of a PR in this repository. So if you want to participate, you will have to open a proper PR here. Note though that we are not asking for a notebook, but for a functor, rewrite rule, or method that is available from lambeq's public interface. If at the end there are more than one PRs open for the same issue, we will select the solution we consider the best (or split the bounty, as mentioned above).

@ACE07-Sev
Copy link
Contributor

@dimkart dear, I have made the PR. I did two sets of codes, one is the one I made a PR for, the other is basically like something you would write (same structure and trend as the other classes), but I couldn't really test it to see if it works, so I made the PR for the one that I was able to test.

I don't like my current PR exactly because it's not a class. I am certain the idea is correct, but there is some syntax error somewhere that I can't find hehe. I'll try to see if I can fix that as well.

This was referenced May 29, 2023
@ACE07-Sev
Copy link
Contributor

@dimkart dear, I have made the PR with the class format. I added it to the Rewriter class as an _available_rules and to use we have to simply pass the words and apply it to the diagram.

@dimkart
Copy link
Contributor

dimkart commented Jun 19, 2023

This is now completed. Thank you all for your work!

@dimkart dimkart closed this as completed Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment