Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will lemmatization negatively affect FlauBERT word embeddings? #38

Closed
mcriggs opened this issue Dec 22, 2021 · 9 comments
Closed

Will lemmatization negatively affect FlauBERT word embeddings? #38

mcriggs opened this issue Dec 22, 2021 · 9 comments

Comments

@mcriggs
Copy link

mcriggs commented Dec 22, 2021

Hello!

I am using FlauBERT to generate word embeddings as part of a study on word sense disambiguation (WSD).

The FlauBERT tokenizer does not recognize a significant number of words in my corpus and, as a result, segments them. For example, the FlauBERT tokenizer does not recognize the archaic orthography of some verbs. It also does not recognize plural forms of a number of other words in the corpus. As a result, the tokenizer segments a number of words in the corpus, some of which are significant to my research.

I understand that I could further train FlauBERT on a corpus of eighteenth-century French in order to create a new model and a new tokenizer specifically for eighteenth-century French. However, compiling a corpus of eighteenth-century French that is large and heterogeneous enough to be useful would be challenging (and perhaps not even possible).

As an alternative to training a new model, I thought I might lemmatize my corpus before running it through FlauBERT. Stanford's Stanza NLP package (stanfordnlp.github.io/stanza/) recognizes the archaic orthography of the verbs in my corpus and turns them into the infinitive form, a form FlauBERT recognizes. Similarly, Stanza also changes the plural forms of other words into singular forms, forms FlauBERT also recognizes. Thus, if I were to lemmatize my corpus in Stanza, the FlauBERT tokenizer would then be able to recognize substantially more words in my corpus.

Would lemmatizing my corpus in this way adversely affect my FlauBERT results and a WSD analysis in particular? In general, does lemmatization have a negative effect on BERT results and WSD analyses more particularly?

Given that FlauBERT is not trained on lemmatized text, I imagine that lemmatizing the corpus would indeed negatively effect the results of the analysis. As an alternative to training FlauBERT on a corpus of eighteenth-century French (which may not be possible), could I instead train it on a corpus of lemmatized French and then use this new model for a WSD analysis on my corpus of lemmatized eighteenth-century French? Would that work?

I'm not sure if this is the right place for these sorts of questions!

Thank you in advance for your time.

@formiel
Copy link
Contributor

formiel commented Dec 29, 2021

Hi @mcriggs. Sorry for the late reply!

@loic-vial @TCMVince Could you please help me on this issue? Unfortunately I'm not familiar with WSD :(

@VSegonne
Copy link
Collaborator

Hey,
Sorry for the late reply, it's a busy time of the year !

I haven't really faced that problem before so I can only provide answers based on my humble intuition.

  1. Just a simple clarification : can you explain why segmentation would be a problem in your setup ?
    Segmentation is part of the BERT tokenizer relying on the use of the BPE encoding. As a result, many words will be segmented, not only the unknown ones, which in fact is not a problem in itself.

  2. I honestly don't know to which extend lemmatizing the corpus will do the trick given the differences between the training and testing data processing. Indeed, lemattizing the 18th century text might just add another layer of genre difference !
    Nevertheless given the potential semantic relatedness between the original and lemmatized words, that may not be that bad and may be worth the try, at least for WSD (it's just a simple intuition, can't really assert anything at that point).

Hope it helps, keep us posted !

@mcriggs
Copy link
Author

mcriggs commented Dec 29, 2021

@TCMVince and @formiel Thank you both for your willingness to help! I really appreciate it!

@TCMVince, segmentation is a problem for the project I'm working on because the plural form of the word my research focuses on -- 'baroque' -- gets broken up into two parts.

Let me give you a specific example of what the FlauBERT tokenizer does. Take the sentences "Il aimait la perle baroque." and "Il aimait les perles baroques." The FlauBERT tokenizer (with the 'flaubert/flaubert_large_cased' model) treats the first of these sentences as:

['il', 'aimait', 'la', 'perle', 'baroque', '.']
[0, 41, 9138, 17, 18882, 17043, 16, 1]

and the second as:

['il', 'aimait', 'les', 'perles', 'bar', 'oques', '.']
[0, 41, 9138, 22, 11211, 2214, 40208, 16, 1]

From these results, I understand that the tokenizer recognizes the singular form 'baroque', but doesn't recognize the plural form 'baroques' and for this reason splits it into 'bar' and 'oques'.

I don't know how to generate word embeddings for the segmented form of 'baroques' (i.e. 'bar', 'oques') because it has two word indices (i.e. 2214 and 40208).

Here's the code I use for generating word embeddings of the singular form of 'baroque' in my corpus:

keyword_embeddings = []

for text in texts:
    #tokenized_text = tokenizer.tokenize(text) 
    #print(tokenized_text)
    encoded_text = tokenizer.encode(text)
    #print(encoded_text)
    keyword_index = encoded_text.index(17043) #17043 is the index of "baroque" in FlauBERT.
    #print(keyword_index)
    token_ids = torch.tensor([tokenizer.encode(text)])
    with torch.no_grad():
        last_layer = model(token_ids)[0]
    keyword_embedding = last_layer[:, keyword_index, :]
    keyword_embeddings.append(keyword_embedding)

keyword_all_embeddings = torch.cat(keyword_embeddings, dim=0)

Given that "baroques" has two word indices (2214 and 40208), I don't know how I can generate word embeddings for it using this code. Does that make sense?

I thought lemmatization might help me resolve this problem. If I were to lemmatize the two example sentences, they would both be turned into the same sentence: "Il aimer le perle baroque." The tokenizer would treat this sentence as follows:

['il', 'aimer', 'le', 'perle', 'baroque', '.']
[0, 41, 4568, 20, 18882, 17043, 16, 1]

Thus, by lemmatizing the sentences prior to using FlauBERT, I could resolve the problem of the plural form of 'baroques' being segmented by the tokenizer. I understand that lemmatizing erases information about gender, tense, etc., but that information is not of primary interest to me.

Moreover, if I were to lemmatize the text prior to generating word embeddings in FlauBERT, I could easily visualize the word embeddings of both 'baroque' and 'baroques' in the same TSNE scatterplot.

Does that make any sense? Does lemmatization seem like a reasonable option here?

By the way, if I were to further train FlauBERT on a corpus of eighteenth-century French, how big should the corpus be?

Perhaps I might try to further train FlauBERT on an eighteenth-century French corpus and on a lemmatized version of that same corpus and see what difference it makes to word embeddings...

Thank you again for your help!

@VSegonne
Copy link
Collaborator

As for the segmentation, one usual way to recover embeddings consists in summing/averaging the contextual vectors of the corresponding BPE tokens from split words.

Using your example :
['il', 'aimait', 'les', 'perles', 'bar', 'oques', '.']
[0, 41, 9138, 22, 11211, 2214, 40208, 16, 1]
This would mean to sum/average the representations of 2214 and 40208. One way to do this is to keep track of the BPE span when using the tokenizer.
You may want to take a look at encode() and forward() at : https://github.com/getalp/Flaubert/blob/master/flue/wsd/verbs/modules/wsd_encoder.py#L33

I would start with this first.

Then depending on the time you spent on each experiment you can of course try to lemmatize your test data (I assume it isn't that big). I would be interesting of the results by the way.

Finally as for the size of the training corpus, I honestly don't know. @formiel you may have an idea ?

Hope it helps!

@mcriggs
Copy link
Author

mcriggs commented Dec 30, 2021

@TCMVince That does help indeed! Thank you!

Thank you as well for pointing me to that other code. Unfortunately, I couldn't really make sense of it. (My coding skills are pretty basic...) But I was able to tweak what I already had and it seems to work. I can just do:

keyword_embeddings = []

for text in texts:
    encoded_text = tokenizer.encode(text)
    keyword_index_1 = encoded_text.index(2214)
    keyword_index_2 = encoded_text.index(40208)
    token_ids = torch.tensor([tokenizer.encode(text)])
    with torch.no_grad():
        last_layer = model(token_ids)[0]
    keyword_embedding_1 = last_layer[:, keyword_index_1, :]
    #print(keyword_embedding_1)
    keyword_embedding_2 = last_layer[:, keyword_index_2, :]
    #print(keyword_embedding_2)
    keyword_embedding = (keyword_embedding_1 + keyword_embedding_2)/2
    #print(keyword_embedding)
    keyword_embeddings.append(keyword_embedding)

keyword_all_embeddings = torch.cat(keyword_embeddings, dim=0)

That seems to work. I'll give that a try. I'll also compare the results with a lemmatized version of my texts and will let you know how it turns out. :)

Thanks again!

@mcriggs
Copy link
Author

mcriggs commented Dec 30, 2021

@formiel I'm very interested in further training FlauBERT on a corpus of eighteenth-century French. If you have any sense of how big a corpus I would need I would really appreciate it!
Thank you!

@VSegonne
Copy link
Collaborator

I'm glad I could help ;)

@formiel
Copy link
Contributor

formiel commented Dec 31, 2021

Thanks a lot @TCMVince for your help!

@mcriggs Some previous work (https://arxiv.org/abs/1908.04812) has shown the benefit of continued pre-training on in-domain data using 0.5M examples. Maybe you could try a corpus of similar size or you could try out different amount of data to investigate the impact of continued pre-training on domain-specific data (I would say the more data the better). You may also want to check out these works: https://arxiv.org/abs/2010.02559 and https://arxiv.org/abs/1904.02232, where they propose different approaches to adapt pre-trained models to a specific domain.

I hope that it would help!

@mcriggs
Copy link
Author

mcriggs commented Dec 31, 2021

@formiel Thank you!! This is all very helpful. :)

@mcriggs mcriggs closed this as completed Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants