-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding additional custom features? #91
Comments
Hey @echan00, thanks for using FARM. Can you also supply some more insights into the use case and maybe a link to a relevant paper? |
Hi @Timoeller
In the example above, knowing that Apple Computers is bold is useful information to predict the NER. For my particular use case, I want to be able to predict word-alignment given the source text and target translation. I want to be able to point to a keyword in the source text and to be able to predict the same keyword in the target text. (yes, the is just the separation between the source and targe texts)
|
Thanks for the explanations. Yes this seems to be a super cool use case - but no, it is not implemented in FARM yet. But the modularity of language model and prediction head will make it rather simple to implement. I guess the changes would be twofold:
If you are going forward with this we would help you out as much as possible, because this seems like a really interesting feature. |
Thanks Timoeller. Do you think just concatenating the extra information as-is will be good enough? Do I need to worry about the dimensionality of the extra information in regards to the huge BERT embedding model? |
That is a very stimulating question. So if you concatenate the information to the Bert output embeddings, the prediction head should take care of selecting and weighting the relevant parts. Since the prediction is on a token level you would need to concat this extra information on a token level, too. I think a more elegant, but also much more difficult, way would be to have "type information" already present at the input to the language model. I would envision this as additional embeddings similar to how positional embeddings get integrated into Bert. You would have to pretrain a whole Langauge Model from scratch though for that to work... |
@Timoeller I'm new to this so making changes to your library may end up requiring you to answer quite a few questions. I hope you don't mind.
Is this necessary if I expect to use BERT tokenizer? Or am I missing something?
Would you mind providing a bit more detail as to how to go about this? How do you suggest I concatenate the extra information? I'm trying to figure out how to access the extra information in the AdaptiveModel function forward() you referenced, I presume from data_silo? Other questions:
|
Exactly, Bert will split some words into subwords (eg 'exponents' into 'ex' and '##ponents'). These subwords need to be aligned with the custom features you add. Its should be rather simple, you can reuse the code for creating NER labels. To the other questions:
The data for this will be inside the kwargs and I will try to explain how to get it there. Hope that helps. Good luck and please keep me updated on your progress or ask if something brakes. This is a really exciting use case! |
Thanks for your help @Timoeller I have made all the changes in my fork: https://github.com/echan00/FARM I am now running into an error during model training:
My guess is I am not concatenating the extra data correctly:
Would absolutely appreciate it if you could point me in the right direction. EDIT - Providing a sample of the data from the log for reference:
|
Hey, looking good already. I had a quick glance at the code and it seems its doing the right things at the right places. Though I cannot tell without debugging if custom_data is in the right format. Concerning your error: you are right, your code works on python level but at this stage we are in the torch tensor world. So we have to use torchs cat function: https://pytorch.org/docs/stable/torch.html#torch.cat One thing I noticed. Your custom_feature_dim is 192. I guess you have 192 different classes for this feature. You would need to one hot encode those classes into a vector of dimensionality 192. Now your custom_data is just a single integer, so the custom_feature_dim should be 1. Looking forward to your results! |
Okay, I updated my concat_additional function as follows:
Due to
I am now running into another error, and not sure what it is about:
|
Mhh, going forward() but not in the exact right direction : ) So list1 and list2 should have shape [batch_size, max_seq_len, 768] and [batch_size, max_seq_len, 1] respectively. I really do not understand how custom_feature_dim should be 0. That means you do not have custom features... |
Okay I'll be looking into this some more :) thanks for all the pointers
Do you mean concatenate dimension 768 of list1 (the last slice) and list2 (the one and only slice since dimension is 1)? Or something else? |
Ah, sorry, I think we are having a terminology issue here. So yes you concat the 768 coming from bert embeddings with the 1 coming from your custom feature. If the custom feature has 10 different classes (bold, italics, capitalized, different font,...) and you one-hot encode this into 10 dimensions, you would add these 10 to 768. |
Something doesn't sound right. I just ran the provided ner.py from the examples folder, and sequence_output (list1) is torch.Size([5, 128, 768]) and 3 dimensions. I imagine this is [number of features, max_seq_len, embedding_size] and not [batch_size, max_seq_len, embedding_size] as you described above. In my version, sequence_output (list1) is torch.Size([3, 128, 768]) and 3 dimensions, which is in line with the 3 being the number of features. UPDATE:
|
Just wrapped up one epoch of training, results don't look good:
Below is one example of the input features. Everything looks correct.
|
This is already quite a success. The network is getting the data and training something. Could be that this single number is too high compared to the values inside of the embedding. Try setting all custom_data to 0. If that trains a good system scale these values to something small. The next step would be to one hot encode. Ah and could you update your fork please. Then I can have a detailed look in the debugger to see what happens. |
Okay, I just updated my fork. Here is the data that you should unzip besides ner-run.py:
Okay, I'll try setting custom_data to 0, I'll also separately try setting custom_data to be same as the output (ner_labels_ids).
Isn't custom_data already one-hot-encoded? 0s and 1s only? EDIT: |
Same results when custom_data are all 0s, accuracy is in the single digits. It seems like custom_data is either a) not affecting the model during training or b) the model is not quite picking up what it is being trained to do |
That is true, if the additional data is only binary the current implementation is one hot encoded and correct.
This is good and bad news. The good news is that we can still figure out if custom data helps the classification. The bad news is, that there is probably a larger bug. Either in how you convert the data and pass it to the model or how you evaluate the predictions. Can you check if both are working correctly? Then we can continue implementing the custom data. Hope this goes forward, it seems we are really close : ) |
I presume the input features don't change anymore before being passed into the model for training? I've inspected them and they look good to me. I could post a sample and explain them to verify they are appropriate? |
Below is one sample of the data: This is the tokenized text (plus offsets and start_of_word), the additional features (custom_data), and the prediction (ner_label). The data length is 192.
These are the model input features for training (the data is length 194 after adding [CLS] and [SEP], and then padded to length 512):
EDIT: look like I found some discrepancies, let me double check first |
Ok I will wait to dig deeper. Another thing: you do not have to set max_se_len to 512 if the data is always shorter. This is actually super bad because it wastes a lot of compute. But NER inside a multilingual model works. We have benchmarked it against our German Bert as you can see in this blog post: https://deepset.ai/german-bert Good luck! Lets chase this bug down! |
I tried training again after improving the data processing but the results are still completely inaccurate.
I will try it as you suggest. Instead of "0" and "1", I will use "N" and "Y". I am also making "[CLS]" and "[SEP]" distinct predictions of their own. A sample is below:
|
Model results are still off. @Timoeller, if you could take a look that would be great. I've updated my fork as well. Below are the evaluation results from model training:
It seems like [SEP] is being predicted appropriately. I imagine this means there is no bug with evaluation and the issue is custom_data information is not registering? |
Sorry @echan00 I havent had time to look into this. |
Thanks for checking in. At this point I've doubled checked the data enough times where the chances of data inaccuracies is very low. The best performance I've gotten so far is worse than random. I'm not sure what else I can do at this point. |
Did you try NER on the same data in another framework and it produced the same bad results? Unfortuantely I do not have the time to look deeply into your code + data and prefer supporting you with bugs/problems inside FARM. If there is nothing more to try we should close this issue for now and hope for better days and data : ) |
Is it possible to add additional custom features in addition to using pre-trained language models for the down-stream tasks?
For example:
The text was updated successfully, but these errors were encountered: