-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Hugging face entity extraction #777
Conversation
@suryatejreddy Any feedback on this PR? |
if isinstance(row_output, list): | ||
row_output = {k: [dic[k] for dic in row_output] for k in row_output[0]} | ||
result_list.append(row_output) | ||
if outputs != [[]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be made cleaner by returning an empty Dataframe when NER model returns empty list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by returning the empty dataframe it dosen't work it gives unknown columns finds error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but in the sample case, we do have entities so I believe that shouldn't happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, In sample case we have the entities and where this unknown columns issue hasn't been occurred
eva/udfs/abstract/hf_abstract_udf.py
Outdated
row_output = {k: [dic[k] for dic in row_output] for k in row_output[0]} | ||
result_list.append(row_output) | ||
elif self.pipeline_args["task"] == "ner": | ||
result_list.append({"entity": "", "score": 0, 'index': 0, 'end': 0, 'word': 0, 'start': 0}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we hardcoding this here? The above code should take care of it automatically right? Is there a case when it doesn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@affan00733 Why is there an unknown column coming up with only the NER task? What happens with some other task like text classification? Why does this code not work:
row_output = {k: [dic[k] for dic in row_output] for k in row_output[0]}
We should avoid specializing the code for a particular task as much as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same text classification it is working properly , none of the issues have been encountered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@affan00733 left some comments. cc @jarulraj
@suryatejreddy added some comments and fixed all the issues reported and added the addition test case. |
@affan00733 could you try running again with the new sample text case that you added which actually has entities? |
@suryatejreddy i have tried with new sample case where it is working and getting some entities , you can see in the below image |
@jarulraj @suryatejreddy when carefully inspecting the issue I have found that unknown columns have occurred only when no entities have been detected. Taking that scenario I think that hardcoded block will be needed for ner |
@affan00733 Is it possible to generally handle scenario where entities have been detected -- instead of specializing it to just this task? |
@jarulraj I had tried but couldn’t find a way, as when ner failed to detect the items then in batch.py in project function their has been an assert statement for the checking of unknown and verified columns as seen in image. where verified is columns detected by the ner model so for ner no items are their then it will be empty and unknown is columns from UDF frames that have been detected. As I haven’t had much understanding of batch.py file so @suryatejreddy could you please guide me in this scenario what possible solutions can be done. |
Fixed it by changing the code in |
Thanks professor :) |
@gaurav274 @suryatejreddy Is this behavior what we expect when the UDF does not return any inputs? |
In the hugging face pipeline added the add on functionality of named entity recognition (NER) models