Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Hugging face entity extraction #777

Merged
merged 17 commits into from
May 31, 2023
Merged

Conversation

affan00733
Copy link
Member

In the hugging face pipeline added the add on functionality of named entity recognition (NER) models

@affan00733 affan00733 self-assigned this May 25, 2023
@jarulraj
Copy link
Member

@suryatejreddy Any feedback on this PR?

if isinstance(row_output, list):
row_output = {k: [dic[k] for dic in row_output] for k in row_output[0]}
result_list.append(row_output)
if outputs != [[]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be made cleaner by returning an empty Dataframe when NER model returns empty list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by returning the empty dataframe it dosen't work it gives unknown columns finds error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but in the sample case, we do have entities so I believe that shouldn't happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, In sample case we have the entities and where this unknown columns issue hasn't been occurred

row_output = {k: [dic[k] for dic in row_output] for k in row_output[0]}
result_list.append(row_output)
elif self.pipeline_args["task"] == "ner":
result_list.append({"entity": "", "score": 0, 'index': 0, 'end': 0, 'word': 0, 'start': 0})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we hardcoding this here? The above code should take care of it automatically right? Is there a case when it doesn't work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not taking care of it automatically , it is giving error of unknown column found as shown in below image

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@affan00733 Why is there an unknown column coming up with only the NER task? What happens with some other task like text classification? Why does this code not work:

row_output = {k: [dic[k] for dic in row_output] for k in row_output[0]}

We should avoid specializing the code for a particular task as much as possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same text classification it is working properly , none of the issues have been encountered

Copy link
Collaborator

@suryatejreddy suryatejreddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@affan00733 left some comments. cc @jarulraj

@affan00733
Copy link
Member Author

@affan00733 left some comments. cc @jarulraj

@suryatejreddy added some comments and fixed all the issues reported and added the addition test case.
Only the remaining part is for ner where we are adding the hardcoded condition , because it is giving error of unknown column found. Regarding this added the related images in the block where review has been reported.

@suryatejreddy
Copy link
Collaborator

@affan00733 could you try running again with the new sample text case that you added which actually has entities?

@affan00733
Copy link
Member Author

@affan00733 could you try running again with the new sample text case that you added which actually has entities?

@suryatejreddy i have tried with new sample case where it is working and getting some entities , you can see in the below image

image

@affan00733
Copy link
Member Author

@jarulraj @suryatejreddy when carefully inspecting the issue I have found that unknown columns have occurred only when no entities have been detected. Taking that scenario I think that hardcoded block will be needed for ner

@jarulraj
Copy link
Member

@affan00733 Is it possible to generally handle scenario where entities have been detected -- instead of specializing it to just this task?

@affan00733
Copy link
Member Author

@affan00733 Is it possible to generally handle scenario where entities have been detected -- instead of specializing it to just this task?

@jarulraj I had tried but couldn’t find a way, as when ner failed to detect the items then in batch.py in project function their has been an assert statement for the checking of unknown and verified columns as seen in image. where verified is columns detected by the ner model so for ner no items are their then it will be empty and unknown is columns from UDF frames that have been detected. As I haven’t had much understanding of batch.py file so @suryatejreddy could you please guide me in this scenario what possible solutions can be done.
image

@jarulraj
Copy link
Member

Fixed it by changing the code in function_expression. That is the right place to fix this \ cc @affan00733

@affan00733
Copy link
Member Author

affan00733 commented May 29, 2023

Fixed it by changing the code in function_expression. That is the right place to fix this \ cc @affan00733

Thanks professor :)

@jarulraj
Copy link
Member

jarulraj commented May 29, 2023

@gaurav274 @suryatejreddy Is this behavior what we expect when the UDF does not return any inputs?

@jarulraj
Copy link
Member

jarulraj commented May 29, 2023

Query:

SELECT data, {udf_name}(data) FROM MyPDFs;

Output (when entities are detected)
image-1

Query 2:

SELECT data, {udf_name}(data) FROM MyPDFs WHERE page = 3 AND paragraph >= 1 AND paragraph <= 3;

Output (when no entities are detected)
image-2

@jarulraj jarulraj merged commit ccef6da into master May 31, 2023
@jarulraj jarulraj deleted the hugging-face-entity-extraction branch May 31, 2023 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants