Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(#1134): Allow extending the weak label matrix with embeddings #1487

Merged
merged 12 commits into from May 17, 2022

Conversation

dcfidalgo
Copy link
Contributor

Closes #1134 and #1309

@dcfidalgo dcfidalgo changed the base branch from main to master May 12, 2022 08:56
@dcfidalgo dcfidalgo self-assigned this May 12, 2022
@dcfidalgo dcfidalgo added this to In progress in Release via automation May 12, 2022
@dcfidalgo dcfidalgo marked this pull request as ready for review May 13, 2022 00:08
@dcfidalgo
Copy link
Contributor Author

dcfidalgo commented May 13, 2022

Ok, I will merge this tour de France when the tests pass.

A quick conclusion of this journey:

  • We figured out why Ruan had problems in the beginning with the ag news dataset. Since the dataset is big, he computed the embeddings in colab, but in the order of the original dataset from the HF Hub. In the tutorial we upload the records without an id, so the order in which the weak label object retrieves the records is random. So there was a mismatch between the records and the embeddings, that's why he basically obtained a random classifier. I will open an issue with a proposal to mitigate this issue in the future.
  • Optimizing the threshold is important, so it is not necessarily a cheap approach to improving performance. We could do some more experimentation to find reliable and cheap optimization methods.
  • In general, when one wants to extend the weak label matrix, one should aim for rules with very high precision and not necessarily a high recall, since this is what the extension of the weak label matrix is for. So one has to keep this in mind right from the start.
  • Also, the right embeddings are obviously important as well.

CC @dvsrepo @frascuchon

@frascuchon
Copy link
Member

Well, it looks like we're finally getting this feature...

As improvement points:

  1. The embeddings should be part of the records in the weak label matrix, instead of being passed as an argument in the extension. This can remove potential mismatch embedding errors.
for record in wl.records():
   record.embedding = .....

So, the extend_matrix method will accept only the thresholds argument.

  1. It would be nice to provide some default method or helper to optimize the thresholds that are stopped for the extension, given the importance of this parameter.

But it should be better to tackle them (if apply) in a different PR

@frascuchon frascuchon force-pushed the feat/extend_weak_label_matrix branch from 5aae674 to a054acc Compare May 17, 2022 08:51
@codecov
Copy link

codecov bot commented May 17, 2022

Codecov Report

Merging #1487 (b23f95c) into master (e8f5613) will increase coverage by 0.00%.
The diff coverage is 96.42%.

@@           Coverage Diff           @@
##           master    #1487   +/-   ##
=======================================
  Coverage   95.52%   95.52%           
=======================================
  Files         126      126           
  Lines        6232     6278   +46     
=======================================
+ Hits         5953     5997   +44     
- Misses        279      281    +2     
Flag Coverage Δ
pytest 95.52% <96.42%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rubrix/labeling/text_classification/weak_labels.py 99.34% <96.42%> (-0.66%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6ff8dc...b23f95c. Read the comment docs.

@dcfidalgo
Copy link
Contributor Author

I just talked about his comments with @frascuchon

The embeddings should be part of the records in the weak label matrix, instead of being passed as an argument in the extension. This can remove potential mismatch embedding errors.

We think it would be very helpful if one could materialize/load a WeakLabels object. Then you can easily share your dataset + weak labels, load it on a machine with GPU to add embeddings, etc.

It would be nice to provide some default method or helper to optimize the thresholds that are stopped for the extension, given the importance of this parameter.

I think more tests with different datasets and rules would be necessary to find a reliable and practical (cheap) method ... would be a good project for a student :)

@dcfidalgo dcfidalgo force-pushed the feat/extend_weak_label_matrix branch from a054acc to e48d5d8 Compare May 17, 2022 16:03
@dcfidalgo dcfidalgo merged commit 20ee14a into master May 17, 2022
Release automation moved this from In progress to Waiting Release May 17, 2022
@dcfidalgo dcfidalgo deleted the feat/extend_weak_label_matrix branch May 17, 2022 17:12
@frascuchon frascuchon moved this from Waiting Release to Ready to Release QA in Release Jun 6, 2022
@frascuchon frascuchon removed this from Ready to Release QA in Release Jun 6, 2022
frascuchon pushed a commit that referenced this pull request Jun 6, 2022
…1487)

* feat: add extend method + test

* docs: add tutorial

* improve extend_matrix docstring

* fix: fix syntax error on notebook

* fix: fix syntax error on notebook

* fix: remove mix parameter on extend_matrix

* fix: remove mix parameter on extend_matrix + type hints

* fix: remove mix parameter on extend_matrix + type hints

* chore: add faiss-cpu to dev dependencies

* docs: final version of the tutorial

* docs: small fix

* test: add more tests

Co-authored-by: ruanchaves <ruanchaves93@gmail.com>
(cherry picked from commit 20ee14a)
frascuchon pushed a commit that referenced this pull request Jun 7, 2022
…1487)

* feat: add extend method + test

* docs: add tutorial

* improve extend_matrix docstring

* fix: fix syntax error on notebook

* fix: fix syntax error on notebook

* fix: remove mix parameter on extend_matrix

* fix: remove mix parameter on extend_matrix + type hints

* fix: remove mix parameter on extend_matrix + type hints

* chore: add faiss-cpu to dev dependencies

* docs: final version of the tutorial

* docs: small fix

* test: add more tests

Co-authored-by: ruanchaves <ruanchaves93@gmail.com>
(cherry picked from commit 20ee14a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants