Notus end2end example for preference and instruction generation #145

ignacioct · 2023-12-07T20:26:24Z

Closes #143

First draft of the tutorial. Few things to take in mind:

Missing the conclusions, I'll get them ready when we have the workflow ready, in case something changes.
Unsure of the fine-tuning process, as using the openai framework fine-tunes a gpt-3.5 model, and maybe we want to fine-tune the notus model. In that case, how should I approach it with the Argilla Trainer?

Looking forward to your review! First time I go through distilabel process on my own, so there's probably a lot of little things to tweak :)

…us_end2end

davidberenstein1957 · 2024-01-03T12:18:39Z

@ignacioct, It looks good already but I've got some comments about the introduction of different aspects and the structure.

"Setting up an inference endpoint with Notus" also includes another section about "Defining a custom generation task for a distilabel pipeline".
I think it is really useful to showcase the custom task but we might need some context about that it is possible and we just use it for testing the pipeline etc.

Similarly "Download the AI Act PDF document" also cover "Creating a RAG pipeline using Deepset"

Also, for the separate parts we might add some additional introduction about what is happening and why, like the deepest index for example.

I would maybe make the introduction and overview a bit more catchy to focus on the AI act etc. Something like
"Use notus on inference endpoints to create a legal preference dataset based on RAG instructions from the EU AI Act"

You can also emphasize a bit why this is important to do and what the end user might gain from this approach over other ones.

Maybe some of the printed output is a bit long (about the batches etc.)

Also, in the end, I feel that I'm missing some wrap-up about the OpenAI-finetune and why we would need that/how we can use it. Maybe fine-tuning can be skipped and we can redirect people to some of our other tutorials https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/tutorials.html

ignacioct · 2024-01-04T11:17:27Z

@davidberenstein1957 I've implemented your suggestion; just a few doubts

If the outputs are too long, or do not add value, should I just erase them? Or how shall I deal with 'em.
When you suggest to add additional context for the deepset index, what part exactly do you mean?
I'm still unsure if I shall delete the fine-tuning part or not. I'm right now tempted to delete it, point to another tutorial, and focus on creating a conclusion where when emphasize why this was important and has value.

davidberenstein1957 · 2024-01-04T11:26:44Z

I think erasing them or printing a part of them is fine. e.g. For a list -> show the index. for a text -> show the first x characters.
Why are we using deepest, what are the steps we are going through? I believe we just use it to parse and process the text, but it is good to communicate that clearly :)
We can leave it for now and wait for @dvsrepo. I vote for deleting it and redirecting it to other content that focuses on fine-tuning through DPO and preferably using an open-source model.

davidberenstein1957 · 2024-01-04T11:34:54Z

@ignacioct , looking much better already :)
Here the title is nice and problem oriented for example: https://docs.giskard.ai/en/latest/tutorials/llm_tutorials/index.html
some follow-ups,

I would also rename the title to the one I proposed earlier.
I would mention the RAG part is not active-based on queries/semantic search but rather a more brute-force approach.
maybe a step by step 1.2.3. at the introduction would be nice

ignacioct · 2024-01-04T11:41:44Z

@ignacioct , looking much better already :) Here the title is nice and problem oriented for example: https://docs.giskard.ai/en/latest/tutorials/llm_tutorials/index.html some follow-ups,

I would also rename the title to the one I proposed earlier.

I would mention the RAG part is not active-based on queries/semantic search but rather a more brute-force approach.

maybe a step by step 1.2.3. at the introduction would be nice

Maybe something a little bit shorter like: Use Notus on inference endpoints to create a legal preference dataset ?

…us_end2end

davidberenstein1957 · 2024-01-09T16:52:24Z

@ignacioct, great work. It looks very complete.

3 minor remarks:

Llama2QuestionAnsweringTask, maybe renaming it to QuestionAnsweringTask, might be more generic
can you rename the filename, to something more explicit aligned with the examples?
some of the output in the cells might be cropped a bit? I think just showing the keys or text[:x] would suffice in most cases,

This reverts commit a96bafe.

uploading notus example, keys out

f77d3a1

ignacioct self-assigned this Dec 7, 2023

dosubot bot added the size:XS label Dec 7, 2023

ignacioct marked this pull request as draft December 7, 2023 20:26

ignacioct added 7 commits December 11, 2023 12:05

updated to finance using topics generated by chatgpt

f5fed64

basic workflow is achieved

a93aa12

save current status

f126cbf

Merge commit '3e1023b7b3e77c431f4c5721dd0f93da916d8568' into docs/not…

503d9ae

…us_end2end

workflow finished until training

13c379f

Merge commit 'ac492ee66a5d267a2af69184be156906d5a12f76' into docs/not…

17e65be

…us_end2end

Merge commit '7eb1fd29cd5978842c2fc18aad964e650517050a' into docs/not…

b3d3fe6

…us_end2end

gabrielmbmb force-pushed the main branch 4 times, most recently from f1bf4bb to e682616 Compare December 21, 2023 16:40

alvarobartt removed the size:XS label Dec 27, 2023

ignacioct added 5 commits January 2, 2024 14:26

text corrected and images added

019792c

typo

d3d2710

first draft of training

44d80f6

ready for first review

99a2cc4

Merge commit 'f5a030c44668a70ae82cb93c53886e7dffc4db21' into docs/not…

eddce9d

…us_end2end

ignacioct requested review from dvsrepo, davidberenstein1957, gabrielmbmb and plaguss January 3, 2024 11:49

ignacioct marked this pull request as ready for review January 3, 2024 12:21

textual changes

c238377

cell outputs corrected

50ec4a3

ignacioct added 6 commits January 4, 2024 12:42

named change to David's idea, but shorter

40436e2

disclaimer on RAG

e51b815

index added

0e37bb6

Merge commit 'd9ad890958e77b7dcac04f8b1bcc5d2770799304' into docs/not…

13b5a75

…us_end2end

conclusions done

c54c18b

formatting and grammar check

81c9b86

ignacioct added 10 commits January 10, 2024 09:06

QuestionAnsweringTask renamed

84bc4df

inference endpoint corrected with new QuestionAnsweringTask

a3e49cd

weird file CNAME removed

a96bafe

new name for tutorial

8d73a55

typo on name

1561d23

Revert "weird file CNAME removed"

c01edcc

This reverts commit a96bafe.

installation code uncommented

050d940

colab and plugin added

6a4b77f

dependency added

2c518f0

tutorial added to path

88df690

ignacioct merged commit 7b22080 into main Jan 11, 2024
4 checks passed

ignacioct deleted the docs/notus_end2end branch January 11, 2024 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notus end2end example for preference and instruction generation #145

Notus end2end example for preference and instruction generation #145

ignacioct commented Dec 7, 2023 •

edited

davidberenstein1957 commented Jan 3, 2024 •

edited

ignacioct commented Jan 4, 2024

davidberenstein1957 commented Jan 4, 2024 •

edited

davidberenstein1957 commented Jan 4, 2024 •

edited

ignacioct commented Jan 4, 2024

davidberenstein1957 commented Jan 9, 2024

Notus end2end example for preference and instruction generation #145

Notus end2end example for preference and instruction generation #145

Conversation

ignacioct commented Dec 7, 2023 • edited

davidberenstein1957 commented Jan 3, 2024 • edited

ignacioct commented Jan 4, 2024

davidberenstein1957 commented Jan 4, 2024 • edited

davidberenstein1957 commented Jan 4, 2024 • edited

ignacioct commented Jan 4, 2024

davidberenstein1957 commented Jan 9, 2024

ignacioct commented Dec 7, 2023 •

edited

davidberenstein1957 commented Jan 3, 2024 •

edited

davidberenstein1957 commented Jan 4, 2024 •

edited

davidberenstein1957 commented Jan 4, 2024 •

edited