Data synthesis #3

mmirman · 2023-06-09T12:20:23Z

We should use the larger LLM to synthesize data for training the small LLM with in the optimizing api

abhigya-sodani · 2023-06-22T18:21:02Z

cartazio · 2023-06-27T17:20:20Z

#12 mentions some features this should have

duplicate/ too similar a response detection

duplicates can be handled with a hashing based lookup, or even maybe something like a bloom filter if we didn't want much. any strong hash function would work for duplicate detection and then its just a question of how much space overhead we're ok with.

@abhigya-sodani proposed a nice way to approach "too similar responses". namely the Euclidean aka L2 aka innerproduct aka cosine similarity distance of pairs of responses. This does run into the issue of having a quadratic blow up in operations as a function of the number of examples. One approach to resolve that would be to maintain some approximate nearest neighbors data structure and only compare against the closest few. I'm very unfamiliar with the current state of the art for ANN tools, let alone which ones have a quality python interface, @mmirman any strong opinions there?

mmirman · 2023-06-27T17:26:48Z

This is already an approximate measure of dataset diversity so why not just roll with that and try to make it a probabilistic approximate measure of dataset diversity? Like take a subsample of random pairs and compute diversity of that set.

mmirman · 2023-06-27T17:27:24Z

Or a few sub-cliques. I'm sure there's some statistically optimal way of doing this that doesn't matter since we aren't publishing

mmirman · 2023-06-27T17:27:58Z

bloom filter is overkill until we have users complaining

cartazio · 2023-06-27T18:20:23Z

This is already an approximate measure of dataset diversity so why not just roll with that and try to make it a probabilistic approximate measure of dataset diversity? Like take a subsample of random pairs and compute diversity of that set.

you mean that average L2 norm of the pairs of distances? eg average_i=0,n, j>i of ( dist(i,j) )

ok cool, i'm comfortable with the subsampling then doing the all pairs average. (not where I was gonna go, but a good place none the less).

cartazio · 2023-06-27T18:24:01Z

wrt embedding vector computation text-embedding-ada-002 would be the model invoked in the preliminary version I suppose, at least for the openai flavored approach, right?

cartazio · 2023-06-27T18:24:16Z

though thats not an ideal end state

mmirman · 2023-06-27T18:32:34Z

average is probably the wrong statistic but something like itBest,Matt==================Dr. Matthew Mirmanhttps://www.mirman.comOn 27 Jun 2023, at 8:20 PM, Carter Tazio Schonwald ***@***.***> wrote: This is already an approximate measure of dataset diversity so why not just roll with that and try to make it a probabilistic approximate measure of dataset diversity? Like take a subsample of random pairs and compute diversity of that set. you mean that average L2 norm of the pairs of distances? eg average_i=0,n, j>i of ( dist(i,j) ) ok cool, i'm comfortable with the subsampling then doing the all pairs average. (not where I was gonna go, but a good place none the less). —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

mmirman · 2023-06-27T18:33:09Z

like, avg # of pairs sampled before a pair of distance below X?Best,Matt==================Dr. Matthew Mirmanhttps://www.mirman.comOn 27 Jun 2023, at 8:24 PM, Carter Tazio Schonwald ***@***.***> wrote: though thats not an ideal end state —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

cartazio · 2023-06-27T18:35:31Z

oh, like "min and 10th percentile distance" among subsampled pairs?

cartazio · 2023-06-29T16:48:54Z

recap: exact duplicate detection, then feature vector comparisons

cartazio · 2023-06-29T18:32:37Z

items from #24 are

Add all the parameters
Prompt variation aka support alternative prompts, possibly with different defaults for each supported model
Track number of actual responses vs requested and other possible statistics on quality
Dedup handling
-[ ]First exact repeat duplicate detection
-[ ] then semantic vector comparisons for duplicate detection (this is going to possibly be a bit more fiddly)
-[ ] consider adding Parameters for the diverse ways you might want to define exact matches or vector comparisons

cartazio · 2023-06-29T19:01:53Z

#29 is also important

cartazio · 2023-07-05T19:00:46Z

632538f added basic duplicate detection, but theres a LOT more work to be done there, with the next future step being evaluating Spacy and OpenAI/LLM semantic vector embeddings and parameter tuning around that

cartazio · 2023-07-06T16:27:01Z

still needs K-shot example support

mmirman added $200 feat/enhancement New feature or request good first issue Good for newcomers labels Jun 13, 2023

mmirman self-assigned this Jun 14, 2023

cartazio mentioned this issue Jun 29, 2023

June July tasks #24

Closed

22 tasks

mmirman added this to the ICML milestone Jun 30, 2023

This was referenced Jul 6, 2023

Data Synthesis #33

Closed

Data synthesis #37

Merged

abhigya-sodani closed this as completed Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data synthesis #3

Data synthesis #3

mmirman commented Jun 9, 2023

abhigya-sodani commented Jun 22, 2023

cartazio commented Jun 27, 2023

mmirman commented Jun 27, 2023

mmirman commented Jun 27, 2023

mmirman commented Jun 27, 2023

cartazio commented Jun 27, 2023

cartazio commented Jun 27, 2023

cartazio commented Jun 27, 2023

mmirman commented Jun 27, 2023 via email

mmirman commented Jun 27, 2023 via email

cartazio commented Jun 27, 2023

cartazio commented Jun 29, 2023

cartazio commented Jun 29, 2023

cartazio commented Jun 29, 2023

cartazio commented Jul 5, 2023

cartazio commented Jul 6, 2023

Data synthesis #3

Data synthesis #3

Comments

mmirman commented Jun 9, 2023

abhigya-sodani commented Jun 22, 2023

cartazio commented Jun 27, 2023

mmirman commented Jun 27, 2023

mmirman commented Jun 27, 2023

mmirman commented Jun 27, 2023

cartazio commented Jun 27, 2023

cartazio commented Jun 27, 2023

cartazio commented Jun 27, 2023

mmirman commented Jun 27, 2023 via email

mmirman commented Jun 27, 2023 via email

cartazio commented Jun 27, 2023

cartazio commented Jun 29, 2023

cartazio commented Jun 29, 2023

cartazio commented Jun 29, 2023

cartazio commented Jul 5, 2023

cartazio commented Jul 6, 2023