-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data synthesis #3
Comments
#12 mentions some features this should have
duplicates can be handled with a hashing based lookup, or even maybe something like a bloom filter if we didn't want much. any strong hash function would work for duplicate detection and then its just a question of how much space overhead we're ok with. @abhigya-sodani proposed a nice way to approach "too similar responses". namely the Euclidean aka L2 aka innerproduct aka cosine similarity distance of pairs of responses. This does run into the issue of having a quadratic blow up in operations as a function of the number of examples. One approach to resolve that would be to maintain some approximate nearest neighbors data structure and only compare against the closest few. I'm very unfamiliar with the current state of the art for ANN tools, let alone which ones have a quality python interface, @mmirman any strong opinions there? |
This is already an approximate measure of dataset diversity so why not just roll with that and try to make it a probabilistic approximate measure of dataset diversity? Like take a subsample of random pairs and compute diversity of that set. |
Or a few sub-cliques. I'm sure there's some statistically optimal way of doing this that doesn't matter since we aren't publishing |
bloom filter is overkill until we have users complaining |
you mean that average L2 norm of the pairs of distances? eg ok cool, i'm comfortable with the subsampling then doing the all pairs average. (not where I was gonna go, but a good place none the less). |
wrt embedding vector computation |
though thats not an ideal end state |
average is probably the wrong statistic but something like itBest,Matt==================Dr. Matthew Mirmanhttps://www.mirman.comOn 27 Jun 2023, at 8:20 PM, Carter Tazio Schonwald ***@***.***> wrote:
This is already an approximate measure of dataset diversity so why not just roll with that and try to make it a probabilistic approximate measure of dataset diversity? Like take a subsample of random pairs and compute diversity of that set.
you mean that average L2 norm of the pairs of distances? eg average_i=0,n, j>i of ( dist(i,j) )
ok cool, i'm comfortable with the subsampling then doing the all pairs average. (not where I was gonna go, but a good place none the less).
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
like, avg # of pairs sampled before a pair of distance below X?Best,Matt==================Dr. Matthew Mirmanhttps://www.mirman.comOn 27 Jun 2023, at 8:24 PM, Carter Tazio Schonwald ***@***.***> wrote:
though thats not an ideal end state
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
oh, like "min and 10th percentile distance" among subsampled pairs? |
recap: exact duplicate detection, then feature vector comparisons |
items from #24 are
|
#29 is also important |
632538f added basic duplicate detection, but theres a LOT more work to be done there, with the next future step being evaluating Spacy and OpenAI/LLM semantic vector embeddings and parameter tuning around that |
still needs K-shot example support |
We should use the larger LLM to synthesize data for training the small LLM with in the optimizing api
The text was updated successfully, but these errors were encountered: