Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data synthesis #3

Closed
mmirman opened this issue Jun 9, 2023 · 16 comments
Closed

Data synthesis #3

mmirman opened this issue Jun 9, 2023 · 16 comments
Assignees
Labels
feat/enhancement New feature or request good first issue Good for newcomers

Comments

@mmirman
Copy link
Contributor

mmirman commented Jun 9, 2023

We should use the larger LLM to synthesize data for training the small LLM with in the optimizing api

@mmirman mmirman added $200 feat/enhancement New feature or request good first issue Good for newcomers labels Jun 13, 2023
@mmirman mmirman self-assigned this Jun 14, 2023
@abhigya-sodani
Copy link
Collaborator

#10

@cartazio
Copy link
Contributor

#12 mentions some features this should have

  1. duplicate/ too similar a response detection

duplicates can be handled with a hashing based lookup, or even maybe something like a bloom filter if we didn't want much. any strong hash function would work for duplicate detection and then its just a question of how much space overhead we're ok with.

@abhigya-sodani proposed a nice way to approach "too similar responses". namely the Euclidean aka L2 aka innerproduct aka cosine similarity distance of pairs of responses. This does run into the issue of having a quadratic blow up in operations as a function of the number of examples. One approach to resolve that would be to maintain some approximate nearest neighbors data structure and only compare against the closest few. I'm very unfamiliar with the current state of the art for ANN tools, let alone which ones have a quality python interface, @mmirman any strong opinions there?

@mmirman
Copy link
Contributor Author

mmirman commented Jun 27, 2023

This is already an approximate measure of dataset diversity so why not just roll with that and try to make it a probabilistic approximate measure of dataset diversity? Like take a subsample of random pairs and compute diversity of that set.

@mmirman
Copy link
Contributor Author

mmirman commented Jun 27, 2023

Or a few sub-cliques. I'm sure there's some statistically optimal way of doing this that doesn't matter since we aren't publishing

@mmirman
Copy link
Contributor Author

mmirman commented Jun 27, 2023

bloom filter is overkill until we have users complaining

@cartazio
Copy link
Contributor

This is already an approximate measure of dataset diversity so why not just roll with that and try to make it a probabilistic approximate measure of dataset diversity? Like take a subsample of random pairs and compute diversity of that set.

you mean that average L2 norm of the pairs of distances? eg average_i=0,n, j>i of ( dist(i,j) )

ok cool, i'm comfortable with the subsampling then doing the all pairs average. (not where I was gonna go, but a good place none the less).

@cartazio
Copy link
Contributor

wrt embedding vector computation text-embedding-ada-002 would be the model invoked in the preliminary version I suppose, at least for the openai flavored approach, right?

@cartazio
Copy link
Contributor

though thats not an ideal end state

@mmirman
Copy link
Contributor Author

mmirman commented Jun 27, 2023 via email

@mmirman
Copy link
Contributor Author

mmirman commented Jun 27, 2023 via email

@cartazio
Copy link
Contributor

oh, like "min and 10th percentile distance" among subsampled pairs?

@cartazio
Copy link
Contributor

recap: exact duplicate detection, then feature vector comparisons

@cartazio cartazio mentioned this issue Jun 29, 2023
22 tasks
@cartazio
Copy link
Contributor

items from #24 are

  • Add all the parameters
  • Prompt variation aka support alternative prompts, possibly with different defaults for each supported model
  • Track number of actual responses vs requested and other possible statistics on quality
  • Dedup handling
    -[ ]First exact repeat duplicate detection
    -[ ] then semantic vector comparisons for duplicate detection (this is going to possibly be a bit more fiddly)
    -[ ] consider adding Parameters for the diverse ways you might want to define exact matches or vector comparisons

@cartazio
Copy link
Contributor

#29 is also important

@mmirman mmirman added this to the ICML milestone Jun 30, 2023
@cartazio
Copy link
Contributor

cartazio commented Jul 5, 2023

632538f added basic duplicate detection, but theres a LOT more work to be done there, with the next future step being evaluating Spacy and OpenAI/LLM semantic vector embeddings and parameter tuning around that

@cartazio
Copy link
Contributor

cartazio commented Jul 6, 2023

still needs K-shot example support

This was referenced Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants