-
Notifications
You must be signed in to change notification settings - Fork 28
Add instruction commit data cleaning code #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add the dataset cleaning code.
Muennighoff
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very neat script, amazing work!
Some initial comments - I will play a bit more with it!
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
|
@Muennighoff Have fixed other problems! |
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
dataset/instrution_commit_clean.py
Outdated
| if example["old_change_range"] >= 200: | ||
| if random.random() > LONG_SAMPLING: | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step removes around 25% in my experiments on a sample of 10K, things such as below:
['Implement code-tests for the following checks:\n\n* com.google.fonts/check/081\n* com.google.fonts/check/083\n* com.google.fonts/check/084\n', 'refactor\n', 'weather: need to return None if date_period is out of YahooWeather data range\n', 'Eliminate redundant options for cylindrical/disc flames\n', 'var name typo: n_gramming\n', 'Replace `optparse` with `argparse`\n', 'tgrep: string literals to unicode\n', 'indent with four spaces\n', "use desired type instead of 'Any'\n", 'Clean up whitespace.\n']
Upping the range to 500, would reduce it to only 50%. Maybe worth running an ablation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should be an option. I will explore it later!
…er its upload. - Add soft/hard filtering strategies to filter out commit dataset. - Filter content with much long length since they cannot be fit into the context window.
| def get_line_diff_range(example): | ||
| old_file_start = 0 | ||
| old_file_end = 0 | ||
|
|
||
| new_file_start = 0 | ||
| new_file_end = 0 | ||
|
|
||
| n_inserts = 0 | ||
| n_deletes = 0 | ||
|
|
||
| for group in SequenceMatcher(None, example["old_contents"].splitlines(), | ||
| example["new_contents"].splitlines()).get_grouped_opcodes(): | ||
| group = [g for g in group if g[0] != "equal"] | ||
|
|
||
| for element in group: | ||
| if element[0] == "insert": | ||
| n_inserts += element[4] - element[3] | ||
| if element[0] == "delete": | ||
| n_deletes += element[2] - element[1] | ||
| if element[0] == "replace": | ||
| n_deletes += element[2] - element[1] | ||
| n_inserts += element[4] - element[3] | ||
|
|
||
| first, last = group[0], group[-1] | ||
| file1_range = (first[1], last[2]) | ||
| file2_range = (first[3], last[4]) | ||
|
|
||
| old_file_start = min(file1_range[0], old_file_start) | ||
| old_file_end = max(file1_range[1], old_file_end) | ||
|
|
||
| new_file_start = min(file2_range[0], new_file_start) | ||
| new_file_end = max(file2_range[1], new_file_end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a bug here where old_file_start & new_file_start will always be 0 rather than the actual start of the new changes, since they are initialized to 0 and the min of (x, 0) where x>=0 is always 0.
Rather, I think they should not be initialized to 0 and instead just set to the first file1_range[0] that occurs, no?
Indeed looking through some examples, I cannot find any that do not start with imports / licenses, i.e. the top of a file
No description provided.