Skip to content

Conversation

@SivilTaram
Copy link
Collaborator

No description provided.

Copy link
Collaborator

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat script, amazing work!

Some initial comments - I will play a bit more with it!

Qian and others added 4 commits March 29, 2023 21:53
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
@SivilTaram
Copy link
Collaborator Author

@Muennighoff Have fixed other problems!

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Comment on lines 188 to 190
if example["old_change_range"] >= 200:
if random.random() > LONG_SAMPLING:
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step removes around 25% in my experiments on a sample of 10K, things such as below:

['Implement code-tests for the following checks:\n\n* com.google.fonts/check/081\n* com.google.fonts/check/083\n* com.google.fonts/check/084\n', 'refactor\n', 'weather: need to return None if date_period is out of YahooWeather data range\n', 'Eliminate redundant options for cylindrical/disc flames\n', 'var name typo: n_gramming\n', 'Replace `optparse` with `argparse`\n', 'tgrep: string literals to unicode\n', 'indent with four spaces\n', "use desired type instead of 'Any'\n", 'Clean up whitespace.\n']

Upping the range to 500, would reduce it to only 50%. Maybe worth running an ablation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be an option. I will explore it later!

@Muennighoff Muennighoff mentioned this pull request Mar 30, 2023
SivilTaram added 4 commits March 31, 2023 11:40
…er its upload.

- Add soft/hard filtering strategies to filter out commit dataset.
- Filter content with much long length since they cannot be fit into the context window.
Comment on lines +89 to +120
def get_line_diff_range(example):
old_file_start = 0
old_file_end = 0

new_file_start = 0
new_file_end = 0

n_inserts = 0
n_deletes = 0

for group in SequenceMatcher(None, example["old_contents"].splitlines(),
example["new_contents"].splitlines()).get_grouped_opcodes():
group = [g for g in group if g[0] != "equal"]

for element in group:
if element[0] == "insert":
n_inserts += element[4] - element[3]
if element[0] == "delete":
n_deletes += element[2] - element[1]
if element[0] == "replace":
n_deletes += element[2] - element[1]
n_inserts += element[4] - element[3]

first, last = group[0], group[-1]
file1_range = (first[1], last[2])
file2_range = (first[3], last[4])

old_file_start = min(file1_range[0], old_file_start)
old_file_end = max(file1_range[1], old_file_end)

new_file_start = min(file2_range[0], new_file_start)
new_file_end = max(file2_range[1], new_file_end)
Copy link
Collaborator

@Muennighoff Muennighoff Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a bug here where old_file_start & new_file_start will always be 0 rather than the actual start of the new changes, since they are initialized to 0 and the min of (x, 0) where x>=0 is always 0.

Rather, I think they should not be initialized to 0 and instead just set to the first file1_range[0] that occurs, no?

Indeed looking through some examples, I cannot find any that do not start with imports / licenses, i.e. the top of a file

@SivilTaram SivilTaram closed this Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants