Add instruction commit data cleaning code #1

SivilTaram · 2023-03-28T16:59:05Z

No description provided.

Add the dataset cleaning code.

Muennighoff

Very neat script, amazing work!

Some initial comments - I will play a bit more with it!

dataset/instrution_commit_clean.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

SivilTaram · 2023-03-29T15:16:53Z

@Muennighoff Have fixed other problems!

dataset/instrution_commit_clean.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

Muennighoff · 2023-03-30T17:08:13Z

dataset/instrution_commit_clean.py

+    if example["old_change_range"] >= 200:
+        if random.random() > LONG_SAMPLING:
+            return False


This step removes around 25% in my experiments on a sample of 10K, things such as below:

['Implement code-tests for the following checks:\n\n* com.google.fonts/check/081\n* com.google.fonts/check/083\n* com.google.fonts/check/084\n', 'refactor\n', 'weather: need to return None if date_period is out of YahooWeather data range\n', 'Eliminate redundant options for cylindrical/disc flames\n', 'var name typo: n_gramming\n', 'Replace `optparse` with `argparse`\n', 'tgrep: string literals to unicode\n', 'indent with four spaces\n', "use desired type instead of 'Any'\n", 'Clean up whitespace.\n']

Upping the range to 500, would reduce it to only 50%. Maybe worth running an ablation?

Yes it should be an option. I will explore it later!

…er its upload. - Add soft/hard filtering strategies to filter out commit dataset. - Filter content with much long length since they cannot be fit into the context window.

…target flexibly.

Muennighoff · 2023-04-03T00:20:33Z

dataset/instrution_commit_clean.py

+def get_line_diff_range(example):
+    old_file_start = 0
+    old_file_end = 0
+
+    new_file_start = 0
+    new_file_end = 0
+
+    n_inserts = 0
+    n_deletes = 0
+
+    for group in SequenceMatcher(None, example["old_contents"].splitlines(),
+                                 example["new_contents"].splitlines()).get_grouped_opcodes():
+        group = [g for g in group if g[0] != "equal"]
+
+        for element in group:
+            if element[0] == "insert":
+                n_inserts += element[4] - element[3]
+            if element[0] == "delete":
+                n_deletes += element[2] - element[1]
+            if element[0] == "replace":
+                n_deletes += element[2] - element[1]
+                n_inserts += element[4] - element[3]
+
+        first, last = group[0], group[-1]
+        file1_range = (first[1], last[2])
+        file2_range = (first[3], last[4])
+
+        old_file_start = min(file1_range[0], old_file_start)
+        old_file_end = max(file1_range[1], old_file_end)
+
+        new_file_start = min(file2_range[0], new_file_start)
+        new_file_end = max(file2_range[1], new_file_end)


I think there's a bug here where old_file_start & new_file_start will always be 0 rather than the actual start of the new changes, since they are initialized to 0 and the min of (x, 0) where x>=0 is always 0.

Rather, I think they should not be initialized to 0 and instead just set to the first file1_range[0] that occurs, no?

Indeed looking through some examples, I cannot find any that do not start with imports / licenses, i.e. the top of a file

Qian added 2 commits March 29, 2023 00:56

Create instrution-commit-clean.py.py

796649e

Add the dataset cleaning code.

update the script location

769382f

Muennighoff reviewed Mar 29, 2023

View reviewed changes

Qian and others added 4 commits March 29, 2023 21:53

Update dataset/instrution_commit_clean.py

76939c7

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

Update dataset/instrution_commit_clean.py

414b510

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

Remove opt out function.

5363caa

Fix the hash regex.

71f7133

Muennighoff reviewed Mar 30, 2023

View reviewed changes

dataset/instrution_commit_clean.py Outdated Show resolved Hide resolved

Update dataset/instrution_commit_clean.py

17225f0

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

Muennighoff reviewed Mar 30, 2023

View reviewed changes

Muennighoff mentioned this pull request Mar 30, 2023

Create filter_v2.py #3

Merged

SivilTaram added 4 commits March 31, 2023 11:40

Fix the drop_opt_out issue - it has been removed.

12f4730

- Add dataset meta description for confirming the dataset setting aft…

6358ad6

…er its upload. - Add soft/hard filtering strategies to filter out commit dataset. - Filter content with much long length since they cannot be fit into the context window.

- Keep the old_contents and new_contents to compute the loss only on …

c3228c0

…target flexibly.

Add a new hyper parameter for ablation on long threshold.

8c160dc

Muennighoff reviewed Apr 3, 2023

View reviewed changes

SivilTaram closed this Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add instruction commit data cleaning code #1

Add instruction commit data cleaning code #1

Uh oh!

SivilTaram commented Mar 28, 2023

Uh oh!

Muennighoff left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SivilTaram commented Mar 29, 2023

Uh oh!

Uh oh!

Muennighoff Mar 30, 2023

Uh oh!

SivilTaram Mar 31, 2023

Uh oh!

Muennighoff Apr 3, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add instruction commit data cleaning code #1

Add instruction commit data cleaning code #1

Uh oh!

Conversation

SivilTaram commented Mar 28, 2023

Uh oh!

Muennighoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SivilTaram commented Mar 29, 2023

Uh oh!

Uh oh!

Muennighoff Mar 30, 2023

Choose a reason for hiding this comment

Uh oh!

SivilTaram Mar 31, 2023

Choose a reason for hiding this comment

Uh oh!

Muennighoff Apr 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Muennighoff Apr 3, 2023 •

edited

Loading