Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Migration of Fuzz Tests & Integration Scripts From the OSS-Fuzz Project Repo #1901

Merged

Conversation

DaveLak
Copy link
Contributor

@DaveLak DaveLak commented Apr 12, 2024

As discussed in #1889, this PR introduces the changes necessary to migrate the OSS-Fuzz fuzz tests and configuration scripts from the OSS-Fuzz repository into GitPython's repo.

Summary of Changes

Most of the changes proposed here are the same as described in the discussion thread linked above. Beyond that, there are a few minor differences and potentially non-obvious details that I want to call out here:

Directory Structure

Initially, I proposed a single, flat "fuzzing/" directory at the top level, but when I moved the files, I felt that a few sub-directories would help with organization.

The fuzzing/READEME.md file introduced in this PR has a section describing them and some additional documentation.

Updates to Fuzz Targets

The fuzz test files in this PR include the changes that were originally proposed in google/oss-fuzz#11763.

Seed Data

Part of efficiency improvements I proposed in google/oss-fuzz#11763 included the addition of dictionary files (described in the new README) and seed corpus zip files.

I stored these in a repo that I maintain: https://github.com/DaveLak/oss-fuzz-inputs/tree/main/gitpython, because OSS-Fuzz has understandable concerns about seed data bloating the size of their repo.

In this PR:

Next Steps

@Byron & @EliahKagan, please:

  1. Let me know what you think of the changes here.
  2. Take a look at the related PR I have open in the OSS-Fuzz repo: [gitpython]: Move Fuzz Tests & Configuration Upstream google/oss-fuzz#11803 and let me know if theres anything I should change. A comment from each of you acknowledging and approving the changes there is required before it can be merged.
  3. If either of you could assist me with validating that the fuzzing/ directory doesn't get picked up by the publishing process, I'd appreciate it! I did what I could to test it locally but I'm not familiar enough with the release process to be confident.

Thanks! I'll address feedback any feedback as quickly as possible.

Migrates the OSS-Fuzz tests and setup scripts from the OSS-Fuzz
repository to GitPython's repo as discussed here:
gitpython-developers#1887 (comment)

These files include the changes that were originally proposed in:
google/oss-fuzz#11763

Additional changes include:
- A first pass at documenting the contents of the fuzzing set up in a
  dedicated README.md
- Adding the dictionary files to this repo for improved visibility. Seed
  corpra zips are still located in an external repo pending further
 discussion regarding where those should live in the long term.
Adds additional documentation links and fixes some typos.
- Updates the fuzzing documentation to include steps for working with
  locally modified versions of the gitpython repository.
- Updates the build.sh script to make the fuzz target search path more
  specific, reducing the risk of local OSS-Fuzz builds picking up
  files located outside of where we expect them (for example, in a .venv
  directory.)
- add artifacts produced by local OSS-Fuzz runs to gitignore
- Fix typos in the documentation on dictionaries
- Link to the fuzzing directory in the main README where it is
  referenced.
@DaveLak DaveLak marked this pull request as ready for review April 12, 2024 05:00
DaveLak added a commit to DaveLak/oss-fuzz that referenced this pull request Apr 12, 2024
Updates the gitpython project files to enable migrating and maintaining
fuzz targets and build scripts upstream.

Related PR in the upstream repo: gitpython-developers/GitPython#1901
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this wonderful PR! It's very well thought out and a delight. Particularly the fuzzing/README.md is a great introduction to how it all works. >[!TIP] is also new to me, and something I hope to be using more often in future.

About Fuzzing-Fixtures

Overall, the extra-files are just about 2MB in total, and I wonder how stable they will be. If stable, I could imagine including them in this repo.
If there is any concerns about this, gitpython-developers could fork the repo and make you an admin on it.

About archives for publishing on PyPy

I created a new archive from the branch of this PR with python -m build --sdist --wheel and didn't see any trace of fuzzing in the produced archive.

Screenshot 2024-04-13 at 08 05 16

It seems good.

Regarding the OSS fuzz PR

I didn't review it for correctness, but approved it in case it helps getting it merge.

Regarding my review

I just glanced at the shell scripts, and basically ignored the google-provided fuzz targets.

All that's left for me is to figure out where to put the ~2MB of corpus files.

And, thanks again for picking this up :)!

.gitignore Show resolved Hide resolved
@DaveLak
Copy link
Contributor Author

DaveLak commented Apr 14, 2024

Thanks, I appreciate the kind words 🙂 And yes, >[!TIP] is a handy feature for sure. There are a few others as well.

About Fuzzing-Fixtures

Overall, the extra-files are just about 2MB in total, and I wonder how stable they will be. If stable, I could imagine including them in this repo. If there is any concerns about this, gitpython-developers could fork the repo and make you an admin on it.

I think a dedicated repo under the gitpython-developers org would be the best option. No need to fork mine though, a new one with whatever license you prefer would be fine. I have no preference as far my own permissions though.

Ideally, a seed corpus would be added for each new fuzz target. Obviously that means more files but it doesn't necessarily mean "10 fuzz targets = 10mb". The size & number of interesting inputs will change over time and depending on the complexity/number of code paths in the functionality under test, and it's hard to know what that equates to in terms of per target corpus size at this point. The two zips in my repo right now actually demonstrate that pretty nicely: both of them basically top out their respective coverage quickly but one zip is ~450kb while the other is ~1.5mb.

I like how Bitcoin Core handles in https://github.com/bitcoin-core/qa-assets. From what I've seen, that seems like the cleanest and most flexible.

About archives for publishing on PyPy

I created a new archive from the branch of this PR with python -m build --sdist --wheel and didn't see any trace of fuzzing in the produced archive.

Regarding the OSS fuzz PR

I didn't review it for correctness, but approved it in case it helps getting it merge.

Regarding my review

I just glanced at the shell scripts, and basically ignored the google-provided fuzz targets.

Awesome thanks!

@Byron
Copy link
Member

Byron commented Apr 14, 2024

I think a dedicated repo under the gitpython-developers org would be the best option. No need to fork mine though, a new one with whatever license you prefer would be fine. I have no preference as far my own permissions though.

Great! I have created a fork of your repository under this organization and added you as maintainer. This should give you enough permissions, but if not, you could be admin as well.

Screenshot 2024-04-14 at 08 35 46

You see that I named it after the same repository of the bitcoin organization.

Once you adjusted the repository URL to the fork under this organisation's control, the PR would be ready for merge from my side.

Thanks again for all your work here!

@DaveLak
Copy link
Contributor Author

DaveLak commented Apr 15, 2024

Thanks @Byron! I've updated the URL to unblock this PR and I'll follow up with a PR to cleanup the new qa-assets repo following the pattern of the Bitcoin one.

Copy link
Contributor

@EliahKagan EliahKagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least one and possibly two changes will be needed to comply with the terms of the Apache License. That license imposes requirements for distributing the licensed work, including, in subsection 4(a):

You must give any other recipients of the Work or Derivative Works a copy of this License

This is to say that merely preserving the license headers (even with other acknowledgement) is not sufficient, and the license text itself needs to be included somewhere in the GitPython repository.

It seems to me that the most hassle-free way to achieve this is to copy the license file from the repository the Apache 2.0 licensed Python scripts originally came from (so long as there is nothing unusual about it) into the fuzzing/ directory, and name it LICENSE-APACHE.

The reason to put in the fuzzing/ directory rather than in the fuzzing/fuzz-targets/ subdirectory that contains those scripts, is that new files not covered by that license may end up in there later. So putting it there wouldn't really make it more specific. There is also a benefit to having it in a directory that also contains a README.md that covers, among other code, the code it pertains to--especially if the details about what it covers are also moved from the main readme into that one, as I suggest in a review comment below. But it could go in the fuzzing/fuzz-targets/ subdirectory if you prefer.

It shouldn't go at the top level of the repository, because that could lead to confusion, and also because (at least without further changes) it would end up in built sdist and wheel packages. Currently nothing from the fuzzing/ directory is included in such packages; I have verified this (as has Byron).

The slightly trickier part is that the Apache License also requires that changes be noted. Per 4(b):

You must cause any modified files to carry prominent notices stating that You changed the files

I believe that does apply here. The subtle aspect of this is that this is often not followed, and I think is not expected to be followed, when proposing changes for inclusion in an existing project that the files came from. For example, pull requests on a project that uses the Apache License often do not, and I think are not expected to, add any self-attribution, if the contributors decide they don't need to be attributed individually.

So I think anything that has formerly been published under this license in the OSS-Fuzz repository or other such places, if it included your modifications there, should not require that anything be added to put it here.

But it looks like your changes to the Apache licensed files were not ever included there. If this is where they are first published, then I think a note has to be added. If you want it to be a copyright notice naming you, I think that is fine. If not, then it could just say "This file has been modified by" followed by your name, or "GitPython contributors," or "contributors to GitPython," or something like that.

Please definitely do say if you think I am mistaken about any of that, or even if you are not sure but are worried I may be mistaken.

Besides the above, I think there are no other blockers for merging this. I've posted a number of comments below, some recommending other changes.

  • A small number pertain to how the licensing situation is described. Unlike the above, none are important enough to delay merging this. However, those are changes I might not feel comfortable making separately without checking with you first. I recommend at least reading those, and they may lead to changes in this PR. But I emphasize that they do not need to stand in the way of this being merged.
  • Most others are technical, and I believe at least one identifies a minor bug, but none are anywhere near serious enough that they need to delay merging, and you can even ignore them completely if you like. For these, if you happen to find any that recommend changes that you would prefer not be made, then you may want to state that, since unlike changes to how the licensing situation is documented, I would not otherwise feel a need to exercise any special caution when making such changes later. (You can also make changes in this PR based on them, but I am not asking you to do that and I don't want to delay this unnecessarily or push you into doing any further work on this that you may not wish to do at this time.)

README.md Outdated Show resolved Hide resolved
fuzzing/fuzz-targets/fuzz_config.py Show resolved Hide resolved
fuzzing/fuzz-targets/fuzz_tree.py Show resolved Hide resolved
fuzzing/oss-fuzz-scripts/build.sh Show resolved Hide resolved
shutil.rmtree(git_dir)

os.mkdir(git_dir)
with open(head_file, "w") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a specific reason to do otherwise, I recommend specifying an encoding explicitly, for example, encoding="utf-8". Otherwise it is unclear if the reason it is omitted is just because it is expected not to make a practical difference across different platforms, or because a difference is actually desired. (I am guessing the reason is the former, because I don't think the fuzz tests run on Windows, which is where these issues usually come into play.) If omitting it is intentional, then it might be worthwhile to include a comment to state the reason.

python3 -m pip install .

# Directory to look in for dictionaries, options files, and seed corpa:
SEED_DATA_DIR="$SRC/seed_data"
Copy link
Contributor

@EliahKagan EliahKagan Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SRC comes from outside this script and is set in such a way that it is very unlikely to start with -. I have refrained from mentioning places where -- should be added before arguments that have $SRC as a prefix if the value of SRC can begin, even accidentally, with -. Most commands accept --, but not all, and I have no objection to omitting it on the occasion it is truly known not to be needed. I'll assume such comments may not be wanted, unless you request them. The same goes for container-environment-boostrap.sh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SRC comes from outside this script and is set in such a way that it is very unlikely to start with -.

You are correct. I explained the OSS-Fuzz provided env vars a bit more in this comment: #1901 (comment). These variables are certainly a case where -- is truly known not to be needed, because if the omission of it results in these scripts breaking then most (more likely, all) of the other OSS-Fuzz projects would break as well.


That said, I wasn't particularly happy with having the "$SRC/seed_data" directory specified in two separate scripts, where one implicitly depends on the other creating it. If nothing else, that feels typo prone.

Initially I considered having a third script that just defined the shared variable and sourceing it in each, but that felt like overkill.

Ultimately, I think the best thing to do would be to remove the need for that directory all together. When I get around to the changes I described in a reply to you here: #1903 (comment), I should be able to do all of the dictionary & corpus zip creation inside container-environment-bootstrap.sh (i.e., during the Docker image build step) and put everything in $OUT/ directly from there. That would allow the removal of the "$SRC/seed_data" directory completely, and also make the fuzz harness build step more efficient because it wouldn't need to produce the fixture artifacts every time.

I'm open to alternative suggestions though 🙂

I'll leave this comment unresolved for now for my own reference

fuzzing/oss-fuzz-scripts/build.sh Show resolved Hide resolved
fuzzing/oss-fuzz-scripts/build.sh Show resolved Hide resolved
@DaveLak
Copy link
Contributor Author

DaveLak commented Apr 15, 2024

Thanks for the thorough review @EliahKagan! It's very helpful, and I'm particularly appreciative of the thoughtful explanations and suggestions you shared.

I'll push some changes shortly to address your feedback, prioritizing the updates related to the license, so we can get this merged and unblock the downstream OSS-Fuzz PR.

As for anything else that you explicitly noted as non-blocking: I'll either address them in a follow-up, or reply to in the
relevant comments to discuss further. I do think they're all valuable suggestions and worth addressing ASAP.

DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 16, 2024
Addresses feedback and encorperates suggestions from PR gitpython-developers#1901 to ensure
that the Apache License requirements are met for the two files that they
apply to, and the documentation pertaining to licensing of the files in
this repository is clear and concise.
Addresses feedback and encorperates suggestions from PR gitpython-developers#1901 to ensure
that the Apache License requirements are met for the two files that they
apply to, and the documentation pertaining to licensing of the files in
this repository is clear and concise.

The contects of LICENSE-APACHE were coppied from the LICENSE file of the
OSS-Fuzz repository that the two fuzz harnesses came from as of commit:
https://github.com/google/oss-fuzz/blob/c2c0632831767ff05c568e7b552cef2801d739ff/LICENSE
@DaveLak DaveLak force-pushed the oss-fuzz-test-harness-upstreaming branch from 321534c to 945a767 Compare April 16, 2024 06:22
@DaveLak
Copy link
Contributor Author

DaveLak commented Apr 16, 2024

@EliahKagan, commit 945a767 addresses the blocking concerns regarding the licensing and documentation of Apache licensed files. Please see the diff of that commit for specifics.

Sorry about the force-push; I wanted to make sure the OSS-Fuzz commit at the time I copied the Apache license file from was captured in the commit message.1

Regarding:

It seems to me that the most hassle-free way to achieve this is to copy the license file from the repository the Apache 2.0 licensed Python scripts originally came from (so long as there is nothing unusual about it) into the fuzzing/ directory, and name it LICENSE-APACHE.

I did exactly that. I also checked the diff of the contents of the file included here against https://www.apache.org/licenses/LICENSE-2.0.txt just to be sure; it looks good.

Diff Results
--- fuzzing/LICENSE-APACHE
+++ fuzzing/LICENSE-APACHE-from-website-src
@@ -1,3 +1,4 @@
+
                              Apache License
                        Version 2.0, January 2004
                     http://www.apache.org/licenses/
@@ -178,7 +179,7 @@
APPENDIX: How to apply the Apache License to your work.

       To apply the Apache License to your work, attach the following
-      boilerplate notice, with the fields enclosed by brackets "{}"
+      boilerplate notice, with the fields enclosed by brackets "[]"
       replaced with your own identifying information. (Don't include
       the brackets!)  The text should be enclosed in the appropriate
       comment syntax for the file format. We also recommend that a
@@ -186,7 +187,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

-   Copyright {yyyy} {name of copyright owner}
+   Copyright [yyyy] [name of copyright owner]

    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    @@ -198,4 +199,4 @@
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
-   limitations under the License.
+   limitations under the License.
    \ No newline at end of file

CC @Byron

Footnotes

  1. ...but I forgot it the first time in 321534c ...and I see I have a few typos in 945a767 🤦

@DaveLak DaveLak requested a review from EliahKagan April 16, 2024 06:34
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sorting this out!

It looks good to me, but I will wait for @EliahKagan to double-check.

Copy link
Contributor

@EliahKagan EliahKagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@DaveLak
Copy link
Contributor Author

DaveLak commented Apr 16, 2024

FYI, I'm hesitant to push more commits here, including via the suggestion comments, because I don't want to keep making the review approvals stale.

I'll reply to the remaining as necessary of course, but I'm going to start a new branch for them unless someone suggests otherwise. Lmk if either of you would prefer them on this PR directly.

@EliahKagan
Copy link
Contributor

EliahKagan commented Apr 16, 2024

FYI, I'm hesitant to push more commits here, including via the suggestion comments, because I don't want to keep making the review approvals stale.

I'll reply to the remaining as necessary of course, but I'm going to start a new branch for them unless someone suggests otherwise. Lmk if either of you would prefer them on this PR directly.

I agree. I think this PR is in good shape to be merged, and further refinements can be made in one or more subsequent PRs.

This is the case even for future adjustments, if any, that would affect how the licensing situation is described. Everything that was blocking has been fully resolved.

DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 16, 2024
Prefer executing these files using the OSS-Fuzz or `python` command
methods outlined in the `fuzzing/README`.

Based on feedback and discussion on:
gitpython-developers#1901
DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 16, 2024
This script is meant to be sourced by the OSS-Fuzz file of the same
name, rather than executed directly. The shebang may lead to the
incorrect assumption that the script is meant for direct execution.
Replacing it with this directive instructs ShellCheck to treat
the script as a Bash script, regardless of how it is executed.

Based @EliahKagan's suggestion and feedback on:
gitpython-developers#1901
DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 16, 2024
This script is executed directly, not sourced as is the case with
`build.sh`, so it should have an executable bit set to avoid ambiguity.

Based @EliahKagan's suggestion and feedback on:
gitpython-developers#1901
DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 16, 2024
- Make the link text for the OSS-Fuzz test status URL more descriptive
- Fix capitalization of GitPython repository name

Based @EliahKagan's suggestion and feedback on:
gitpython-developers#1901
DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 16, 2024
Replaces the null character delimiter `-d $'\0'` with the simpler
empty string `-d ''` in the fuzzing harness build loop.

This changes leverages the Bash `read` builtin behavior to avoid
unnecessary complexity and improving script readability.

Based @EliahKagan's suggestion and feedback on:
gitpython-developers#1901
DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 16, 2024
DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 16, 2024
A misspelling in the https://github.com/gitpython-developers/qa-assets
repository is still present here. It will need to be fixed in that
repository first.

"corpora" is a difficult word to spell consistently I guess. This made
for a good opportunity to improve the phrasing of two other comments at
at least.

Based @EliahKagan's suggestion and feedback on:
gitpython-developers#1901
@DaveLak DaveLak mentioned this pull request Apr 16, 2024
11 tasks
@DaveLak
Copy link
Contributor Author

DaveLak commented Apr 16, 2024

FWIW @EliahKagan I tackled most of the simpler changes and put up a Draft PR: #1903

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I read the last two comments too late.

However, what follows is still something I consider feasible, nothing is going to be stale. Instead, I prefer to have all the existing comments in one place and resolve conversations.

Never mind :D - see #1903 .


Thanks a lot for keeping working on the refinements. I thought I'd wait until the suggestions are committed (they can be batched), and a few more things are cleaned up.

Regarding the hardcoded /tmp path that then expects a /tmp/.git, I agree that this under no circumstances should be running in an unisolated environment. Since the documentation clearly states that docker is needed, and if it's unlikely to impossible to run outside (or to get to that line), then I think it's OK to leave it.
It seems to be common practice in the OSS-fuzz project as well, and portability is not their concern.

Thanks for bearing with us, this PR is already fantastic, and so is working with you on this!

fuzzing/fuzz-targets/fuzz_config.py Show resolved Hide resolved
fuzzing/fuzz-targets/fuzz_tree.py Show resolved Hide resolved
@Byron Byron merged commit e3fb1f2 into gitpython-developers:main Apr 17, 2024
26 checks passed
@DaveLak
Copy link
Contributor Author

DaveLak commented Apr 17, 2024

Thanks both! I updated the branch for #1903 and marked it ready for review.

I agree the state of fuzz_tree.py is not ideal and appreciate the feedback you each left about it. I've given it some thought as well but I hadn't considered some of the points raised.

I still plan to share in a reply on the unresolved thread here, I just haven't had a chance to yet. I should be able to in a few hours.

@EliahKagan
Copy link
Contributor

Regarding the hardcoded /tmp path that then expects a /tmp/.git, I agree that this under no circumstances should be running in an unisolated environment. Since the documentation clearly states that docker is needed, and if it's unlikely to impossible to run outside (or to get to that line), then I think it's OK to leave it.

To be clear, Docker is not needed, so this can be the real, unisolated /tmp, and the documentation presents that as a reasonable thing to do under some circumstances. It was with awareness of this that I recommended this be merged even before that is fixed.

However, I realize I omitted a piece of my reasoning: I think it's easier to fix this after both this PR and google/oss-fuzz#11803 are merged (the latter is not merged yet but looks like it will be merged soon), because that will make it easier to test the fix both outside and inside Docker.

(The Docker case is, I believe, more important, since that's how the tests are automatically run on the OSS-Fuzz infrastructure, and running individual tests outside of Docker is, if I understand correctly, mostly of interest when working on the fuzz tests themselves.)

For the same reason, I think it is justified for #1903 not to include a fix for this, since the changes in #1903 otherwise could be merged at any time and do not need to wait on google/oss-fuzz#11803.

@DaveLak
Copy link
Contributor Author

DaveLak commented Apr 18, 2024

Ok, sorry for the delay. Below are my thoughts regarding fuzz_tree.py, but also, a little extra context on the OSS-Fuzz execution environment because it may be helpful for thinking about #1901 (comment).

About the OSS-Fuzz Environment

You both touched on it, but for the sake of shared understanding, the scripts and tests look like they do in part because of the OSS-Fuzz container environment implementation details.

The key points to understand are:

  • There are two stages of an OSS-Fuzz run that influence why the shell scripts and fuzz tests look like they do: a build stage and a fuzzer execution stage.
  • Each stage has its own distinct container environment.
  • Each container environment's filesystem is mostly read-only.
  • The containers are Linux based (Ubuntu 20.04 right now I believe.)

Build Stage Execution Environment

As per this page of the docs:

build.sh script environment
When your build.sh script is executed, the following locations are available within the image:

Location Env Variable Description
/out/ $OUT Directory to store build artifacts (fuzz targets, dictionaries, options files, seed corpus archives).
/src/ $SRC Directory to checkout source files.
/work/ $WORK Directory to store intermediate files.

Everything else is read-only.

Fuzzer Execution Stage Environment

As per this page of the docs:

File system
Everything except /tmp is read-only, including the directory that your fuzz target executable lives in.
/dev is also unavailable.

Regarding the Fuzz Tests

When I initially started on this, my primary objective was really only about fixing the build. I did try to clean them up a bit in the process and apply some best practices, but those changes were mostly incidental and I intentionally avoided any significant refactors.

After we decided to bring them here, my focus shifted to lowering the barrier to entry for contributors because of the nature of the project. In particular, I tried to distill the useful information from various place that I haven't seen documented in one place yet. One such tip is the direct execution section, but in hindsight I overlooked the environment assumptions in fuzz_tree.py.

On fuzz_tree.py Specifically

This test is less than ideal for several reasons.

  1. Writing to the /tmp dir

In addition to what has been mentioned, modifying global state and I/O operations (e.g., writing to the temp dir) are problematic for fuzzing in general.

Further, @EliahKagan's remark in #1901 (comment) about "implicating other things in /tmp" made me realize this may already be causing issues inside the OSS-Fuzz execution environment. A non-obvious implementation detail specific to Python projects in OSS-Fuzz that the compile_python_fuzzer function in the build.sh script uses Pyinstaller to bundle the fuzz harness and all related dependencies into a single file archive that gets transferred to the fuzzer execution stage, where it is unpacked into a subdirectory of /tmp for execution. Which, I believe means, the test execution is actually using the test executable as a fixture 🥴

I have yet to confirm this, but I wouldn't be surprised if that is the cause of the issue I described as a possible " circular dependency " in google/oss-fuzz#11763

  1. It's very slow

With an average execs per second of ~4 after the seed corpus loads on a short local run, it is far, far, below what is hoped for. In contrast, I'm seeing about 1432 exec/s on fuzz_config.py in a similar local run. fuzz_config.py creates a config fake in-memory but never writes it to disk.

I haven't dug in yet, but I think this is partly related to point 1 above and partly related to some of the implementation details of the legacy code in GitPython.

This is less of a concern/priority than the /tmp dir issue though because even if it's slow, it's still at least testing the API.

Thoughts on Next Steps

I'm fully in agreement that the /tmp issue and documentation related to it must be fixed before other updates are considered. I think it can be broken down into smaller problems in the following prioritized order:

  1. Address adverse potential that the local execution documentation exposes unaware contributors to.

Possible quick fix options:

  • Remove that section all together for now (probably quickest and safest?)
  • Add a Warning or Caution callout to that section about fuzz_tree.py (this feels not as great IMHO)
  1. Address the /tmp writing implementation issues in fuzz_tree.py
  • TemporaryDirectory as suggested by @EliahKagan looks like a good solution but I would need to test it in OSS-Fuzz (which I can certainly do regardless of the downstream PR's status.)
  • I've also considered proposing we add a Dockerfile to the fuzzing/ dir that would provide a "batteries included" way for folks to use the direct execution method quickly and easily inside a container environment prepared specifically to support it. Even before this discussion, I thought that would be the ideal way to document it because users wouldn't need to worry about any of the particulars beyond having Docker installed, which they'd already need for the alternative anyway.

I think the most robust solution would be some combination of both points above, but I don't know enough to comment on the potential issues that we may face goinf the TemporaryDirectory route inside OSS-Fuzz. The Dockerfile solution on the other hand is something I've already been toying with locally.

I also looked at options like MemoryFS from https://github.com/PyFilesystem/pyfilesystem2 because an in-memory solution would be ideal, but I think the git executable invocation may cause issues with that approach.

What do you both think?

DavidKorczynski pushed a commit to google/oss-fuzz that referenced this pull request Apr 18, 2024
Updates the gitpython project files to enable migrating and maintaining
fuzz targets and build scripts upstream.

Related PR in the upstream repo:
gitpython-developers/GitPython#1901

`project.yaml` updates:

- @Byron, the maintainer of GitPython, is added as the primary contact.
- @EliahKagan and myself are added to the `auto_ccs` list as discussed
with @Byron here:
gitpython-developers/GitPython#1889 (comment)
- @DavidKorczynski I removed what I believe is your email from the
`vendor_ccs` because it looked like you were included as the default
when no other contacts were listed. If this was a mistake on my part and
you want to remain listed as a CC, please let me know and I'll correct
it.

Thanks!
@Byron
Copy link
Member

Byron commented Apr 18, 2024

Thanks so much for the follow-up - I feel well-informed now!

As OSS-fuzz strongly relies on docker, maybe it would be easiest to 'go with the flow' and provide a dockerfile as part of the fuzzing directory. That way, the docs can refer exclusively to that and thanks to docker, everything, including dependencies, would work just as if they would run on the OSS-fuzz infrastructure.
Doing so would allow me to run these as well, as generally I am no fan of venv and all - docker is fine though and I have experience with it.

As second step, one could see if improving the tempdir usage would fix certain issues that you have been seeing, including to make using OSS fuzz safe for running it directly. gitoxide can do that without issues, by the way, and maybe being able to do that helps with the local development workflow.

With that said, I also think that… I'd really want you to do more with gitoxide 😁, and even though Rust OSS fuzzing is currently broken across the ecosystem, maybe there are some parsers that still don't have enough fuzzerage (fuzzing coverage). Maybe it would be more fun, too, as 'the system' won't be in your way anymore. But I digress 😅.

DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 20, 2024
Adds a Dockerfile to enable easily executing the fuzz targets directly
inside a container environment instead of directly on a host machine.
This addresses concerns raised in PR gitpython-developers#1901 related to how `fuzz_tree.py`
writes to the real `/tmp` directory of the file system it is executed on
as part of setting up its own test fixtures, but also makes for an
easier to use development workflow.

See this related comment on PR gitpython-developers#1901 for additional context:
gitpython-developers#1901 (comment)
@DaveLak
Copy link
Contributor Author

DaveLak commented Apr 20, 2024

As OSS-fuzz strongly relies on docker, maybe it would be easiest to 'go with the flow' and provide a dockerfile as part of the fuzzing directory. That way, the docs can refer exclusively to that and thanks to docker, everything, including dependencies, would work just as if they would run on the OSS-fuzz infrastructure.
Doing so would allow me to run these as well, as generally I am no fan of venv and all - docker is fine though and I have experience with it.

I've added a simple Dockerfile and I instructions for using it and a warning about not using Docker in #1904. Let me know what you think!

As second step, one could see if improving the tempdir usage would fix certain issues that you have been seeing, including to make using OSS fuzz safe for running it directly.

I'm planning to take a deeper at this. Even without the /tmp concerns, there is something that is blocking Fuzz Introspector from working properly, which would be nice to resolve.

With that said, I also think that… I'd really want you to do more with gitoxide 😁

I'm definitely interested in taking you up on that! I've been interested in learning more about Rust for a while and I've found working on fizzers a great way to get familiar with new projects and languages, so you can likely expect to see me pop up in that project sooner or later 😁


In other news, the downstream OSS-Fuzz PR was merged and the Gitpython builds are succeeding!

https://introspector.oss-fuzz.com/project-profile?project=gitpython

The Coverage build is still reporting failures, but that is expected until ClusterFuzz generates enough corpora to run the analysis which I expect will happen in the next day or two.

Exciting to see!

@DaveLak DaveLak deleted the oss-fuzz-test-harness-upstreaming branch April 22, 2024 20:16
DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 22, 2024
These files are already BSD-3-Clause even without the headers, but
adding these comments and the `LICENSE-BSD` symlink to the root level
`LICENSE` file are helpful to reinforce that there are only two
particular files in the `fuzzing/` that are not under BSD-3-Clause.

See:
gitpython-developers#1901 (comment)
DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 22, 2024
While discussing adding similar license comments to the shell scripts
introduced in PR gitpython-developers#1901, it was noticed that the shell scripts in the
repository root directory did not have such comments and suggested that
we could add them when the scripts in the `fuzzing/` directory were
updated, so this commit does just that.

See:
gitpython-developers#1901 (comment)
DaveLak added a commit to DaveLak/GitPython that referenced this pull request Apr 29, 2024
As discussed in the initial fuzzing integration PR[^1], `fuzz_tree.py`'s
implementation was not ideal in terms of coverage and its reading/writing to
hard-coded paths inside `/tmp` was problematic as (among other concerns), it
causes intermittent crashes on ClusterFuzz[^2] when multiple workers execute
the test at the same time on the same machine.

The changes here replace `fuzz_tree.py` completely with a completely new
`fuzz_repo.py` fuzz target which:

- Uses `tempfile.TemporaryDirectory()` to safely manage tmpdir creation and
  tear down, including during multi-worker execution runs.
- Retains the same feature coverage as `fuzz_tree.py`, but it also adds
  considerably more from much smaller data inputs and with less memory consumed
  (and it doesn't even have a seed corpus or target specific dictionary yet.)
- Can likely be improved further in the future by exercising additional features
  of `Repo` to the harness.

Because `fuzz_tree.py` was removed and `fuzz_repo.py` was not derived from it,
the Apache License call outs in the docs were also updated as they only apply to
the singe `fuzz_config.py` file now.

[^1]: gitpython-developers#1901 (comment)
[^2]: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=68355
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants