Skip to content
A proof-of-concept code search for GitHub that de-duplicates results
Branch: development
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This is a demo of GitHub code search that de-duplicates results from identical files. It compares SHA1 hashes and file contents to mark files as identical, and combine their results into one.

It's a proof-of-concept, rather than a hardened search tool.

Dear GitHub, please steal these ideas for the first-party code search 😉


Clone this repository, create a new virtualenv with Python 3.6 or later, and install the dependencies.

$ git clone
$ cd github-code-search
$ virtualenv env
$ source env/bin/activate
$ pip3 install -r requirements.txt

You also need a personal access token for GitHub, which you can get from the GitHub developer settings. This token needs the public_repo scope.


Run the script, passing your query and API token:

$ python "lang:python requests.get" --api_token=abc123

This will load the search results, render them as an HTML file, and open the file in your web browser.

Next steps

This is a proof-of-concept I wrote in a single train journey, not a hardened application. I'm not planning to work on it any further, but I did have some ideas on what you could do next:

  • Should duplicate results weight higher in the search? If the same file appears in 100 repos, should that increase the search ranking? I'm not sure -- it would be interesting to experiment.

  • Detect nearly-duplicate files. If two files are the same, except for some lines that are unrelated to the search, treat those two files as the same. This requires more sophisticated diffing logic.

  • Highlight search terms inside the code snippets. If you use regular GitHub code search, it highlights the search terms within the code snippet. I don't do that yet, but there's (probably) enough information in the API to do that.

  • Make it faster! Right now it's pretty slow -- it has to fetch the contents of every unique file that appears in the search results. Parallelising the HTTP requests or doing something fancy with GraphQL to reduce the number of requests would make it faster.

  • Pagination. Right now it only uses the first page of results, even though the API is paginated. It'd be nice to expose later results in some way.

  • Give more visibility into the duplicate results. Any duplicate results are just hidden behind "+N duplicates". Completely hiding them is probably a mistake – there might be circumstances in which you want to see the duplicates (although I don't think that's the common use case). It would be good if there was some way to see them if you really wanted.



You can’t perform that action at this time.