Skip to content
This repository has been archived by the owner on Oct 12, 2020. It is now read-only.

Use GitHub API v4, add options and improve trust algorithm #8

Merged
merged 11 commits into from
Jun 29, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 40 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,14 @@
<p align="center">
<img width="300" src="img/logo.png"/>
</p>

<p align="center">
<a href="#license">
<img src="https://img.shields.io/badge/license-MIT-blue.svg?style=flat" />
</a>
<a href="https://hub.docker.com/r/ullaakut/astronomer/">
<img src="https://img.shields.io/docker/pulls/ullaakut/astronomer.svg?style=flat" />
</a>
<a href="https://goreportcard.com/report/github.com/ullaakut/astronomer">
<img src="https://goreportcard.com/badge/github.com/ullaakut/astronomer" />
</a>
Expand All @@ -20,11 +24,15 @@ The goal of this tool is to **detect illegitimate GitHub stars from bot accounts

## Trust algorithm

Trust is computed based on four different factors:
Trust is computed based on many different factors:

* The average amount of account lifetime contributions among stargazers
* The average amount of lifetime contributions among stargazers
* The average amount of private contributions
* The average amount of public created issues
* The average amount of public authored commits
* The average amount of public opened pull requests
* The average amount of public code reviews
* The average weighted contribution score (weighted by making older contributions more trustworthy)
* The 65th, 85th and 95th percentile of the weighted contribution score: can be useful to detect a mix of real and fake users that would amount to a normal average
* The average account age, older is more trustworthy

### Upcoming improvements
Expand All @@ -34,26 +42,43 @@ I am planning on soon also computing every 5th percentile (`5`, `10`, `15` and s
## Examples

<p align="left">
<img width="65%" src="img/cameradar.png">
<img width="65%" src="img/traefik.png">
</p>
<p align="right">
<img width="65%" src="img/suspicious_repo.png">
</p>
<p align="left">
<img width="65%" src="img/flaeg.png">
<img width="65%" src="img/envoy.png">
</p>

## How to use it

### Docker image

In order to use Astronomer, you'll need a GitHub token with `repo` read rights. You can generate one [in your GitHub Settings > Developer settings > Personal Access Tokens](https://github.com/settings/tokens). Make sure to keep this token secret. You will also need to have docker installed.

Run `docker pull ullaakut/astronomer`.

Then, use the astronomer docker image like such: `docker run -t -e GITHUB_TOKEN=$GITHUB_TOKEN -v "/path/to/your/cache/folder:/data/" ullaakut/astronomer myusername/myrepository`
Then, use the astronomer docker image like such: `docker run -t -e GITHUB_TOKEN=$GITHUB_TOKEN -v "/path/to/your/cache/folder:/data/" ullaakut/astronomer repositoryOwner/repositoryName -d`

* The `-t` flag allows you to get a colored output. You can remove it from the command line if you don't care about this.
* The `-e GITHUB_TOKEN=<your_token>` option is mandatory. The GitHub API won't authorize any requests without it.
* The `-v "/path/to/your/cache/folder:/data/"` option can be used to cache the responses from the GitHub API on your machine. This means that the next time you run a scan, Astronomer will simply update its cache with the new stargazers since your last scan, and compute the trust levels again. It is highly recommended to use cache if you plan on scanning popular repositories (more than 1000 stars) more than once.
* The `-d` flag enables more detailed trust factor computation.

### Binary

You can also install the go binary by [enabling go modules](https://github.com/golang/go/wiki/Modules#how-to-use-modules) and running `go install github.com/ullaakut/astronomer`. Make sure that your `go` version is at least `1.11.x`.

You can verify your `go` version by running `go version`.

The `astronomer` binary will then be available in `$GOPATH/bin/astronomer`.

## Arguments and options

* It is required to specify a repository in the form `repositoryOwner/repositoryName`. This argument's position does not matter.
* **`-d, --debug`**: Show more detailed trust factors, such as percentiles (default: false)
* **`-c, --cachedir` (string)**: Set the directory in which to store cache data (default: `./data`)

## Questions & Answers

Expand All @@ -67,11 +92,11 @@ Repositories with high amounts of stars, especially when they arrive in bursts,

> _Why is `Astronomer` so slow? It's been scanning a project for hours._

Astronomer needs to make a lot of queries to the GitHub API in order to fetch all of the user data. It typically needs to do one request per page of stargazers (that's one query per 30 users), and then two requests per user (one to get their user profile and one for their GitHub contributions). This means that for a repository with 25000 stars, Astronomer would need to make 830 requests to list all users and then 50000 requests to get all user data. The other issue is that the GitHub API is rate limited to 5000 requests per hour, so this particular scan would end up taking at least 10 hours. I plan on contacting GitHub to try to get a token with more flexible rate limiting, since I believe this project is beneficial to their business, but I'm not confident this request will be accepted.
Astronomer needs to make a lot of queries to the GitHub API in order to fetch all of the user data. It typically needs to do one request per page of stargazers per year of contributions, (as of 2019 that's 11 requests per 30 users). The issue is that the GitHub API is rate limited to 5000 requests per hour, so for a scan of 25000 stars for example, about 9000 requests are required, which will result in at least a two hour scan (takes about 6 hours on my machine/network). I plan on contacting GitHub to try to get a token with more flexible rate limiting, since I believe this project is beneficial to their business, but I'm not confident this request will be accepted.

> _How can I contribute to this project?_

If you have a strong math background, knowledge in statistics and analytics, or in general believe you could make the trust algorithm smarter, please contact me, or at least feel free to open a feature request describing what algorithm you think would work better.
If you have a strong math background, knowledge in statistics and analytics, or in general believe you could make the trust algorithm smarter, please contact me, or at least feel free to open a feature request describing what algorithm you think would work better. A feature that I would be especially interested in is computing the curve of percentile values for each trust factor and compare it to a reference curve, in order to detect inconsistencies.

If you are a software engineer or a web developer (or both), you could also participate in helping to build the next version of Astronomer: an API and a web application to let people scan whatever repositories they want for fake stars, and see previously generated reports through a web interface. It would make it easy for everyone to check whether or not a repository's stargazers are legit.

Expand All @@ -81,9 +106,15 @@ Also, if you have data to backup a claim that you have a better value for the go

Ideally, this should be a GitHub feature. The issue is that it's actually almost impossible to differentiate a bot account and the account of someone who just created a GH account to star a repository and show their support, which can lead to angry customers for GitHub if they chose to ban potentially illegitimate accounts. It's also very easy for people who make bot accounts to make them seem legit by creating private repositories with daily contributions, but this can also be detected to some extent, if it's a trend that ends up appearing.

> _What's the strange hardcoded skip in the `query.go` file?_

Unfortunately there's an issue in the GitHub API, where [this user](https://github.com/jstrachan) has so many contributions that all API requests that would contain his contributions time out, consistently. Since he starred `containous/traefik`, I had to hardcode a skip in order to allow the scan to continue (since the GH API's only method of pagination is to use the `cursor` returned by the user node, I had to manually get his cursor value myself and hardcode it. Writing logic to handle this case generically whenever it happens would be possible but I'm not sure it's a priority right now). I've sent a support request to GitHub so when they fix it, I'll make sure to remove this skip.

## Thanks

Thanks to the authors of [spencerkimball/stargazers](https://github.com/spencerkimball/stargazers) and the [GitHub contributions API](https://github.com/Didericis/github-contributions-api) 🙏
Thanks to the authors of [spencerkimball/stargazers](https://github.com/spencerkimball/stargazers) who greatly inspired the early design of this project 🙏

The original Go gopher was designed by [Renee French](http://reneefrench.blogspot.com).

## License

Expand Down
41 changes: 19 additions & 22 deletions cache.go
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
package main

import (
"bufio"
"bytes"
"fmt"
"io/ioutil"
Expand All @@ -18,8 +17,8 @@ import (
// supplied request's URL. If found, the file contains a cached copy
// of the HTTP response. The contents are read into an http.Response
// object and returned.
func getCache(ctx *Context, req *http.Request) (*http.Response, error) {
filename := cacheEntryFilename(ctx, req.URL.String())
func getCache(ctx context, req *http.Request, pagination string) (*http.Response, error) {
filename := cacheEntryFilename(ctx, req.URL.String()+pagination)
pathToCreate := path.Dir(filename)

if err := os.MkdirAll(pathToCreate, os.ModeDir|0755); err != nil {
Expand All @@ -43,51 +42,49 @@ func readCachedResponse(filename string, req *http.Request) (*http.Response, err
return nil, err
}

return http.ReadResponse(bufio.NewReader(bytes.NewBuffer(body)), req)
return &http.Response{
Body: ioutil.NopCloser(bytes.NewReader(body)),
}, nil
}

// putCache puts the supplied http.Response into the cache.
func putCache(ctx *Context, req *http.Request, resp *http.Response) error {
defer resp.Body.Close()

filename := cacheEntryFilename(ctx, req.URL.String())
func putCache(ctx context, req *http.Request, pagination string, body []byte) error {
filename := cacheEntryFilename(ctx, req.URL.String()+pagination)
f, err := os.Create(filename)
if err != nil {
return err
return fmt.Errorf("unable to create cache file: %v", err)
}
defer f.Close()

if err := resp.Write(f); err != nil {
f.Close()
return err
_, err = f.Write(body)
if err != nil {
return fmt.Errorf("unable to write response in cache file: %v", err)
}

f.Close()

readResp, err := readCachedResponse(filename, req)
_, err = readCachedResponse(filename, req)
if err != nil {
return err
}

resp.Body = readResp.Body
return nil
}

// cacheEntryFilename creates a filename-safe name in a subdirectory
// of the configured cache dir, with any access token stripped out.
func cacheEntryFilename(ctx *Context, url string) string {
newURL := strings.Replace(url, fmt.Sprintf("access_token=%s", ctx.Token), "", 1)
return filepath.Join(ctx.CacheDir, ctx.Repo, sanitize.BaseName(newURL))
func cacheEntryFilename(ctx context, url string) string {
newURL := strings.Replace(url, fmt.Sprintf("access_token=%s", ctx.githubToken), "", 1)
return filepath.Join(ctx.cacheDirectoryPath, ctx.repoOwner, ctx.repoName, sanitize.BaseName(newURL))
}

// clearEntry clears a specified cache entry.
func clearEntry(ctx *Context, url string) error {
func clearEntry(ctx context, url string) error {
filename := cacheEntryFilename(ctx, url)
return os.Remove(filename)
}

// Clear clears all cache entries for the repository specified in the
// fetch context.
func Clear(ctx *Context) error {
filename := filepath.Join(ctx.CacheDir, ctx.Repo)
func Clear(ctx context) error {
filename := filepath.Join(ctx.cacheDirectoryPath, ctx.repoOwner, ctx.repoName)
return os.RemoveAll(filename)
}
Loading