-
Notifications
You must be signed in to change notification settings - Fork 24
Documenting the algorithm and providing justification evidence #45
Comments
Additional results for torvalds/linux
|
Hi @changkun ! Thanks for your tests and your very interesting suggestions :) As stated in the README, my algorithm only attempts to estimate authenticity based on the user contributions. The repositories that you scanned are mostly starred by casual GitHub users, which makes the results quite low :/ For example, your repository being a tutorial, it will tend to be starred by CS students who are learning about C++ and thus their average contributions will tend to be lower than that of technical libraries. I determined the ratios for what is "trustworthy" based on a sample of some GitHub open source repositories: LiveSplit/LiveSplit
bettercap/bettercap
bxcodec/faker
cenkalti/backoff
containous/traefik
d5/tengo
derailed/k9s
dgageot/demoit
ehazlett/interlock
envoyproxy/envoy
fatih/color
francoispqt/gojay
gcla/termshark
golang/proposal
grafana/loki
guptarohit/asciigraph
hashicorp/raft
iafan/goplayspace
ikruglov/slapper
imdario/mergo
jlevesy/sind
julienschmidt/httprouter
kataras/iris
knqyf263/trivy
kubernetes/kops
labstack/echo
ldez/prm
lukechampine/uint128
michenriksen/gitrob
montanaflynn/stats
moul/assh
mvdan/gofumpt
nektos/act
notnil/chess
olivere/elastic
operator996/yaocl
rancher/k3s
rs/zerolog
sirupsen/logrus
spf13/cobra
spf13/viper
thoas/stats
totoval/totoval
tsenart/vegeta
ullaakut/astronomer
ullaakut/cameradar
ullaakut/gorsair
ullaakut/nmap
ullaakut/rtspallthethings
valyala/fasthttp
vbauerster/mpb
vektra/mockery
zhangpeihao/gortmp Unfortunately, this list is biased towards open source Go projects and libraries, which are most often used in technical projects, which means that the average stargazer of those projects is probably not representative of the average GitHub stargazer. That would explain the results you found.
I would love to discuss this more in depth with you since you seem to have great ideas and suggestions, and a better knowledge than me in this matter :) |
Hi @Ullaakut. Sorry for my late response, and thank you for open the discussion regarding the algorithm.
In conclude, malicious stargazer detection is a challenging problem, you must sample enough data on GitHub and analyzing: 1) what are the essential factors, 2) what are the potential factors that you haven't considered at the moment, so on... |
Hey again @changkun ! Thanks for the details 🤔 I'm thinking of ways to have a better idea of the differences between legit and malicious stargazers, but the more I look into it the more challenging the problem becomes. It turns out a lot of bought stars might be hacked accounts, or accounts from legit GitHub users who are simply getting paid to star projects when asked. (See https://gimhub.com/). And those are pretty much impossible to detect, compared to other users. What my algorithm is good at detecting is basically any repository that used one of the old python scripts that were used about 3 years ago, to automatically create accounts and star a GitHub project, resulting in the first X accounts of the repository having absolutely 0 contributions and only starring a project. Those scripts are no longer working though, since GH increased their account creation security by requiring email validation and a security challenge (using either vision or hearing). |
Thank you for this very interesting project. Here I share a few of my tests while using the project.
I initially tested my personal project which has about 3.9k stars, the result seems wasn't so good.
Then, I picked another project from GitHub trend page:
OK, then let's test Tensorflow.
Issues to the Algorithm
This repo is proposing a justice algorithm without previous study on the ratio of algorithm. As a user of your algorithm, I particularly expect the following supporting points on why the algorithm is accurate:
Showing theoretical analysis regarding the influence of each of the defined factors, and providing regression analysis and statistical stability of the algorithm.
Making benchmarks on various projects, illustrates how your algorithm match the theoretical analysis for the TOP10 valuable open source projects, like golang/go, torvalds/linux, etc.
May I have how did you have this conclusion? How large is your test samples? What are they? etc.
Establish a user study, an important way of evaluating usability issue is to held an user study. Typically, a single score has lack of expression on many different aspects, and it is not easy to say if the star of a repo is seriously fake or unworthy. Making quantitative analysis on, for example, how other users feel about the score provided by the algorithm, does the score matches your mental expectation? why? how could we help? those are questions should be seriously considered.
The text was updated successfully, but these errors were encountered: