-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The algorithm to count the stars of a repository is incorrect. #914
Comments
ReasonThe reason of this problem is that when we use the original algorithm, the stars are counted by the 'WatchEvent' triggered only after 1/1/2015 because the data source GH Archive did not start recording Event API until 1/1/2015.
So, the stars before 1/1/2015 is not counted by the original algorithm. EvaluationIf we take an example: SELECT *
FROM github_log.events WHERE repo_id=76067
[Out]
id type action actor_id actor_login \
0 3150100458 ForkEvent added 9071941 shivani02
1 3263259481 IssueCommentEvent created 246181 fdo
2 3263257459 PullRequestEvent opened 246181 fdo
3 3263259051 PullRequestEvent closed 246181 fdo
4 3298659514 WatchEvent started 3060073 zhangli344236745
5 3309655998 WatchEvent started 5877145 4148
6 3367470360 WatchEvent started 6853487 ashdude1120
7 2729434104 WatchEvent started 11944951 patrickjohn931
8 2861893330 WatchEvent started 660911 jh86
9 3104435709 ForkEvent added 3079568 sam-tsai
10 3143294267 WatchEvent started 3719921 Lh4cKg
repo_id repo_name org_id org_login created_at ... \
0 76067 django/django-old 27804 django 2015-09-15 21:12:43 ...
1 76067 django/django-old 27804 django 2015-10-22 02:55:12 ...
2 76067 django/django-old 27804 django 2015-10-22 02:54:00 ...
3 76067 django/django-old 27804 django 2015-10-22 02:54:54 ...
4 76067 django/django-old 27804 django 2015-11-02 15:29:18 ...
5 76067 django/django-old 27804 django 2015-11-05 02:37:23 ...
6 76067 django/django-old 27804 django 2015-11-22 06:42:14 ...
7 76067 django/django-old 27804 django 2015-04-16 08:32:08 ...
8 76067 django/django-old 27804 django 2015-06-03 23:22:15 ...
9 76067 django/django-old 27804 django 2015-08-31 15:49:46 ...
10 76067 django/django-old 27804 django 2015-09-14 05:22:39 ...
...
[11 rows x 132 columns] we can find out that: The repository 'django/django-old' just has 11 events but it has 2732 stars. |
I think there are 3 reasons:
Example 1: SELECT *
FROM github_log.events WHERE repo_id=1022930
[Out]
id type action actor_id actor_login \
0 2508242562 ForkEvent added 5823644 dud3
1 9554613329 ForkEvent added 9651925 vijayvani
2 9575447500 ForkEvent added 50352335 VijayEluri
3 16734794487 ForkEvent added 23434323 atrocitytheme
4 3993792457 ForkEvent added 1516696 DevFactory
5 4013909134 ForkEvent added 18620623 SobolSigizmund
6 4020777350 ForkEvent added 19342648 InsightsDev
7 19547339541 ForkEvent added 6499936 charygao
8 3311082181 WatchEvent started 5877145 4148
9 20285826134 ForkEvent added 28696476 bellmit
10 9482380461 WatchEvent started 29672525 FrankieLee1997
11 5239581163 PullRequestEvent closed 413005 robhoes
12 19164581776 WatchEvent started 89962858 masterofalluniverse
repo_id repo_name org_id org_login \
0 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
1 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
2 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
3 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
4 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
5 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
6 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
7 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
8 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
9 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
10 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
11 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
12 1022930 CloudStack-extras/CloudStack-archive 1006719 CloudStack-extras
created_at ... commit_comment_id commit_comment_author_id \
0 2015-01-13 00:14:16 ... 0 0
1 2019-05-02 22:44:41 ... 0 0
2 2019-05-07 06:26:27 ... 0 0
3 2021-06-10 22:46:37 ... 0 0
4 2016-05-10 09:44:19 ... 0 0
5 2016-05-13 23:49:08 ... 0 0
6 2016-05-16 19:38:25 ... 0 0
7 2022-01-02 04:44:50 ... 0 0
8 2015-11-05 12:24:09 ... 0 0
9 2022-02-16 10:16:39 ... 0 0
10 2019-04-21 14:27:01 ... 0 0
11 2017-01-31 10:17:57 ... 0 0
12 2021-12-03 07:11:23 ... 0 0
... Example 2: SELECT *
FROM github_log.events WHERE repo_id=52308441
[Out]
...
id type action actor_id actor_login \
...
1 11352830554 CommitCommentEvent added 11790366 flexsurfer
...
repo_id repo_name org_id org_login created_at \
...
1 52308441 status-im/status-react 11767950 status-im 2018-10-03 12:38:31
...
... while SELECT *
FROM github_log.events WHERE repo_name='status-im/status-mobile'
[Out]
id type action actor_id actor_login \
0 22919621662 IssueCommentEvent created 40699771 status-im-auto
1 22919688898 IssueCommentEvent created 40699771 status-im-auto
...
repo_id repo_name org_id org_login \
0 52308441 status-im/status-mobile 11767950 status-im
1 52308441 status-im/status-mobile 11767950 status-im
...
created_at ... commit_comment_id commit_comment_author_id \
0 2022-07-17 13:17:29 ... 0 0
1 2022-07-17 13:27:58 ... 0 0
...
... The name has been changed. Example 3:
Of which the PRs and issues are:
They seem normal. I think may be it is that just some errors have inccurred? |
After all analysis above, I have the following suggestions. Suggestions
|
In clickhouse-demo, this notebook seems to reproduce the functions in github-explorer to show we can do all analysis by open-digger. In addition, I have had the same question before, and here are some of my thoughts. I'll write it down for discussion. Firstly, I think the use of For For
This SQL count the number of stars increased per month(Frank told me it's complicated but I forget where). For |
Hi, @xgdyp . Thanks for reply.
I see... Such as replacing the original notice to a warning:
That, I think, will helps a lot.
Yep, I myself use this API to do the real-time counting of a small number of repoistories, too.
This point certainly convinced me, I did not pay attention to this. Thank you for this explanation. |
Thanks for this example query. It works well. According to the query above, I've come up with a research idea. But it is irrelated to the main question of this issue. So I will close this issue and add it into my own research issues. Thanks again. |
For this part, should I raise a PR or just not worth it? |
@yoyo-wu98 Thanks for the great work you've done about the star count.
|
Hi, @frank-zsy . Thanks a lot for the explanation. Helps a lot:) |
Question Description
In Clickhouse Demo, we have the below algorithm (we will call it 'Original Algorithm' for short) to count the stars of a repository:
But I found that if we use another algorithm (we will call it 'New Algorithm' for short) as below we can get a different result which does not contain the removed or duplicated stars:
This result may not be correct, but I think it is more close to the correct answer.
Evaluation
You mentioned in the article that
But I think the difference can be very large.
If we use the two algorithms to calculate the stars before 7/9/2022, we can get these two results:
We can see that there are a lot of changes in order.
I use a query to figure out how different between the two algorithm:
Summary
Anyway, I think we should use the New Algorithm to calculate the stars.
Appendix
The difference rate can be calculated by the query below:
So we come into another problem: why is the difference rate (difference between original algorithm between new algorithm / original algorithm counts) much larger than 1.
The text was updated successfully, but these errors were encountered: