Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The algorithm to count the stars of a repository is incorrect. #914

Closed
yoyo-wu98 opened this issue Jul 18, 2022 · 10 comments
Closed

The algorithm to count the stars of a repository is incorrect. #914

yoyo-wu98 opened this issue Jul 18, 2022 · 10 comments
Labels
kind/question Category issues related to questions or problems

Comments

@yoyo-wu98
Copy link
Contributor

yoyo-wu98 commented Jul 18, 2022

Question Description

In Clickhouse Demo, we have the below algorithm (we will call it 'Original Algorithm' for short) to count the stars of a repository:

SELECT COUNT() AS total_count
FROM github_log.events
WHERE type = 'WatchEvent'

But I found that if we use another algorithm (we will call it 'New Algorithm' for short) as below we can get a different result which does not contain the removed or duplicated stars:

SELECT repo_name, MAX(repo_stargazers_count) AS stars
FROM github_log.events 
GROUP BY repo_name

This result may not be correct, but I think it is more close to the correct answer.

Evaluation

You mentioned in the article that

There is no events when someone removes the star. This means that star removal is invisible, but it should be rare in any case, so we can calculate the approximate number of stars from these events.

But I think the difference can be very large.

If we use the two algorithms to calculate the stars before 7/9/2022, we can get these two results:

  1. Original Algorithm:
    SELECT repo_name, COUNT() AS stars
    FROM github_log.events WHERE type = 'WatchEvent'
    GROUP BY repo_name
    ORDER BY stars DESC
    limit 10
    
    [Out]
                                 repo_name   stars
    0                       996icu/996.ICU  270420
    1                            vuejs/vue  227619
    2                 sindresorhus/awesome  227345
    3            FreeCodeCamp/FreeCodeCamp  224461
    4      kamranahmedse/developer-roadmap  217409
    5                       facebook/react  214575
    6  jwasham/coding-interview-university  210002
    7     donnemartin/system-design-primer  196206
    8                tensorflow/tensorflow  193690
    9            freeCodeCamp/freeCodeCamp  179371
  2. New Algorithm:
    SELECT repo_name, max(repo_stargazers_count) as repo_stargazers_count
    FROM github_log.events 
    group by repo_name
    order by repo_stargazers_count DESC
    limit 10
    
    [Out]
                                    repo_name  repo_stargazers_count
    0               freeCodeCamp/freeCodeCamp                 349508
    1                          996icu/996.ICU                 262748
    2  EbookFoundation/free-programming-books                 241623
    3     jwasham/coding-interview-university                 225568
    4               FreeCodeCamp/freeCodeCamp                 219298
    5               FreeCodeCamp/FreeCodeCamp                 218740
    6                    sindresorhus/awesome                 210146
    7         kamranahmedse/developer-roadmap                 201459
    8                 public-apis/public-apis                 201445
    9                               vuejs/vue                 197883

We can see that there are a lot of changes in order.

I use a query to figure out how different between the two algorithm:

SELECT repo_name, abs(original.stars - new.stars) as diff from 
(SELECT repo_name, COUNT() AS stars
FROM github_log.events WHERE type = 'WatchEvent'
GROUP BY repo_name) as original 
join 
(SELECT repo_name, max(repo_stargazers_count) as stars
FROM github_log.events 
group by repo_name) as new
on original.repo_name = new.repo_name
order by diff DESC
limit 20

[Out]
                                     repo_name    diff
0                    FreeCodeCamp/freeCodeCamp  218733
1                    freeCodeCamp/freeCodeCamp  170139
2             codecrafters-io/build-your-own-x  140738
3                              ohmyzsh/ohmyzsh   93090
4                              mui/material-ui   74611
5       EbookFoundation/free-programming-books   70384
6                      animate-css/animate.css   64199
7                             microsoft/vscode   63546
8                          aplus-framework/app   57586
9                                   Microsoft/   55390
10  practical-tutorials/project-based-learning   54963
11                              997icu/996.ICU   53962
12                         puppeteer/puppeteer   53256
13                           coder/code-server   49729
14                     public-apis/public-apis   49372
15                              redis-io/redis   43706
16                        microsoft/TypeScript   43673
17                              apache/echarts   43375
18                              vercel/next.js   43298
19                      remix-run/react-router   43253

Summary

Anyway, I think we should use the New Algorithm to calculate the stars.


Appendix

The difference rate can be calculated by the query below:

SELECT repo_name, diff, diff / original_stars as diff_rate FROM
(SELECT repo_name, abs(original.stars - new.stars) as diff, original.stars as original_stars from 
(SELECT repo_name, COUNT() AS stars
FROM github_log.events WHERE type = 'WatchEvent'
GROUP BY repo_name) as original 
join 
(SELECT repo_name, max(repo_stargazers_count) as stars
FROM github_log.events 
group by repo_name) as new
on original.repo_name = new.repo_name)
ORDER BY diff_rate DESC
limit 20

[Out]
                          repo_name   diff     diff_rate
0                            rails/  39882  39882.000000
1                        Microsoft/  55390  27695.000000
2                             zeit/  25233  25233.000000
3                          videojs/  21461  21461.000000
4           floating-ui/popper-core  18398  18398.000000
5                           google/  26354  13177.000000
6                         dawnlabs/  10341  10341.000000
7                           apache/  19924   6641.333333
8                           apereo/   4479   4479.000000
9           status-im/status-mobile   3480   3480.000000
10        not-kennethreitz/requests  39446   2465.375000
11                     hanami/lotus   2374   2374.000000
12                 yusuke/Twitter4J   2241   2241.000000
13     jezdez/django-webpack-loader   2178   2178.000000
14              EatBreatheCode/pace  14958   2136.857143
15  APD-ORG/awesome-public-datasets  16513   2064.125000
16            slityourthroat/badass   2048   2048.000000
17       Microsoft/botframework-sdk   5636   1878.666667
18                   flipt-io/flipt   1849   1849.000000
19                       libra/diem  15455   1717.222222

So we come into another problem: why is the difference rate (difference between original algorithm between new algorithm / original algorithm counts) much larger than 1.

@yoyo-wu98 yoyo-wu98 changed the title The algorithm counting the stars of a repository is incorrect. The algorithm counting the stars of a repository is incorrect. #kind/question Jul 18, 2022
@open-digger-bot open-digger-bot bot added the kind/question Category issues related to questions or problems label Jul 18, 2022
@yoyo-wu98 yoyo-wu98 changed the title The algorithm counting the stars of a repository is incorrect. #kind/question The algorithm counting the stars of a repository is incorrect. Jul 18, 2022
@yoyo-wu98
Copy link
Contributor Author

yoyo-wu98 commented Jul 18, 2022

Reason

The reason of this problem is that when we use the original algorithm, the stars are counted by the 'WatchEvent' triggered only after 1/1/2015 because the data source GH Archive did not start recording Event API until 1/1/2015.

  • Activity archives for dates between 2/12/2011-12/31/2014 was recorded from the (now deprecated) Timeline API.
  • Activity archives for dates starting 1/1/2015 is recorded from the Events API.

(reference: http://www.gharchive.org)

So, the stars before 1/1/2015 is not counted by the original algorithm.

Evaluation

If we take an example:

SELECT *
FROM github_log.events WHERE repo_id=76067

[Out]
            id               type   action  actor_id       actor_login  \
0   3150100458          ForkEvent    added   9071941         shivani02   
1   3263259481  IssueCommentEvent  created    246181               fdo   
2   3263257459   PullRequestEvent   opened    246181               fdo   
3   3263259051   PullRequestEvent   closed    246181               fdo   
4   3298659514         WatchEvent  started   3060073  zhangli344236745   
5   3309655998         WatchEvent  started   5877145              4148   
6   3367470360         WatchEvent  started   6853487       ashdude1120   
7   2729434104         WatchEvent  started  11944951    patrickjohn931   
8   2861893330         WatchEvent  started    660911              jh86   
9   3104435709          ForkEvent    added   3079568          sam-tsai   
10  3143294267         WatchEvent  started   3719921            Lh4cKg   

    repo_id          repo_name  org_id org_login          created_at  ...  \
0     76067  django/django-old   27804    django 2015-09-15 21:12:43  ...   
1     76067  django/django-old   27804    django 2015-10-22 02:55:12  ...   
2     76067  django/django-old   27804    django 2015-10-22 02:54:00  ...   
3     76067  django/django-old   27804    django 2015-10-22 02:54:54  ...   
4     76067  django/django-old   27804    django 2015-11-02 15:29:18  ...   
5     76067  django/django-old   27804    django 2015-11-05 02:37:23  ...   
6     76067  django/django-old   27804    django 2015-11-22 06:42:14  ...   
7     76067  django/django-old   27804    django 2015-04-16 08:32:08  ...   
8     76067  django/django-old   27804    django 2015-06-03 23:22:15  ...   
9     76067  django/django-old   27804    django 2015-08-31 15:49:46  ...   
10    76067  django/django-old   27804    django 2015-09-14 05:22:39  ...   

...
[11 rows x 132 columns]

we can find out that: The repository 'django/django-old' just has 11 events but it has 2732 stars.

@yoyo-wu98
Copy link
Contributor Author

yoyo-wu98 commented Jul 18, 2022

So we come into another problem: why is the difference rate (difference between original algorithm between new algorithm / original algorithm counts) much larger than 1.

I think there are 3 reasons:

  1. most of stars are triggered before 1/1/2015;
  2. some repositories' name have been changed;
  3. some errors inccurred.

Example 1:

SELECT *
FROM github_log.events WHERE repo_id=1022930

[Out]
             id              type   action  actor_id          actor_login  \
0    2508242562         ForkEvent    added   5823644                 dud3   
1    9554613329         ForkEvent    added   9651925            vijayvani   
2    9575447500         ForkEvent    added  50352335           VijayEluri   
3   16734794487         ForkEvent    added  23434323        atrocitytheme   
4    3993792457         ForkEvent    added   1516696           DevFactory   
5    4013909134         ForkEvent    added  18620623       SobolSigizmund   
6    4020777350         ForkEvent    added  19342648          InsightsDev   
7   19547339541         ForkEvent    added   6499936             charygao   
8    3311082181        WatchEvent  started   5877145                 4148   
9   20285826134         ForkEvent    added  28696476              bellmit   
10   9482380461        WatchEvent  started  29672525       FrankieLee1997   
11   5239581163  PullRequestEvent   closed    413005              robhoes   
12  19164581776        WatchEvent  started  89962858  masterofalluniverse   

    repo_id                             repo_name   org_id          org_login  \
0   1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
1   1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
2   1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
3   1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
4   1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
5   1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
6   1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
7   1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
8   1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
9   1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
10  1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
11  1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   
12  1022930  CloudStack-extras/CloudStack-archive  1006719  CloudStack-extras   

            created_at  ...  commit_comment_id  commit_comment_author_id  \
0  2015-01-13 00:14:16  ...                  0                         0   
1  2019-05-02 22:44:41  ...                  0                         0   
2  2019-05-07 06:26:27  ...                  0                         0   
3  2021-06-10 22:46:37  ...                  0                         0   
4  2016-05-10 09:44:19  ...                  0                         0   
5  2016-05-13 23:49:08  ...                  0                         0   
6  2016-05-16 19:38:25  ...                  0                         0   
7  2022-01-02 04:44:50  ...                  0                         0   
8  2015-11-05 12:24:09  ...                  0                         0   
9  2022-02-16 10:16:39  ...                  0                         0   
10 2019-04-21 14:27:01  ...                  0                         0   
11 2017-01-31 10:17:57  ...                  0                         0   
12 2021-12-03 07:11:23  ...                  0                         0   

...

Example 2:

SELECT *
FROM github_log.events WHERE repo_id=52308441

[Out]
...
             id                type action  actor_id        actor_login  \
...
1   11352830554  CommitCommentEvent  added  11790366         flexsurfer   
...

     repo_id               repo_name    org_id  org_login          created_at  \
...
1   52308441  status-im/status-react  11767950  status-im 2018-10-03 12:38:31   
...

...

while

SELECT *
FROM github_log.events WHERE repo_name='status-im/status-mobile'

[Out]
             id               type   action  actor_id     actor_login  \
0   22919621662  IssueCommentEvent  created  40699771  status-im-auto   
1   22919688898  IssueCommentEvent  created  40699771  status-im-auto   
...

     repo_id                repo_name    org_id  org_login  \
0   52308441  status-im/status-mobile  11767950  status-im   
1   52308441  status-im/status-mobile  11767950  status-im   
...

            created_at  ...  commit_comment_id  commit_comment_author_id  \
0  2022-07-17 13:17:29  ...                  0                         0   
1  2022-07-17 13:27:58  ...                  0                         0   
...

...

The name has been changed.

Example 3:

SELECT *
FROM github_log.events WHERE repo_name='rails/'

[Out]
           id               type   action  actor_id     actor_login  repo_id  \
0  7810602824   PullRequestEvent   opened   6443532     bogdanvlviv     8514   
1  7816141721   PullRequestEvent   closed     94129  georgeclaghorn     8514   
2  7578883365  IssueCommentEvent  created     94129  georgeclaghorn     8514   
3  8016021876         WatchEvent  started  34185978       keshandrr     8514   

  repo_name  org_id org_login          created_at  ...  commit_comment_id  \
0    rails/    4223     rails 2018-06-12 07:27:07  ...                  0   
1    rails/    4223     rails 2018-06-13 02:42:06  ...                  0   
2    rails/    4223     rails 2018-04-24 14:31:59  ...                  0   
3    rails/    4223     rails 2018-07-25 12:03:59  ...                  0  

...

Of which the PRs and issues are:

  1. Fix test TestCaseTest#test_deprecated_body_stream rails/rails#33125
  2. Don't purge ActiveStorage blob if record was rolled back to still reference it rails/rails#33069
  3. Use webpack4 in ActiveStorage rails/rails#32391

They seem normal.

I think may be it is that just some errors have inccurred?

@yoyo-wu98
Copy link
Contributor Author

yoyo-wu98 commented Jul 18, 2022

After all analysis above, I have the following suggestions.

Suggestions

  1. Use repo_id instead of repo_name as the indexes of analysis;
  2. Use the New Algorithm to count the stars of a repoistory:
    SELECT repo_id, max(repo_stargazers_count) as repo_stargazers_count
    FROM github_log.events WHERE type = 'PullRequestEvent'
    group by repo_id

@yoyo-wu98 yoyo-wu98 changed the title The algorithm counting the stars of a repository is incorrect. The algorithm to count the stars of a repository is incorrect. Jul 19, 2022
@xgdyp
Copy link
Contributor

xgdyp commented Jul 19, 2022

In clickhouse-demo, this notebook seems to reproduce the functions in github-explorer to show we can do all analysis by open-digger.

In addition, I have had the same question before, and here are some of my thoughts. I'll write it down for discussion.

Firstly, I think the use of star can be divided into count and aggregate
For count, we may calculate the count of a project. i.e. we can calculate how many stars now in open-digger.
For aggregate usage, we can calculate some metrics like "How many stars have increased this week?"

For count usage,I think we may need an accurate number. I usually get this number by Github API. (You need to add a Accept params in post headers).This will return a list of all the star records of a repo.like:
image
It returns the star time too so we can get the star history. Actually ,many projects like starcharts use this method too.

For aggregate usage, we need to do a lot of calculations, so we can't use GitHub API because of rate limit.
In this case, we can use clickhouse to do more aggregation operations. We can get the approximate number (for higher approximation, we need to limit each account to star only once). like:

SELECT tmp.repo_id ,
anyHeavy( tmp.repo_name) as repo_name ,
COUNT( *) as star_num

FROM 
(select repo_id,repo_name,actor_id,actor_login,type, max(created_at)as created_at from year2021 GROUP BY repo_id,repo_name,actor_id,actor_login,type) tmp

WHERE tmp.type = 'WatchEvent' and toMonth( tmp.created_at) = 12
GROUP BY repo_id
ORDER BY star_num DESC
LIMIT 10

This SQL count the number of stars increased per month(Frank told me it's complicated but I forget where).

For repo_stargazers_count, it seemsrepo_stargazers_count field only be triggered by PullRequestEvent and PullRequestReviewCommentEvent type. So if we need to aggregate stars over a time interval, this will be inaccurate too because PR-related events may be less frequent in all events.

@yoyo-wu98
Copy link
Contributor Author

yoyo-wu98 commented Jul 19, 2022

Hi, @xgdyp . Thanks for reply.

In clickhouse-demo, this notebook seems to reproduce the functions in github-explorer to show we can do all analysis by open-digger.

I see...
But I still suggest that we should add some notice in the demo part in case some developers may be misled.

Such as replacing the original notice to a warning:

Original Notice:

There is no events when someone removes the star. This means that star removal is invisible, but it should be rare in any case, so we can calculate the approximate number of stars from these events.

to the following warning:

Pay attention, the star counting here is incorrect because of the following reasons:

  1. The stars before 1/1/2015 are not counted by 'WatchEvent' because the data source GH Archive did not start recording Event API until 1/1/2015.
  2. Some stars are removed or re-stared several times so it may lead to missing stars or duplicated stars.
  3. repo_name may have been changed across the data.

So, our suggestions are:

  1. Use Github API to fetch the real-time star count.
  2. Use repo_id instead of repo_name as the indexes of analysis.

That, I think, will helps a lot.

For count usage,I think we may need an accurate number. I usually get this number by Github API. (You need to add a Accept params in post headers).This will return a list of all the star records of a repo.like:

Yep, I myself use this API to do the real-time counting of a small number of repoistories, too.

For repo_stargazers_count, it seemsrepo_stargazers_count field only be triggered by PullRequestEvent and PullRequestReviewCommentEvent type. So if we need to aggregate stars over a time interval, this will be inaccurate too because PR-related events may be less frequent in all events.

This point certainly convinced me, I did not pay attention to this. Thank you for this explanation.

@yoyo-wu98
Copy link
Contributor Author

yoyo-wu98 commented Jul 19, 2022

For aggregate usage, we need to do a lot of calculations, so we can't use GitHub API because of rate limit. In this case, we can use clickhouse to do more aggregation operations. We can get the approximate number (for higher approximation, we need to limit each account to star only once). like:

SELECT tmp.repo_id ,
anyHeavy( tmp.repo_name) as repo_name ,
COUNT( *) as star_num

FROM 
(select repo_id,repo_name,actor_id,actor_login,type, max(created_at)as created_at from year2021 GROUP BY repo_id,repo_name,actor_id,actor_login,type) tmp

WHERE tmp.type = 'WatchEvent' and toMonth( tmp.created_at) = 12
GROUP BY repo_id
ORDER BY star_num DESC
LIMIT 10

Thanks for this example query. It works well.

According to the query above, I've come up with a research idea. But it is irrelated to the main question of this issue. So I will close this issue and add it into my own research issues. Thanks again.

@yoyo-wu98 yoyo-wu98 reopened this Jul 19, 2022
@yoyo-wu98
Copy link
Contributor Author

yoyo-wu98 commented Jul 19, 2022

But I still suggest that we should add some notice in the demo part in case some developers may be misled.

Such as replacing the original notice to a warning:

Original Notice:

There is no events when someone removes the star. This means that star removal is invisible, but it should be rare in any case, so we can calculate the approximate number of stars from these events.

to the following warning:

Pay attention, the star counting here is incorrect because of the following reasons:

  1. The stars before 1/1/2015 are not counted by 'WatchEvent' because the data source GH Archive did not start recording Event API until 1/1/2015.
  2. Some stars are removed or re-stared several times so it may lead to missing stars or duplicated stars.
  3. repo_name may have been changed across the data.

So, our suggestions are:

  1. Use Github API to fetch the real-time star count.
  2. Use repo_id instead of repo_name as the indexes of analysis.

For this part, should I raise a PR or just not worth it?

@xgdyp
Copy link
Contributor

xgdyp commented Jul 19, 2022

@gymgym1212 @frank-zsy

@frank-zsy
Copy link
Contributor

frank-zsy commented Jul 19, 2022

@yoyo-wu98 Thanks for the great work you've done about the star count.

  • The demo notebook is a reproduce work of github-explorer, actually I did my best to keep consistent with the original work. And since the table schema in the demo does not contain repo_id column, it can only use repo_name to aggregate the result which will lead to an incorrect result like freeCodeCamp. Please feel free to raise a PR and add a notice in the notebook about the problem.
  • And @xgdyp gives a comprehensive explanation about the event data, like use actor_id distinction to remove duplicate star events and repo_stargazers_count only exists in PullRequest related event type. So we can use API (v3 or GraphQL) to get the real time star count.
  • And for the metrics implementation in OpenDigger, actually we all use repo_id, org_id or actor_id to identify an instance and use argMax(created_at, repo_name) to get the recent repo name rather than use the name directly. I think you can also refer this in the notice of demo notebook too.

@yoyo-wu98
Copy link
Contributor Author

Hi, @frank-zsy . Thanks a lot for the explanation. Helps a lot:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Category issues related to questions or problems
Projects
None yet
Development

No branches or pull requests

3 participants