Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to count commits #80

Merged
merged 5 commits into from Feb 18, 2019
Merged

Add support to count commits #80

merged 5 commits into from Feb 18, 2019

Conversation

@budden
Copy link
Contributor

budden commented Feb 12, 2019

This one is for issue #26 (implement COUNT). So far not all test suites are updated, otherwise it seem to work. Please review when you have time, meanwhile I'll update missing tests in a day or two.

@cloudson

This comment has been minimized.

Copy link
Owner

cloudson commented Feb 13, 2019

That's a great contribution @budden , another one to check. Thanks a lot.

@budden

This comment has been minimized.

Copy link
Contributor Author

budden commented Feb 13, 2019

Now I seem to have touched all test suites.

@cloudson

This comment has been minimized.

Copy link
Owner

cloudson commented Feb 14, 2019

Hey @budden I'm testing your change and the way how we present the information is really good but there is a problem that I mentioned in #26.
For now, gitql has a default limit result and if I type select count(*) from commits it shows the default limit (10 rows).
My worried here ist that I didn't find a solution beyond a full table scan respecting the way how gitql works (accessing directly git objects), but count the commits in projects like kubernetes would be slow.

How do you see that issue?

@budden

This comment has been minimized.

Copy link
Contributor Author

budden commented Feb 14, 2019

Hi @cloudson , I've read the discussion in the issue and decided to start with simplest yet useful thing.
Haha, kubernetes has 74+k commits :) SO says there is command line to get count "faster", like git rev-list --all --count, see https://stackoverflow.com/questions/677436/how-do-i-get-the-git-commit-count .
If we go this way there are more questions to explore then. For instance, what to do with select count(*) from commits where .... Counting all commits (without where) is just one of use cases, not necessary the most important one. Sorry I have no good PC at hand to check all those things, will do that tomorrow my timezone.
What goes to limit, I don't think it is a bad thing, because count(*) ... limit N would show the right number if N is good enough. This behavior can be changed so that count ignores limit, but current one is regular. That's up to you, I can change it. OTOH I would change default limit to something more useful, like 1000.

@budden

This comment has been minimized.

Copy link
Contributor Author

budden commented Feb 14, 2019

My wrong. In Postgresql and MySQL, limit must not affect the count, because limit applies to the result set, which is always of length 1. The only possible exception is limit 0. So I'll change this one tomorrow.

@cloudson

This comment has been minimized.

Copy link
Owner

cloudson commented Feb 15, 2019

No rush man, we can discuss more before you implement it.

@budden

This comment has been minimized.

Copy link
Contributor Author

budden commented Feb 15, 2019

It takes 3-4 seconds on my hardware to obtain 'count(*) from commits limit 100000' for kubernetes. Do we really need some optimization here?

@budden

This comment has been minimized.

Copy link
Contributor Author

budden commented Feb 15, 2019

select count(*) from commits limit 100500 now runs 4 seconds against git rev-list --count HEAD, which takes 0.8 seconds only. More interesting, if we run git rev-list --all --count, then results are different, where --all --count returns more commits. I don't know what does it mean, but in worst case it means that commits virtual table does not cover all actual commits.

@cloudson

This comment has been minimized.

Copy link
Owner

cloudson commented Feb 15, 2019

wow 😍 , for me it's quite good. Can you check it with some more complex where query using date interval and email as criteria?

@cloudson cloudson changed the title Issue 26 by budden Add support to count commits Feb 15, 2019
@budden

This comment has been minimized.

Copy link
Contributor Author

budden commented Feb 15, 2019

Ok, will do after making limit to not affect count.

@budden

This comment has been minimized.

Copy link
Contributor Author

budden commented Feb 16, 2019

Fixed count vs limit. The results of measuring follow. I'm running debian stretch amd64 on a virtual box, which runs on a ASUS ROG G771JW laptop with an SSD drive under windows 10. Virtual memory is disabled on Windows 10.

/y/kubernetes$ time git rev-list --count HEAD
74763
real	0m2,981s

$ time git ql --type json "select count(*) from commits"
[{"count":"74763"}]
real	0m5,551s

$ time git ql --type json "select count(*) from commits where author_email like '%.com' order by date asc limit 3"
[{"count":"71952"}]
real	0m6,463s

$ time git ql "select count(*) from commits where author_email like '%.com' and date >= '2017-01-01' and date <= '2017-12-31'"
+-------+
| count |
+-------+
| 17617 |
+-------+
real	0m7,880s


@cloudson

This comment has been minimized.

Copy link
Owner

cloudson commented Feb 16, 2019

I tested locally and it seems amazing. I just found a little problem that I would like to discuss to you.

git ql --type json "select count(*) from commits where author_email like '%.com' order by date asc limit 3" 

Should return 3 instead of 71952. I mean, when we set a limit, count should not be greater than it, isn't?
What do you think about it?

@budden

This comment has been minimized.

Copy link
Contributor Author

budden commented Feb 16, 2019

Should return 3 instead of 71952.

Hi! I'm hesitating it should. As I mentioned before, postgresql and mysql apply limit after count, so "limit" sees exactly one record containing the count value, and so limit is actually meaningless after count (limit 0 is not allowed in gitql).
For instance,

https://dba.stackexchange.com/questions/117284/count-gives-more-than-1-with-limit-1
https://stackoverflow.com/questions/17020842/mysql-count-with-limit

So if we're following "the principle of least surprise", gitql should do the same, that is, return 71952 regardless of LIMIT value.

@cloudson

This comment has been minimized.

Copy link
Owner

cloudson commented Feb 18, 2019

You are totally right @budden . I'm merging it and appreciate all your work. Thanks a lot.

@cloudson cloudson merged commit 4767869 into cloudson:develop Feb 18, 2019
2 checks passed
2 checks passed
Codacy/PR Quality Review Up to standards. A positive pull request.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.