Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programmatic access to yearly stats? #662

Closed
eliben opened this issue Feb 4, 2023 · 8 comments
Closed

Programmatic access to yearly stats? #662

eliben opened this issue Feb 4, 2023 · 8 comments

Comments

@eliben
Copy link

eliben commented Feb 4, 2023

I'm looking to answer the question: "what are the most viewed pages on my site in the past year?" programmatically (i.e. in some format a program/script can easily parse)

There are several approaches I explored:

  • Using the API like /api/v0/stats/hits -- this seems to provide information for the last 7 days, and it's unclear how to change the date range?
  • Querying the DB directly -- it's not clear which tables to look at -- would hits, hit_counts or hit_stats be the right table to work with?
  • Using the CSV export -- would this be simpler then querying the DB directly?
  • Using the cmdline dashboard command -- seems to be limited to a certain date range too
  • Set the main HTML dashboard to the right date parameters and scrape that HTML -- this will likely work fine, except needing to parse HTML instead of something more convenient like CSV or JSON
  • View data as a textual table #354 mentions a textual table view, but I couldn't find it - was it removed?

Any ideas/suggestions much appreciated :-)

@arp242
Copy link
Owner

arp242 commented Feb 9, 2023

I think something like this is what you want?

/api/v0/stats/hits?start=2022-01-01&end=2022-12-31&daily=1

The API reference has an overview of all parameters: https://www.goatcounter.com/help/api#api-reference-195

e.g. the RapiDoc version: https://www.goatcounter.com/api2.html#get-/api/v0/stats/hits

@eliben
Copy link
Author

eliben commented Feb 9, 2023

Ah, thanks! I read the help/api doc in detail, but missed the links to RapiDoc that explains parameters like start and daily.

@eliben eliben closed this as completed Feb 9, 2023
@eliben eliben reopened this Dec 2, 2023
@eliben
Copy link
Author

eliben commented Dec 2, 2023

@arp242 I have a follow-up question.

Can I somehow obtain more than 100 pages from the API? Per the RapiDoc reference (https://www.goatcounter.com/api2.html#get-/api/v0/stats/hits) the max is 100.

I realize these outputs can be rather large, but I actually don't need the daily/hourly counts at all - just getting the total number of views per page in the given time period would be great.

@arp242
Copy link
Owner

arp242 commented Dec 6, 2023

What kind of limit would you like?

The grouping is comparatively expensive, that's why there's a limit. It'll probably be okay for smaller sites but for larger sites getting 2,000 paths in one go is probably going to be very slow.

daily=true makes it a bit faster. Should maybe change that to group=hour, group=day, group=none – without grouping it'll be a lot faster. Took a quick look and it's not too hard, but does require a tad of refactoring.

@eliben
Copy link
Author

eliben commented Dec 6, 2023

Generally, I'd like no limit :-) In fact, I have just such a test case - my blog's stats DB is around 500 MiB now, and I have over 2.2k paths hit in the past 11 months.

I wrote a direct SQL query against the DB that obtains all posts + their total counts, and it takes 1.2 sec on my machine, or 5 sec on a sluggish VM I have hosting the blog.

The big performance issue - IMHO - is the fine-grained aggregation of hourly and daily hits. When you count totals you don't really need any of them, all you need is the post path and how many total hits it had in a period. I do realize that the way the DB is currently structured, this implies a pretty much linear scan for the entire DB to collect these stats.

I may be missing something, but isn't the API meant for the user's own use? So maybe it's ok to be a bit slow if the query is known to be large. Perhaps you're worried about the hosted solution you provide - that large API queries will overload it? That's a reasonable concern.

@arp242
Copy link
Owner

arp242 commented Dec 6, 2023

Oh right, it's for self-hosted? Can just add an option or something for that – if it's slow then that's your own problem.

So maybe it's ok to be a bit slow if the query is known to be large.

Yes and no, because the database spending tons of time on your slow API requests also affects other users and their performance. On your own self-hosted version that's probably not an issue, but for goatcounter.com it can be.

arp242 added a commit that referenced this issue Dec 8, 2023
@arp242
Copy link
Owner

arp242 commented Dec 8, 2023

I added -api-max flag; I don't think a group= option probably isn't needed(?) Maybe I'll add that later – I'll keep this issue open for now as a reminder.

I'll also have look at adding a TOML config file or something – doing all this stuff with CLI flags is becoming increasingly more unmanageable.

@eliben
Copy link
Author

eliben commented Dec 9, 2023

Thanks!

@eliben eliben closed this as completed Dec 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants