-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move simple growth and usage stats to this project #15
Comments
@jlegind scripts at https://jlegind.github.io/ are cumulative monthly totals for the year (year to date). https://github.com/jlegind/GBIF_monthly_statistics In other words, December will always have the largest numbers for that year and January will always have the smallest for that year. And the current month and year will always be larger than the previous months. postgres and registry are used here: I am not sure how to handle the downloads numbers in a hive/R way... hive and the prod.occurrence table here: These "records by country from start" numbers will easiest to port to the hive/R approach. All of these are cumulative (for the year-to-date) numbers, which is what the communications team wants/uses. I have done partially done some of this in R already: |
Perhaps naively, I would anticipate the R script connecting to the registry database to calculate those numbers - it'd be a port of whatever the python script is currently doing. |
Ok sounds good. This is what I am already doing for my current R port of Jan's stats. |
Oh, great! The intention then is simply to port that into this project, so it gets run automatically in the GBIF analytics run without individuals having to run things in their own repositories. Seems like we just need to verify the scripts are correct and move them in here then? |
I only ever did the downloads (registry) part (not the hive occ counts) of Jan's stats in R. Right now I thinking that some SQL that @jlegind might write for us, which gets us as close as possible followed by some R clean up would be the best approach. |
My plan is to simplify and test the part of the stats that deal with total numbers pr publishing country, and the growth in-year |
I made this pull request adding support for the downloads part but not the registry part of Jan's monthly stats. |
I have added my Python3 code for the increase in occurrence records by country using HIVE prod_h occurrence table. |
(For the moment, only regarding the growth statistics, not the usage (download) ones, i.e. Jan's Hive script.) What's the requirement for the final analytics output here?
We don't generate any HTML in this project, it was moved into the website years ago. Do you want me to make some sort of HTML reports, or should that be done in the portal under the existing analytics? (Or neither?) In any case, I think we should link to these CSVs from the portal so anyone needing these figures can manipulate them as they desire, perhaps by linking to the CSV for each chart. |
Download statistics: https://analytics-files.gbif-uat.org/download/ I couldn't make sense of the figures returned by R (e.g. the cumulative user count for monthly downloads for February 2020 is greater than the distinct user count for January + February), so I have rewritten them as four separate SQL queries. |
For some time @jlegind has privately maintained python scripts capturing basic statistics on https://jlegind.github.io/
These scripts should be moved into this project and executed on the same schedule as the analytics run, with the output stored in https://analytics-files.gbif.org, noting that the directory structure for that is to be changed.
I suggest that in exploring this, we consider the feasibility of porting this to the Hive/R approach that this project uses to keep the implementation simple. If that is not practical, then we should introduce the python dependencies.
The text was updated successfully, but these errors were encountered: