Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move simple growth and usage stats to this project #15

Closed
timrobertson100 opened this issue Nov 10, 2020 · 10 comments
Closed

Move simple growth and usage stats to this project #15

timrobertson100 opened this issue Nov 10, 2020 · 10 comments
Assignees

Comments

@timrobertson100
Copy link
Member

For some time @jlegind has privately maintained python scripts capturing basic statistics on https://jlegind.github.io/

These scripts should be moved into this project and executed on the same schedule as the analytics run, with the output stored in https://analytics-files.gbif.org, noting that the directory structure for that is to be changed.

I suggest that in exploring this, we consider the feasibility of porting this to the Hive/R approach that this project uses to keep the implementation simple. If that is not practical, then we should introduce the python dependencies.

@jhnwllr
Copy link

jhnwllr commented Nov 10, 2020

@jlegind scripts at https://jlegind.github.io/ are cumulative monthly totals for the year (year to date). https://github.com/jlegind/GBIF_monthly_statistics

In other words, December will always have the largest numbers for that year and January will always have the smallest for that year. And the current month and year will always be larger than the previous months.

postgres and registry are used here:
Total Download Stats 2019-06-01
Gbif downloads 2019 up-until May

I am not sure how to handle the downloads numbers in a hive/R way...

hive and the prod.occurrence table here:
GBIF records by country total until 2019-05-31
GBIF records by country from start of 2019 until May

These "records by country from start" numbers will easiest to port to the hive/R approach.

All of these are cumulative (for the year-to-date) numbers, which is what the communications team wants/uses.

I have done partially done some of this in R already:
https://github.com/jhnwllr/gbifdownloadstats

@timrobertson100
Copy link
Member Author

I am not sure how to handle the downloads numbers in a hive/R way...

Perhaps naively, I would anticipate the R script connecting to the registry database to calculate those numbers - it'd be a port of whatever the python script is currently doing.

@jhnwllr
Copy link

jhnwllr commented Nov 10, 2020

Ok sounds good. This is what I am already doing for my current R port of Jan's stats.
https://github.com/jhnwllr/gbifdownloadstats/blob/master/R/db_con.R

@timrobertson100
Copy link
Member Author

Oh, great!

The intention then is simply to port that into this project, so it gets run automatically in the GBIF analytics run without individuals having to run things in their own repositories. Seems like we just need to verify the scripts are correct and move them in here then?

@jhnwllr
Copy link

jhnwllr commented Nov 10, 2020

I only ever did the downloads (registry) part (not the hive occ counts) of Jan's stats in R. Right now I thinking that some SQL that @jlegind might write for us, which gets us as close as possible followed by some R clean up would be the best approach.

@jlegind
Copy link
Contributor

jlegind commented Nov 12, 2020

My plan is to simplify and test the part of the stats that deal with total numbers pr publishing country, and the growth in-year
which will be current numbers by pub country subtracting snapshot_20200101 numbers by country. That ought to return growth and perhaps a few cases of negative growth (due to datasets deleted, datasets curated).
I'm doing this in CENTOS 7 which should be close to the GBIF backend environment.

@jhnwllr
Copy link

jhnwllr commented Nov 17, 2020

I made this pull request adding support for the downloads part but not the registry part of Jan's monthly stats.

#17

@jlegind
Copy link
Contributor

jlegind commented Dec 2, 2020

I have added my Python3 code for the increase in occurrence records by country using HIVE prod_h occurrence table.
#17

@MattBlissett
Copy link
Member

(For the moment, only regarding the growth statistics, not the usage (download) ones, i.e. Jan's Hive script.)

What's the requirement for the final analytics output here?

  1. I've added a straightforward count of occurrences to each country's statistics, e.g. http://analytics-files.gbif-uat.org/country/AD/about/csv/occ.csv . Subtraction gives the data growth between any snapshots.

  2. The same, by publishing country: http://analytics-files.gbif-uat.org/country/AD/publishedBy/csv/occ.csv

  3. And all data, which comes "free" when the above is added to the existing analytics script structure: http://analytics-files.gbif-uat.org/global/csv/occ.csv

  4. I've also added a split by country, and another by publishing country, to the global statistics. This will make global analysis easier, avoiding the need to retrieve CSVs for every country: http://analytics-files.gbif-uat.org/global/csv/occ_country.csv http://analytics-files.gbif-uat.org/global/csv/occ_publisherCountry.csv
    This last one can directly produce https://jlegind.github.io/GBIFrecordsbycountrytotaluntil2018-12-3134dbc28e-74a6-42b7-a03f-107e417ac72b/ , and by subtracting the figure from an earlier snapshot, https://jlegind.github.io/GBIFrecordsbycountryfromstartof2018untilDecemberb28ddd86-449f-42d3-aa24-9d643eebdd04/

We don't generate any HTML in this project, it was moved into the website years ago. Do you want me to make some sort of HTML reports, or should that be done in the portal under the existing analytics? (Or neither?)

In any case, I think we should link to these CSVs from the portal so anyone needing these figures can manipulate them as they desire, perhaps by linking to the CSV for each chart.

MattBlissett added a commit that referenced this issue Jan 22, 2021
@MattBlissett
Copy link
Member

Download statistics: https://analytics-files.gbif-uat.org/download/

I couldn't make sense of the figures returned by R (e.g. the cumulative user count for monthly downloads for February 2020 is greater than the distinct user count for January + February), so I have rewritten them as four separate SQL queries.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants