Move simple growth and usage stats to this project #15

timrobertson100 · 2020-11-10T10:19:38Z

For some time @jlegind has privately maintained python scripts capturing basic statistics on https://jlegind.github.io/

These scripts should be moved into this project and executed on the same schedule as the analytics run, with the output stored in https://analytics-files.gbif.org, noting that the directory structure for that is to be changed.

I suggest that in exploring this, we consider the feasibility of porting this to the Hive/R approach that this project uses to keep the implementation simple. If that is not practical, then we should introduce the python dependencies.

jhnwllr · 2020-11-10T12:59:42Z

@jlegind scripts at https://jlegind.github.io/ are cumulative monthly totals for the year (year to date). https://github.com/jlegind/GBIF_monthly_statistics

In other words, December will always have the largest numbers for that year and January will always have the smallest for that year. And the current month and year will always be larger than the previous months.

postgres and registry are used here:
Total Download Stats 2019-06-01
Gbif downloads 2019 up-until May

I am not sure how to handle the downloads numbers in a hive/R way...

hive and the prod.occurrence table here:
GBIF records by country total until 2019-05-31
GBIF records by country from start of 2019 until May

These "records by country from start" numbers will easiest to port to the hive/R approach.

All of these are cumulative (for the year-to-date) numbers, which is what the communications team wants/uses.

I have done partially done some of this in R already:
https://github.com/jhnwllr/gbifdownloadstats

timrobertson100 · 2020-11-10T13:12:33Z

I am not sure how to handle the downloads numbers in a hive/R way...

Perhaps naively, I would anticipate the R script connecting to the registry database to calculate those numbers - it'd be a port of whatever the python script is currently doing.

jhnwllr · 2020-11-10T13:15:45Z

Ok sounds good. This is what I am already doing for my current R port of Jan's stats.
https://github.com/jhnwllr/gbifdownloadstats/blob/master/R/db_con.R

timrobertson100 · 2020-11-10T13:27:13Z

Oh, great!

The intention then is simply to port that into this project, so it gets run automatically in the GBIF analytics run without individuals having to run things in their own repositories. Seems like we just need to verify the scripts are correct and move them in here then?

jhnwllr · 2020-11-10T13:57:35Z

I only ever did the downloads (registry) part (not the hive occ counts) of Jan's stats in R. Right now I thinking that some SQL that @jlegind might write for us, which gets us as close as possible followed by some R clean up would be the best approach.

jlegind · 2020-11-12T13:20:26Z

My plan is to simplify and test the part of the stats that deal with total numbers pr publishing country, and the growth in-year
which will be current numbers by pub country subtracting snapshot_20200101 numbers by country. That ought to return growth and perhaps a few cases of negative growth (due to datasets deleted, datasets curated).
I'm doing this in CENTOS 7 which should be close to the GBIF backend environment.

jhnwllr · 2020-11-17T09:29:15Z

I made this pull request adding support for the downloads part but not the registry part of Jan's monthly stats.

#17

jlegind · 2020-12-02T12:53:52Z

I have added my Python3 code for the increase in occurrence records by country using HIVE prod_h occurrence table.
#17

…cies growth.

MattBlissett · 2021-01-07T18:51:39Z

(For the moment, only regarding the growth statistics, not the usage (download) ones, i.e. Jan's Hive script.)

What's the requirement for the final analytics output here?

I've added a straightforward count of occurrences to each country's statistics, e.g. http://analytics-files.gbif-uat.org/country/AD/about/csv/occ.csv . Subtraction gives the data growth between any snapshots.
The same, by publishing country: http://analytics-files.gbif-uat.org/country/AD/publishedBy/csv/occ.csv
And all data, which comes "free" when the above is added to the existing analytics script structure: http://analytics-files.gbif-uat.org/global/csv/occ.csv
I've also added a split by country, and another by publishing country, to the global statistics. This will make global analysis easier, avoiding the need to retrieve CSVs for every country: http://analytics-files.gbif-uat.org/global/csv/occ_country.csv http://analytics-files.gbif-uat.org/global/csv/occ_publisherCountry.csv
This last one can directly produce https://jlegind.github.io/GBIFrecordsbycountrytotaluntil2018-12-3134dbc28e-74a6-42b7-a03f-107e417ac72b/ , and by subtracting the figure from an earlier snapshot, https://jlegind.github.io/GBIFrecordsbycountryfromstartof2018untilDecemberb28ddd86-449f-42d3-aa24-9d643eebdd04/

We don't generate any HTML in this project, it was moved into the website years ago. Do you want me to make some sort of HTML reports, or should that be done in the portal under the existing analytics? (Or neither?)

In any case, I think we should link to these CSVs from the portal so anyone needing these figures can manipulate them as they desire, perhaps by linking to the CSV for each chart.

Remaining part of #15.

MattBlissett · 2021-01-22T13:05:38Z

Download statistics: https://analytics-files.gbif-uat.org/download/

I couldn't make sense of the figures returned by R (e.g. the cumulative user count for monthly downloads for February 2020 is greater than the distinct user count for January + February), so I have rewritten them as four separate SQL queries.

timrobertson100 assigned jlegind Nov 10, 2020

timrobertson100 mentioned this issue Nov 10, 2020

Add regional rollups to the analytics #16

Closed

2 tasks

jhnwllr self-assigned this Nov 10, 2020

jhnwllr mentioned this issue Nov 17, 2020

Adding support for monthly download stats #17

Closed

MattBlissett added a commit that referenced this issue Jan 7, 2021

#15: Add figures via Hive for country/publisherCountry occurrence/spe…

0d404d0

…cies growth.

MattBlissett added a commit that referenced this issue Jan 7, 2021

#15: Add figures via Hive for country/publisherCountry occurrence/spe…

3943708

…cies growth.

MattBlissett added a commit that referenced this issue Jan 22, 2021

Add monthly/annual download statistics.

2cc90ce

Remaining part of #15.

MattBlissett closed this as completed in 5bf1f37 Jan 28, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move simple growth and usage stats to this project #15

Move simple growth and usage stats to this project #15

timrobertson100 commented Nov 10, 2020

jhnwllr commented Nov 10, 2020

timrobertson100 commented Nov 10, 2020

jhnwllr commented Nov 10, 2020

timrobertson100 commented Nov 10, 2020

jhnwllr commented Nov 10, 2020

jlegind commented Nov 12, 2020

jhnwllr commented Nov 17, 2020

jlegind commented Dec 2, 2020

MattBlissett commented Jan 7, 2021

MattBlissett commented Jan 22, 2021

Move simple growth and usage stats to this project #15

Move simple growth and usage stats to this project #15

Comments

timrobertson100 commented Nov 10, 2020

jhnwllr commented Nov 10, 2020

timrobertson100 commented Nov 10, 2020

jhnwllr commented Nov 10, 2020

timrobertson100 commented Nov 10, 2020

jhnwllr commented Nov 10, 2020

jlegind commented Nov 12, 2020

jhnwllr commented Nov 17, 2020

jlegind commented Dec 2, 2020

MattBlissett commented Jan 7, 2021

MattBlissett commented Jan 22, 2021