Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log of disappeared modules (github) #17

Open
AlexDaniel opened this issue Apr 9, 2020 · 3 comments
Open

Log of disappeared modules (github) #17

AlexDaniel opened this issue Apr 9, 2020 · 3 comments

Comments

@AlexDaniel
Copy link

Sometimes people decide to remove github repos or to delete their modules. This is fine, but it's a pain for everyone when there's something that depends on the deleted code. In the past I restored some modules by re-creating them in https://github.com/raku-community-modules by using git repos that zef stores locally, but I got lucky because I actually had installed these modules in the past. Now that crai provides tarballs it is less of an issue, but it'd be great to know when a module is deleted so that we can react quicker. This should also make release management just a little bit less painful.

I think it'd be nice to have a simple log with timestamps and names of modules that no longer have accessible git repos.

@chloekek chloekek mentioned this issue Apr 9, 2020
12 tasks
@chloekek
Copy link
Owner

chloekek commented Apr 9, 2020

I am drafting the new data model and this is what I came up with that should be relevant to this issue:

$!dbh.do(q:to/SQL/);
-- A run is one invocation of the cron job that indexes archives.
-- Each run is identified by the time at which it started.
-- Runs are useful for tracking the state of the ecosystem over time.
CREATE TABLE runs (
[when] TEXT NOT NULL,
PRIMARY KEY ([when])
)
SQL
$!dbh.do(q:to/SQL/);
-- For each run, the set of archives that were found to exist.
CREATE TABLE encounters (
run_when TEXT NOT NULL,
archive_url TEXT NOT NULL,
PRIMARY KEY (run_when, archive_url)
)
SQL

Every time the cron job runs (every hour), it will write down which archives it found on CPAN and GitHub.

Then we can use SQL to query the difference between any two runs, with a query like the following. If you want we can even make it send an email to you or post it in IRC or create a GH issue or something.

SELECT
    archives.meta_name
FROM
    encounters
    INNER JOIN archives
        ON archives.url = encounters.archive_url
WHERE
    encounters.run_when = ?1

EXCEPT

SELECT
    archives.meta_name
FROM
    encounters
    INNER JOIN archives
        ON archives.url = encounters.archive_url
WHERE
    encounters.run_when = ?2

@chloekek
Copy link
Owner

chloekek commented Apr 16, 2020

You can now see how many archives it found on each run: https://crai.foldr.nl/runs.

This can be easily extended to display which distributions it found, and compare that to other runs.

@chloekek
Copy link
Owner

Will need to ignore runs that are outliers in terms of number of archives encountered. It’s more likely that CPAN or GitHub was down or the cronjob crashed, than that hundreds of packages were suddenly deleted.

I don’t know much about statistics so I will have to learn that first, which is fun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants