Log of disappeared modules (github) #17

AlexDaniel · 2020-04-09T14:39:40Z

Sometimes people decide to remove github repos or to delete their modules. This is fine, but it's a pain for everyone when there's something that depends on the deleted code. In the past I restored some modules by re-creating them in https://github.com/raku-community-modules by using git repos that zef stores locally, but I got lucky because I actually had installed these modules in the past. Now that crai provides tarballs it is less of an issue, but it'd be great to know when a module is deleted so that we can react quicker. This should also make release management just a little bit less painful.

I think it'd be nice to have a simple log with timestamps and names of modules that no longer have accessible git repos.

chloekek · 2020-04-09T18:17:59Z

I am drafting the new data model and this is what I came up with that should be relevant to this issue:

crai/crai/lib/Crai/Database.rakumod

Lines 122 to 139 in 596b83f

    
               $!dbh.do(q:to/SQL/); 
        
                   -- A run is one invocation of the cron job that indexes archives. 
        
                   -- Each run is identified by the time at which it started. 
        
                   -- Runs are useful for tracking the state of the ecosystem over time. 
        
                   CREATE TABLE runs ( 
        
                       [when]                      TEXT    NOT NULL, 
        
                       PRIMARY KEY ([when]) 
        
                   ) 
        
                   SQL 
        
               $!dbh.do(q:to/SQL/); 
        
                   -- For each run, the set of archives that were found to exist. 
        
                   CREATE TABLE encounters ( 
        
                       run_when                    TEXT    NOT NULL, 
        
                       archive_url                 TEXT    NOT NULL, 
        
                       PRIMARY KEY (run_when, archive_url) 
        
                   ) 
        
                   SQL

Every time the cron job runs (every hour), it will write down which archives it found on CPAN and GitHub.

Then we can use SQL to query the difference between any two runs, with a query like the following. If you want we can even make it send an email to you or post it in IRC or create a GH issue or something.

SELECT
    archives.meta_name
FROM
    encounters
    INNER JOIN archives
        ON archives.url = encounters.archive_url
WHERE
    encounters.run_when = ?1

EXCEPT

SELECT
    archives.meta_name
FROM
    encounters
    INNER JOIN archives
        ON archives.url = encounters.archive_url
WHERE
    encounters.run_when = ?2

chloekek · 2020-04-16T17:09:43Z

You can now see how many archives it found on each run: https://crai.foldr.nl/runs.

This can be easily extended to display which distributions it found, and compare that to other runs.

chloekek · 2020-04-22T10:55:36Z

Will need to ignore runs that are outliers in terms of number of archives encountered. It’s more likely that CPAN or GitHub was down or the cronjob crashed, than that hundreds of packages were suddenly deleted.

I don’t know much about statistics so I will have to learn that first, which is fun!

chloekek mentioned this issue Apr 9, 2020

v2 #18

Open

12 tasks

chloekek added crai feature labels Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log of disappeared modules (github) #17

Log of disappeared modules (github) #17

AlexDaniel commented Apr 9, 2020

chloekek commented Apr 9, 2020 •

edited

chloekek commented Apr 16, 2020 •

edited

chloekek commented Apr 22, 2020

Log of disappeared modules (github) #17

Log of disappeared modules (github) #17

Comments

AlexDaniel commented Apr 9, 2020

chloekek commented Apr 9, 2020 • edited

chloekek commented Apr 16, 2020 • edited

chloekek commented Apr 22, 2020

chloekek commented Apr 9, 2020 •

edited

chloekek commented Apr 16, 2020 •

edited