Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

From one feed of uploads to three ... #198

Open
shadowcat-mst opened this issue Jan 10, 2016 · 12 comments
Open

From one feed of uploads to three ... #198

shadowcat-mst opened this issue Jan 10, 2016 · 12 comments

Comments

@shadowcat-mst
Copy link

When we did the PSIXDISTS trial uploads, a lot of things that follow the PAUSE upload feed to spot new perl5 distributions became somewhat confused (both software and liveware).

I think for the moment, the sensible approach to deal with this is to have three feeds

  • The current feed, modified to exclude ^Perl6/
  • A new 6uploads feed, which contains only ^Perl6/ uploads
  • an alluploads feed, which contains both.

Naming I care not about, just about being able to e.g. point a GumbyNET at a 6uploads feed for #perl6, and being able to not confuse people expecting only perl5 stuff.

I suspect we're going to find that this isn't exactly a PAUSE issue in its entirety, but I still can't think of a better place to put the ticket.

@shadowcat-mst
Copy link
Author

Brain dumping -

9:37 @kentnl https://metacpan.org/feed/recent http://search.cpan.org/recent
http://search.cpan.org/uploads.rdf # related?

Where does GumbyNET get its data from?

If we have multiple consumers of the pause indexing emails or the daemon tail, can we play with things so that people's regexps do what we want them to?

@shadowcat-mst
Copy link
Author

20:07 < ranguard> mst:
https://github.com/CPAN-API/cpan-api/blob/master/lib/MetaCPAN/Script/Watcher.pm#L269 - It watches the RECENT-*.json files for changes to the CPAN directory
every 15 seconds

Well, that's one target. I guess 'RECENT-1h', 'RECENT6-1h' and 'RECENTALL-1h' ?

@andk
Copy link
Owner

andk commented Jan 11, 2016

On Sun, 10 Jan 2016 12:10:14 -0800, shadowcat-mst notifications@github.com said:

20:07 < ranguard> mst:
https://github.com/CPAN-API/cpan-api/blob/master/lib/MetaCPAN/Script/Watcher.pm#L269 -
It watches the RECENT-*.json files for changes to the CPAN directory
every 15 seconds

Well, that's one target. I guess 'RECENT-1h', 'RECENT6-1h' and
'RECENTALL-1h' ?

Now I understand what you meant with three feeds. So I said "wfm" too
early.

I don't think I can be persuaded to change File::Rsync::Mirror::Recent.
That'a the system that owns the RECENT* files. I'd strongly prefer to
have these uchanged. Any split of these files can happen downstream, no?

andreas

@ranguard
Copy link
Contributor

Would one option be to keep the RECENT* (all) but add:

  • p5-RECENT...
  • p6-RECENT...

This disadvantage of this is most downstream clients (assuming most currently just want p5) then have to be updated, but at least it should be as simple as looking at a different file path, rather than having to parse the content differently.

@andk
Copy link
Owner

andk commented Jan 13, 2016

On Tue, 12 Jan 2016 12:50:20 -0800, Leo Lapworth notifications@github.com said:

Would one option be to keep the RECENT* (all) but add:

  • p5-RECENT...
  • p6-RECENT...

This disadvantage of this is most downstream clients (assuming most
currently just want p5) then have to be updated, but at least it
should be as simple as looking at a different file path, rather than
having to parse the content differently.

That's a piece of software that can run anywhere besides pause. But the
way you describe it, it bears a price for everybody of additional files
in the system that eat resources in form of space, download time and
complexity. So I'd say such a splitter should run elsewhere. I'm open
for better suggestions.

One option might be to teach File::Rsync::Mirror::Recent to do filtering
or splitting. And/or to read the indexes from different places than the
one that offers the files for download.

andreas

@rjbs
Copy link
Collaborator

rjbs commented Jan 14, 2016

I also hadn't realized that the suggestion was to alter the rrr files. I think that's pretty much a non-starter, unfortunately. It isn't crazy to suggest that PAUSE itself produce RSS files, and that would be easy… but it does mean that downstream things would have to be updated.

@shadowcat-mst
Copy link
Author

I hadn't realised when I first looked at this that the files used for rrr are the same ones people are using to find newly uploaded dists (as opposed to files in general). If it turns out only to be metacpan, that can surely be changed, but if it's other things as well, we're into nasty trade-off land.

I think we need to check how search.cpan.org, the various utility bots, cpanmetadb, and cpantesters handle this. I shall start poking people.

(Edited to add: IRC pings left for preaction wrt cpantesters, BinGOs wrt GumbyNET*, and miyagawa wrt cpanmetadb; mail sent to the search.cpan.org contact address in the hopes Graham will take pity on me for that question ;)

@miyagawa
Copy link
Contributor

The current version of cpanmetadb doesn't use rrr files. Instead it just fetches the whole 02packages file and replaces everything in one big transaction. I also clone & fetch the PAUSE-batch git repo to retrieve the history.

Previously I was looking at the RRR files when it was running on GAE in Python, but not anymore.
https://github.com/miyagawa/cpanmetadb/blob/master/main.py#L135

@shadowcat-mst
Copy link
Author

@miyagawa Thanks! That rules cpanmetadb out of us needing to worry about it. Molto Bene.

@shadowcat-mst
Copy link
Author

So, the Gumbys are apparently NNTP + regex based -

17:13 <@mst> BinGOs: what are the Gumbys using? please either answer here and 
             I'll ticket it or answer on GH above
17:22 <@BinGOs> mst: it tails the nntp.perl.org newsgroup 'perl.cpan.uploads' 
                for upload emails.
17:22 <@mst> ooooo.
17:23 <@mst> how does it tell one's an upload?
17:24 <@BinGOs> $subject =~ m!^CPAN Upload: (.+\.tar\.gz|\.tgz|\.zip)$!i

so I guess we can eliminate this particular vector by having the Perl6/ directory generate a subject line of 'Perl6 Upload: ...' - would anybody see any particular issue with that? (notably @andk @rjbs)

@rjbs
Copy link
Collaborator

rjbs commented Jan 17, 2016

It at least doesn't seem crazy. I don't think I have any strong opinion beyond that.

@andk
Copy link
Owner

andk commented Jan 17, 2016

On Sat, 16 Jan 2016 09:27:18 -0800, shadowcat-mst notifications@github.com said:

So, the Gumbys are apparently NNTP + regex based -
17:13 @mst BinGOs: what are the Gumbys using? please either answer here and
I'll ticket it or answer on GH above
17:22 @bingos mst: it tails the nntp.perl.org newsgroup 'perl.cpan.uploads'
for upload emails.
17:22 @mst ooooo.
17:23 @mst how does it tell one's an upload?
17:24 @bingos $subject =~ m!^CPAN Upload: (.+.tar.gz|.tgz|.zip)$!i

so I guess we can eliminate this particular vector by having the
Perl6/ directory generate a subject line of 'Perl6 Upload: ...' -
would anybody see any particular issue with that? (notably @andk
@rjbs)

I see no nasty effects of such a change. The subject line gets written
here:

https://github.com/andk/pause/blob/master/bin/paused#L561

andreas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants