Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl the wiki.php.net page instead of using externals.io rss feed. #56

Open
morloderex opened this issue Aug 11, 2023 · 9 comments
Open
Labels
feature Issues and PRs that introduce a new feature

Comments

@morloderex
Copy link

morloderex commented Aug 11, 2023

The externals.io rss feed is not really ideal for synchronization of the rfcs sense the link to the rfc in question is not always inside the email when it's being announced for voting and the title doesn't always start with [VOTE] either.

I propose to instead crawl the wiki.php.net/rfc page once a day within a queued job this way we could also crawl the entire rfc content so that the user reading the rfc never actually needs to go the actual php.net page to read it?

This also opens up the possibility to create rfcs which is still in the discussion phase this way the community could start voting the rfcs long before the actual internals voting starts.

It should be rather easy sense the php.net wiki page doesn't heavily rely on javascript.

thoughts @brendt?

@brendt
Copy link
Owner

brendt commented Aug 11, 2023

Yeah that makes sense! Do you want to give it a shot?

@pronskiy
Copy link
Collaborator

I've got some crawling already implemented in the scope of my research project, so I can add the basis here.

@morloderex
Copy link
Author

@brendt Alright i took a bit of POC on this.

It uses spatie/crawler and some dom manipulation to extract the rfc text and put that into the description field.

We are importing stuff now, but the code is not pretty at all, and the php site's code highlighter is a bit tricky to work around.

But at least it is actually saving the data.

I have for now only done it for the rfcs under discussions tho just a POC.

There's still a lot of cleanup to do on it but at least something is syncing.

@brendt checkout https://github.com/morloderex/rfc-vote/tree/sync-rfcs to try it out.

@brendt brendt added the feature Issues and PRs that introduce a new feature label Aug 14, 2023
@morloderex
Copy link
Author

Okay, so just to give you an update on this ticket.

During last week, I took a bit of an R&D session in order to find the best way to do this.

I can stumped upon https://github.com/ramsey/php-rfcs

So that inspired me to go ahead and figure out the ins and outs of how that worked.

It appears that we can get everything in rst format (I plan on converting to markdown), but it is great news.

The only this is that using this approach we cannot really get the final vote results as the wiki returns some weird result when trying to get them using the raw result for an rfc.

But I plan to work around this by simply doing another request and parse the xhtml body for it instead.

The question now becomes how should we structure the rfcs by a status or by the requested PHP version or if it's already implemented in version should we even have a list of these?

@brendt
Copy link
Owner

brendt commented Aug 19, 2023

Hi @morloderex I'm sorry, somehow I missed your previous comment, super nice that you're working on this!

I think it'll be important to have control over what RFCs we sync, and which ones we don't. The way I envision it:

  • We sync all RFCs, but save them in a separate table (PendingRfc or something alike)
  • Admin users can view a list of all pending RFCs (newest first), and have a button to convert them into a real RFC that'll be published on the site.
  • Admins can also remove (soft delete) pending RFCs, so that they don't end up in the list anymore
  • When an RFC is done, we should have the option to reimport information (like eg. the vote results).
  • What's important is that we don't tightly couple PHP's RFC cycle to ours. It's entirely possible that we publish an RFC while it hasn't gone to internal voting yet, and also that we keep it published even after the internals vote has closed.

I realise this is quite a big issue. You definitely don't have to do all of this if you don't want to ;) If you're already able to PR some kind of "synchroniser", that would already be a huge step forward :)

@morloderex
Copy link
Author

morloderex commented Aug 21, 2023

@brendt

I get that you want to be in control over which rfcs we should show on our site, that makes sense. But let's imagine that we are pulling in some rfc during internals discussion phase and we have moved that rfc to be published already.

And the during the internals discussion phase the rfc text changes we kinda still want to keep the rfc in sync with the original text.

So we still need the rfcs to be a bit coupled to the internals list in order to sync any text changes regardless of our internal state due to possible rfc revisions.

I will start work on some kind "synchroniser" taking in your feedback.

@pronskiy
Copy link
Collaborator

Folks, please make sure you don't publish Draft RFCs https://phpopendocs.com/internals/rfc_etiquette#dont-publicise-other-peoples-draft-work

@morloderex
Copy link
Author

@pronskiy I am pulling everything in to an other table and then someone with administrative rights can publish the rfc to our website after the internals announcement.

I think that's the way forward we simply have no good way of knowing when a draft rfc is ready to be discussed.

@brendt
Copy link
Owner

brendt commented Aug 26, 2023

I think we do? As soon as it's moved to the "under discussion" section?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues and PRs that introduce a new feature
Projects
None yet
Development

No branches or pull requests

3 participants