Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP import script bbcode fixes in Fluxbb importer + some fixes to all importers #18953

Closed
wants to merge 16 commits into from

Conversation

harry-wood
Copy link
Contributor

@harry-wood harry-wood commented Nov 9, 2022

Not really work in progress any more, but there's quite a lot here so I think I should break this into smaller more manageable PRs for review. I have already done so with the very first step: #18893 If we can get that merged, then I will carve off more of this big beast into separate PRs.

We're looking to run a big FluxBB import (https://forum.openstreetmap.org/ -> https://community.openstreetmap.org/) and with the code in this branch we're hopefully ready to go for that purpose (although we're doing large scale testing still, so some further changes may arise)

At the moment there's currently a cheeky commit here to do a Gemfile path swap, to include some changes I've been making to the ruby-bbcode-to-md gem. If I can get those merged then, no need for the Gemfile path swap.

@harry-wood harry-wood force-pushed the fluxbb-bbcode-fixes branch 5 times, most recently from 0a37560 to dd5dbeb Compare November 9, 2022 12:57
@@ -257,7 +257,7 @@ if ENV["IMPORT"] == "1"

# NOTE: in import mode the version of sqlite can matter a lot, so we stick it to a specific one
gem 'sqlite3', '~> 1.3', '>= 1.3.13'
gem 'ruby-bbcode-to-md', git: 'https://github.com/nlalonde/ruby-bbcode-to-md'
gem 'ruby-bbcode-to-md', git: 'https://github.com/harry-wood/ruby-bbcode-to-md', branch: 'fluxbb-tests-and-fixes'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please create a PR with your changes at https://github.com/nlalonde/ruby-bbcode-to-md?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I have done so (see description above). So I'm hoping this Gemfile change would eventually not be necessary, although... I don't know if nlalonde is intending to merge any PRs there and maintain that fork of the gem. D'you know what the status of that is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He's on our team, so I'm sure he will merge the PR when it's ready for review. Just mention me here when your PRs are ready for review and I'll get everything sorted.

@harry-wood harry-wood force-pushed the fluxbb-bbcode-fixes branch 3 times, most recently from 0309425 to 5db38b2 Compare November 10, 2022 10:53
@discoursebot
Copy link

This pull request has been mentioned on Discourse Meta. There might be relevant details there:

https://meta.discourse.org/t/migration-from-fluxbb-while-preserving-incoming-links/172351/5

@harry-wood harry-wood force-pushed the fluxbb-bbcode-fixes branch 5 times, most recently from 2f6585c to cdfab31 Compare November 14, 2022 00:21
@discoursebot
Copy link

This pull request has been mentioned on Discourse Meta. There might be relevant details there:

https://meta.discourse.org/t/migration-from-fluxbb-while-preserving-incoming-links/172351/8

Add some tests of ImportScripts::FluxBB. Lots more could be covered here.

This is a runnable script which ordinarily would run immediately when we require the file. For test set-up we need to avoid this, while still being able to load the class. We do this by disabling the `perform` method.

We also include a rudimentary mock object for faking queries to the soure FluxBB MySQL database. For more extensive tests a real MySQL database might be better, or something like "NullDB" gem.
Add post content examples with a variety of FluxBB-flavoured bbcode. This is covering the full set of examples given on the FluxBB help pagei e.g. https://bbs.archlinux.org/help.php#bbcode and https://forum.openstreetmap.org/help.php

The tests are grouped into
* Examples which converts correctly to markdown (or other discourse supported bbcode/html)
* Examples which are best left unconverted, because discourse supports the bbcode and that is the best way to format it.
* Examples of FluxBB color bbcode which cannot be converted to any syntax supported by discourse (at least not without plugin). It's debatable what the most desirable treatment of these examples is, but the test describes current behaviour which is to strip away the bbcodes to leave plain text.
...and finally (perhaps most importantly)
* A disabled set of examples which currently do not work! This is FluxBB-flavoured bbcode which should be converted to markdown (or other discourse supported bbcode/html), but which is either ignored or incorrectly converted. Expected output is included, although in some cases there are multiple ways we might do it e.g. lists could be converted to markdown or html.

Note: While we could have done a similar bunch of testing within the ruby-bbcode-to-md gem, it is this import script which claims to be handling specifically FluxBB flavoured bbcode. It does so (or attempts to do so) by running some regexp replacements in the `process_fluxbb_post` method as preproccessing before handing over to the generic bbcode_to_md gem. These tests cover both of those steps taken together.
FluxBB supports different smiley character combos to discourse.

Fix these in the pre-processing step with string substitutions, since they're not bbcode related.

Note: FluxBB only supports smileys with proceeding whitespace, and our substitutions should only apply to these ones. This prevents us having problems with, for example, messing up urls such as https://awkward.com?x=D (we don't want a :D smiley on the end there!)

Note: Discourse can have other smileys installed, for example a fresh install does not know about `:lol:`, hence we need to convert it, but meta.discourse.org does know about it.
Markdown doesn't have a way of doing underlined text other than <u>the u html tag</u>. Curiously discourse does not allow this, but does allow [u]the u bbcode[/u]. Meanwhile the ruby-bbcode-to-md gem decides to remove the [u] tag, rather than converting it to the html form (which seems like a bug).

From the point of view of importing bbcode into discourse, the correct treatment of [u] would be to leave it unaltered, so here we add it to the list of disabled tags when calling the gem.

Note: Although I'm noticing this [u]underlines[/u] issue in relation to FluxBB, I am applying this fix across all bbcode importers.
Add an overridable method `bbcode_tag_additions_and_overrides`. This allows our import scripts to pass in configuration to the ruby-bbcode-to-md gem for adding or modifying the way bbcode parsing happens. Defaults can be seen in the gem repo.

This offers a more elegant and powerful and potentially less error prone way to influence the bbcode parsing, to use instead of or alongside the approach of doing pre-processing regexp substitutions.
But add a new failing edge case [h]Heading\non two lines[/h]
Add [del],[ins],and [em] suppor to the FluxBB importer.

These are FluxBB-flavoured bbcode, or at least they're not mentioned in the bbcode.org reference.

They're also not directly supported in markdown, but in discourse we can use <del> and <ins>. There's no way of representing "em" semantically in discourse, but of course it looks the same as **bold**.
FluxBB supports an [img=alt text] parameter, which is different from the bbcode.org standard implemented by the gem https://www.bbcode.org/posting-images-with-bbcode.php so we need to override the gem's default [img] config.
Add support for [list] bbcode and all its sub tags and varients. Lists get pretty complicated when we look at sub-list and paragraphs within list items, and how to represent such things in markdown. Generally this involves '*' prefixed lines (or not, when there's paragraphs) all with exact right amount of indentation.

Although it would be more elegant to reach this markdown representation (It's nice readble syntax, if a little fragile), for now it seems easier and safer to use html tags <ul> <ol> & <li>, which discourse allows, and which can be translated directly from their bbcode equivalents, with no need for indentation logic.

This translation is best acheived with regex, because regex is good at such things, but also because the awkward [*] char, and the alternate [link] vs [link=1] representations don't feed into the bbcode gem very well.

Alphabetic ordered lists (FluxBB's [list=a] syntax) are problematic, because discourse doesn't have any way to do it https://meta.discourse.org/t/ordered-lists-with-non-number-markers/127745/6 . HTML also doesn't really have this, except via `list-style: alpha` css. Here we output `<ol class="alpha">`, not because it will work, but because it might offer a useful way to make it work with further customisation, and it retains the information.
Add [email] bbcode to our list of syntax which we don't need to alter.
This brings in some fixes to our last two failing tests as well as some new edge cases:

The gem now supports the standard [s] bbcode for ~~Strike-through~~.

The gem now supports a new `:newlines => :to_br` tag config option which we activate here to deal with newlines in headings.

The gem now avoids introducing new paragraphs mid-tag, which will break markdown output. It swaps in a <br> instead. Ugly solution but it works.
Add support for converting the various internal linking bbcodes which FluxBB has: [post], [topic], [forum] and [user], into conventional links.

For now, these are converted to relative links which *would* work on the source FluxBB site, e.g. post links become `[/viewtopic.php?pid=(old post id)](link text)`. Futher transformation tricks and redirects will need to be applied to all relative links to make them a work in the destination discourse.

Arguably the semantically correct way to link across from one post to another within discourse would be using a quote, however it may be important to preserve the look of a straightforward link, to avoid potential formatting issues (and it's an easier conversion).
Add the FLUXBB_RELATIVE_LINKS_BASE optional config which, if set, will be used to rewrite all relative links to absolute.

This is applied as a pre-processing step to all [url] tags, prior to the bbcode_to_md conversion, and then separately it is applied to the output of [post]/[topic]/[forum]/[user] bbcode conversions (which otherwise end up as more relative links)
Create permalinks to redirect FluxBB-style URLs of the forms
* profile.php?id={user id}
* viewtopic.php?pid={post id}
* viewtopic.php?id={topid id}
* viewforum.php?id={forum id}

In the case of FluxBB linking a specific post within a topic page, FluxBB provides example URLs with a fragment added, so the full URL is of the form `viewtopic.php?pid=<post id>#p<post id>`, causing the browser to jump down to the specified post. When processing a redirect, the browser may re-attach this fragment with the old post id, but it won't have any effect, and the discourse logic should currectly jump down and highlight the post id specified in the Permalink object.
We used html style list items (<li>) instead of trying to convert to markdown style, avoiding troublesome indentation challenges, but it turns out discourse has an awkward quirk. We cannot give other markdown within a single line list item, so for example:

<ul>
<li>list item **with bold**</li>
</ul>

...will not work. Bold syntax gets ignored. But if we put in an extra pair of newlines:

<ul>
<li>

list item **with bold**</li>
</ul>

...this does work. Presumably causing discourse to treat it as a new paragraph, although luckily it doesn't introduce any extra vertical space in the output, so it's a good workaround we can apply to all list items.
(After bringing in updates to the gem) we now convert [code] to ```a markdown style code block```. Various other changes in the gem, but this was the only test affected.
@nattsw
Copy link
Contributor

nattsw commented Apr 3, 2023

Hey there, we're tackling older PRs and will close PRs with no follow up. Please open a new one when you're ready.

@nattsw nattsw closed this Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants