Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Great Terminal Rewrite #409

Closed
wants to merge 1 commit into from
Closed

Conversation

Lemmmy
Copy link
Contributor

@Lemmmy Lemmmy commented Apr 18, 2020

The Great Terminal Rewrite

This is a series of PRs which aims to improve terminal objects all round, with particular focus on:

  • Memory usage
  • Networking
  • Rendering

For the majority of CC's lifetime, the current terminal implementation has worked fine. But recently, especially with the improvements brought by Cobalt, people have been pushing CC further and further to its absolute limits. We've seen this a lot on SwitchCraft, with a huge number of computers doing a lot of work at once. The Juroku cinema especially (rendering 480p video at 20FPS on 16x9 monitors) has prompted some urgent improvements to the system - players frequently time out due to the bulk of serialising monitors inefficiently with NBT. This series of PRs aims to fix all of this.

The Plan

Terminal internals rewrite

Terminals currently use TextBuffers to store their data. This isn't an awful data structure, because it makes it very convenient to access lines, but as terminals get bigger and more of them exist, this can become a 'death by a thousand paper cuts' problem, particularly with the class instance overhead.

This class will be removed entirely, and terminals will store just three 1-dimensional byte arrays: one for text (chars), one for the background colour (0-15), and one for the text colour (0-15). This structure has some great performance benefits in that it can be trivially copied from/to (using the native System.arraycopy), and this data can be passed directly to OpenGL in the form of a Uniform Buffer Object or Buffer Texture (more on this later). term.blit and term.write are now instantaneous operations with no loops involved (besides converting colour strings to byte arrays).

Networking rewrite

Terminal serialisation is currently terrifying. If you've ever seen the printed pages NBT, you probably know what to expect. The entire terminal state (for monitors and computers) is serialised into NBT and sent to all clients in a structure that looks something like this:

bool colour: If the terminal supports colour
int  width: Terminal width in characters
int  height: Terminal height in characters

compound terminal:
  int   term_cursorX: The current cursor X position in characters
  int   term_cursorY: The current cursor Y position in characters
  bool  term_cursorBlink: Whether or not the cursor is blinking
  int   term_textColour: The current text colour (in bitmask form, so black is 32768)
  int   term_bgColour: The current background colour
  int[] term_palette: An integer array with palette colours 0-15 as RGB8 integers

  for each line in the terminal's height:
    str term_text_n: A blit string for this entire line's text
    str term_textColour_n: A blit string for this entire line's text colour (i.e. a hexadecimal string)
    str term_bgColour_n: A blit string for this entire line's background colour

This entire payload is recalculated every time anything on a terminal changes in a tick (I call this 'marked dirty'). As the terminal gets bigger and/or updates more frequently, the overhead of the nested blit tags is not negligible (and gzip can only do so much). The key takeaway here, though, is that the entire terminal state is re-sent to every player anytime anything changes (even just the cursor position!).

The new design is ultimately going to look something like this:

  • Avoid NBT (still under consideration).
  • Split the terminal into 8x8 character "chunks".
  • Only send the chunks that are marked dirty in a tick, instead of the entire terminal.
  • Serialise each chunk as two fixed-size byte arrays, where:
    • The text is one 64-byte array.
    • Both text and background colour are stored as a single byte (since there are 16 colours, 4 bits, you can fit both colours in one byte).
    • A short (two bytes) at the start of each chunk denoting which chunk index it is.
  • Payloads are calculated once per tick, cached, and re-sent to all players, instead of individually calculated for each.
  • The entire terminal will only be sent to a player if their client requests it, e.g. they have just loaded the chunk (this implementation detail still needs to be figured out).
  • term.scroll will be a separate packet or flag in the payload, so as to not require re-sending the entire state (as scrolling will mark every chunk dirty otherwise).
  • Actions such as term.setCursorPos and term.setCursorBlink will send a minimal payload without invalidating any chunks if it is necessary.
  • If a term is cleared, chunks will not be marked as dirty, but instead a separate term.clear payload will be sent.
  • term.setPaletteColour will also be a separate packet/payload.

I believe that 8x8 chunks are a fair compromise in signal to noise ratio - there is only one (or two) additional byte per 64 characters in a terminal (thus, only 20/40 extra bytes in a regular computer terminal).

Implementation details

With this payload design, it will be easy to perform the calculations necessary to reproduce the terminal. For example, a 51x19 terminal would be chunked like this:

All you need is the terminal's width and height, and you can calculate the position and dimensions of any given chunk from there:

local width, height = 51, 19
local chunksW, chunksH = math.ceil(width / 8), math.ceil(height / 8)

-- return the absolute x/y coordinates and width/height of a chunk by index
local function getChunk(id)
  local chunkX = id % chunksW
  local chunkY = math.floor(id / chunksW)

  local x = chunkX * 8
  local y = chunkY * 8

  local chunkWidth = chunkX == (math.floor(width / 8)) and (width % 8) or 8
  local chunkHeight = chunkY == (math.floor(height / 8)) and (height % 8) or 8
end

The payload for a single chunk will look something like this:

The implementation for other packets, such as term.scroll, is yet to be decided.

The focus of these changes are:

  • Reduce payload size.
  • Don't send any character or colour data if we don't have to.
  • Only send regions of the terminal that actually change, instead of everything at once.
  • Optimise common operations such as scrolling and clearing to be smaller dedicated packets/payloads.

Rendering rewrite (separate PR)

The final big change planned is a complete rewrite of the terminal renderer. The current renderer uses a messy mix of Tessellator quads and raw display lists. This is fine for GUI terminals and map-like held terminals, where there is only ever one on the screen at a time, and they aren't that large. However, monitors can get quite large, and there can be many of them visible at once, and updating as fast as the server will allow them. As such, the main focus for the rendering change (and most of the changes in the terminal rewrite in general) is monitors.

Hardware survey

After discussing a few ideas with Lignum, we decided to perform a hardware survey amongst the SwitchCraft userbase. This has provided us with a good sample size for understanding the hardware support of 1.12 players. As of writing, we have received 62 responses. Within those responses, 56 (90.3%) of users support OpenGL 3.1 or greater:

OpenGL Version Users %
2.1 3 4.8
3.0 3 4.8
3.3 1 1.6
4.0 5 8.1
4.3 3 4.8
4.4 1 1.6
4.5 7 11.3
4.6 39 62.9
Full statistics available here.

Thus, 56 (90.3%) of users support Texture Buffer Objects and Uniform Buffer Objects directly through these features being core in OpenGL 3.1. But, they are also available through their respective ARB and EXT extensions:

Feature Users %
OpenGL 3.1 56 90.3
ARB_texture_buffer_object 47 75.8
EXT_texture_buffer_object 41 66.1
ARB_uniform_buffer_object 59 95.2
Full statistics available here.

It doesn't seem that there are any users who have TBOs provided to them solely through an extension. Everybody who has access to TBOs has access to OpenGL 3.1. On the other hand, there are 3 users who only have access to UBOs via the ARB extension. As such, it's not impossible that there is a user who only has access to TBOs via the ARB or EXT extensions.

The new renderer idea

So, why do we care about those features? The new plan for terminal rendering is kind of wacky, but very possible. We're hoping to move the entire terminal renderer into a single shader. This shader will take in the following inputs:

  • The terminal font texture.
  • The width and height of the terminal (in characters).
  • The palette of the terminal.
  • A Buffer Texture containing the entire Terminal state:
    • Each terminal character is a single pixel in the texture.
    • The red channel represents the characters (0-255).
    • The green channel represents the text colour (0-15).
    • The blue channel represents the background colour (0-15).
  • When TBOs are not available, a Uniform Buffer Object will be used instead, with the same data as a TBO.

This will unify all of the terminal renderers into a single class that handles all of the rendering, and all of the work will be done by the GPU. As such, we won't have thousands of quads on the screen just for each monitor in the world. The buffer textures can be cached (similar to frame buffers), and discarded when the chunk is unloaded. We should also be able to use some Buffer Object Streaming techniques to update the TBOs/UBOs efficiently.

Fallback renderer

Within the hardware survey's sample, there are 6 users who don't have access to TBOs, and 3 who don't have access to UBOs. These are the OpenGL 2.1 users. Despite being released in 2006, OpenGL 2.1 is still relatively common, particularly with users who are using open source graphics drivers, or Intel integrated graphics. Mojang raised the minimum OpenGL version to 2.1 in 1.8, so we don't have to worry about providing a fallback renderer for anything less than this. Regardless, UBOs and TBOs are still unavailable in a small number of cards, so we need to support something.

We decided that we still plan to scrap the current renderer, and as a fallback renderer, use VBOs. The Tessellator actually uses VBOs internally, regardless of the 'Use VBOs' option being turned off. As such, the current font renderer also uses VBOs, with the exception of monitors, where it's a half-VBO half-display list Frankenstein's monster. That said, there is a much more efficient way for these to be used, so we're going to be scrapping most of the existing code.

As such, our plan for rendering support looks something like this:

if version >= 3.1:
  // use GL3.1's native TBOs
else if GL_ARB_uniform_buffer_object:
  // use UBOs from ARB extension
else:
  // use GL2.1's native VBOs

There doesn't seem to be much point in supporting the extensions for TBOs at current usage, but if there seems to be appropriate demand, then it will look something like this:

if version >= 3.1:
  // use GL3.1's native TBOs
else if GL_ARB_texture_buffer_object:
  // use TBOs from ARB extension
else if GL_EXT_texture_buffer_object:
  // use TBOs from EXT extension
else if GL_ARB_uniform_buffer_object:
  // use UBOs from ARB extension
else:
  // use GL2.1's native VBOs

With an appropriate level of abstraction, all of this would be easy to manage, and all the terminal renderers would use a common class. It would be a matter of swapping out the function calls and changing a few arguments to support the TBO extensions.

The Roadmap

This PR is still very much a work in progress, and is currently not ready for review. It is mainly here as a marker for progress.

  • Remove TextBuffer and replace them with 1-dimensional byte arrays. (95% complete)
  • Adjust all APIs and existing terminal rendering code to use the new byte arrays.
  • Write tests for the new terminal stuff.
  • Rewrite the terminal network packet to use the new byte arrays.
  • Implement the 'small' payloads for:
    • term.clear
    • term.scroll
    • term.setCursorPos
    • term.setCursorBlink
    • term.setPaletteColour
    • Terminal resizing
  • Implement chunking/diffing.
  • Rewrite the renderers and implement TBOs/UBOs (separate PR).

This PR will definitely need to be rebased later on.

@SquidDev SquidDev added area-Core area-Minecraft enhancement labels Apr 18, 2020
@Lemmmy
Copy link
Contributor Author

@Lemmmy Lemmmy commented Apr 18, 2020

We did a little more research on the OpenGL version results from the survey, and figured out that all of the 2.1 users are macOS users. This is because Minecraft uses the compatibility profile, which on the macOS driver, forces the version to 2.1 or lower.

So, we are basically implementing the VBO renderer for the macOS users (and whoever else comes along with a 2.1 card, though that's very unlikely).

SquidDev added a commit that referenced this issue Apr 21, 2020
This is a backport of 1.15's terminal rendering code with some further
improvements. This duplicates a fair bit of code, and is much more
efficient.

I expect the work done in #409 will supersede this, but that's unlikely
to make its way into the next release so it's worth getting this in for
now.

 - Refactor a lot of common terminal code into
   `FixedWithFontRenderer`. This shouldn't change any behaviour, but
   makes a lot of our terminal renderers (printed pages, terminals,
   monitors) a lot cleaner.

 - Terminal rendering is done using a single mode/vertex format. Rather
   than drawing an untextured quad for the background colours, we use an
   entirely white piece of the terminal font. This allows us to batch
   draws together more elegantly.

 - Some minor optimisations:
   - Skip rendering `"\0"` and `" "` characters. These characters occur
     pretty often, especially on blank monitors and, as the font is empty
     here, it is safe to skip them.
   - Batch together adjacent background cells of the same colour. Again,
     most terminals will have large runs of the same colour, so this is a
     worthwhile optimisation.

   These optimisations do mean that terminal performance is no longer
   consistent as "noisy" terminals will have worse performance. This is
   annoying, but still worthwhile.

 - Switch monitor rendering over to use VBOs.

   We also add a config option to switch between rendering backends. By
   default we'll choose the best one compatible with your GPU, but there
   is a config option to switch between VBOS (reasonable performance) and
   display lists (bad).

When benchmarking 30 full-sized monitors rendering a static image, this
improves my FPS[^1] from 7 to 95. This is obviously an extreme case -
monitor updates are still slow, and so more frequently updating screens
will still be less than stellar.

[^1]: My graphics card is an Intel HD Graphics 520. Obviously numbers
      will vary.
@Lemmmy Lemmmy mentioned this pull request Apr 30, 2020
3 tasks
SquidDev added a commit that referenced this issue May 3, 2020
 - Write to a PacketBuffer instead of generating an NBT tag. This is
   then converted to an NBT byte array when we send across the network.
 - Pack background/foreground colours into a single byte.

This derives from some work I did back in 2017, and some of the changes
made/planned in #409. However, this patch does not change how terminals
are represented, it simply makes the transfer more compact.

This makes the patch incredibly small (100 lines!), but also limited in
what improvements it can make compared with #409. We send 26626 bytes
for a full-sized monitor. While a 2x improvement over the previous 58558
bytes, there's a lot of room for improvement.
SquidDev added a commit that referenced this issue May 3, 2020
This uses the system described in #409 (or at least, how I understand
it), to render monitors in a more efficient manner.

Each monitor is backed by a texture buffer object (TBO) which contains
a relatively compact encoding of the terminal state. This is then
rendered using a shader, which consumes the TBO and uses it to index
into main font texture.

My OpenGL skills are pretty much nonexistent, so the implementation of
this is no doubt terrible. However, the performance so far is
outstanding compared with the current VBO renderer, as it transmits
significantly less data to the GPU.
@SquidDev SquidDev mentioned this pull request May 3, 2020
3 tasks
SquidDev added a commit that referenced this issue May 5, 2020
This uses the system described in #409, to render monitors in a more
efficient manner.

Each monitor is backed by a texture buffer object (TBO) which contains
a relatively compact encoding of the terminal state. This is then
rendered using a shader, which consumes the TBO and uses it to index
into main font texture.

As we're transmitting significantly less data to the GPU (only 3 bytes
per character), this effectively reduces any update lag to 0. FPS appears
to be up by a small fraction (10-15fps on my machine, to ~110), possibly
as we're now only drawing a single quad (though doing much more work in
the shader).

On my laptop, with its Intel integrated graphics card, I'm able to draw
120 full-sized monitors (with an effective resolution of 3972 x 2330) at
a consistent 60fps. Updates still cause a slight spike, but we always
remain above 30fps - a significant improvement over VBOs, where updates
would go off the chart.

Many thanks to @Lignum and @Lemmmy for devising this scheme, and helping
test and review it! 
@SquidDev
Copy link
Member

@SquidDev SquidDev commented May 16, 2020

Given that most of the rendering changes have been merged, it's probably worth beginning to look into what changes can be made to our network code now.

I guess I'm thinking the following steps:

  1. Compress monitor data. This was the original CC behaviour before I rewrote the packet system. Might be worth making the compression level a configuration option, or at least fiddling about (for instance, disable on a single-player world) in order to reduce CPU cost.

  2. Send monitor data as a separate packet, rather than as part of the TE data. This is going to be somewhat tricky, as we'll need to roll at least some player-tracking code ourselves.

  3. Bandwidth limits: I think before we do any further optimisations, I do want a system to prevent sending stupid amounts of data every tick.

    Effectively we want a limit of how many (monitor-related) bytes are sent to a player every tick. If we've exceeded our budget for that player, then monitors are added to a queue and deferred until later. We'll probably want to make this a priority queue of some sort, so closer monitors are updated earlier.

    It might be worth having a buffer, which is replenished by $bandwidth each tick - a little like Plethora's energy system. This way we can allow massive updates in one go, assuming they only happen rarely (i.e. when loading chunks).

@SquidDev SquidDev mentioned this pull request May 18, 2020
3 tasks
@SquidDev
Copy link
Member

@SquidDev SquidDev commented Jun 25, 2020

Having talked with @Lemmmy, I'm going to close this for now.

I really want to add incremental updates in the future, quite possibly using this design. However, we've made several pretty major optimisations to the network code (reducing traffic by at least 50%, often 75-80%), so this is less of a priority.

@SquidDev SquidDev closed this Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Core area-Minecraft enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants