From cdb91363e2a6bcde0ff51f506de2865faa593f8f Mon Sep 17 00:00:00 2001
From: George Ho <19851673+eigenfoo@users.noreply.github.com>
Date: Tue, 14 Sep 2021 21:47:30 -0400
Subject: [PATCH] Revert "Publish websocket blog post (#93)"
This reverts commit 85d59aa6ce1a9a4c0bca25dcdf2d69170ee492a8.
---
.../2021-06-13-tornado-websockets.md | 87 +++++--------------
1 file changed, 23 insertions(+), 64 deletions(-)
rename _posts/2021-09-27-tornado-websockets.md => _drafts/2021-06-13-tornado-websockets.md (57%)
diff --git a/_posts/2021-09-27-tornado-websockets.md b/_drafts/2021-06-13-tornado-websockets.md
similarity index 57%
rename from _posts/2021-09-27-tornado-websockets.md
rename to _drafts/2021-06-13-tornado-websockets.md
index 11cfe05071d5..3266c9dea03b 100644
--- a/_posts/2021-09-27-tornado-websockets.md
+++ b/_drafts/2021-06-13-tornado-websockets.md
@@ -4,16 +4,20 @@ excerpt: "WebSockets with the Tornado web framework is a simple, robust way to
handle streaming data. I walk through a minimal example and discuss why these
tools are good for the job."
tags:
- - python
- streaming
- tornado
- websocket
header:
overlay_image: /assets/images/cool-backgrounds/cool-background8.png
caption: 'Photo credit: [coolbackgrounds.io](https://coolbackgrounds.io/)'
-last_modified_at: 2021-09-27
+last_modified_at: 2021-06-13
+search: false
---
+{% if page.noindex == true %}
+
+{% endif %}
+
A lot of data science and machine learning practice assumes a static dataset,
maybe with some MLOps tooling for rerunning a model pipeline with the freshest
version of the dataset.
@@ -31,20 +35,20 @@ requests with REST endpoints). Of course, Tornado has pretty good support for
WebSockets as well.
In this blog post I'll give a minimal example of using Tornado and WebSockets
-to handle streaming data. The toy example I have is one app (`server.py`)
-writing samples of a Bernoulli to a WebSocket, and another app (`client.py`)
+to handle streaming data. The toy example I have is one app (`transmitter.py`)
+writing samples of a Bernoulli to a WebSocket, and another app (`receiver.py`)
listening to the WebSocket and keeping track of the posterior distribution for
a [Beta-Binomial conjugate model](https://eigenfoo.xyz/bayesian-bandits/).
After walking through the code, I'll discuss these tools, and why they're good
choices for working with streaming data.
-For another tutorial on this same topic, you can check out [`proft`'s blog
+For another good tutorial on this same topic, you can check out [`proft`'s blog
post](https://en.proft.me/2014/05/16/realtime-web-application-tornado-and-websocket/).
-## Server
+## Transmitter
-- When `WebSocketServer` is registered to a REST endpoint (in `main`), it keeps
- track of any processes who are listening to that endpoint, and pushes
+- When `WebSocketHandler` is registered to a REST endpoint (on line 44), it
+ keeps track of any processes who are listening to that endpoint, and pushes
messages to them when `send_message` is called.
* Note that `clients` is a class variable, so `send_message` is a class
method.
@@ -56,20 +60,12 @@ post](https://en.proft.me/2014/05/16/realtime-web-application-tornado-and-websoc
case. For example, you could watch a file for any modifications using
[`watchdog`](https://pythonhosted.org/watchdog/), and dump the changes into
the WebSocket.
-- The [`websocket_ping_interval` and `websocket_ping_timeout` arguments to
- `tornado.Application`](https://www.tornadoweb.org/en/stable/web.html?highlight=websocket_ping#tornado.web.Application.settings)
- configure periodic pings of WebSocket connections, keeping connections alive
- and allowing dropped connections to be detected and closed.
-- It's also worth noting that there's a
- [`tornado.websocket.WebSocketHandler.websocket_max_message_size`](https://www.tornadoweb.org/en/stable/websocket.html?highlight=websocket_max_message_size#tornado.websocket.WebSocketHandler)
- attribute. While this is set to a generous 10 MiB, it's important that the
- WebSocket messages don't exceed this limit!
-
+
-## Client
+## Receiver
-- `WebSocketClient` is a class that:
+- `WebSocketReceiver` is a class that:
1. Can be `start`ed and `stop`ped to connect/disconnect to the WebSocket and
start/stop listening to it in a separate thread
2. Can process every message (`on_message`) it hears from the WebSocket: in
@@ -78,39 +74,17 @@ post](https://en.proft.me/2014/05/16/realtime-web-application-tornado-and-websoc
but this processing could theoretically be anything. For example, you
could do some further processing of the message and then dump that into a
separate WebSocket for other apps (or even users!) to subscribe to.
-- To connect to the WebSocket, we need to use a WebSocket library: thankfully
- Tornado has a built-in WebSocket functionality (`tornado.websocket`), but
- we're also free to use other libraries such as the creatively named
- [`websockets`](https://github.com/aaugustin/websockets) or
+- To connect to the WebSocket, we need to use a WebSocket client, such as the
+ creatively named
[`websocket-client`](https://github.com/websocket-client/websocket-client).
-- Note that we run `on_message` on the same thread as we run
- `connect_and_read`. This isn't a problem so long as `on_message` is fast
- enough, but a potentially wiser choice would be to offload `connect_and_read`
- to a separate thread by instantiating a
- [`concurrent.futures.ThreadPoolExecutor`](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor)
- and calling
- [`tornado.ioloop.IOLoop.run_in_executor`](https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.run_in_executor),
- so as not to block the thread where the `on_message` processing happens.
-- The `io_loop` instantiated in `main` (as well as in `server.py`) is
- important: it's how Tornado schedules tasks (a.k.a. _callbacks_) for delayed
+- Note that we run `read` is a separate thread, so as not to block the main
+ thread (where the `on_message` processing happens).
+- The `io_loop` instantiated on line 50 (as well as in `transmitter.py`) is
+ important - it's how Tornado schedules tasks (a.k.a. _callbacks_) for delayed
(a.k.a. _asynchronous_) execution. To add a callback, we simply call
`io_loop.add_callback()`.
-- The [`ping_interval` and `ping_timeout` arguments to
- `websocket_connect`](https://www.tornadoweb.org/en/stable/websocket.html?highlight=ping_#tornado.websocket.websocket_connect)
- configure periodic pings of the WebSocket connection, keeping connections
- alive and allowing dropped connections to be detected and closed.
-- The `callback=self.maybe_retry_connection` is [run on a future
- `WebSocketClientConnection`](https://github.com/tornadoweb/tornado/blob/1db5b45918da8303d2c6958ee03dbbd5dc2709e9/tornado/websocket.py#L1654-L1655).
- Here, we simply get the `future.result()` (i.e. the WebSocket client
- connection itself) — I don't actually do anything with the `self.connection`,
- but you could if you wanted. In the event of an exception while doing that,
- we assume there's a problem with the WebSocket connection and retry
- `connect_and_read` after 3 seconds. This all has the effect of recovering
- gracefully if the WebSocket is dropped or `server.py` experiences a brief
- outage for whatever reason (both of which are probably inevitable for
- long-running apps using WebSockets).
-
-
+
+
## Why Tornado?
@@ -153,21 +127,6 @@ SSE)](https://www.smashingmagazine.com/2018/02/sse-websockets-data-flow-http2/):
it seems to be a cleaner protocol for unidirectional data flow, which is really
all that we need.
-Additionally, [Armin
-Ronacher](https://lucumr.pocoo.org/2012/9/24/websockets-101/) has a much
-starker view of WebSockets, seeing no value in using WebSockets over TCP/IP
-sockets for this application:
-
-> Websockets make you sad. [...] Websockets are complex, way more complex than I
-> anticipated. I can understand that they work that way but I definitely don't
-> see a value in using websockets instead of regular TCP connections if all you
-> want is to exchange data between different endpoints and neither is a browser.
-
-My thought after reading these criticisms is that perhaps WebSockets aren't the
-ideal technology for handling streaming data (from a maintainability or
-architectural point of view), but that doesn't mean that they aren't good
-scalable technologies when they do work.
-
---
[^1]: There is [technically a difference](https://sqlstream.com/real-time-vs-streaming-a-short-explanation/) between "real-time" and "streaming": "real-time" refers to data that comes in as it is created, whereas "streaming" refers to a system that processes data continuously. You stream your TV show from Netflix, but since the show was created long before you watched it, you aren't viewing it in real-time.