Skip to content

Latest commit

 

History

History
146 lines (139 loc) · 5.5 KB

TODO.md

File metadata and controls

146 lines (139 loc) · 5.5 KB

TODO

TODO

  • tweak of batcher yield
  • pack.Payload reuse memory, json.NewEncoder(os.Stdout)
  • metrics isolation by cluster
  • participant starts slow
    • [06/06/17 15:06:11 CST] [TRAC] ( engine.go:281) engine starting...
    • [06/06/17 15:06:11 CST] [TRAC] ( engine.go:343) [10.9.1.1:9877] participant starting...
    • [06/06/17 15:06:41 CST] [INFO] ( engine.go:349) [10.9.1.1:9877] participant started
  • cluster
    • monitor resources cost and rebalance
    • support multiple projects
  • resource group
  • FIXME access denied leads to orphan resource
  • myslave should have no checkpoint, placed in Input
  • enhance Decision.Equals to avoid thundering herd
  • myslave server_id uniq across the cluster
  • add Operator for Filter
    • count, filter, regex, sort, split, rename
  • RowsEvent avro
  • inc binlog replication recv buffer size
  • alert mysql binlog lags
  • dbc participants -i // show internal buffers
  • model.RowsEvent add dbus timestamp
  • HY000 auto heal
  • multiversion config in zk
  • model.RowsEvent add dbus timestamp
  • controller
    • a participant is electing, then shutdown took a long time(blocked by CreateLiveNode)
    • 2 phase rebalance: close participants then notify new resources
    • what if RPC fails
    • leader.onBecomingLeader is parallal: should be sequential
    • hot reload raises cluster herd: participant changes too much
    • when leader make decision, it persists to zk before RPC for leader failover
    • owner of resource
    • leader RPC has epoch info
    • if Ack fails(zk crash), resort to local disk(load on startup)
    • engine shutdown, controller still send rpc
    • test cases
      • sharded resources
      • brain split
      • zk dies or kill -9, use cache to continue work
      • kill -9 participant/leader, and reschedule
      • cluster chaos monkey
  • kafka producer qos
  • batcher only retries after full batch ack, add timer?
  • KafkaConsumer might not be able to Stop
  • kguard integration
  • router finding matcher is slow
  • hot reload on config file changed
  • each Input have its own recycle chan, one block will not block others
  • when Input stops, Output might still need its OnAck
  • KafkaInput plugin
  • use scheme to distinguish type of DSN
  • plugins Run has no way of panic
  • (replication.go:117) [zabbix] invalid table id 2968, no correspond table map event
  • make canal, high cpu usage
    • because CAS backoff 1us, cpu busy
  • ugly design of Input/Output ack mechanism
    • we might learn from storm bolt ack
  • some goroutine leakage
  • telemetry mysql.binlog.lag/tps tag name should be input name
  • pipeline
    • 1 input, multiple output
    • filter to dispatch dbs of a single binlog to different output
  • kill Packet.input field
  • visualized flow throughput like nifi
    • dump uses dag pkg
    • pipeline
  • router metrics
  • dbusd api server
  • logging
  • share zkzone instance
  • presence and standby mode
  • graceful shutdown
  • master must drain before leave cluster
  • KafkaOutput metrics
    • binlog tps
    • kafka tps
    • lag
  • hub is shared, what if a plugin blocks others
    • currently, I have no idea how to solve this issue
  • Batcher padding
  • shutdown kafka
  • zk checkpoint vs kafka checkpoint
  • kafka follower stops replication
  • can a mysql instance with miltiple databases have multiple Log/Position?
  • kafka sync produce in batch
  • DDL binlog
    • drop table y;
  • trace async producer Successes channel and mark as processed
  • metrics
  • telemetry and alert
  • what if replication conn broken
  • position will be stored in zk
  • play with binlog_row_image
  • project feature for multi-tenant
  • bug fix
    • kill dbusd, dbusd-slave did not leave cluster
    • next log position leads to failure after resume
    • KafkaOutput only support 1 partition topic for MysqlbinlogInput
    • table id issue
    • what if invalid position
    • router stat wrong Total:142,535,625 0.00B speed:22,671/s 0.00B/s max: 0.00B/0.00B
    • ffjson marshalled bytes has NL before the ending bracket
  • test cases
    • restart mysql master
    • mysql kill process
    • race detection
    • tc drop network packets and high latency
    • mysql binlog zk session expire
    • reset binlog pos, and check kafka did not recv dup events
    • MysqlbinlogInput max_event_length
    • min.insync.replicas=2, shutdown 1 kafka broker then start
  • GTID
    • place config to central zk znode and watch changes
  • Known issues
  • Roadmap
    • pubsub audit reporter
    • universal kafka listener and outputer

Issues

  • a big DELETE statement might kill dbusd
    • It might exceed max event size: 1MB mysql seems to auto-chunk the big event into chunks of small events
    • It might malloc a very big memory in RowsEvent struct
    • mysql packet max payload len = (1<<24 -1)
  • OSC tools will make 'ALTER' very complex, whence dbusd not able to clear table columns cache
    • use SQL comment to solve it

Memo

  • mysqlbinlog input peak with mock output

    • 140k event per second
    • 30k row event per second
    • 260Mb network bandwidth
    • KafkaOutput 35K msg per second
    • it takes 2h25m to zero lag for platform of 2d lag
  • dryrun MockInput -> MockOutput

    • 2.1M packet/s