Skip to content
This repository has been archived by the owner on Mar 1, 2021. It is now read-only.

Latest commit

 

History

History
49 lines (32 loc) · 1.45 KB

README.md

File metadata and controls

49 lines (32 loc) · 1.45 KB

postgres database

We are using POSTGRESQL as the store for the raw scraped data from the various data sources.
The schemas are quite similar to the scraped data structures.

Table of Contents

This database is the more sophisticated one and is running in production.

insta_schema

Remarks

  • internal_picture_url is pointing to the downloaded picture on S3

Twitter

This database is not in production yet and at the moment only dumps the tweaked scraped data.

twitter_schema

Debezium

The debezium connector generates a change stream from all change events in postgres (read, create, update, delete) and writes them into a kafka-topic "postgres.public.<table_name>"

To read from this stream you can:

  • get kafkacat
  • inspect the topic list in kafka:
    $ kafkacat -L -b my-kafka | grep 'topic "postgres'
  • consume a topic with
    $ kafkacat -b my-kafka -t <topic_name>

The messages are quite verbose, since they include their own schema description. The most interesting part is the value.payload:

$ kafkacat -b my-kafka -topic postgres.public.users | jq '.value | fromjson | .payload'`