Inquisio is a system to extract online news and provide its processed information. Inquisio can provide up-to-date information in low latency. Users can get news information, such as title, category, tags, contents, and daily summaries of news information (category count, tags count). Users can also get processed streamed data from provided Pub/Sub topics.
Check out these amazing libraries and services that I used to develop this project:
- BeautifullSoup with lxml to parse HTML page.
- Request to retrieve website page.
- SQLAlchemy with psycopg2 to interact with PostgreSQL database using Python.
- Kafka as Pub/Sub services to enable stream data processing.
- Kafka-Python to interact with Kafka using Python.
- FastAPI to build fully documented API.
Inqusio services separated to 4 subprojects:
- Stream Producer (Streamer)
- Stream Archiver (Archiver)
- Stream Transmitter (Transmitter)
- Data API (API)
Notes:
- Each subproject can be setup by editing each
settings.py
- To run this project, please setup PostgreSQL and Kafka first.
Below is the diagram of Inquisio services:
In short:
Streamer
will scraper information from online news website.- Scraped data will be processed and published to Kafka topics.
- For archive data, data will be consumed by
Archiver
, dumped to JSON file and processed toSRC
andDW
table. API
will communicate with archived data by user request.- For transmit data, data will be consumed by
Transmitter
, processed and published to another Kafka topics. - User can subscribe to transmitter topic for real-time data.
Streamer
is a cluster of web-scraping or web-crawling with data publisher services. Streamer
will extract news information several website, category (or channel), and specific date. Currently, Streamer
capable of extracting information from:
Archiver
is a service to batching stream data and load it to database. Archiver
consist of two components:
- Consumer: subscribe topics and batch stream data into daily data.
- Loader: loading daily data to database for source data (SRC) and update summary data (DW).
Transmitter
is a service to processed stream data and publish it to another Pub/Sub topics. Transmitter
will clean, transform, and add metadata to stream data. User can subscribe Transmitter
topic to get real-time data.
API
is used to get source or summary data. API
will retrieve data from archived data, so there is no time to wait for scraper to finish their scraping process.