Inquisio

Overview

Inquisio is a system to extract online news and provide its processed information. Inquisio can provide up-to-date information in low latency. Users can get news information, such as title, category, tags, contents, and daily summaries of news information (category count, tags count). Users can also get processed streamed data from provided Pub/Sub topics.

Check out these amazing libraries and services that I used to develop this project:

BeautifullSoup with lxml to parse HTML page.
Request to retrieve website page.
SQLAlchemy with psycopg2 to interact with PostgreSQL database using Python.
Kafka as Pub/Sub services to enable stream data processing.
Kafka-Python to interact with Kafka using Python.
FastAPI to build fully documented API.

Workflow

Inqusio services separated to 4 subprojects:

Stream Producer (Streamer)
Stream Archiver (Archiver)
Stream Transmitter (Transmitter)
Data API (API)

Notes:

Each subproject can be setup by editing each settings.py
To run this project, please setup PostgreSQL and Kafka first.

Below is the diagram of Inquisio services:

In short:

Streamer will scraper information from online news website.
Scraped data will be processed and published to Kafka topics.
For archive data, data will be consumed by Archiver, dumped to JSON file and processed to SRC and DW table.
API will communicate with archived data by user request.
For transmit data, data will be consumed by Transmitter, processed and published to another Kafka topics.
User can subscribe to transmitter topic for real-time data.

Stream Producer

Streamer is a cluster of web-scraping or web-crawling with data publisher services. Streamer will extract news information several website, category (or channel), and specific date. Currently, Streamer capable of extracting information from:

Find out more! >>>

Stream Archiver

Archiver is a service to batching stream data and load it to database. Archiver consist of two components:

Consumer: subscribe topics and batch stream data into daily data.
Loader: loading daily data to database for source data (SRC) and update summary data (DW).

Find out more! >>>

Stream Transmitter

Transmitter is a service to processed stream data and publish it to another Pub/Sub topics. Transmitter will clean, transform, and add metadata to stream data. User can subscribe Transmitter topic to get real-time data.

Find out more! >>>

Data API

API is used to get source or summary data. API will retrieve data from archived data, so there is no time to wait for scraper to finish their scraping process.

Find out more! >>>

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
api		api
archiver		archiver
database		database
docs		docs
streamer		streamer
transmitter		transmitter
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inquisio

Overview

Workflow

Stream Producer

Stream Archiver

Stream Transmitter

Data API

About

Releases

Packages

Languages

avidito/inquisio

Folders and files

Latest commit

History

Repository files navigation

Inquisio

Overview

Workflow

Stream Producer

Stream Archiver

Stream Transmitter

Data API

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages