Standalone version running:
Website deployed:
Visit it deployed here.
Make sure you have following tools:
- Go >= 1.22.2;
- Node >= 20.0.0;
- (Make) Makefile >= 3.81;
- Docker >= 25.0.3;
- Docker-compose >= v2.24.6;
To run the standalone version locally follow bellow steps:
Using Makefile:
make run
Directly using go:
PORT=8080 go run --race cmd/standalone/*.go
Then access http://localhost:8080 on your browser.
If you want to perform the enrichment on your local environment and see the website follow theses steps:
- turn on MongoDB using docker-compose;
make db_up
- run enricher;
make run_enricher
- run the website;
make run_site
Then you can access the site at http://localhost:3000.
You can inspect the Makefile for more details and also to check which env vars are being injected into the process.
This project was created to demonstrate the usage of Server-Sent Events (SSE) in Go application. When users access the web client, they find a "Search" button that, when clicked, initiates a data pipeline on the Go server. This pipeline involves web scraping from various sites to gather information about castles, including their names, and the country and city where they are located. The server sends these details back to the client in real time via SSE, allowing users to see the results as they are processed.
Rethinking this project I transformed it into an open data project to collect and consolidate data about european castles and make it available.
The enrichment process has 3 main stages.
In the first stage it iterates through know sources of castles (web pages) and scrap the received HTML to collecting links containing the data we want to extract. For now we use the following websites as sources of data:
For each of the above source we have an Enricher implemented which is able to interact with the website. Such enrichers must follow the Enricher interface that you can check here. This first step is implemented by the CollectCastlesToEnrich method and one concrete example of implementation is the one from Portgual.
All the enrichment sources are defined in the enricher entrypoint as the bellow map:
enrichers := map[enricher.Source]enricher.Enricher{
enricher.CastelosDePortugal: enricher.NewCastelosDePortugalEnricher(httpClient, htmlfetcher.Fetch),
enricher.EDBIDAT: enricher.NewEbidatEnricher(httpClient, htmlfetcher.Fetch),
enricher.HeritageIreland: enricher.NewHeritageIreland(httpClient, htmlfetcher.Fetch),
enricher.MedievalBritain: enricher.NewMedievalBritainEnricher(httpClient, htmlfetcher.Fetch),
}
Those enrichers runs concurrently and controlled by the executor package, and as the links to collect castle data are received, they are passed to the next pipeline stage through a channel.
The second stage visits the received links and scrapes data from the HTML to collect the info we need. One concrete example can be seen by checking the the implementation of method EnrichCastle for the Heritage Ireland data source.
Once the castle struct is filled with collected data, it is sent to third stage through a channel.
The third stage receives the enriched castle and if the program detects that there is already info saved for that castle a reconciliation is performed based on the presence of some key and mandatory fields, like name, city and district.
Those reconciled data castles are then consolidated into a buffer of configurable size and once the buffer is full they are saved into MongoDB in a bulk operation to avoid multiple writes against the DB.
Country | Source web site |
---|---|
United Kingdom | https://medievalbritain.com/category/type/medieval-castles/ |
Portugal | https://www.castelosdeportugal.pt/ |
Ireland | https://heritageireland.ie/visit/castles/ |
Slovakia | https://www.ebidat.de/cgi-bin/ebidat.pl?a=a&te53=7; |
Denmark | https://www.ebidat.de/cgi-bin/ebidat.pl?a=a&te53=2; |
Find it here;