Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikipedia on Swarm #1

Open
costgallo opened this issue May 24, 2022 · 9 comments
Open

Wikipedia on Swarm #1

costgallo opened this issue May 24, 2022 · 9 comments
Assignees

Comments

@costgallo
Copy link
Collaborator

costgallo commented May 24, 2022

Motivation:

  • E.T. wants to access Wikipedia, but the site is no longer accessible on the web.
  • They want to find a way to download the entire Wikipedia and access it offline, but this download requires a lot of traffic (500 GB of wiki mirror) and is difficult to update - there surely must be an easier, faster and more efficient way to access Wikipedia.
  • Fortunately, the Swarm network hosts a mirror of Wikipedia, so anyone connected to it can access and search through the complete set of information.
  • Wikipedia on Swarm is updated periodically (at minimum on a monthly basis) and is always accessible through the same link and the same means.

Goals:

  • Maintain a mirror of English wikipedia on Swarm, complying with all necessary licences. It should be able to handle non-latin alphabet characters (uploading languages like Spanish, Russian, Czech, Arabic, Farsi, Korean, Japanese and Chinese).
  • Create a reusable solution that provides broader utility - components can be reused to upload large collections of small files like other ZIM archives (e.g. Project Gutenberg) or OpenStreetMap data.
  • Create or modify a web interface and/or an app to allow searching and reading of Swarm hosted content.
  • Anyone with a devops background should be able to run the solution. Nonetheless, we expect the winner(s) of the bounty to run and maintain the tools (and they may qualify for a Fellowship in return). The solution needs to be open source and well documented.

Note:
Hosting of a database of this scale on Swarm has not been efficiently automated yet.
As of today, the Bee client can reliably upload and retrieve small files. For larger datasets, an efficient mechanism for upload should be implemented.

Technical requirements:

  • Create a pipeline built from a set of independent components that observes Wikipedia dumps and uploads them to the Swarm network.
  • The design of the interfaces as well as the actual modularisation between these components is up to you. Below is a suggested pipeline. The only component we would like to always keep separate is the Uploader.
  • +---------+ +------------+ +-----------+ +----------+ +----------+
  • | Trigger | -> | Downloader | -> | Extractor | -> | Enhancer | -> | Uploader |
  • +---------+ +------------+ +-----------+ +----------+ +----------+
  • Trigger: triggers the upload
    • It either watches the repository with ZIM archives or, through some other means, triggers the build periodically as new versions of ZIM are released.
    • Good place to download these archives from is https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/
    • Downloader: download specific ZIM archive
    • Any optimisations such as checksum validation or some form of streaming into the next step in the pipeline would be appreciated but not required.
  • Extractor: extracts the archive
    • Here you can get inspired by the great submissions to We Are Millions hackathon.
  • Enhancer (optional but recommended)
    • Content aware step which enriches the files with additional information such as checksum of the uploaded ZIM file, the date it was released, ENS name, etc. This should however be deterministic and replicable.
    • It can also add UX features like search mechanism and any such addition will be considered for the bounty.
  • Uploader: upload the archive content to Swarm network.
    • A reliable mechanism for uploads of large datasets should be created.
    • One approach is to upload the files individually and then create a custom manifest where you link all the files together. You can get inspired by EdgarBarrantes/swarm-zim-uploader, but feel free to experiment with new approaches as well.
    • The Uploader should handle any errors or problems that may occur and continue uploading.
    • The mechanism should support streaming of the content to upload (and should start uploading right away) as well as just mounting a resource and uploading that.
    • The output of this step is a Swarm hash (can of course output additional data).
    • Handling of postage stamps, etc. can be done through gateway-proxy or built into the solution.
    • This component needs to be very much agnostic and be able to upload any collection of small files and nested directories.
  • Feeds and ENS (bonus)
    • The resulting Swarm hash could and most likely should be stored in a feed.
    • This feed is separately stamped and re-uploaded to ensure it does not disappear from the Swarm network.
    • The hash should be updated each time in an ENS record or this ENS record should store the feed. The design is up to you and will be considered when awarding the bounty.
  • The solution should comprise of at least two independent docker containers (one of them being the Uploader) that can be chained one after the other with clear interfaces between them. Splitting it into smaller services that do one thing would be appreciated.
  • In addition, any improvements such as a performant decentralised search mechanism using Swarm or Swarm and external decentralised services would be welcomed.

Assessment Criteria
Internal:

  • Can the solution upload full English Wikipedia?
  • Is there an uploaded version of English Wikipedia and other languages?
  • Can we run the product with the documentation provided on AWS linux machine?
  • What is the final product’s user experience? Does it have a search mechanism? How does it perform? We’d love the bounty to invite any innovation.
  • Does it implement any optimisations? E.g. uploading only what changed, advanced stamp management, restamping chunks, reuploading missing chunks, streaming…
  • How much time/resources does it consume?

External:

  • Meeting the listed requirements.
  • Swarm hash pointing to a full English Wikipedia uploaded to Swarm through your solution.
  • Quality of implementation: code, documentation, technical excellence.
  • Quality of user experience (with regards to uploading and, more importantly, with regards to using the uploaded wikipedia).
  • How innovative is your solution, i.e. does it have some additional features like search mechanism or a mechanism ensuring that the content is available (on demand reuploading or global pinning)?
  • Technical complexity and optimisations.

Code of Conduct
Let’s build exciting things.

Projects need to support privacy, data interoperability and data sovereignty where applicable. For more details, get familiar with Fair data society's principles.

Prize challenge:
This bounty has a total value of 50k DAI to be disbursed in BZZ (as of the price of BZZ on the day of the payout as determined by Swarm Association). Prize can be awarded to the same project, or can be split and assigned to different ones, whatever the judging committee will deem most appropriate.
The winner of the first prize will receive at least 20k DAI.
The remaining 30k DAI will be distributed to the winner of the first prize, or to other winners, up to the 5th place, or whatever the judging committee will deem most appropriate. If according to the judging committee no project matches all the criteria mentioned above, no prizes will be awarded and the deadline might be extended.

The most promising projects may also be contacted for a Fellowship.

Submission requirements
A final delivery will include:

  • Open source code licence:
    • any licence for the project shall include terms substantially similar to those of version 1.9 of the Open Source Definition, promulgated by the Open Source Initiative at https://opensource.org/osd-annotated.
  • Documentation: how to run, how it works
  • Recorded video demonstration of the working solution
  • A working link to a public Github or Gitlab repository containing: the code, presentations, demo, documentation, and licence information
  • The submitted solution must be a working product ready to be deployed in production
  • The solution uses Swarm network as the underlying storage

The judging panel will attempt to retrieve several random pages to ensure that English Wikipedia has been uploaded to Swarm mainnet as required for this bounty. Projects that do not pass this test will be discarded.

Eligibility
Employees, contractors, or officers of Swarm Association and their affiliates are not eligible to participate in the bounty.

Participants can register as a team or as individuals. Participants can either join other teams or work alone. We believe in collaboration and encourage participants to work together.

Timeline
The deadline for submitting your project is on August 31st 2022 at 16:00 CET.

Winners will be announced within 30 days of the deadline.

Important Links
Swarm Gateway
White Paper
Bee Documentation
Swarm Discord
FDS Github
FDS Discord

No Liability
The participant acknowledges and agrees that, to the fullest extent permitted by law, he/she will not hold Swarm liable for any and all damages or injury whatsoever caused by or related to his/her participation to the bounty under any cause or action whatsoever of any kind, including, without limitation, actions for breach of warranty, breach of contract or tort (including negligence) and that Swarm shall not be liable for any indirect, incidental, special, exemplary or consequential damages, including for loss of profits, goodwill or data, in any way whatsoever arising out of the participation to the bounty.

Governing Law and Jurisdiction
These terms as well as all matters arising out or in relation to them shall be governed by the laws of Switzerland, to the exclusion of the rules on conflicts of laws.

Any claim or dispute regarding these terms or in relation to them shall be subject to the exclusive jurisdiction of the Courts of Neuchâtel, Switzerland, subject to an appeal at the Swiss Federal Court.

@costgallo costgallo self-assigned this May 25, 2022
@gitcoinbot
Copy link

gitcoinbot commented May 29, 2022

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


Work has been started.

These users each claimed they can complete the work by 1 month ago.
Please review their action plans below:

1) vporton has started work.

The task seems manageable. Maybe I'll rewrite Bee in Rust.
2) igar1991 has started work.

I have experience working with Swarm. Everything will be done!
3) jusonalien has started work.

I'm familiar with swarm ecosystem, have a good time!
4) hieple7985 has started work.

We have a great Team to handle it very well
5) karthik-co has started work.

Cool project motivation/goals!
(a) I will research with my team and play with it initially and then
(b) hack out a good solution that fits/solves the needs
6) lyledavids has started work.

Create and Maintain a mirror of wikipedia on Swarm,
7) mfw78 has started work.

As the amount of data to be wrangled is large, I will implement some Ethereum Swarm primitives in Rust and use these libraries for data processing.

Learn more on the Gitcoin Issue Details page.

@gitcoinbot
Copy link

gitcoinbot commented Jul 2, 2022

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


Work for 96089.3681 BZZ (50000.00 USD @ $0.44/BZZ) has been submitted by:

  1. @shepherliu
  2. @igar1991
  3. @hieple7985
  4. @karthik-co
  5. @mfw78

@FairDataSociety-github please take a look at the submitted work:


@hieple7985
Copy link

@costgallo any way which I could contact you for weekly communication update work in progress, sir?

@costgallo
Copy link
Collaborator Author

@costgallo any way which I could contact you for weekly communication update work in progress, sir?

Hello @hieple7985, there's no need for a weekly updated. If you still want to contact, you can use Wikiprize@ethswarm.org, if you have any questions please go to FDS Discord
Good luck!

@hieple7985
Copy link

I already sent a email to Wikiprize@ethswarm.org a couple of weeks ago.

Time for contacting on Discord with several questions for clarification.

@hieple7985
Copy link

The invitation seem invalided, which link will work well? And which channel to be contacted with?

https://discord.gg/j8hNqbZ4

@costgallo
Copy link
Collaborator Author

Can you please try this link https://discord.gg/rckvz2VC ?
Also, I didn't see the email, if you can please send it again I'll check it :)

@hieple7985
Copy link

Which email I could send to right destination, sir?

@costgallo
Copy link
Collaborator Author

Wikiprize@ethswarm.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants