Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some sort of data store will be needed #41

Open
soliveira-vouga opened this issue Jan 22, 2019 · 4 comments

Comments

Projects
None yet
3 participants
@soliveira-vouga
Copy link
Contributor

commented Jan 22, 2019

At the moment, subscribers in the channels.yml file is hardcoded in. It'd make more sense to have this dynamically updated via a script that runs periodically... if this subscriber number is meant to be used for anything (eg: sorting channels by # of subscribers)

We can use a data store (eg: flat file, sqlite, etc) that can update certain data dynamically. This will also need to be considered for #22.

To add this we'd need a backend (can be written in Node.Js) that interacts with the data store and can provide a simple api to the frontend as well, for example serving channels data, playlist data, etc. And #22 could also use the API to post submissions in a pending review state

@dirkkelly

This comment has been minimized.

Copy link
Member

commented Jan 23, 2019

@soliveira-vouga Thanks for this. We are in agreement this needs to be updated dynamically, however there should be no reason to implement any other data store than the yml files inside the data folder.

Additional lambda services can be set up to commit directly to master which would trigger a rebuild of the website, or even set up a pull request so that a human can confirm before merging.

Channel subscriber data for example could be updated daily, where as watch/like count on videos might be done every few hours. This would mean there is no additional load on the user's browser, or delays on retrieving data, also means we could sort natively on the data.

I'm not sure how much you've worked with static site generators, they're definitely something it takes a bit to wrap your head around. The more conversations we can have about these topics the better. I had a good chat with @Murodese last night to go over this exact issue, right now he's looking into making executable that can add/edit channels/videos, these could then be leveraged by the updating service.

@soliveira-vouga

This comment has been minimized.

Copy link
Contributor Author

commented Jan 23, 2019

I've used Jekyll in the past. While it's great for simple blogs or static websites, it is limited to that. It cannot handle dynamic forms. Something as simple as a contact form requires a third party service. I also wouldn't use it if one of the requirements is that data is constantly updated via a script running on a cronjob. Here we're actually adding more overhead than simply using a database (eg: a flat file db). If you're going to use lambda then why not use S3 as your data storage? If you're going to use lambda, then you'll need to store git credentials somewhere for it to have permission to push to master. Same goes for whatever technology you decide to use for the user submission form. You will now need to handle git credentials safely which is additional overhead

Also, using git as the data store you'll have additional problems when having the public submit content via a webform. How will you handle their commits? Will they go to a separate branch? How will you keep that branch in sync with master? What if there is a conflict between the branches? How will you select which submissions are to be included and which ones are not to be includes? You can use cherry-pick but even this method contains the history of the commit you're picking. Again this is all continuous overhead that having a simple data store would resolve.

You'll already need a backend to handle form data (ex: taking the form data and committing it to git) so why not use that backend to store the data in a proper db?

@quietlemon

This comment has been minimized.

Copy link

commented Mar 5, 2019

Personally, i have mixed feelings about this (and i've experimented with both approaches). The security-related issues @soliveira-vouga brought up are very real, but i think the main problem with git-as-db is there is no built-in ACL and the past can be rewritten (that is not true of all CVS, just git). So giving a bot access to a repository can have disastrous consequences if the bot gets compromised.

In the end you're trying to synchronise your static website (breadtube) with external services. So you're trying to build a coherent state, but static websites are mostly "stateless" information-sharing tools, where you need to rely on external tools (whether manual curating, your forge, or some other application) to populate your data folders (handle the state). But most software forges were not built with this purpose in mind so you end up either giving full access to your repo to some external service (ouch) or you end up keeping the content on your build machine (with proper backups) and not using git for this.

Approach #1 : 3rd party service managing your repo

That's what Netlify does with its "CMS" and "Forms" features. Basically, they run a service managing identity and permissions and let a bot of theirs open/merge PRs (using PR titles as a means of keeping track) when you ask it to through their web interface

Pros:

  • all your content is stored in git and can be coherently replicated without having to reimport everything all over again

Cons:

  • give full access to a bot run by a for-profit entity (when is it going to be hacked/supoenaed? :D)
  • need to implement code for every supported forge (i.e. github, sourcehut, gitlab, gitea..) unless you use Netlify services (they already implemented github and gitlab)
  • lose track of your actually-useful PRs in the sea of comment submissions waiting approval

So the only way you're not reinventing the wheel (a super complex wheel) going in this direction is by using Netlify, who are as far as i know the only people who have a such usable solution. But do we trust them, their financial interests, and their thousands of line of ugly NodeJS code controlling our git repository as single source of truth?

Approach #2: your data folder is curated locally

You can add some parts of your data folder to .gitignore so it doesn't get synced to the forge. In this case, you want to handle backups properly so you can get back up and running in little time if you need to migrate or restore (so you don't have to fetch again info about thousands of videos). Also, allows to focus on developing breadtube core in the repo (separation of concerns).

Even if your data is stored locally, you need a way for remote users to suggest content so you really want some form of Web API taking care of authentication, permissions management (ACL), moderation... then you can expose a web "app" (HTML forms + CSS, no JS needed) and a CLI client for your API to propose content.

Pros :

  • your web API for submitting content cannot compromise your templates or rewrite history, only insert/remove data (risks mitigation)
  • code to audit is notably smaller as you don't need to deal with different forges ("remove some piles of shit from the shitstack" approach to security
  • can be self-hosted anywhere without relying on evil corporations!

Cons :

  • the centralized source of truth lies on your build server and therefore cannot be inspected/reused, and probably provides less history-keeping and content-integrity mechanisms compared to a content versioning system (such as a git repo)
  • you really really need to keep backups because the forge isn't gonna do it for you

Note: i did not mention any external database in regards to this API because we already have a flat-file database. But of course it could be swapped for SQLite or whatever suits better, and then an export script can generate the data files accordingly. Personally, i don't really see the point but it's feasible and fits within this second approach.

Approach #3: taking the best of both worlds?

Maybe an intermediate approach could be to use a local source of truth, but handle it as a separate CVS repository (such as git). This way you get integrity and history from the CVS while not giving your bot/API access to the whole website repository. Plus, it's notably easier to have your scripts maintain the repo as you won't have to deal with merges and incorrect states and whatnot (the original source of truth is on your build server and git is only there to keep history/backups).

This third approach i never really experimented with so far. Do you have any critics/feedback? Is it worth trying out?

EDIT: Added some formatting. Also, sorry for the long comment. I hope it brings interesting conversation and solutions to our problems :)

@dirkkelly

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

Another really helpful comment that I completely missed. I’m going to spend this week curating issues and scoping out what work we have now.

As this point the original issue in this has been fixed, there is now a script to pull all the subscriber counts, I run it daily and submit a pull request.

I think that using the pull request model for bots will allow us to avoid issues of compromised services overwriting history (they could only submit to a branch).

Same ultimate problem of having a profit business in our business, something like gitlab could solve this also.

In #43 (comment) the benefit of using a static site for allowing distribution on IPFS etc is the sort of huge benefit i see for the cost of having systems which manage content addition/maintenance

I think continuing the discussion on this is going to be really important, though I’m really happy to have a script to help us maintain the data store as it currently is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.