Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions #10

Closed
RobStallion opened this issue Dec 11, 2018 · 6 comments
Closed

Questions #10

RobStallion opened this issue Dec 11, 2018 · 6 comments
Assignees
Labels
awaiting-review question Further information is requested starter

Comments

@RobStallion
Copy link
Member

RobStallion commented Dec 11, 2018

@nelsonic just opening this issue as a place to ask some questions I have at the moment. Can split each question into their own issue if needed or add them to an FAQ section in the readme if you feel that would be helpful.

Questions

1.

In issue #1 you mention shortening the URL from:

location-app.com/venues/123e4567-e89b-12d3-a456-426655440000

to

location-app.com/sw1x

How would we ensure that this URL is always unique. If we had millions of URLs, like youtube for example? 4 characters does not seem long enough for that to be possible.

2.

In issue #1 you mention:

I suggest that by using a hash of the content as the Primary Key, Ecto (or PostgreSQL) would "reject" the insert request as a "duplicate" and we would not waste space in the database/table with dupe data.

My understanding of the above is that we would take all the content from the form submission and hash it generating a string (which would be used as the primary key). The same content would generate the same string so we could easily tell if it existed or not (I think this is similar to how hash tables work under the hood (thanks for suggesting that book btw 😉)). If my understanding of this is correct then my questions are...

How would we link the change to the original? Would we still be using an alog approach where we have a :entry_id to keep track of this?

Would the long term plan be to be just track the changes, and link to the original file? (I believe this is similar to how both git and IPFS work)

@nelsonic
Copy link
Member

nelsonic commented Dec 11, 2018

Hi @RobStallion, thanks for posting these good questions as an issue. 🎉

in future consider opening separate issues for each question you have for clarity 💭
and making the question the title of the issue for SEO benefits
🔍
and making the life of the "next dev" easier when they have similar questions ... 😉

Answers

1. Uniqueness?

cid will be universally unique.

That needs to be clear to everyone from the first line of the readme.
(if it's not, then it's "my bad" and I need to "fix" it ...)

We will be using the SHA256 hash which to date has not incurred a single hash collision. 256 bits of data is used by most crypto currencies. Enough "smart people" have done the homework on this for us not to worry about it.

The "math" is covered in: https://crypto.stackexchange.com/questions/39641/what-are-the-odds-of-collisions-for-a-hash-function-with-256-bit-output

This is the best video on 256 bit hash collision probability:
image
https://youtu.be/S9JGmA5_unY

This video does not cover the "Birthday Paradox" see: nelsonic/nelsonic.github.io#576
But again, for the purposes of this answer and indeed any project we are likely to work on in our lifetime,
when dealing with 256 bit hashes, the chance of a "birthday attack" creating a collision is "ignorable".

cid means we have the <option> to store content on IPFS

We need to make it clear that using a cid as the unique identifier for a record
means we can optionally store content on IPFS for redundancy/decentralisation,
but for the purposes of building our Apps in 2019, we are NOT going to even try to build "D-Apps" because unfortunately there is no way of maintaining privacy for private/personal content on IPFS without pre-storage encryption which then automatically implies storing encryption/decryption keys somewhere centrally.
i.e. something will need to be stored centrally, so we might as well store the data centrally
to reduce query time and request latency in any App(s) we build.

Note: the reason I haven't previously "proclaimed" in dwyl/technology-stack#67
that "all our apps will be distributed by default from now on",
is because the Application building "story" is incomplete on IPFS/IPLD.
There is no way of deleting old data that people no longer want to exist: ipfs-inactive/faq#9
This means that if someone says something hurtful or untrue, they cannot "retract" it
to stop it perpetuating on the netwok ... so decentralisation and content replication can be harmful!
One of the original principals of IPFS was the "permanent web".
I'm fairly certain that most users will not like the idea of "losing control" over their data,
and indeed this is incompatible with EU law: https://en.wikipedia.org/wiki/Right_to_be_forgotten
So we are going to use cid as a means of ensuring uniqueness in our DB records,
and we will use the concept of prev for versioning. See: answers 1 and 2 below.
But we are not going to store textual data on IPFS for the foreseeable future,
until "Filecoin" is fully operational and we have a guarantee that our data will not disappear.
We can still use cid 100% independently of IPFS and when the ecosystem "matures",
we can offer users of our application(s) (Time, Tudo, ALT, etc) to "backup" their data to IPFS!
For now, ignore the existence of IPFS and focus purely on using cid to replace entry_id in Alog.

Uniqueness in a Phoenix-based Web Application

In a given web app, there will be a PostgreSQL database that will store the data.
Each item of content will have a cid

Imagine that we are building a "home rental" website. "restful-bed-and-healthy-breafast.com"
which has the short domain: bnb.com here is a example (simplified) "homes" table:

inserted cid(PK)1 address slug prev
1541609554 hdyk80sgPeAX Wayne Manor, 1007 Mountain Drive, Gotham hdy null
1541618643 HvTlGsEX88Nc Wayne Manor, 1007 Mountain Drive, Gotham waynemanor hdyk80sgPeAX
1541628987 pN7hWNuqJ6J Wayne Manor, 1007 Mountain Drive, Gotham, USA waynemanor HvTlGsEX88Nc

The first row is the "creation" of the entry for "Wayne Manor".
At this point the URL would be: bnb.com/hdy corresponding to the first 3 letters of the cid.

The second row is when the listing owner updates the slug to be a more friendly waynemanor
so the URL is more human memorable and SEO friendly: bnb.com/waynemanor
The URL may be longer but it's more memorable and thus people may prefer it.

Notice how the value of prev refers to the cid of the previous version of the record?
that's how we do versioning in a cid based web app. (see below)

As this data will be stored "centrally" by a PostgreSQL database, the DB can be responsible for ensuring that the slug field is not duplicated. We will need to run a "SELECT" query before inserting any record that has a slug to confirm that the user inserting the data has the access rights to update the row with that slug but we will clarify those "access control" details later. For now, let's stick with the simplified version.

In the third row, we added the "USA" to the address which changed the content and thus creates a new cid. The prev refers to the previous version of the record (before "USA" was added). The slug has not changed, so the URL is still the same: bnb.com/waynemanor

2. Updating Content

the update version of content would be linked to the previous version using a prev field the way it happens in IPFS, Etherium and Bitcoin (so it will be familiar to people)
prev: previous_cid address example:

inserted cid(PK)1 name address prev
1541609554 gVSTedHFGBetxy Bruce Wane 1007 Mountain Drive, Gotham null
1541618643 smnELuCmEaX42 Bruce Wane Rua Goncalo Afonso, Vila Madalena, Sao Paulo, 05436-100, Brazil gVSTedHFGBetxy

When a row does not have a prev value then we know it is the first time that content has been inserted into the database. When a prev value is defined in a row we know this is a new version of a previously inserted content and we can "traverse the tree" to see all previous versions.

1: all cid values truncated for brevity.

@RobStallion please let me know if this answers your questions. 🤔
If not, please help identify the remaining confusion. thanks. 👍

@nelsonic nelsonic added question Further information is requested starter labels Dec 11, 2018
@RobStallion
Copy link
Member Author

@nelsonic Those are amazing thank you. Super super helpful. 👍

@nelsonic
Copy link
Member

@RobStallion do you want to convert these questions & answers into "FAQ.md" and create a PR? 😉

@RobStallion
Copy link
Member Author

@nelsonic Will do 👍

@RobStallion
Copy link
Member Author

The following lines added to the read in #16 answer my first question...

The reason we can abbreviate the URL to just gV is because our SHORT URL service has a centralised Database/store. If we wanted to run a decentralised content addressing system, we would simply link to the full cid: dwyl.co/gVSTedHFGBetxyYib9mBQsjtZj4dJjQe

@RobStallion
Copy link
Member Author

Closing as @nelsonic has answered my questions and they have been added to readme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review question Further information is requested starter
Projects
None yet
Development

No branches or pull requests

2 participants