Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import legacy git proposals into tstore. #1425

Open
lukebp opened this issue Jun 5, 2021 · 4 comments
Open

Import legacy git proposals into tstore. #1425

lukebp opened this issue Jun 5, 2021 · 4 comments
Assignees
Labels
enhancement The issue enhances an existing feature.

Comments

@lukebp
Copy link
Member

lukebp commented Jun 5, 2021

The proposals that were submitted to the git backend are currenlty hosted on proposals-archive.decred.org, while all new proposals are hosted on proposals.decred.org. This creates a fragmentation that diminishes the ability to find older proposals.

The legacy proposals are currently hard coded into the gui and displayed in the list views, but this approach is suboptimal since the proposals are not included in politeia functionality such as searching for proposals by user ID. A better approach would be to import the git backend proposals directly into the tstore backend.

There's two main issues with importing the git backend proposals into the tstore backend.

  • It breaks the git backend timestamps. The git backend timestamps the git commit hash. You can obtain the timetamp data and the hash of a proposal file fairly easily, but there is no easy way of proving the file hash is included in the timestamp. In order to keep the timestamps coherent, you must take the git repo as a single entity that can't be pulled apart.
  • A git backend proposal is very different than a tstore backend proposal. The proposal markdown file is the same, but all of the metadata that accompanies the proposal is different due to the changes in the plugin architecture. You can't import the legacy metadata files into the new backend, which means providing the original hash and timestamp will be pointless since the original hash won't match what is imported into the backend.

Solution

Import the legacy git backend proposals using the format required by the tstore backend while also keeping the proposals-archive site up. There would be an additional LegacyToken field in the proposal metadata. When set, the gui will indicate that the proposal is a legacy proposal and you must go to the [proposals-archive link] if you want to see the proposal in its original form with valid timestamps.

This would solve the UX issues of legacy proposals not showing up on the proposals.decred.org site while also sidestepping the timestamp and incompatible format issues.

Implementation

  • Add a LegacyToken field to the proposal metadata.
type ProposalMetadata struct {
	Name string `json:"name"` // Proposal name

	// LegacyToken will only be populated if the proposal is a legacy
	// proposal that was submitted to the git backend.
	LegacyToken string `json:"legacytoken,omitempty"`
}
  • Write a tool that formats the git backend proposals into the tstore format and submits them to politeiad. Have all of the legacy proposals hardcoded into the tool.
  • Use the tool to import the legacy git backend proposals into the tstore backend.
@lukebp lukebp changed the title Import git backend proposals into the tstore backend. Import legacy git proposals into tstore. Jun 5, 2021
@lukebp lukebp added this to the v1.1.0 milestone Jun 6, 2021
@lukebp lukebp removed this from the v1.1.0 milestone Jul 21, 2021
@lukebp lukebp added the 91cfcc8 label Jul 21, 2021
@lukebp
Copy link
Member Author

lukebp commented Jul 21, 2021

politeia changes

  • Add LegacyToken to ProposalMetadata.
  • Prevent LegacyToken from being filled in on normal proposal submissions.

politeia legacy import tool

Here are some additional details on what is likely the easiest way to accomplish this task.

  • Rather than hardcoding all of the proposal data, I think it would be easier to walk the mainnet git repo and parse then convert the data. The politeiad/cmd/politeiaimport tool already has some of the code required to do this.
  • The tool should not use the politeiad or backend API. It should initialize a tstore instance directly and use the tstore API.
  • The legacy metadata streams will need to be converted over to the data structures that the current plugin metadata structures (usermd, comments, ticketvote). These may not be a simple 1-to-1 conversion since the metadata structure might have changed in the tstore upgrade. We can deal with these issues as they arise.
  • The legacy comment and vote journals will need to be walked and parsed. The code to replay the journals can be found in the gitbe implemenations that existed in the politeia repo prior to the v1.0.0 release.
  • The recordmetadata.json and the proposal files should be a simple 1-to-1 conversion. I don't think the record metadata structure changed in the tstore update.

This tool is only going to need to be used once. Once the legacy data has been migrated, the tool can be deleted from the politeia repo. Since this is a one off tool you can do things the quick and dirty way. The code should still be clean and readable, but things like hardcoding certain values or writings local structs to decode data into is fine.

@lukebp lukebp added the enhancement The issue enhances an existing feature. label Jul 28, 2021
@lukebp
Copy link
Member Author

lukebp commented Oct 18, 2021

As expected, this has turned out to be quite a difficult and complex task.

One of the main issues that we're encounting is the fact that the record token will not be the same. The tlog backend derives the record token from the tlog tree ID. This tree ID is a random int64 that is set by tlog on tree creation. We do not have the ability to set custom tree IDs, which means that legacy proposals will be assigned new tokens when they're imported into the tlog backend.

This is problematic because the token is part of the message that clients sign when submitting data like comments and votes. This leaves us with a decision to make. We can either:

  1. Keep the token fields in the data set to the legacy tokens so that the signatures remain coherent, but at the cost of breaking various parts of the backend. The backend assumes that the record token and the tree ID reference the same underlying bits, just encoded differently (int64 for tree IDs, hex for record tokens). Using the legacy token in the token field of the data breaks this assumption, which will cause various parts of the politeia and politeiagui code to break. The scope of this problem is somewhat limited though since we only need to worry about code that retrieves data, not code that writes.

  2. Update the token fields of legacy data to match the token derived from the tlog tree ID in order to not break the backend code, but at the cost of breaking all of the client signatures. This is problematic since signature validation is a standard part of both the backend and client side code when retrieving data. If we went this route we would need to insert the legacy proposals into tlog, compile a list of the tlog tokens that correspond to legacy proposals, hardcode the list into both politeia and politeiagui, then update the code to skip signature validation for any tokens in this list.

There are also instances where the data format, and thus the message being signed, changed between the git backend and tlog backend. In these cases, even if you use the legacy token the client signature will still be invalid because of the data format changes. The StartVote structure is one such instance.

We decided to go with option 1. There will be various bugs and edge cases that will need to be fixed, but since this is only for reads and not writes, the impact of such bugs will be minimal and can be fixed as they are found. If we went with option 2, hardcoding in everything required to skip the signature validation checks for these legacy proposals would be just as much, if not more of a headache. Unfortunately, we will still need to hardcode in the signature validation skips for the small number of invalid signatures that will still be present due to data format changes, like with the StartVote. There's not really much we can do to get around that for now.

@lukebp
Copy link
Member Author

lukebp commented Oct 19, 2021

Another large challenge with this is the cast vote timestamps.

The git backend did not include the timestamp of when a vote was cast due to privacy concerns. The tlog backend does since they are included in the tlog tree anyway and adding the timestamp to the cast vote struct makes it much easier for dcrdata to build their vote graphs.

In order to get the cast vote timestamp for the legacy votes, we'll need to pull the timestamp of the git commit from when the vote was added. These commits occurred every hour. This is how dcrdata built their vote graphs for the legacy proposal votes, so the code already exists to do this, but porting it over to this import tool and making sure it still works is another big pain point.

@thi4go
Copy link
Member

thi4go commented Nov 1, 2021

For documentation sake, we found a bug on the votes cache of the legacy www api, which makes the vote count returned from the api differ from when counting it directly from the ballot journal. This will not be a concern anymore once we complete the legacy import to tstore, and further deprecate the legacy www api.

@amass01 amass01 assigned amass01 and unassigned thi4go Jan 26, 2022
@lukebp lukebp removed the 91cfcc8 label Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement The issue enhances an existing feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants