Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better XML format [RFC] #279

Closed
diegogangl opened this issue Apr 11, 2020 · 21 comments · Fixed by #522
Closed

Better XML format [RFC] #279

diegogangl opened this issue Apr 11, 2020 · 21 comments · Fixed by #522
Labels
enhancement maintainability Automated tests suite, tooling, refactoring, or anything that makes it easier for developers priority:critical RFC "Request for Comments" brainstorming tickets for things we are unsure about

Comments

@diegogangl
Copy link
Contributor

diegogangl commented Apr 11, 2020

We can (and should) improve the file format of the local backend. A more strict structure would help us move everything into the treemodel faster, and avoid many bugs.

Current format

For reference this is what a task currently looks like:

	<task id="1@1" status="Active" tags="" uuid="d1312d77-d42d-4403-b080-ca065dde166f">
		<title>Learn How To Use Subtasks</title>
		<addeddate>2020-04-10T20:48:11</addeddate>
		<modified>2020-04-10T20:37:02</modified>
		<subtask>6000caf7-6197-4d77-a50e-8bd8804c5694</subtask>
		<subtask>7ab5f54c-dc3c-4cae-a968-1361d7269eb6</subtask>
		<subtask>3749c090-0d84-48b5-ad85-941657c3cff9</subtask>
		<content>A &amp;quot;Subtask&amp;quot; is something that you need to do first before being able to accomplish your task. In GTG, the purpose of subtasks is to cut down a task in smaller subtasks that are easier to achieve and to track down.

To insert a subtask in the task description (this window, for instance), begin a line with &amp;quot;-&amp;quot;, then write the subtask title and press Enter.

Try inserting one subtask below. Type &amp;quot;- This is my first subtask!&amp;quot;, for instance, and press Enter:

→   One subtask
→   Another subtask
→   Yet another one



Alternatively, you can also use the &amp;quot;Insert Subtask&amp;quot; button.

Note that subtasks obey to some rules: first, a subtask's due date can never happen after its parent's due date and, second, when you mark a parent task as done, its subtasks will also be marked as done.

And if you are not happy with your current tasks/subtasks organization, you can always change it by drag-and-dropping tasks on each other in the tasks list.
</content>
		<task-remote-ids/>
	</task>

Here's another example with the "alternative IDs":

	<task id="0@1" status="Active" tags="" uuid="1990638b-4f95-4645-af3a-df1b27b35c10">
		<title>Getting Started With GTG</title>
		<addeddate>2020-04-10T20:48:21</addeddate>
		<modified>2020-04-10T20:48:11</modified>
		<subtask>1@1</subtask>
		<subtask>2@1</subtask>
		<subtask>3@1</subtask>
		<subtask>4@1</subtask>
		<subtask>5@1</subtask>
		<subtask>6@1</subtask>
		<subtask>7@1</subtask>
		<subtask>8@1</subtask>
		<content>
If you are interested in knowing more about GTG's other features, you will find more information here:
&lt;subtask&gt;2@1&lt;/subtask&gt;
&lt;subtask&gt;3@1&lt;/subtask&gt;
&lt;subtask&gt;4@1&lt;/subtask&gt;
&lt;subtask&gt;5@1&lt;/subtask&gt;
&lt;subtask&gt;6@1&lt;/subtask&gt;
&lt;subtask&gt;7@1&lt;/subtask&gt;
&lt;subtask&gt;8@1&lt;/subtask&gt;

The GTG team.</content>
		<task-remote-ids/>
	</task>

There are several problems here:

  • Tasks have two IDs and it's not very clear when one or the other is used
  • There are two ways to refer to a subtask, using one of the large IDs or using a textual representation
  • The content of the content tag is escaped xml, like a matryoshka doll
  • Tags are store by name in a stringified list
  • Subtasks are referenced around by id twice. One in their proper tag, and then in content
  • The file itself has no metadata about itself (like a version number or something)
  • Tags are stored in a separate file

There's a projects.xml file which has some metadata and connects the tags xml with the tasks file. Apparently the previous team had envisioned something like a projects system, where tasks were contained in a project. Each project having it's own backend and associated tags file.

Looks like this was never completed though 🤔

Proposed

  • Put everything in one file. It can be called gtg_data.xml to avoid clashing with the old files.
  • Use a single ID
  • Use UUID4 for IDs
  • Always use ISO 8601 for dates (except for fuzzy dates)
  • Include a "header" section with metadata about the file
  • Remove task-remote-ids. It's not being used at all, and some guy left a comment in the code saying he doesn't think we need them!

Tags

  • All tags are stored as long as they meet one of these conditions:
    1. They have at least one task
    2. They have some kind of customization (color/icon/etc)
  • Tasks refer to tags by their UUID
  • Tags in the content aren't marked (the text editor can find them easily)

This is what that task would look like:

<gtgData appVersion="0.5" xmlVersion="2">
	<taglist>
		<tag id="7171ff82-119a-4933-8277-a8ef5ce6a3e2" color="E9B96E" name="GTG"/>
		<tag id="140f74ea-b2f1-4b0f-b72b-0e85f471bb98" color="cdd3854e56d8" icon="emblem-shared-symbolic.symbolic" name="life"/>
		<tag id="94669f60-2f8e-4b16-b87f-c1d46ade4536" color="c96a52131cd2" name="errands"/>
		<tag id="46890bc2-c924-4146-8279-472099abc0b1" color="c96a52131cd2" name="other_errands"/>
		<tag id="aeb6e795-cb65-4d89-bf80-c7ea524fcfa7" color="c96a52131cd2" name="home_renovation"/>
	</taglist>

	<tasklist>
		<task id="2fdcd50f-0106-48b2-9f16-db2f8dbbf044" status="Active">
			<title>Learn How To Use Subtasks</title>
			<tags>
				<tag>7171ff82-119a-4933-8277-a8ef5ce6a3e2 <tag/>
				<tag>46890bc2-c924-4146-8279-472099abc0b1 <tag/>
				<tag>94669f60-2f8e-4b16-b87f-c1d46ade4536 <tag/>
			</tags>
			<dates>
				<addedDate>2020-04-10T20:48:11</addedDate>
				<modifyDate>2020-04-10T20:37:02</modifyDate>
				<startDate>2020-05-10T00:00:00</startDate>
			</dates>

			<content>
                            <p>@GTG, @errands, @home_renovation 
                             A &amp;quot;Subtask&amp;quot; is something that you need to do first before being able to accomplish your task. In GTG, the purpose of subtasks is to cut down a task in smaller subtasks that are easier to achieve and to track down.

			To insert a subtask in the task description (this window, for instance), begin a line with &amp;quot;-&amp;quot;, then write the subtask title and press Enter.

			Try inserting one subtask below. Type &amp;quot;- This is my first subtask!&amp;quot;, for instance, and press Enter:</p>

			<sub>bf33b248-ab96-4b99-9e40-8b60c1d7fe2e</sub>
			<sub>a957c32a-6293-46f7-a305-1caccdfbe34c</sub>
			<sub>98b683e0-1efa-4d8d-b3f9-8bcf954942d6</sub>

			<p>Alternatively, you can also use the &amp;quot;Insert Subtask&amp;quot; button.

			Note that subtasks obey to some rules: first, a subtask's due date can never happen after its parent's due date and, second, when you mark a parent task as done, its subtasks will also be marked as done.

			And if you are not happy with your current tasks/subtasks organization, you can always change it by drag-and-dropping tasks on each other in the tasks list.</p>
			</content>
                        </task>
	                <task id="bf33b248-ab96-4b99-9e40-8b60c1d7fe2e" status="Done">
				<title>One subtask</title>
				<content><p>This is some test subtask with a @tag </p></content>
			</task>

			<task id="a957c32a-6293-46f7-a305-1caccdfbe34c" status="Active">
				<title>Another subtask</title>
	   		       <dates>
				       <addedDate>2020-04-10T20:48:11</addedDate>
                                       <fuzzyDueDate>someday</fuzzyDueDate>			
			        </dates>
			        <content />
			</task>

				<!-- You get the point... -->
	</tasklist>
</gtgData>

Versioning

We should always keep support for n-1 versions. This could go into versioning module. Since we have different filenames we can try to read gtg_data.xml first, if it's not there we can try to detect projects.xml and go into the versioning code.


Feedback much appreciated!

@nekohayo
Copy link
Member

I don't know if this up-to-date or relevant, but just in case you hadn't seen it, I found this in the wiki today: https://wiki.gnome.org/Apps/GTG/DataModel

@nekohayo
Copy link
Member

So, my gut feeling (and I'm probably not the best person to comment on file format design; I believe @broussea, @ploum and @izidormatusov would be much more qualified than me to comment) is that your observations generally make sense but I have some reservations:

  • I doubt just using integers as task IDs is a good idea. The hash approach seemed like the only/best way to avoid conflicts; but then there shouldn't be integer task IDs at all in competition with that. Besides, at some point integer numbers become kinda meaningless even to the human eye I suspect; there's a reason why Git completely dismissed the widespread notion of integer revision numbers (in other VCSes) in favor of hashes, for example...
  • There ought to be a file format version numbering scheme if it's going to change at all in the future, to handle upgrades etc. Handling a change to the file format sounds like a risky ordeal, so unless it's bringing a ton of benefits and it's covered by a ton of tests I'd be a bit wary of it.
  • I'm not sure having the list of tags be in the same file as the main big tasks file, particularly if we write to the tags often, as this would mean traversing the whole tree of tasks to do so (though maybe we already do?)...
  • I suspect the projects.xml was probably made to allow for the notion of projects as data "profiles" like issue multiple profiles/sessions #215 ; personally I'm more interested by automatic projects generated from parent tasks in issue Automatic projects (aka micro-projects) listing in the sidebar (or as a view mode) #245 but leaving the projects file there (maybe it ought to be renamed profiles) doesn't really hurt and leaves room for implementation...

Anyhow, those are just my uneducated guesses.

@leio
Copy link
Member

leio commented Apr 25, 2020

I think UUID instead of ID make sense here; it would avoid any practical clash possibilities, making things like merging two different GTG XML files together much more straightfoward. It can also act as the primary key in any potential hypothetical DB backed storage backends as well (instead of ID). Internally it can be handled as a proper uuid.UUID (128-bit number) too, not inefficient strings.

I'd be careful about merging everything into a single big file, if they aren't yet. Megabyte sized XML file for "hardcore" users, especially if "completed" tasks are kept in there as well, doesn't sound like something that'd be very trouble-free either, or performant for a simple edit. Maybe a big file that gets logically split once big, but with everything in the background merged together seamlessly, but on edit only saving the files where the element that changed is? Then again, I'm not really sure of XML (lxml) performance here.

Not knowing much about the context and bigger picture, those were my initial thoughts.

@diegogangl
Copy link
Contributor Author

There ought to be a file format version numbering scheme if it's going to change at all in the future, to handle upgrades etc. Handling a change to the file format sounds like a risky ordeal, so unless it's bringing a ton of benefits and it's covered by a ton of tests I'd be a bit wary of it.

Indeed, this is covered in the header of the proposed file. It would store both gtg's version and the xml.

<gtg-data app-version="0.5" xml-version="2">

Benefits (off the top of my head):

  • Make the data representation truly hierarchical, so when we read it we can generate the tree more easily (and faster).
  • Avoid the matryoska effect, which is the root cause of the "node invalid" bug and probably others
  • The new format has a version in the header that we can use to do upgrades and changes
  • The new content is a little closer to what the textview tags look like, so it should also be easier and faster to load a task

I'm not sure having the list of tags be in the same file as the main big tasks file, particularly if we write to the tags often, as this would mean traversing the whole tree of tasks to do so (though maybe we already do?)...

That's interesting, though the tags are tightly related to the tasks. Most of the time if you are writing a tag, you are also doing something to a task. We would have to check this TBH.

I suspect the projects.xml was probably made to allow for the notion of projects as data "profiles" like issue #215 ; personally I'm more interested by automatic projects generated from parent tasks in issue #245 but leaving the projects file there (maybe it ought to be renamed profiles) doesn't really hurt and leaves room for implementation...

Having everything in one files makes profiles even easier: just load a different file. Unless you want to share tags across profiles, I don't know how useful that would be tho. We could even have a command line parameter to pass gtg a path to any random xml file, and be able to load random profiles.

I'd be careful about merging everything into a single big file, if they aren't yet. Megabyte sized XML file for "hardcore" users, especially if "completed" tasks are kept in there as well, doesn't sound like something that'd be very trouble-free either, or performant for a simple edit. Maybe a big file that gets logically split once big, but with everything in the background merged together seamlessly, but on edit only saving the files where the element that changed is? Then again, I'm not really sure of XML (lxml) performance here.

I have 367 active tasks, with another 236 done. I often paste a lot of text and urls into tasks (some are straight up notes lol). My numbers are:


▶ ls -l
.rw-r--r--@ 283k januz 25 Apr 18:43 gtg_tasks.xml
.rw-r--r--@  235 januz 25 Apr 18:43 projects.xml
.rw-r--r--@ 2.2k januz 25 Apr 18:43 tags.xml

So we are looking at 285kb in total. I would need about 4 times more tasks to reach 1MB. @nekohayo you are the ultimate GTG warrior, how big are your files?

As for lxml, the website has benchmarks:
"[..] a 3.4MB XML file containing the Old Testament [...]"

lxml.etree.parse done in 0.016 seconds
lxml.etree.XMLParser.feed(): 25317 nodes read in 0.022 seconds

Parsing times shouldn't be a problem unless you are actually more busy than god, though building the Treemodel could take a while. But that would be the same with the current format.

You do have a point with old closed tasks. Maybe we can detect if auto-purge is disabled and move closed tasks to a separate file. Though maybe the end result would be the same. We need to load everything, so we whether it's in one file or two it's going to take a while.

I doubt just using integers as task IDs is a good idea. The hash approach seemed like the only/best way to avoid conflicts; but then there shouldn't be integer task IDs at all in competition with that. Besides, at some point integer numbers become kinda meaningless even to the human eye I suspect; there's a reason why Git completely dismissed the widespread notion of integer revision numbers (in other VCSes) in favor of hashes, for example...

I think UUID instead of ID make sense here; it would avoid any practical clash possibilities, making things like merging two different GTG XML files together much more straightfoward. It can also act as the primary key in any potential hypothetical DB backed storage backends as well (instead of ID). Internally it can be handled as a proper uuid.UUID (128-bit number) too, not inefficient strings.

Good points guys, I hadn't thought about conflicts and storing it as a uuid. I'll update the proposal

@nekohayo
Copy link
Member

For what it's worth, the biggest filesize I've had for my tasks xml file has been 930 kB (recently I stopped pruning closed/done tasks for about 6 months for some particular reason), though it would be infinitely bigger if I hadn't used the task reaper plugin (now part of core) for all these years. That said, if lxml is as fast as it sounds, the performance problem will be negligible.

I agree that having the closed tasks be a separate XML probably doesn't change much. Though, now that I think of it, it probably could allow some mega-optimization hack when the "closed tasks remover" feature is called (simpler search domain), in theory... but it might not be needed, as that kind of optimization might be dwarfed by the performance gains of lxml.

Again... I don't think I'm the right person to have an opinion on the "proper" way to structure data within the XML format ;)

@leio
Copy link
Member

leio commented Apr 26, 2020

I don't worry about the performance of it too much, if you don't end up with 5+ MB things. I'd think more about the aspects of just having to write a 1MB-10MB all the time when one little change is made, and having those queued up to be done constantly during active GTG use.
However with UUIDs, we can think about splitting it a bit again (e.g. subtasks hierarchy always in same file, but completely disconnected tasks in other files, etc) and such probably in the future as well, if it turns out beneficial.

@izidormatusov
Copy link
Contributor

Put everything in one file

GTG had big plans to support many backends. You could theoretically keep your @home tasks in a separate XML file in a dropbox folder while @work would stay on on the disk only. Tags were always stored separately. Projects file specified extra backends (e.g. configuration for RememberTheMilk, fetching information from bug trackers, etc)

Use a single ID

AFAIK only the template start tags are using this. These style ids were used in the past and got replaced by uuids. Have not been removed completely.

Use simple ints for IDs: 1,2,3,4 (we should handle ID recycling though)

go for uuid

Always use ISO 8601 for dates

There are "fuzzy" dates: now, soon, someday. People are using these quite a bit

Include a "header" section with metadata about the file

Sounds good.

Nest subtasks inside their parent tasks. No more matryoshkas!

There is a reason for the nested XML. GTG supports basic formatting like <b>, <i>, links. You can include the link to subtasks in your text. The idea is:

 My important tasks
  - <subtask>1</subtask
  - <subtask>2</subtask

 Nice to do:
  - <subtask>3</subtask
  - <subtask>4</subtask

Saying that, the situation can be much more improved. You can have XML nested inside of another XML instead of storing serialized version. There are many bugs where the subtasks are not desrialized properly and tasks end with garbage like &lt;subtask&gt;2@1&lt;/subtask&gt; plus a new tag.

GTG supports tasks represented in Directed Acyclic Graphs (aka there can be two parents per task). To be honest, this adds a lot of complexity and is not very well supported in UI. If you put subtasks under the main task, you remove this ability.

Subtasks can have their parents changed which would mean more complexity of the code on the serialising the tasks.

Remove task-remote-ids. It's not being used at all, and some guy left a comment in the code saying he doesn't think we need them!

Sounds good.

Use tags inside to separate text and GtkTextTags (tags/subtasks) inside content

+1 for having proper XML inside of content.

Refer to tags by ID
This might make it easier though if there is any kind of synchronization to an external service, it would become less understandable (<tag id="2"> vs @foobar). However referring to tags by their id would simplify renaming tags.

@diegogangl
Copy link
Contributor Author

GTG had big plans to support many backends. You could theoretically keep your @home tasks in a separate XML file in a dropbox folder while @work would stay on on the disk only. Tags were always stored separately. Projects file specified extra backends (e.g. configuration for RememberTheMilk, fetching information from bug trackers, etc)

Sounds like some of these things would be better handled at a backend level 🤔
For instance keeping their configuration should be part of gtg config. We could also support more
than one xml backend at a time too. If tags and tasks have UUIDs the contents of both files could be merged as @leio mentioned.

There are "fuzzy" dates: now, soon, someday. People are using these quite a bit

True, I forgot to mention that

There is a reason for the nested XML. GTG supports basic formatting like <b>, <i>, links. You can include the link to subtasks in your text. Saying that, the situation can be much more improved. You can have XML nested inside of another XML instead of storing serialized version. There are many bugs where the subtasks are not desrialized properly and tasks end with garbage like &lt;subtask&gt;2@1&lt;/subtask&gt; plus a new tag.

Right, this is what I wanted to do with the content tag. Separate the text from tags to make it easier to parse. Though it seems like mixing text and tags is legal and supported by lxml. Still on the fence on that, since it would make the file simpler but processing more complicated.

GTG supports tasks represented in Directed Acyclic Graphs (aka there can be two parents per task). To be honest, this adds a lot of complexity and is not very well supported in UI. If you put subtasks under the main task, you remove this ability.

We talked about this with @ploum recently. I'm leaning towards removing this from GTG. There's only a couple of UI functions for this that don't currently work, and very few use cases for the amount of complexity. For the use case he mentioned (tasks being blocked by more than one task), I think it would be easier to have some kind of internal linking between tasks. Then you can write something like Blocked by <link tid=[UUID]>this task</link> and have a fancy link. A plugin could pick that up and disable the "done" button until the other task is done.

Thanks for weighing in!

@johnnybubonic
Copy link

johnnybubonic commented Jul 16, 2020

@diegogangl asked me for input because I'm an XML nerd and I have some suggestions of my own in #431 (sidenote, a schema would let you validate with xmllint as well as any XSLT/XML 1.0 parser that supports validation, not just LXML). Here goes!

id/uuid

I think UUID instead of ID make sense here; it would avoid any practical clash possibilities, making things like merging two different GTG XML files together much more straightfoward. It can also act as the primary key in any potential hypothetical DB backed storage backends as well (instead of ID). Internally it can be handled as a proper uuid.UUID (128-bit number) too, not inefficient strings.
(@leio)

go for uuid
(@izidormatusov)

I'd also strongly, strongly recommend using UUID4 (uuid.uuid4() to generate, or uuid.UUID(hex = 'uuid-string-here', version = 4) to nativize it); I'm in agreement. If users want something human-readable (i.e. a title for the task), an additional "name" attribute for tasks could be added with a collapsed whitespace string type.

It may be worth considering to use UUID5 instead, with the namespace being the "project" name if that's a feature that GTG decides going forward is something to keep (i.e. uuid.uuid5(proj_name, task_content) to generate). But it's not necessary; UUID4 should be fine (and requires less heavy parsing). SHA1's pretty much broken anyways, so while extremely unlikely, I'd be concerned about collisions with UUID5. But it does have the benefit of operating on the content itself. Of course, if the content text changes, well... there goes your UUID.
I guess all that to say "yeah, use UUID4".

Fully agreed on one and only one id attribute as well - to be clear, get rid of uuid attr and make id a UUID4.

The only difficulty is we can't specify the id attribute as an xs:ID in the schema. xs:ID would make things really easy, because validation AUTOMATICALLY fails if more than one element, anywhere in the document, shares the same attribute with the same value. We can get around this with an xs:unique constraint, but it's limited in scope.

Worth noting that I happen by luck to already have a validation for use in schemas for UUID4, so that'd work fine.

Unified or split XML files?

Why not both? Now that you're switching to LXML (#401), you can use XInclude right from LXML. Let the engine reassemble the files for you when you parse. Split them into however many you want. It's a single function call, dirt easy.

I'd avoid hyphens in tags, though. Instead of gtg-data, use gtgData or just gtg or data.

ISO 8601 for dates/times

The XML itself should always include static dates, ideally in UTC/"Zulu" time, for data portability and validation purposes. I think it'd be okay (if there's a reasonable way to parse it) to let the user define a specific date/time in relative terms, but then convert it into a static date/time (and, thereafter, display as a static date/time). ISO 8601 is a good choice, since it has a native XML schema definition that can accept multiple formats. If you plan on implementing "expected durations of time", there's a type for that too (or it can be specified right in the timestamp). I do already have a type for accepting either an ISO 8601 or UNIX Epoch, though, too.

My recommendation is the format 2020-07-16T02:08:00Z variant to always be used (with a separate expectedLength attribute of type xs:duration, if that feature is added). This is quite easily nativized to a python datetime.datetime object via a .strptime() call with format %Y-%m-%dT%H:%M:%SZ, or written to an XML value via i.e. datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'). Likewise, an xs:duration is a little harder to parse into a datetime.timedelta object but if all values are given (i.e. 5 minutes = P0Y0M0DT0H5M0S instead of just 5M), it will be MUCH easier to parse and will still validate perfectly well against an xs:duration type. was actually much easier to parse than I thought as long as you get the regex right and decide on static values for certain time units instead of relative.

Version info in root element attributes

YES. A big yes. You're going to have to probably break backwards compat for this first release of new data since previous versions won't even have the data version attribute, but that should be fine because SO much of this is going to change that it wouldn't be worth keeping code around to parse previous versions. A converter should exist, but you probably don't want to keep conditionals around for that old code in the core.

That said, again - I'd avoid hyphens in tags.

Instead of... Instead...
app-version appVersion (or appVer, etc.)
xml-version xmlVer (or dataVer, formatVer/fmtVer, etc.)

Nest subtasks inside their parent tasks

You could, but you lose some uniqueness checking. Instead, I'd recommend keeping subtasks as actual tasks and making a <subtasks> container element inside a <task> element, and then containing each subtask's <task> id attr within them. That's typically how it'd be done in XML, and lets you have subtasks assigned/referenced to multiple parent tasks down the line as well.

i.e.

<!-- ... -->
<task ...>
  <!-- ... -->
  <subtasks>
    <sub>SUB-ID-ATTR-HERE</sub>
    <sub>ANOTHER-HERE</sub>
  </subtasks>
</task>
<!-- ... -->

Remove task-remote-ids

I still have no idea what they're used for. Or were intended to be used for. They COULD be useful in tandem with #432 (in that "this task should be local-only/don't merge on server") or the like, but we should instead make that a task attribute of some sort and with a more clear name (localOnly with type xs:boolean).

Personal/Additional suggestions

That was all I saw in @diegogangl's original proposal that I had to comment on. I have some of my own suggestions in #431 (and ones I thought of just now) that I'd like to offer, though.

Strict XML naming conventions

I mentioned the tag names above, but here's some more.

  • addeddate should be addedDate per strict XML naming conventions
  • donedate should be doneDate per strict XML naming conventions
  • modified should match the above, i.e. modifyDate

(I'm probably missing some. You get the picture.)

Tags

They absolutely should be individual elements, not a list attribute. @diegogangl's proposed format is fine, but here's an alternate one.

<!-- ... -->
	<taglist>
		<tag id="7171ff82-119a-4933-8277-a8ef5ce6a3e2" color="e9b96e">GTG</tag>
		<tag id="140f74ea-b2f1-4b0f-b72b-0e85f471bb98" color="cdd3854e56d8" icon="emblem-shared-symbolic.symbolic">life</tag>
		<tag id="94669f60-2f8e-4b16-b87f-c1d46ade4536" color="c96a52131cd2">errands</tag>
		<tag id="46890bc2-c924-4146-8279-472099abc0b1" color="c96a52131cd2">other_errands</tag>
		<tag id="aeb6e795-cb65-4d89-bf80-c7ea524fcfa7" color="c96a52131cd2">home_renovation</tag>
	</taglist>
<!-- ... -->
<tasklist>
		<task id="2fdcd50f-0106-48b2-9f16-db2f8dbbf044" status="Active">
			<!-- ... -->
			<tags>
				<tag>7171ff82-119a-4933-8277-a8ef5ce6a3e2"</tag>
				<tag>46890bc2-c924-4146-8279-472099abc0b1</tag>
				<tag>94669f60-2f8e-4b16-b87f-c1d46ade4536</tag>
			</tags>
                <!-- ... -->
<!-- ... -->

Both his and my model can both be validated fine. But take note of some changes I suggest specifically:

  • Tag names don't really need to start with @ outside of the user interaction part. They're already their own separate objects, they already have their own containers and separate definitions.
  • The color attribute? Doesn't need a #; there's a native XML datatype for hex values: hexBinary (which will not accept a #).

Booleans

All boolean values are currently using True/False in the XML. XML has a native datatype for that as well, boolean. It accepts either true/1 or false/0. I have a function that nativizes these pretty well.

Constraints

Do any elements or attributes have length limits (besides the standardized formats, like UUID)? Minimum length or value? Maximum? What elements can there be multiple of, which ones are optional, which ones require one and only one present, etc.?

Namespace definition

You'll also need to declare namespaces and their schemas at the beginning. I provide an example of this in #431.

End

Sorry, I know it's long and I'm some external interloper or whatever, but I really like that GTG is being revived and I figure if now's my chance to make some suggestions for cleaner XML, I should probably hop on it. :)

Thanks for reading, and thank you for putting so much work into this!

@diegogangl
Copy link
Contributor Author

@johnnybubonic whoa, thank you so much for all the feeback! This helps a lot \o/

id/uuid

I guess all that to say "yeah, use UUID4".

UUID4 looks fine. Tasks can be deleted, and old closed tasks get autopurged by default, so I wouldn't worry much about collisions.

The only difficulty is we can't specify the id attribute as an xs:ID in the schema. xs:ID would make things really easy, because validation AUTOMATICALLY fails if more than one element, anywhere in the document, shares the same attribute with the same value. We can get around this with an xs:unique constraint, but it's limited in scope.

This sounds useful, but I think tags and tasks should have "different sets" of IDs. We load the entire file into memory and then query those data structures, so there's no chance of collision between tags and tasks.

Unified or split XML files?

Didn't know about XInclude, that looks really useful but TBH there's just no good reason to split the files other than file size.

ISO 8601 for dates/times

The problem here is the fuzzy dates. We don't just have tomorrow, we also have: now, someday and soon. None of these match easily with an actual date. The Date class does match them to absolute dates but it's kind of hacky and would be really hard to parse back into something fuzzy. Maybe we can have separate fuzzyStartDate kind of elements? So inside the dates tag, we can optionally have actual dates with the proper format, or these tags with a limited set of allowed strings.

Version info in root element attributes

Thanks for the tip. Yeah, my idea is to have a separate module to host all the versioning code.

Nest subtasks inside their parent tasks

You could, but you lose some uniqueness checking

What do you mean by uniqueness checking?

and lets you have subtasks assigned/referenced to multiple parent tasks down the line as well.

We probably won't support multiple parents. It's a source of headaches both for the backend code and the UI, and the use cases are better served by just supporting internal linking between tasks.

Remove task-remote-ids

There was some code to read them but it was already commented out with a suggestion to remove them when I got here :)

Tags

I like your proposal better, there's no reason to keep ID as an attribute there.

Personal/Additional suggestions

These all sound great! The Pythonista in me hates not using hyphens, but if that's the standard way 👍

About constraints:

  • Tasks definitely need a title, content and dates (one of each) and a status attribute
  • added and modified are also always on, and there's only one of each inside dates
  • A task can only have one tags and one subtasks
  • gtg-data always has the same attributes
  • tasklist and taglist should always be there, even if they are empty

That's all I can think of 🤔 , everything else is optional.

@johnnybubonic
Copy link

@johnnybubonic whoa, thank you so much for all the feeback! This helps a lot \o/

My pleasure!

This sounds useful, but I think tags and tasks should have "different sets" of IDs. We load the entire file into memory and then query those data structures, so there's no chance of collision between tags and tasks.

Yep, but an xs:ID would ensure that a tag and task wouldn't have the same id attribute for instance. I THINK; I'd have to check to see if it's for all elements or all elements of the same tag.

Didn't know about XInclude, that looks really useful but TBH there's just no good reason to split the files other than file size.

Yep, agreed, but it does let it be more modular. Granted, with modularity can come complexity, so YMMV.

The problem here is the fuzzy dates. We don't just have tomorrow, we also have: now, someday and soon. None of these match easily with an actual date. The Date class does match them to absolute dates but it's kind of hacky and would be really hard to parse back into something fuzzy.

Is it required to display them in a fuzzy manner, or just parse them as input and write to the data storage as a fixed time? I'd think the latter would probably be the way to go. (humanize WOULD let you display it as fuzzy pretty well, FWIW. It's best to store the dates in a format easily understood by the machine since, realistically, humans shouldn't be looking at the raw XML files.)

What do you mean by uniqueness checking?

Task IDs, tasks content, anything. Since subtasks can have subtasks of their own, you start messing with recursion. While it's possible to support recursion in a schema from what I recall, it does lead to some potentially messy parsing. For those reasons I'd recommend treating subtasks as references to actual tasks rather than containing the entire subtask.

The Pythonista in me hates not using hyphens, ...

Yep. I'd hate to see/use camelCase in my actual code too (I tend to opt for underscores), but code and data are different! Hyphens can mess up some XML libraries. W3C occasionally uses hyphens for data in their examples but even then, only sometimes - they're pretty inconsistent about it (for instance, all of the standard type definitions are in camelCase i.e. xs:normalizedString etc.).

Thanks for the details about constrains! That helps a lot.

@diegogangl
Copy link
Contributor Author

diegogangl commented Jul 17, 2020

Is it required to display them in a fuzzy manner, or just parse them as input and write to the data storage as a fixed time? I'd think the latter would probably be the way to go. (humanize WOULD let you display it as fuzzy pretty well, FWIW. It's best to store the dates in a format easily understood by the machine since, realistically, humans shouldn't be looking at the raw XML files.)

Yes, it's required to store them. I should mention that "tomorrow" or "friday" aren't fuzzy dates. Those are converted to actual dates after the user selects/types them. Fuzzy dates are "someday", "soon" and "now". None of these can be converted to dates and we need to store them as fuzzy.

Task IDs, tasks content, anything. Since subtasks can have subtasks of their own, you start messing with recursion. While it's possible to support recursion in a schema from what I recall, it does lead to some potentially messy parsing. For those reasons I'd recommend treating subtasks as references to actual tasks rather than containing the entire subtask.

Shame, I though XML was all about nesting.

By the way, what do you think about mixing text and tags in content (html style) vs using tags to keep everything separated?

<content>
This is some text 
<subtask>6000caf7-6197-4d77-a50e-8bd8804c5694</subtask>

Some more text, maybe <strong>bold too?</strong>
</content>

vs

<content>
<p>This is some text</p>
<subtask>6000caf7-6197-4d77-a50e-8bd8804c5694</subtask>

<p>Some more text, maybe</p> <strong>bold too?</strong>
</content>

Seems like it's valid somewhat, and lxml supports it. But I don't know if it's supported in schemas or could cause other kinds of trouble further along.

@johnnybubonic
Copy link

johnnybubonic commented Jul 19, 2020

Yes, it's required to store them. I should mention that "tomorrow" or "friday" aren't fuzzy dates. Those are converted to actual dates after the user selects/types them. Fuzzy dates are "someday", "soon" and "now". None of these can be converted to dates and we need to store them as fuzzy.

Hrm, I see... I'd store them as a different element name, then. That way it can validate a fixed time string OR validate against a list of known-good "fuzzy" values. I could always xs:union them together in the schema to one datatype, but then that can create more complicated parsing code. Something like dueDate and dueFuzzy would suffice.

Though I'd imagine "now" wouldn't be a fuzzy since it'd just be a datetime.datetime.now() call on parsing, and then storing the result of that, yes?

Shame, I though XML was all about nesting.

It absolutely is, yep! But this goes a bit beyond nesting; it's recursion. And while XML Schema can (again, if I recall) support validating recursive elements (<task> types in this case), that complicates parsing if e.g. subtasks can have subtasks, etc. I'd say it's easier to instantiate an i.e. Task class or a subclass SubTask (whose super would be Task, and really the only difference would be a "parent" attribute). You could determine which it should be at time of parsing to see if there's a isSubtask= boolean attribute for the <task> element itself, and if so, associate from there. I did this without any real prior knowledge of the file structure, but I associate Task objects with their parents (if any), via a list of subtask objects. A flat object makes that methodology a lot easier. Recursion is a lot more expensive, in terms of cycles, than association.

By the way, what do you think about mixing text and tags in content (html style) vs using tags to keep everything separated?

A Schema could validate mixed content like that just fine, but from the parsing end via LXML... I'd recommend against it. It's not without some "gotcha!" because unless you want that example to render in the GUI to the user as:


This is some text
<subtask>6000caf7-6197-4d77-a50e-8bd8804c5694</subtask>

Some more text, maybe bold too?


you'd have to do some stripping of child elements while retaining the text component of them, which is not entirely reliable, even with the amazingness that LXML is.

I'd recommend keeping them in separate elements and even perhaps displaying them to users differently, since they're their own thing. Entering them in the task is fine, but the input parser should store them separately and then they should be displayed separately in the GUI once processed in, IMHO.

@diegogangl
Copy link
Contributor Author

Though I'd imagine "now" wouldn't be a fuzzy since it'd just be a datetime.datetime.now() call on parsing, and then storing the result of that, yes?

Nope, this is how it's stored right now:

<duedate>now</duedate>

Setting a task to now gives it a "higher priority" when sorting by due dates. So it needs to be stored

A Schema could validate mixed content like that just fine, but from the parsing end via LXML... I'd recommend against it. [...]

Ah thanks, I figured there might be problems (some of which we already have).

I've updated the proposal with all your suggestions

@johnnybubonic
Copy link

I've updated the proposal with all your suggestions

Thanks! Will update #431 later today or tomorrow with the changes to match current proposal here! Might as well keep them in tandem.

@johnnybubonic
Copy link

johnnybubonic commented Jul 20, 2020

Slightly modified version of example follows.

Namely, <content> elements need to be encapsulated in a CDATA container so validators/parsers don't choke on the HTML inside and try to parse them as child elements of the XML. (See LXML's CDATA overview and docs for the object class.)

This means that using things like <sub> inside <content> will not work unless you then do further additional processing.

Instead of using CDATA containers, you could base64 encode/decode the <content> text. But you'd still have to post-process that as well, so might as well use a CDATA and use a more simple tagging you can just regex out into groups.

See the inline comments below.

<?xml version="1.0" encoding="UTF-8"?>
<gtgData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xmlns="https://wiki.gnome.org/Apps/GTG"
	  appVersion="0.5"
	  xmlVersion="2"
	  xsi:schemaLocation="http://SOMEDOMAIN.TLD/SOME/PATH/TO/data.xsd">
	<taglist>
		<tag id="7171ff82-119a-4933-8277-a8ef5ce6a3e2" color="E9B96E" name="GTG"/>
		<tag id="140f74ea-b2f1-4b0f-b72b-0e85f471bb98"
		     color="cdd3854e56d8"
		     icon="emblem-shared-symbolic.symbolic"
		     name="life"/>
		<tag id="94669f60-2f8e-4b16-b87f-c1d46ade4536" color="c96a52131cd2" name="errands"/>
		<tag id="46890bc2-c924-4146-8279-472099abc0b1" color="c96a52131cd2" name="other_errands"/>
		<tag id="aeb6e795-cb65-4d89-bf80-c7ea524fcfa7" color="c96a52131cd2" name="home_renovation"/>
	</taglist>
	<tasklist>
		<task id="2fdcd50f-0106-48b2-9f16-db2f8dbbf044" status="Active">
			<title>Learn How To Use Subtasks</title>
			<tags>
				<tag>7171ff82-119a-4933-8277-a8ef5ce6a3e2</tag>
				<tag>46890bc2-c924-4146-8279-472099abc0b1</tag>
				<tag>94669f60-2f8e-4b16-b87f-c1d46ade4536</tag>
			</tags>
			<dates>
				<addedDate>2020-04-10T20:48:11</addedDate>
				<modifyDate>2020-04-10T20:37:02</modifyDate>
				<startDate>2020-05-10T00:00:00</startDate>
			</dates>
			<!-- With the content element in a CDATA, you won't be able to detect subs automatically
                          if they use XML/HTML-like tagging.
			     Perhaps a different notation format inside content? e.g. "{! This is a subtask !}"  -->
			<subtasks>
				<sub>bf33b248-ab96-4b99-9e40-8b60c1d7fe2e</sub>
				<sub>a957c32a-6293-46f7-a305-1caccdfbe34c</sub>
			</subtasks>
			<content><![CDATA[<p>@GTG, @errands, @home_renovation 
A &quot;Subtask&quot; is something that you need to do first before being able to accomplish your task. In GTG, the purpose of subtasks is to cut down a task in smaller subtasks that are easier to achieve and to track down.

To insert a subtask in the task description (this window, for instance), begin a line with &quot;-&quot;, then write the subtask title and press Enter.

Try inserting one subtask below. Type &quot;{! This is my first subtask! !}&quot;, for instance, and press Enter:</p>

<p>Alternatively, you can also use the &quot;Insert Subtask&quot; button.

Note that subtasks obey to some rules: first, a subtask's due date can never happen after its parent's due date and, second, when you mark a parent task as done, its subtasks will also be marked as done.

And if you are not happy with your current tasks/subtasks organization, you can always change it by drag-and-dropping tasks on each other in the tasks list.</p>]]></content>
		</task>
		<task id="bf33b248-ab96-4b99-9e40-8b60c1d7fe2e" status="Done">
			<title>One subtask</title>
			<!-- The following does not have a matching tag in taglist? -->
			<content><![CDATA[<p>This is some test subtask with a @tag </p>]]></content>
		</task>
		<task id="a957c32a-6293-46f7-a305-1caccdfbe34c" status="Active">
			<title>Another subtask</title>
			<dates>
				<addedDate>2020-04-10T20:48:11</addedDate>
				<fuzzyDueDate>someday</fuzzyDueDate>
			</dates>
			<content/>
		</task>
		<!-- ... -->
	</tasklist>
</gtgData>

The above validates against what I just pushed to #437 #438. After that I have a few uniqueness constraints to add here and there but it should more or less match current. I'd definitely recommend reviewing the bit about CDATA-ing <content> though.

(EDIT: I did a dumb so removed the parent attributes)

@diegogangl
Copy link
Contributor Author

Is that because of the p tag? we can just use a different tag name, so it's not misinterpreted as html.
Why have both the subtasks tag and the parent ID? While parsing I could check if the subs tag has any elements, fetch those tasks and add them as children. What would the parent attr be used for?

Base64 is a no-go, since we want to keep it human friendly

@johnnybubonic
Copy link

johnnybubonic commented Jul 20, 2020

Is that because of the p tag? we can just use a different tag name, so it's not misinterpreted as html.

Any SGML-subset (XML, HTML, ...) syntax will trigger a validator error unless it's expected per the schema and the parent is a mixed-type, or it's in a CDATA. It's not the name of the tag so much as it being enclosed by < and > unfortunately.

Why have both the subtasks tag and the parent ID? While parsing I could check if the subs tag has any elements, fetch those tasks and add them as children. What would the parent attr be used for?

That is... a good question. I forgot <subtasks> was suggested. I think I added it as a way of reverse-validating subtasks have valid parents, but if the association is one-way (initiated by a subtasks container in the parent), it's completely unnecessary. Scratch from record, will modify accordingly in a few. :)

You'll want to find some way around how you handle subtasks inline in content still, though, if you want subtasks to be handled. I just figured it'd be easier to CDATA it so it could just render the contents as HTML easily. Which means for non-HTML entities that should be converted to, say, hyperlinks (like your content-inline <sub> items) to work, they'd have to use something that isn't SGML and easily just regexed out.

In the proposed tagging syntax for inside CDATA'd context ({! new subtask here !}), it should be able to be regexed out. I'll see if I can get a pattern for that shortly. it can be pulled by doing a re.findall on the (CDATA-handled, per LXML) with the pattern {!\s*(.+?)\s*!}. POC below:

(EDIT: better POC; it'll actually demonstrate the substitution.)

#!/usr/bin/env python3

import re

s = """This is example task text.


There's more text here.

...But suddenly, a wild {! new subtask !} appears! And {! another one !}!

And one with {! an exclamation point! !} And one {!without spaces!}! And even one {! with {} inside because why? !}

It starts with a { and ends with a }. But we only want the subtask text."""


r = re.findall(r'{!\s*(.+?)\s*!}', s)

print('ORIGINAL:')
print(s)
print()
print('FOUND:')
print(r)

for idx, subtask in enumerate(r):
    # Pretend that the list index is the new subtask's ID (a UUID4).
    # Also, I don't know how GTK renders/uses the link anchors. This should be enough to demonstrate though.
    task_ptrn = '{{!\s*{0}\s*!}}'.format(re.escape(subtask))
    task_link = '<a href="{0}">'.format(idx)
    task_html = '{0}{1}</a>'.format(task_link, subtask)
    s = re.sub(task_ptrn, task_html, s)

print('\nThis should now print the original string with links.\n')
print(s)

As shown if you run that, you can find subtasks defined in CDATA-stripped content (so it could still be rendered as HTML straight through, which might be nice from a GUI end). It'd also let users do their own formatting with HTML (I'd recommend implementing rendering limits, though. Probably don't need a <script> inside a task description. ;) ) The only difficulty there is after creating a new task and adding it to the current task's <subtasks>, you'd need to replace that with a link to the task in the original <content>'s CDATA. Which could probably be easily enough done with a re.sub and a .format on the pattern itself. The above POC now does that as well.

Base64 is a no-go, since we want to keep it human friendly

Yeah. It feels like a dirty hack and doesn't really fix the parseable-formatted-content problem anyways.

johnnybubonic added a commit to johnnybubonic/gtg that referenced this issue Jul 20, 2020
johnnybubonic added a commit to johnnybubonic/gtg that referenced this issue Jul 20, 2020
johnnybubonic added a commit to johnnybubonic/gtg that referenced this issue Jul 21, 2020
currently matches getting-things-gnome#279. mostly (still in discussion re: CDATA vs. escaping in <content>).

all uniqueness and associations applied, i think, as well.
@nekohayo
Copy link
Member

Just a random drive-by comment: recently with GTG 0.4's UI opening up some possibilities, I have found myself (as a user) sometimes wishing for the ability to parent more than one task to a child, I found the "single parent, many children"-only model to be a bit restrictive... so if somehow multi-parents could work, I'd love to see it happen. I just have no idea currently how that would be represented/managed in the UI, however.

@johnnybubonic
Copy link

johnnybubonic commented Jul 21, 2020

@leio brought up some interesting questions in IRC re: CDATA/escaping:

05:55:28 < leio> r00t^2: I'd assume CDATA needs some escaping too, right? If the task text contained ]]>
05:55:54 < leio> somewhat relatedly I have a backlog item for myself to file an issue about brackets for existing code
05:56:18 < leio> If a tag is made to contain a < character, it doesn't appear as such in the tag list in sidebar; unsure how it gets saved

So in order:

  1. Yeah, if we used a CDATA container and the user input contained ]]>, it'd be the same problem as <, >, etc. chars used in user input (this extends to things like task names too by the way, not just content). One can escape strings for storing in XML via .encode('ascii', 'xmlcharrefreplace') (but this of course does not work for unicode, etc.), but LXML should do it automagically once casting to an element object's text() (or attribute value, etc.) in a way that retains unicode. For avoiding this with CDATA (which again doesn't - and shouldn't - be escaped), you can regex the string before putting in a CDATA container. re.sub(r']]>', r'', text) would strip it from user input (but it'd be a lossy operation; that particular sequence of characters would be gone forever). Because CDATA is treated as raw data, it has no escaping - just the termination token. See this for more detail.
    1. Thankfully, the chances of someone entering this in as a task title or content are MUCH lower than using <, >, etc. Square brackets do not need to be escaped as long as they aren't followed by a second square bracket and a greater-than character.
  2. Brackets for any user input shouldn't be an issue with CDATA, if you go that route. Provided the exact sequence is not ]]>, anything goes.

@diegogangl
Copy link
Contributor Author

At this point the file format change is basically done with only minor bugs left, so closing this.

@nekohayo nekohayo added RFC "Request for Comments" brainstorming tickets for things we are unsure about maintainability Automated tests suite, tooling, refactoring, or anything that makes it easier for developers labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement maintainability Automated tests suite, tooling, refactoring, or anything that makes it easier for developers priority:critical RFC "Request for Comments" brainstorming tickets for things we are unsure about
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants