extend export command to show tombstone + change output format to CSV #610

brstgt · 2017-12-22T12:27:56Z

No description provided.

chrislusf · 2017-12-22T19:48:10Z

weed/command/export.go

+	if version == storage.Version1 {
+		size = n.Size
+	}
+	fmt.Printf("\"%s\",\"%s\",%d,%t,%s,%s,%s,%t\n",


On my second thought, tab separated values will be easier to parse than dealing with escapes.

brstgt · 2017-12-22T20:18:43Z

Tab is a valid char in filenames in Unix e.g. Am 22.12.2017 20:48 schrieb "Chris Lu" <notifications@github.com>:

…

***@***.**** commented on this pull request. ------------------------------ In weed/command/export.go <#610 (comment)>: > @@ -62,6 +63,24 @@ var ( localLocation, _ = time.LoadLocation("Local") ) +func printNeedle(vid storage.VolumeId, n *storage.Needle, version storage.Version, deleted bool) { + key := storage.NewFileIdFromNeedle(vid, n).String() + size := n.DataSize + if version == storage.Version1 { + size = n.Size + } + fmt.Printf("\"%s\",\"%s\",%d,%t,%s,%s,%s,%t\n", On my second thought, tab separated values will be easier to parse than dealing with escapes. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#610 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAUQZ5rWJ-RpVY3knkZm1-_Xbyt_VEeNks5tDAeEgaJpZM4RK_KU> .

brstgt · 2017-12-22T20:25:14Z

Line by line json encoding is also an option. Elasticsearch uses this in some places for batch processing Am 22.12.2017 21:18 schrieb "benjamin roth" <brstgt@gmail.com>:

…

Tab is a valid char in filenames in Unix e.g. Am 22.12.2017 20:48 schrieb "Chris Lu" ***@***.***>: > ***@***.**** commented on this pull request. > ------------------------------ > > In weed/command/export.go > <#610 (comment)>: > > > @@ -62,6 +63,24 @@ var ( > localLocation, _ = time.LoadLocation("Local") > ) > > +func printNeedle(vid storage.VolumeId, n *storage.Needle, version storage.Version, deleted bool) { > + key := storage.NewFileIdFromNeedle(vid, n).String() > + size := n.DataSize > + if version == storage.Version1 { > + size = n.Size > + } > + fmt.Printf("\"%s\",\"%s\",%d,%t,%s,%s,%s,%t\n", > > On my second thought, tab separated values will be easier to parse than > dealing with escapes. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#610 (review)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAUQZ5rWJ-RpVY3knkZm1-_Xbyt_VEeNks5tDAeEgaJpZM4RK_KU> > . >

chrislusf · 2017-12-22T20:53:21Z

maybe not over-engineer this. Tab should be fine. I do not think people will try to access files via http with a tab in the url.

brstgt · 2017-12-23T07:40:16Z

I thought about that for a while. To be honest, my very first approach was a bit naive.
I don't consider it as over engineering. Especially when this tool is used for repair purposes, it has to be robust. Not quoting properly can even produce security leaks as known from SQL injections. It's about what people can do, not what the normally do.
I consider this tool as an intermediate step towards a proper repair mechanism. Looking a little bit into the future, I'd also offer this export functionality as a rest endpoint that streams volume records that can then be aggregated and compared by a repair coordinator. From my point of view it would be consistent to have the same output format for both the API endpoint and the CLI tool.

With that in mind, the question is, what format is best suited for that.

Regular JSON:

It's not well suited for streaming, so I wouldn't consider that as an option

Line separated JSON per record:

Can be streamed well, Elasticsearch uses that for batch processing.
JSON is the heart of the web, so well suited for APIs
Seaweed speaks JSON in all other endpoints
Downside: Fieldnames are transferred for every record

TSV/CSV

Not typically used in APIs
Smaller in size than JSON

With respect to the repair process, do you have any thought or opinion about all that?
Of course we can talk about nits like tab vs. commas but in the end we have to be on the same page about the long term goal. If we have that, it's easier to talk about the small steps.
If you prefer, we can also continue that discussion in a different place (issue or forum)

chrislusf · 2017-12-23T08:28:34Z

Supporting multiple formats is another way. I was thinking to just add one way, and then add other formats if really necessary.

…

On Fri, Dec 22, 2017 at 11:40 PM Benjamin Roth ***@***.***> wrote: I thought about that for a while. To be honest, my very first approach was a bit naive. I don't consider it as over engineering. Especially when this tool is used for repair purposes, it has to be robust. Not quoting properly can even produce security leaks as known from SQL injections. It's about what people *can* do, not what the *normally* do. I consider this tool as an intermediate step towards a proper repair mechanism. Looking a little bit into the future, I'd also offer this export functionality as a rest endpoint that streams volume records that can then be aggregated and compared by a repair coordinator. From my point of view it would be consistent to have the same output format for both the API endpoint and the CLI tool. With that in mind, the question is, what format is best suited for that. Regular JSON: - It's not well suited for streaming, so I wouldn't consider that as an option Line separated JSON per record: - Can be streamed well, Elasticsearch uses that for batch processing. - JSON is the heart of the web, so well suited for APIs - Seaweed speaks JSON in all other endpoints - Downside: Fieldnames are transferred for every record TSV/CSV - Not typically used in APIs - Smaller in size than JSON With respect to the repair process, do you have any thought or opinion about all that? Of course we can talk about nits like tab vs. commas but in the end we have to be on the same page about the long term goal. If we have that, it's easier to talk about the small steps. If you prefer, we can also continue that discussion in a different place (issue or forum) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#610 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABeL7-iK0Ig1_nMCHdlrs_V_8ZU4m3txks5tDK5jgaJpZM4RK_KU> .

brstgt · 2017-12-23T08:44:09Z

I was also thinking about one. Just wanted to list the pros and cons I had in mind :) I personally have a slight preference for one json obj per record but if you have a strong preference for tsv I am ok with it Am 23.12.2017 09:28 schrieb "Chris Lu" <notifications@github.com>: Supporting multiple formats is another way. I was thinking to just add one way, and then add other formats if really necessary.

On Fri, Dec 22, 2017 at 11:40 PM Benjamin Roth ***@***.***> wrote: I thought about that for a while. To be honest, my very first approach was a bit naive. I don't consider it as over engineering. Especially when this tool is used for repair purposes, it has to be robust. Not quoting properly can even produce security leaks as known from SQL injections. It's about what

people

*can* do, not what the *normally* do. I consider this tool as an intermediate step towards a proper repair mechanism. Looking a little bit into the future, I'd also offer this

export

functionality as a rest endpoint that streams volume records that can then be aggregated and compared by a repair coordinator. From my point of view it would be consistent to have the same output format for both the API endpoint and the CLI tool. With that in mind, the question is, what format is best suited for that. Regular JSON: - It's not well suited for streaming, so I wouldn't consider that as an option Line separated JSON per record: - Can be streamed well, Elasticsearch uses that for batch processing. - JSON is the heart of the web, so well suited for APIs - Seaweed speaks JSON in all other endpoints - Downside: Fieldnames are transferred for every record TSV/CSV - Not typically used in APIs - Smaller in size than JSON With respect to the repair process, do you have any thought or opinion about all that? Of course we can talk about nits like tab vs. commas but in the end we have to be on the same page about the long term goal. If we have that,

it's

easier to talk about the small steps. If you prefer, we can also continue that discussion in a different place (issue or forum) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#610 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-

auth/ABeL7-iK0Ig1_nMCHdlrs_V_8ZU4m3txks5tDK5jgaJpZM4RK_KU>

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#610 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAUQZ4JUAbX84GjrjR06-gNtqoAp9QxSks5tDLnBgaJpZM4RK_KU> .

chrislusf · 2017-12-23T18:06:26Z

Json is good, make sure everything is in one line for easier parsing. On Sat, Dec 23, 2017 at 12:44 AM Benjamin Roth <notifications@github.com> wrote:

…

I was also thinking about one. Just wanted to list the pros and cons I had in mind :) I personally have a slight preference for one json obj per record but if you have a strong preference for tsv I am ok with it Am 23.12.2017 09:28 schrieb "Chris Lu" ***@***.***>: Supporting multiple formats is another way. I was thinking to just add one way, and then add other formats if really necessary. On Fri, Dec 22, 2017 at 11:40 PM Benjamin Roth ***@***.***> wrote: > I thought about that for a while. To be honest, my very first approach was > a bit naive. > I don't consider it as over engineering. Especially when this tool is used > for repair purposes, it has to be robust. Not quoting properly can even > produce security leaks as known from SQL injections. It's about what people > *can* do, not what the *normally* do. > I consider this tool as an intermediate step towards a proper repair > mechanism. Looking a little bit into the future, I'd also offer this export > functionality as a rest endpoint that streams volume records that can then > be aggregated and compared by a repair coordinator. From my point of view > it would be consistent to have the same output format for both the API > endpoint and the CLI tool. > > With that in mind, the question is, what format is best suited for that. > > Regular JSON: > > - It's not well suited for streaming, so I wouldn't consider that as > an option > > Line separated JSON per record: > > - Can be streamed well, Elasticsearch uses that for batch processing. > - JSON is the heart of the web, so well suited for APIs > - Seaweed speaks JSON in all other endpoints > - Downside: Fieldnames are transferred for every record > > TSV/CSV > > - Not typically used in APIs > - Smaller in size than JSON > > With respect to the repair process, do you have any thought or opinion > about all that? > Of course we can talk about nits like tab vs. commas but in the end we > have to be on the same page about the long term goal. If we have that, it's > easier to talk about the small steps. > If you prefer, we can also continue that discussion in a different place > (issue or forum) > > — > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > <#610 (comment) >, > or mute the thread > <https://github.com/notifications/unsubscribe- auth/ABeL7-iK0Ig1_nMCHdlrs_V_8ZU4m3txks5tDK5jgaJpZM4RK_KU> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#610 (comment)>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAUQZ4JUAbX84GjrjR06-gNtqoAp9QxSks5tDLnBgaJpZM4RK_KU > . — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#610 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABeL7yPerIw-4K0ckA_s_3D5VB8YLT66ks5tDL1egaJpZM4RK_KU> .

ingardm · 2018-02-08T08:55:28Z

Any update on this?

brstgt · 2018-02-08T09:49:25Z

To be honest: We think about moving from seaweed to ceph as I don't see a strong community here. So - probably not from our side.

ingardm · 2018-02-08T10:08:52Z

We're just getting started with seaweed. We already have both gluster and ceph clusters running, but for small files this is the best we've found for our use case.

merging #610 and add "-limit" option

chrislusf · 2018-07-15T03:52:28Z

merged via 3edfe1d

extend export command to show tombstone + change output format to CSV

efe2d30

chrislusf reviewed Dec 22, 2017

View reviewed changes

chrislusf added a commit that referenced this pull request Jul 15, 2018

extend export command to show tombstone + change output format to CSV

3edfe1d

merging #610 and add "-limit" option

chrislusf closed this Jul 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend export command to show tombstone + change output format to CSV #610

extend export command to show tombstone + change output format to CSV #610

brstgt commented Dec 22, 2017

chrislusf Dec 22, 2017

brstgt commented Dec 22, 2017 via email

brstgt commented Dec 22, 2017 via email

chrislusf commented Dec 22, 2017

brstgt commented Dec 23, 2017

chrislusf commented Dec 23, 2017 via email

brstgt commented Dec 23, 2017 via email

chrislusf commented Dec 23, 2017 via email

ingardm commented Feb 8, 2018

brstgt commented Feb 8, 2018

ingardm commented Feb 8, 2018

chrislusf commented Jul 15, 2018

extend export command to show tombstone + change output format to CSV #610

extend export command to show tombstone + change output format to CSV #610

Conversation

brstgt commented Dec 22, 2017

chrislusf Dec 22, 2017

Choose a reason for hiding this comment

brstgt commented Dec 22, 2017 via email

brstgt commented Dec 22, 2017 via email

chrislusf commented Dec 22, 2017

brstgt commented Dec 23, 2017

chrislusf commented Dec 23, 2017 via email

brstgt commented Dec 23, 2017 via email

chrislusf commented Dec 23, 2017 via email

ingardm commented Feb 8, 2018

brstgt commented Feb 8, 2018

ingardm commented Feb 8, 2018

chrislusf commented Jul 15, 2018