-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extend export command to show tombstone + change output format to CSV #610
Conversation
if version == storage.Version1 { | ||
size = n.Size | ||
} | ||
fmt.Printf("\"%s\",\"%s\",%d,%t,%s,%s,%s,%t\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my second thought, tab separated values will be easier to parse than dealing with escapes.
Tab is a valid char in filenames in Unix e.g.
Am 22.12.2017 20:48 schrieb "Chris Lu" <notifications@github.com>:
… ***@***.**** commented on this pull request.
------------------------------
In weed/command/export.go
<#610 (comment)>:
> @@ -62,6 +63,24 @@ var (
localLocation, _ = time.LoadLocation("Local")
)
+func printNeedle(vid storage.VolumeId, n *storage.Needle, version storage.Version, deleted bool) {
+ key := storage.NewFileIdFromNeedle(vid, n).String()
+ size := n.DataSize
+ if version == storage.Version1 {
+ size = n.Size
+ }
+ fmt.Printf("\"%s\",\"%s\",%d,%t,%s,%s,%s,%t\n",
On my second thought, tab separated values will be easier to parse than
dealing with escapes.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#610 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAUQZ5rWJ-RpVY3knkZm1-_Xbyt_VEeNks5tDAeEgaJpZM4RK_KU>
.
|
Line by line json encoding is also an option. Elasticsearch uses this in
some places for batch processing
Am 22.12.2017 21:18 schrieb "benjamin roth" <brstgt@gmail.com>:
… Tab is a valid char in filenames in Unix e.g.
Am 22.12.2017 20:48 schrieb "Chris Lu" ***@***.***>:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In weed/command/export.go
> <#610 (comment)>:
>
> > @@ -62,6 +63,24 @@ var (
> localLocation, _ = time.LoadLocation("Local")
> )
>
> +func printNeedle(vid storage.VolumeId, n *storage.Needle, version storage.Version, deleted bool) {
> + key := storage.NewFileIdFromNeedle(vid, n).String()
> + size := n.DataSize
> + if version == storage.Version1 {
> + size = n.Size
> + }
> + fmt.Printf("\"%s\",\"%s\",%d,%t,%s,%s,%s,%t\n",
>
> On my second thought, tab separated values will be easier to parse than
> dealing with escapes.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#610 (review)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAUQZ5rWJ-RpVY3knkZm1-_Xbyt_VEeNks5tDAeEgaJpZM4RK_KU>
> .
>
|
maybe not over-engineer this. Tab should be fine. I do not think people will try to access files via http with a tab in the url. |
I thought about that for a while. To be honest, my very first approach was a bit naive. With that in mind, the question is, what format is best suited for that. Regular JSON:
Line separated JSON per record:
TSV/CSV
With respect to the repair process, do you have any thought or opinion about all that? |
Supporting multiple formats is another way. I was thinking to just add one
way, and then add other formats if really necessary.
…On Fri, Dec 22, 2017 at 11:40 PM Benjamin Roth ***@***.***> wrote:
I thought about that for a while. To be honest, my very first approach was
a bit naive.
I don't consider it as over engineering. Especially when this tool is used
for repair purposes, it has to be robust. Not quoting properly can even
produce security leaks as known from SQL injections. It's about what people
*can* do, not what the *normally* do.
I consider this tool as an intermediate step towards a proper repair
mechanism. Looking a little bit into the future, I'd also offer this export
functionality as a rest endpoint that streams volume records that can then
be aggregated and compared by a repair coordinator. From my point of view
it would be consistent to have the same output format for both the API
endpoint and the CLI tool.
With that in mind, the question is, what format is best suited for that.
Regular JSON:
- It's not well suited for streaming, so I wouldn't consider that as
an option
Line separated JSON per record:
- Can be streamed well, Elasticsearch uses that for batch processing.
- JSON is the heart of the web, so well suited for APIs
- Seaweed speaks JSON in all other endpoints
- Downside: Fieldnames are transferred for every record
TSV/CSV
- Not typically used in APIs
- Smaller in size than JSON
With respect to the repair process, do you have any thought or opinion
about all that?
Of course we can talk about nits like tab vs. commas but in the end we
have to be on the same page about the long term goal. If we have that, it's
easier to talk about the small steps.
If you prefer, we can also continue that discussion in a different place
(issue or forum)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#610 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABeL7-iK0Ig1_nMCHdlrs_V_8ZU4m3txks5tDK5jgaJpZM4RK_KU>
.
|
I was also thinking about one. Just wanted to list the pros and cons I had
in mind :)
I personally have a slight preference for one json obj per record but if
you have a strong preference for tsv I am ok with it
Am 23.12.2017 09:28 schrieb "Chris Lu" <notifications@github.com>:
Supporting multiple formats is another way. I was thinking to just add one
way, and then add other formats if really necessary.
On Fri, Dec 22, 2017 at 11:40 PM Benjamin Roth ***@***.***> wrote:
I thought about that for a while. To be honest, my very first approach was
a bit naive.
I don't consider it as over engineering. Especially when this tool is used
for repair purposes, it has to be robust. Not quoting properly can even
produce security leaks as known from SQL injections. It's about what
people
*can* do, not what the *normally* do.
I consider this tool as an intermediate step towards a proper repair
mechanism. Looking a little bit into the future, I'd also offer this
export
functionality as a rest endpoint that streams volume records that can then
be aggregated and compared by a repair coordinator. From my point of view
it would be consistent to have the same output format for both the API
endpoint and the CLI tool.
With that in mind, the question is, what format is best suited for that.
Regular JSON:
- It's not well suited for streaming, so I wouldn't consider that as
an option
Line separated JSON per record:
- Can be streamed well, Elasticsearch uses that for batch processing.
- JSON is the heart of the web, so well suited for APIs
- Seaweed speaks JSON in all other endpoints
- Downside: Fieldnames are transferred for every record
TSV/CSV
- Not typically used in APIs
- Smaller in size than JSON
With respect to the repair process, do you have any thought or opinion
about all that?
Of course we can talk about nits like tab vs. commas but in the end we
have to be on the same page about the long term goal. If we have that,
it's
easier to talk about the small steps.
If you prefer, we can also continue that discussion in a different place
(issue or forum)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#610 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-
auth/ABeL7-iK0Ig1_nMCHdlrs_V_8ZU4m3txks5tDK5jgaJpZM4RK_KU>
.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#610 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAUQZ4JUAbX84GjrjR06-gNtqoAp9QxSks5tDLnBgaJpZM4RK_KU>
.
|
Json is good, make sure everything is in one line for easier parsing.
On Sat, Dec 23, 2017 at 12:44 AM Benjamin Roth <notifications@github.com>
wrote:
… I was also thinking about one. Just wanted to list the pros and cons I had
in mind :)
I personally have a slight preference for one json obj per record but if
you have a strong preference for tsv I am ok with it
Am 23.12.2017 09:28 schrieb "Chris Lu" ***@***.***>:
Supporting multiple formats is another way. I was thinking to just add one
way, and then add other formats if really necessary.
On Fri, Dec 22, 2017 at 11:40 PM Benjamin Roth ***@***.***>
wrote:
> I thought about that for a while. To be honest, my very first approach
was
> a bit naive.
> I don't consider it as over engineering. Especially when this tool is
used
> for repair purposes, it has to be robust. Not quoting properly can even
> produce security leaks as known from SQL injections. It's about what
people
> *can* do, not what the *normally* do.
> I consider this tool as an intermediate step towards a proper repair
> mechanism. Looking a little bit into the future, I'd also offer this
export
> functionality as a rest endpoint that streams volume records that can
then
> be aggregated and compared by a repair coordinator. From my point of view
> it would be consistent to have the same output format for both the API
> endpoint and the CLI tool.
>
> With that in mind, the question is, what format is best suited for that.
>
> Regular JSON:
>
> - It's not well suited for streaming, so I wouldn't consider that as
> an option
>
> Line separated JSON per record:
>
> - Can be streamed well, Elasticsearch uses that for batch processing.
> - JSON is the heart of the web, so well suited for APIs
> - Seaweed speaks JSON in all other endpoints
> - Downside: Fieldnames are transferred for every record
>
> TSV/CSV
>
> - Not typically used in APIs
> - Smaller in size than JSON
>
> With respect to the repair process, do you have any thought or opinion
> about all that?
> Of course we can talk about nits like tab vs. commas but in the end we
> have to be on the same page about the long term goal. If we have that,
it's
> easier to talk about the small steps.
> If you prefer, we can also continue that discussion in a different place
> (issue or forum)
>
> —
> You are receiving this because you commented.
>
>
> Reply to this email directly, view it on GitHub
> <#610 (comment)
>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-
auth/ABeL7-iK0Ig1_nMCHdlrs_V_8ZU4m3txks5tDK5jgaJpZM4RK_KU>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#610 (comment)>,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AAUQZ4JUAbX84GjrjR06-gNtqoAp9QxSks5tDLnBgaJpZM4RK_KU
>
.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#610 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABeL7yPerIw-4K0ckA_s_3D5VB8YLT66ks5tDL1egaJpZM4RK_KU>
.
|
Any update on this? |
To be honest: We think about moving from seaweed to ceph as I don't see a strong community here. So - probably not from our side. |
We're just getting started with seaweed. We already have both gluster and ceph clusters running, but for small files this is the best we've found for our use case. |
merging #610 and add "-limit" option
merged via 3edfe1d |
No description provided.