Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV: Use placeholders when header and data length differs #2555

Merged
merged 5 commits into from
Nov 23, 2021

Conversation

yuferpegom
Copy link
Contributor

@yuferpegom yuferpegom commented Jan 14, 2021

This PR adds a couple of function that basically allows the user to include the empty fields (using some placeholders) in cases where there are more headers than data (or vice-versa) in the csv being transformed to a Map.

Examples:

  1. When there are more data than headers
eins,zwei
11,12,13

maps to

Map("eins" -> "11", "zwei" -> "12", "Missing header" -> "13")

The "Missing header" value is the default one, the user has the option to pass a custom value for this placeholder.

  1. When there are more headers than data
eins,zwei,drei,vier,fünt
11,12,13

maps to

Map("eins" -> "11", "zwei" -> "12", "dreir" -> "13", "vier" -> "", "fünt" -> "")

This would be helpful, especially in the second case, when I want to keep the headers even when I don't have values associated with them.

… headers using place holders when the length of each other differs

- Adds test

- Fixes the process function
- Undoes some changes that shouldn't have been done

- Renames combiner function
- Fixes javadoc

Fixes java doc
@yuferpegom yuferpegom changed the title CSV: Use placeholders when header and data length is differet CSV: Use placeholders when header and data length differs Jan 14, 2021
@ennru
Copy link
Member

ennru commented Jan 15, 2021

Thank you for this suggestion.

What should happen if there is more than one extra data column?

markarasev pushed a commit to markarasev/alpakka that referenced this pull request Jan 17, 2021
@yuferpegom
Copy link
Contributor Author

It should just add another header. If the user set a custom one it will add that one otherwise it will use the one configured by default:

  1. Default placeholder:
eins,zwei
11,12,13,14

maps to

Map("eins" -> "11", "zwei" -> "12", "Missing header" -> "13" ,  "Missing Header" -> "14")
  1. Custom placeholder
eins,zwei
11,12,13,14

maps to

Map("eins" -> "11", "zwei" -> "12", "custom" -> "13" ,  "custom" -> "14")

I think that this way is easier to understand what happened to the data (this is more helpful from the developer's point of view).

I also think that the more valuable use case for this change is when there are more headers than data as it is possible that the user wants to keep the data even when he might have forgotten to add a couple of commas on its input csv.

@ennru
Copy link
Member

ennru commented Jan 20, 2021

The Map won't be able to hold multiple values with the same key.

@yuferpegom
Copy link
Contributor Author

Your right. So, It think that it can be bypassed by adding some character to the placeholder, like a number. Something like

  1. Default placeholder:
eins,zwei
11,12,13,14

maps to

Map("eins" -> "11", "zwei" -> "12", "Missing header" -> "13" ,  "Missing Header_1" -> "14")

What do you think? Any idea is also welcome, thanks!

@seglo
Copy link
Member

seglo commented Jan 26, 2021

Your right. So, It think that it can be bypassed by adding some character to the placeholder, like a number.

Sounds reasonable to me. A 0-based index appended to the default missing key. Using your example:

CsvToMap.toMapCombineAll(
  headerDefault = "MissingHeader"
)

would return

Map("eins" -> "11", "zwei" -> "12", "MissingHeader0" -> "13" ,  "MissingHeader1" -> "14")

I would also suggest supporting a default value for missing values too.

CsvToMap.toMapCombineAll(
  valueDefault = "(missing)"
)

would return

Map("eins" -> "11", "zwei" -> "12", "dreir" -> "13", "vier" -> "(missing)", "fünt" -> "(missing)")

You'll need to support a javadsl as well.

@seglo
Copy link
Member

seglo commented Mar 10, 2021

@yuferpegom If you can follow up on this PR soon it can be included in the Alpakka 3.0.0-M1 release soon.

@yuferpegom
Copy link
Contributor Author

Oh I will, thanks

@lightbend-cla-validator

Hi @yupegom,

Thank you for your contribution! We really value the time you've taken to put this together.

Before we proceed with reviewing this pull request, please sign the Lightbend Contributors License Agreement:

https://www.lightbend.com/contribute/cla

@yuferpegom yuferpegom force-pushed the zip-all-when-header-data-length-differs branch from 02abc24 to dbd9495 Compare March 11, 2021 18:52
Copy link
Member

@seglo seglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is almost there. Just a few small things.


// #header-line
val future =
// format: off
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason why exceptions need to be made for all this formatting. Can you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows keeping the special indentation to represent the columns and rows being passed as params.

BTW, I'm just following what was already done before in the spec.

Comment on lines 56 to 59
* A flow translating incoming [[scala.List]] of [[akka.util.ByteString]] to a map of String and
* ByteString using the stream's first element's values as keys. If the header values are shorter
* than the data (or vice-versa) placeholder elements are used to extend the shorter collection to
* the length of the longer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a copy/paste error? The types don't match. For this API they should use Java types and Javadoc conventions to link to those types (see other docs in this class).

Comment on lines 76 to 79
* A flow translating incoming [[scala.List]] of [[akka.util.ByteString]] to a map of String keys
* and values using the stream's first element's values as keys. If the header values are shorter
* than the data (or vice-versa) placeholder elements are used to extend the shorter collection to
* the length of the longer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadocs

* than the data (or vice-versa) placeholder elements are used to extend the shorter collection to
* the length of the longer.
*
* @param charset the charset to decode [[akka.util.ByteString]] to [[scala.Predef.String]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadocs

@seglo seglo added this to the 3.0.1 milestone May 12, 2021
@ennru ennru removed this from the 3.0.1 milestone May 29, 2021
@yuferpegom
Copy link
Contributor Author

@seglo I have addressed your last comments, please take a look when you have a chance and than you!

Copy link
Member

@ennru ennru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ennru ennru merged commit 2147c88 into akka:master Nov 23, 2021
@ennru ennru added this to the 3.0.4 milestone Nov 23, 2021
@yuferpegom yuferpegom deleted the zip-all-when-header-data-length-differs branch April 19, 2022 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants