Skip to content

Conversation

feiyax
Copy link

@feiyax feiyax commented Feb 3, 2021

P4 allows non-unicode characters in changelist description body,
so git-p4 needs to be character encoding aware when reading p4 cl.

This change adds 2 config options: one specifies encoding,
the other specifies erro handling upon unrecognized character.
Those configs apply when it reads p4 description text, mostly
from commands "p4 describe" and "p4 changes".


I have an open question in mind: what might be the best default config to use?

Currently the python's bytes.decode() is called with default utf-8 and strict error handling, so git-p4 pukes on non-unicode characters. I encountered it when git p4 sync attempts to ingest a certain CL.

It seems to make sense to default to replace so that it gets rid of non-unicode chars while trying to retain information. However, i am uncertain on if we have use cases where it relies on the stop-on-non-unicode behavior. (Hypothetically say an automation that's expected to return error on non-unicode char in order to stop them from propagating further?)


I tested it with git p4 sync to a P4 CL that somehow has non-unicode control character in description. With git-p4.cldescencodingerrhandling=ignore, it proceeded without error.

cc: Luke Diamand luke@diamand.org
cc: Andrew Oakley andrew@adoakley.name

P4 allows non-unicode characters in changelist description body,
so git-p4 needs to be character encoding aware when reading p4 cl

This change adds 2 config options, one specifies encoding,
the other specifies erro handling upon unrecognized character.
Those configs  apply when it reads p4 description text, mostly
from commands "p4 describe" and "p4 changes".

Signed-off-by: Feiynag Xue <fxue@roku.com>
@feiyax feiyax changed the title git-p4: handle non-unicode characters in p4 cl git-p4: handle non-unicode characters in p4 changelist description Feb 3, 2021
@feiyax
Copy link
Author

feiyax commented Feb 3, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 3, 2021

Submitted as pull.864.git.1612371600332.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-864/feiyeung/description-text-encoding-handling-v1

To fetch this version to local tag pr-864/feiyeung/description-text-encoding-handling-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-864/feiyeung/description-text-encoding-handling-v1

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 3, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Feiyang via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Feiynag Xue <fxue@roku.com>
>
> P4 allows non-unicode characters in changelist description body,
> so git-p4 needs to be character encoding aware when reading p4 cl
>
> This change adds 2 config options, one specifies encoding,
> the other specifies erro handling upon unrecognized character.
> Those configs  apply when it reads p4 description text, mostly
> from commands "p4 describe" and "p4 changes".
>
> Signed-off-by: Feiynag Xue <fxue@roku.com>
> ---

Adding a few people who had meaningful (read: needs some Perforce
knowledge) changes to this part of the codebase to Cc: to ask for
their reviews.

>     git-p4: handle non-unicode characters in p4 changelist description
>     
>     P4 allows non-unicode characters in changelist description body, so
>     git-p4 needs to be character encoding aware when reading p4 cl.
>     
>     This change adds 2 config options: one specifies encoding, the other
>     specifies erro handling upon unrecognized character. Those configs apply
>     when it reads p4 description text, mostly from commands "p4 describe"
>     and "p4 changes".
>     
>     ------------------------------------------------------------------------
>     
>     I have an open question in mind: what might be the best default config
>     to use?
>     
>     Currently the python's bytes.decode() is called with default utf-8 and
>     strict error handling, so git-p4 pukes on non-unicode characters. I
>     encountered it when git p4 sync attempts to ingest a certain CL.
>     
>     It seems to make sense to default to replace so that it gets rid of
>     non-unicode chars while trying to retain information. However, i am
>     uncertain on if we have use cases where it relies on the
>     stop-on-non-unicode behavior. (Hypothetically say an automation that's
>     expected to return error on non-unicode char in order to stop them from
>     propagating further?)
>     
>     ------------------------------------------------------------------------
>     
>     I tested it with git p4 sync to a P4 CL that somehow has non-unicode
>     control character in description. With
>     git-p4.cldescencodingerrhandling=ignore, it proceeded without error.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-864%2Ffeiyeung%2Fdescription-text-encoding-handling-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-864/feiyeung/description-text-encoding-handling-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/864
>
>  Documentation/git-p4.txt | 13 +++++++++++++
>  git-p4.py                | 12 +++++++++++-
>  2 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b424c..01a0e0b1067 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,19 @@ git-p4.pathEncoding::
>  	to transcode the paths to UTF-8. As an example, Perforce on Windows
>  	often uses "cp1252" to encode path names.
>  
> +git-p4.clDescEncoding::
> +	Perforce allows non-unicode characters in changelist description. 
> +	Use this config to tell git-p4 what encoding Perforce had used for 
> +	description text. This encoding is used to transcode the text to
> +	UTF-8. Defaults to "utf_8".

Would it still work if you replaced "utf_8" here with "UTF-8"?  If
we can use "UTF-8", this description (and the code that does so)
would read much less awkward, I would think.

> +git-p4.clDescNonUnicodeHandling::
> +	Perforce allows non-unicode characters in changelist description. 
> +	Use this config to tell git-p4 what to do when it does not recognize 
> +	the character encoding in description body. Defaults to "strict" for 
> +	stopping upon encounter. "ignore" for skipping unrecognized
> +	characters; "replace" for attempting to convert into UTF-8. 
> +
>  git-p4.largeFileSystem::
>  	Specify the system that is used for large (binary) files. Please note
>  	that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93ac40..abbeb9156bd 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -206,6 +206,13 @@ def decode_path(path):
>                  print('Path with non-ASCII characters detected. Used {} to decode: {}'.format(encoding, path))
>          return path
>  
> +def decode_changlist_description(text):
> +    """Decode bytes or bytearray using configured changelist description encoding options
> +    """
> +    encoding = gitConfig('git-p4.clDescEncoding') or 'utf_8'
> +    err_handling = gitConfig('git-p4.clDescEncodingErrHandling') or 'strict'
> +    return text.decode(encoding, err_handling)
> +
>  def run_git_hook(cmd, param=[]):
>      """Execute a hook if the hook exists."""
>      if verbose:
> @@ -771,7 +778,10 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
>                  for key, value in entry.items():
>                      key = key.decode()
>                      if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        if key == 'desc':
> +                            value = decode_changlist_description(value)
> +                        else:
> +                            value = value.decode()
>                      decoded_entry[key] = value
>                  # Parse out data if it's an error response
>                  if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
>
> base-commit: e6362826a0409539642a5738db61827e5978e2e4

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 4, 2021

On the Git mailing list, Luke Diamand wrote (reply to this):

We've started getting this quite a lot as we switched to a new P4
server and I suspect that the i18N options are incorrect. So I think
this would be welcome.

A test case would be useful, as debugging these decoding problems is a
bit of a nightmare.

Luke

On Wed, 3 Feb 2021 at 22:42, Feiyang Xue <me@feiyangxue.com> wrote:
>
>
>
> On Feb 3, 2021, at 3:44 PM, Junio C Hamano <gitster@pobox.com> wrote:
>
> "Feiyang via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> From: Feiynag Xue <fxue@roku.com>
>
> P4 allows non-unicode characters in changelist description body,
> so git-p4 needs to be character encoding aware when reading p4 cl
>
> This change adds 2 config options, one specifies encoding,
> the other specifies erro handling upon unrecognized character.
> Those configs  apply when it reads p4 description text, mostly
> from commands "p4 describe" and "p4 changes".
>
> Signed-off-by: Feiynag Xue <fxue@roku.com>
> ---
>
>
> Adding a few people who had meaningful (read: needs some Perforce
> knowledge) changes to this part of the codebase to Cc: to ask for
> their reviews.
>
>
> Adding Yang Zhao <yang.zhao@skyboxlabs.com>, who had made character
> encodings related changes for paths.
>
> Adding Scott Lamb <slamb@slamb.org>, who had made changes to this
> “p4CmdList()” method.
>
>
>
>    git-p4: handle non-unicode characters in p4 changelist description
>
>    P4 allows non-unicode characters in changelist description body, so
>    git-p4 needs to be character encoding aware when reading p4 cl.
>
>    This change adds 2 config options: one specifies encoding, the other
>    specifies erro handling upon unrecognized character. Those configs apply
>    when it reads p4 description text, mostly from commands "p4 describe"
>    and "p4 changes".
>
>    ------------------------------------------------------------------------
>
>    I have an open question in mind: what might be the best default config
>    to use?
>
>    Currently the python's bytes.decode() is called with default utf-8 and
>    strict error handling, so git-p4 pukes on non-unicode characters. I
>    encountered it when git p4 sync attempts to ingest a certain CL.
>
>    It seems to make sense to default to replace so that it gets rid of
>    non-unicode chars while trying to retain information. However, i am
>    uncertain on if we have use cases where it relies on the
>    stop-on-non-unicode behavior. (Hypothetically say an automation that's
>    expected to return error on non-unicode char in order to stop them from
>    propagating further?)
>
>    ------------------------------------------------------------------------
>
>    I tested it with git p4 sync to a P4 CL that somehow has non-unicode
>    control character in description. With
>    git-p4.cldescencodingerrhandling=ignore, it proceeded without error.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-864%2Ffeiyeung%2Fdescription-text-encoding-handling-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-864/feiyeung/description-text-encoding-handling-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/864
>
> Documentation/git-p4.txt | 13 +++++++++++++
> git-p4.py                | 12 +++++++++++-
> 2 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b424c..01a0e0b1067 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,19 @@ git-p4.pathEncoding::
> to transcode the paths to UTF-8. As an example, Perforce on Windows
> often uses "cp1252" to encode path names.
>
> +git-p4.clDescEncoding::
> + Perforce allows non-unicode characters in changelist description.
> + Use this config to tell git-p4 what encoding Perforce had used for
> + description text. This encoding is used to transcode the text to
> + UTF-8. Defaults to "utf_8".
>
>
> Would it still work if you replaced "utf_8" here with "UTF-8"?  If
> we can use "UTF-8", this description (and the code that does so)
> would read much less awkward, I would think.
>
>
> I doubt “UTF-8” would work; I do believe the lower case “utf-8” would.
>
> Looking at Python3 documentation on encodings, UTF-8 is specified as “utf_8”.
> It allows aliases of using dash to replace underscore, as pointed out by the
> samge page: https://docs.python.org/3/library/codecs.html#standard-encodings
> > Notice that spelling alternatives that only differ in case or use a hyphen
> > instead of an underscore are also valid aliases; therefore, e.g. 'utf-8’
> > is a valid alias for the 'utf_8' codec.
>
> I used underscore one “utf_8” for consistency reason: this file already has
> uses of “utf_8”.
>
>
>
> +git-p4.clDescNonUnicodeHandling::
> + Perforce allows non-unicode characters in changelist description.
> + Use this config to tell git-p4 what to do when it does not recognize
> + the character encoding in description body. Defaults to "strict" for
> + stopping upon encounter. "ignore" for skipping unrecognized
> + characters; "replace" for attempting to convert into UTF-8.
> +
> git-p4.largeFileSystem::
> Specify the system that is used for large (binary) files. Please note
> that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93ac40..abbeb9156bd 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -206,6 +206,13 @@ def decode_path(path):
>                 print('Path with non-ASCII characters detected. Used {} to decode: {}'.format(encoding, path))
>         return path
>
> +def decode_changlist_description(text):
> +    """Decode bytes or bytearray using configured changelist description encoding options
> +    """
> +    encoding = gitConfig('git-p4.clDescEncoding') or 'utf_8'
> +    err_handling = gitConfig('git-p4.clDescEncodingErrHandling') or 'strict'
> +    return text.decode(encoding, err_handling)
> +
> def run_git_hook(cmd, param=[]):
>     """Execute a hook if the hook exists."""
>     if verbose:
> @@ -771,7 +778,10 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
>                 for key, value in entry.items():
>                     key = key.decode()
>                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        if key == 'desc':
> +                            value = decode_changlist_description(value)
> +                        else:
> +                            value = value.decode()
>                     decoded_entry[key] = value
>                 # Parse out data if it's an error response
>                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
>
> base-commit: e6362826a0409539642a5738db61827e5978e2e4
>
>

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 4, 2021

User Luke Diamand <luke@diamand.org> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 4, 2021

On the Git mailing list, Andrew Oakley wrote (reply to this):

On Wed, 03 Feb 2021 16:59:59 +0000
"Feiyang via GitGitGadget" <gitgitgadget@gmail.com> wrote:
> From: Feiynag Xue <fxue@roku.com>
> 
> P4 allows non-unicode characters in changelist description body,
> so git-p4 needs to be character encoding aware when reading p4 cl
> 
> This change adds 2 config options, one specifies encoding,
> the other specifies erro handling upon unrecognized character.
> Those configs  apply when it reads p4 description text, mostly
> from commands "p4 describe" and "p4 changes".

...

>     It seems to make sense to default to replace so that it gets rid
> of non-unicode chars while trying to retain information. However, i am
>     uncertain on if we have use cases where it relies on the
>     stop-on-non-unicode behavior. (Hypothetically say an automation
> that's expected to return error on non-unicode char in order to stop
> them from propagating further?)

I suspect these options will be insufficient for real repositories.

There are two ways a perforce server is configured:
- unicode mode where the metadata is valid UTF-8, and you can request
  conversion to different character sets
- not in unicode mode where the metadata can be pretty much any random
  bytes (but not '\0'), and the encoding is not stored anywhere

There isn't any way to recover the encoding information from perforce,
and it's likely that a server that's not in unicode mode will end up
with both UTF-8 commits, and commits that contain other things (which
we have no way to work out what they are).

Until recently git-p4 was written in python2 and it just moved the
bytes from perforce into git without trying to interpret them in any
way.  This has the advantage that the git repository will accurately
reflect what was in perforce, even if it's complete garbage.

The other useful option I can think of would be to attempt to decode
the data as UTF-8, but fall back to some other encoding if the data
isn't valid (probably Windows-1252, but a config option would make
sense here).

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 4, 2021

User Andrew Oakley <andrew@adoakley.name> has been added to the cc: list.

description text. This encoding is used to transcode the text to
UTF-8. Defaults to "utf_8".

git-p4.clDescNonUnicodeHandling::

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration name of the document not match with the Py file. Should be clDescEncodingErrHandling here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelliang would you terribly mind reviewing on the Git mailing list instead? See the "reply to this" instructions and find the mail that you want to reply to here.

@feiyax feiyax closed this Apr 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants