Skip to content

Conversation

kontaxis
Copy link

@kontaxis kontaxis commented Mar 19, 2021

Changes since v1:

  • Turned off the feature by default.
  • Removed duplicate code.
  • Added note about Gitweb consumers receiving redacted logs.

Changes since v2:

  • The feature can be set on a per-project basis. ('override' => 1)

Changes since v3:

  • Renamed feature to "email-privacy" and improved documentation.
  • Removed UI elements for git-format-patch since it won't be redacted.
  • Simplified calls to the address redaction logic.
  • Mail::Address is now used to reduce false-positive redactions.

Changes since v4:

  • Rephrased the commit comment.
  • hide_mailaddrs_if_private is slighly more compact.

Changes since v5:

  • A simple <local@domain> filter is used instead of Mail::Address to identify addresses.

Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org
cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: "brian m. carlson" sandals@crustytoothpaste.net
cc: Eric Wong e@80x24.org
cc: "Georgios Kontaxis" geko1702+commits@99rst.org

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 19, 2021

Welcome to GitGitGadget

Hi @kontaxis, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that your Pull Request has a good description, as it will be used as cover letter.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "revisions:" to state which subsystem the change is about, and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the FreeNode IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Freenode. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 19, 2021

There is an issue in commit c2fb4db:
Prefixed commit message must be in lower case: gitweb: Redacted e-mail addresses feature.

@kontaxis kontaxis force-pushed the kontaxis/email_privacy branch from c2fb4db to 5da5273 Compare March 19, 2021 22:48
@dscho
Copy link
Member

dscho commented Mar 20, 2021

/allow

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 20, 2021

User kontaxis is now allowed to use GitGitGadget.

WARNING: kontaxis has no public email address set on GitHub

@kontaxis kontaxis force-pushed the kontaxis/email_privacy branch 4 times, most recently from 036c582 to 6fe6ebd Compare March 20, 2021 21:45
@kontaxis
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 20, 2021

Submitted as pull.910.git.1616283780358.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v1

To fetch this version to local tag pr-910/kontaxis/kontaxis/email_privacy-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-910/kontaxis/kontaxis/email_privacy-v1

WARNING: kontaxis has no public email address set on GitHub

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:

> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers. This may result in unsolicited messages.
> This is a feature for redacting e-mail addresses from the generated
> HTML content.
>
> This feature does not prevent someone from downloading the
> unredacted commit log and extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.

So web crawlers that aren't going to obey robots.txt?

I'm not opposed to this feature, but a glance at gitweb's documentation
seems to show that we don't discuss how to set robots.txt up for it at
all.

Perhaps having that in the docs or otherwise in the default setup would
get us most of the win of this feature?

> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---

Odd:

>     gitweb: Redacted e-mail addresses feature.
>     
>     Gitweb extracts content from the Git log and makes it accessible over
>     HTTP. As a result, e-mail addresses found in commits are exposed to web
>     crawlers. This may result in unsolicited messages. This is a feature for
>     redacting e-mail addresses from the generated HTML content.
>     
>     This feature does not prevent someone from downloading the unredacted
>     commit log and extracting information from it. It aims to hinder the
>     low-effort bulk collection of e-mail addresses by web crawlers.
>     
>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

To have this duplication of the patch here below "---", some GGG feature
gone awry?

> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>
>  Documentation/gitweb.conf.txt | 12 ++++++++++++
>  gitweb/gitweb.perl            | 36 ++++++++++++++++++++++++++++++++---
>  2 files changed, 45 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> index 7963a79ba98b..10653d8670a8 100644
> --- a/Documentation/gitweb.conf.txt
> +++ b/Documentation/gitweb.conf.txt
> @@ -896,6 +896,18 @@ same as of the snippet above:
>  It is an error to specify a ref that does not pass "git check-ref-format"
>  scrutiny. Duplicated values are filtered.
>  
> +email_privacy::
> +    Redact e-mail addresses from the generated HTML, etc. content.
> +    This hides e-mail addresses found in the commit log from web crawlers.
> +    Enabled by default.
> ++
> +It is highly recommended to keep this feature enabled unless web crawlers
> +are hindered in some other way. You can disable this feature as shown below:
> ++
> +---------------------------------------------------------------------------
> +$feature{'email_privacy'}{'default'} = [0];
> +---------------------------------------------------------------------------

I think there's plenty of gitweb users that are going to be relying on
the current behavior, so doesn't it make more sense for this to be
opt-in rather than opt-out?

>  
>  EXAMPLES
>  --------
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..9d21c2583e18 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -569,6 +569,15 @@ sub evaluate_uri {
>  		'sub' => \&feature_extra_branch_refs,
>  		'override' => 0,
>  		'default' => []},
> +
> +    # Redact e-mail addresses.
> +
> +    # To disable system wide have in $GITWEB_CONFIG
> +    # $feature{'email_privacy'}{'default'} = [0];
> +	'email_privacy' => {
> +		'sub' => sub { feature_bool('email_privacy', @_) },
> +		'override' => 0,
> +		'default' => [1]},
>  );
> [...]
>  sub gitweb_get_feature {
> @@ -3471,6 +3480,10 @@ sub parse_tag {
>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$tag{'author_name'}  = $1;
>  				$tag{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$tag{'author_email'} = "private";
> +					$tag{'author'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$tag{'author_name'} = $tag{'author'};
>  			}
> @@ -3519,6 +3532,10 @@ sub parse_commit_text {
>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'author_name'}  = $1;
>  				$co{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'author_email'} = "private";
> +					$co{'author'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$co{'author_name'} = $co{'author'};
>  			}
> @@ -3529,6 +3546,10 @@ sub parse_commit_text {
>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'committer_name'}  = $1;
>  				$co{'committer_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'committer_email'} = "private";
> +					$co{'committer'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$co{'committer_name'} = $co{'committer'};
>  			}
> @@ -3568,9 +3589,13 @@ sub parse_commit_text {
>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>  	}
> -	# remove added spaces
> +	# remove added spaces, redact e-mail addresses if applicable.
>  	foreach my $line (@commit_lines) {
>  		$line =~ s/^    //;
> +		if (gitweb_check_feature('email_privacy') &&
> +			$line =~ m/^([^<]+) <([^>]*)>/) {
> +			$line =~ s/<([^>]+)>/<private>/;
> +		}
>  	}
>  	$co{'comment'} = \@commit_lines;

All of these hunks (and the below) should use some new function that
does this feature check + sanitizing instead of copy/pasting mostly the
same code N times. e.g.:
    
    sub maybe_hide_email {
        my $email = shift;
        return $email unless gitweb_check_feature('email_privacy');
        return hide_email($email);
    }

then:

    $tag{author_email} = maybe_hide_email($2);

Also it looks like this isn't a new issue, but does this need to
implement its own E-Mail parser? We ship with Mail::Address for
git-send-email, can gitweb (and the elided hide_email() function above)
use that too?


> @@ -8060,8 +8085,13 @@ sub git_commitdiff {
>  		close $fd
>  			or print "Reading git-diff-tree failed\n";
>  	} elsif ($format eq 'patch') {
> -		local $/ = undef;
> -		print <$fd>;
> +		while (my $line = <$fd>) {
> +			if (gitweb_check_feature('email_privacy') &&
> +				$line =~ m/^([^<]+) <([^>]*)>/) {
> +				$line =~ s/<([^>]+)>/<private>/;
> +			}
> +			print $line;
> +		}
>  		close $fd
>  			or print "Reading git-format-patch failed\n";

Is that "patch" output meant for "git am"? Won't this severely break
that use-case if so?

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

User Ævar Arnfjörð Bjarmason <avarab@gmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

On the Git mailing list, "brian m. carlson" wrote (reply to this):


--WZgUSzCUfrfbcvXF
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2021-03-21 at 00:42:58, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote:
>=20
> On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:
>=20
> > From: Georgios Kontaxis <geko1702+commits@99rst.org>
> >
> > Gitweb extracts content from the Git log and makes it accessible
> > over HTTP. As a result, e-mail addresses found in commits are
> > exposed to web crawlers. This may result in unsolicited messages.
> > This is a feature for redacting e-mail addresses from the generated
> > HTML content.
> >
> > This feature does not prevent someone from downloading the
> > unredacted commit log and extracting information from it.
> > It aims to hinder the low-effort bulk collection of e-mail
> > addresses by web crawlers.
>=20
> So web crawlers that aren't going to obey robots.txt?
>=20
> I'm not opposed to this feature, but a glance at gitweb's documentation
> seems to show that we don't discuss how to set robots.txt up for it at
> all.
>=20
> Perhaps having that in the docs or otherwise in the default setup would
> get us most of the win of this feature?

I'm going to guess that the two features are orthogonal.  robots.txt is
great for communicating to well-meaning actors what you do and don't
want crawled.  For example, one might ask a web crawler not to crawl
individual commits because that creates excessive load on the server.

This option is about preventing email harvesting, usually for the
purposes of sending spam.  Spam is email abuse and all reasonable people
know it's unacceptable, so by definition the people doing this are bad
actors and are not likely to honor the robots.txt.  As someone who runs
his own mail server, that is certainly my experience.

So I am in favor of this feature.  I think it mirrors what many other
tools do in this space and having it as an option is valuable.

> > diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.=
txt
> > index 7963a79ba98b..10653d8670a8 100644
> > --- a/Documentation/gitweb.conf.txt
> > +++ b/Documentation/gitweb.conf.txt
> > @@ -896,6 +896,18 @@ same as of the snippet above:
> >  It is an error to specify a ref that does not pass "git check-ref-form=
at"
> >  scrutiny. Duplicated values are filtered.
> > =20
> > +email_privacy::
> > +    Redact e-mail addresses from the generated HTML, etc. content.
> > +    This hides e-mail addresses found in the commit log from web crawl=
ers.
> > +    Enabled by default.
> > ++
> > +It is highly recommended to keep this feature enabled unless web crawl=
ers
> > +are hindered in some other way. You can disable this feature as shown =
below:
> > ++
> > +----------------------------------------------------------------------=
-----
> > +$feature{'email_privacy'}{'default'} =3D [0];
> > +----------------------------------------------------------------------=
-----
>=20
> I think there's plenty of gitweb users that are going to be relying on
> the current behavior, so doesn't it make more sense for this to be
> opt-in rather than opt-out?

I agree this make sense as an opt-in feature.  While many people will
want to enable it, users who are performing an upgrade won't necessarily
want the behavior to change right away.
--=20
brian m. carlson (he/him or they/them)
Houston, Texas, US

--WZgUSzCUfrfbcvXF
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.2.27 (GNU/Linux)

iHUEABYKAB0WIQQILOaKnbxl+4PRw5F8DEliiIeigQUCYFahDQAKCRB8DEliiIei
gZ4IAP0XH/a44c+BhQNqitdUnHVMnlqIqm8lVOfgdZOL7u9ImgEAzKQBbeesHszd
FZ4sAMTkRAI+VmIIEQ8evgXphRd2UQA=
=8ZUo
-----END PGP SIGNATURE-----

--WZgUSzCUfrfbcvXF--

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

User "brian m. carlson" <sandals@crustytoothpaste.net> has been added to the cc: list.

@kontaxis kontaxis force-pushed the kontaxis/email_privacy branch from 6fe6ebd to 74af11c Compare March 21, 2021 02:46
@kontaxis
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

Submitted as pull.910.v2.git.1616297564158.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v2

To fetch this version to local tag pr-910/kontaxis/kontaxis/email_privacy-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-910/kontaxis/kontaxis/email_privacy-v2

WARNING: kontaxis has no public email address set on GitHub

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

On the Git mailing list, "Georgios Kontaxis" wrote (reply to this):

>
> On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:
>
>> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers. This may result in unsolicited messages.
>> This is a feature for redacting e-mail addresses from the generated
>> HTML content.
>>
>> This feature does not prevent someone from downloading the
>> unredacted commit log and extracting information from it.
>> It aims to hinder the low-effort bulk collection of e-mail
>> addresses by web crawlers.
>
> So web crawlers that aren't going to obey robots.txt?
>
> I'm not opposed to this feature, but a glance at gitweb's documentation
> seems to show that we don't discuss how to set robots.txt up for it at
> all.
>
> Perhaps having that in the docs or otherwise in the default setup would
> get us most of the win of this feature?
>
File robots.txt is basically asking nicely and we should work on that.
At the same time crawlers that look for addresses to send SPAM to
will probably ignore it so this change is meant for them.

>> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>> ---
>
> Odd:
>
>>     gitweb: Redacted e-mail addresses feature.
>>
>>     Gitweb extracts content from the Git log and makes it accessible
>> over
>>     HTTP. As a result, e-mail addresses found in commits are exposed to
>> web
>>     crawlers. This may result in unsolicited messages. This is a feature
>> for
>>     redacting e-mail addresses from the generated HTML content.
>>
>>     This feature does not prevent someone from downloading the
>> unredacted
>>     commit log and extracting information from it. It aims to hinder the
>>     low-effort bulk collection of e-mail addresses by web crawlers.
>>
>>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org
>
> To have this duplication of the patch here below "---", some GGG feature
> gone awry?
>
>> Published-As:
>> https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v1
>> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git
>> pr-910/kontaxis/kontaxis/email_privacy-v1
>> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>>
>>  Documentation/gitweb.conf.txt | 12 ++++++++++++
>>  gitweb/gitweb.perl            | 36 ++++++++++++++++++++++++++++++++---
>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/gitweb.conf.txt
>> b/Documentation/gitweb.conf.txt
>> index 7963a79ba98b..10653d8670a8 100644
>> --- a/Documentation/gitweb.conf.txt
>> +++ b/Documentation/gitweb.conf.txt
>> @@ -896,6 +896,18 @@ same as of the snippet above:
>>  It is an error to specify a ref that does not pass "git
>> check-ref-format"
>>  scrutiny. Duplicated values are filtered.
>>
>> +email_privacy::
>> +    Redact e-mail addresses from the generated HTML, etc. content.
>> +    This hides e-mail addresses found in the commit log from web
>> crawlers.
>> +    Enabled by default.
>> ++
>> +It is highly recommended to keep this feature enabled unless web
>> crawlers
>> +are hindered in some other way. You can disable this feature as shown
>> below:
>> ++
>> +---------------------------------------------------------------------------
>> +$feature{'email_privacy'}{'default'} = [0];
>> +---------------------------------------------------------------------------
>
> I think there's plenty of gitweb users that are going to be relying on
> the current behavior, so doesn't it make more sense for this to be
> opt-in rather than opt-out?
>
My concern is that Gitweb operators may not understand the need
for this feature or maybe won't be aware the feature exists.

Nevertheless, I've changed the feature to be off by default.
Perhaps we can revisit this decision in the future? :)

>>
>>  EXAMPLES
>>  --------
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 0959a782eccb..9d21c2583e18 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -569,6 +569,15 @@ sub evaluate_uri {
>>  		'sub' => \&feature_extra_branch_refs,
>>  		'override' => 0,
>>  		'default' => []},
>> +
>> +    # Redact e-mail addresses.
>> +
>> +    # To disable system wide have in $GITWEB_CONFIG
>> +    # $feature{'email_privacy'}{'default'} = [0];
>> +	'email_privacy' => {
>> +		'sub' => sub { feature_bool('email_privacy', @_) },
>> +		'override' => 0,
>> +		'default' => [1]},
>>  );
>> [...]
>>  sub gitweb_get_feature {
>> @@ -3471,6 +3480,10 @@ sub parse_tag {
>>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$tag{'author_name'}  = $1;
>>  				$tag{'author_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$tag{'author_email'} = "private";
>> +					$tag{'author'} =~ s/<([^>]+)>/<private>/;
>> +				}
>>  			} else {
>>  				$tag{'author_name'} = $tag{'author'};
>>  			}
>> @@ -3519,6 +3532,10 @@ sub parse_commit_text {
>>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$co{'author_name'}  = $1;
>>  				$co{'author_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$co{'author_email'} = "private";
>> +					$co{'author'} =~ s/<([^>]+)>/<private>/;
>> +				}
>>  			} else {
>>  				$co{'author_name'} = $co{'author'};
>>  			}
>> @@ -3529,6 +3546,10 @@ sub parse_commit_text {
>>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$co{'committer_name'}  = $1;
>>  				$co{'committer_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$co{'committer_email'} = "private";
>> +					$co{'committer'} =~ s/<([^>]+)>/<private>/;
>> +				}
>>  			} else {
>>  				$co{'committer_name'} = $co{'committer'};
>>  			}
>> @@ -3568,9 +3589,13 @@ sub parse_commit_text {
>>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>>  	}
>> -	# remove added spaces
>> +	# remove added spaces, redact e-mail addresses if applicable.
>>  	foreach my $line (@commit_lines) {
>>  		$line =~ s/^    //;
>> +		if (gitweb_check_feature('email_privacy') &&
>> +			$line =~ m/^([^<]+) <([^>]*)>/) {
>> +			$line =~ s/<([^>]+)>/<private>/;
>> +		}
>>  	}
>>  	$co{'comment'} = \@commit_lines;
>
> All of these hunks (and the below) should use some new function that
> does this feature check + sanitizing instead of copy/pasting mostly the
> same code N times. e.g.:
>
>     sub maybe_hide_email {
>         my $email = shift;
>         return $email unless gitweb_check_feature('email_privacy');
>         return hide_email($email);
>     }
>
> then:
>
>     $tag{author_email} = maybe_hide_email($2);
>
> Also it looks like this isn't a new issue, but does this need to
> implement its own E-Mail parser? We ship with Mail::Address for
> git-send-email, can gitweb (and the elided hide_email() function above)
> use that too?
>
Thanks for the suggestion.
I've removed the duplicate code.

The places where there are e-mail addresses today are pretty specific.
(Pretty much the author, committer fields and the Signed-off-by lines)

In theory someone could write a comment with a bunch of addresses in it
but that would be unstructured text and I think Mail::Address is not good
with that.

I can definitely keep working on this topic but maybe in subsequent PRs?
Assuming we're not exposing any addresses right now or redacting things
we shouldn't.

>
>> @@ -8060,8 +8085,13 @@ sub git_commitdiff {
>>  		close $fd
>>  			or print "Reading git-diff-tree failed\n";
>>  	} elsif ($format eq 'patch') {
>> -		local $/ = undef;
>> -		print <$fd>;
>> +		while (my $line = <$fd>) {
>> +			if (gitweb_check_feature('email_privacy') &&
>> +				$line =~ m/^([^<]+) <([^>]*)>/) {
>> +				$line =~ s/<([^>]+)>/<private>/;
>> +			}
>> +			print $line;
>> +		}
>>  		close $fd
>>  			or print "Reading git-format-patch failed\n";
>
> Is that "patch" output meant for "git am"? Won't this severely break
> that use-case if so?
>
Not sure who may be consuming that output.
I've added a note in the documentation for gitweb.conf.

If a web crawler can get to the information by following URLs
then I think we should redact it.

Hopefully by documenting this as a potential issue Gitweb operators
can create a workaround specific to their use case.
Possibly implementing access control and leaving this feature off.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

User "Georgios Kontaxis" <geko1702+commits@99rst.org> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> +    # To disable system wide have in $GITWEB_CONFIG
> +    # $feature{'email_privacy'}{'default'} = [0];
> +	'email_privacy' => {
> +		'sub' => sub { feature_bool('email_privacy', @_) },
> +		'override' => 0,
> +		'default' => [1]},
>  );

I do not see why this should default to true.

It would break existing installations, who have been perfectly happy
with the convenience of supplying a ready access to potential new
contributors who/which addresses to contact plausible mentors in the
projects they are interested in.

And more importantly, I do not see why it should be made impossible
to override per repository/project in a multi-tenant installation.
Some projects may be more "privacy" sensitive than others.  Those
who want to use tighter setting should be able to enable it even
when the side-wide default is set to false, no?

Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

Junio C Hamano <gitster@pobox.com> writes:

> And more importantly, I do not see why it should be made impossible
> to override per repository/project in a multi-tenant installation.
> Some projects may be more "privacy" sensitive than others.  Those
> who want to use tighter setting should be able to enable it even
> when the side-wide default is set to false, no?

To answer an inevitable and natural follow-up question preemptively,
the primary reason why we have the override => 0 mechanism is so
that site administrators can disable certain expensive features
(like blame, snapshot, etc.) no matter what each project hosted by
them wish.

And hiding contributor identity would not be a choice that is based
on how expensive the feature is to run.

@kontaxis kontaxis force-pushed the kontaxis/email_privacy branch from 74af11c to 930cdef Compare March 21, 2021 06:51
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

On the Git mailing list, "Georgios Kontaxis" wrote (reply to this):

> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
>> +    # To disable system wide have in $GITWEB_CONFIG
>> +    # $feature{'email_privacy'}{'default'} = [0];
>> +	'email_privacy' => {
>> +		'sub' => sub { feature_bool('email_privacy', @_) },
>> +		'override' => 0,
>> +		'default' => [1]},
>>  );
>
> I do not see why this should default to true.
>
I've changed the default to "false". V2 should reflect the change.

> It would break existing installations, who have been perfectly happy
> with the convenience of supplying a ready access to potential new
> contributors who/which addresses to contact plausible mentors in the
> projects they are interested in.
>
> And more importantly, I do not see why it should be made impossible
> to override per repository/project in a multi-tenant installation.
> Some projects may be more "privacy" sensitive than others.  Those
> who want to use tighter setting should be able to enable it even
> when the side-wide default is set to false, no?
>
> Thanks.
>
I was actually thinking about the other way around;
preventing projects from disabling this feature.

Sounds like the "override" flag is for other types
of use cases though. I'll change it to "true".

Thanks for the feedback.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Georgios Kontaxis" <geko1702+commits@99rst.org> writes:

>> And more importantly, I do not see why it should be made impossible
>> to override per repository/project in a multi-tenant installation.
>> Some projects may be more "privacy" sensitive than others.  Those
>> who want to use tighter setting should be able to enable it even
>> when the side-wide default is set to false, no?
>>
>> Thanks.
>>
> I was actually thinking about the other way around;
> preventing projects from disabling this feature.

Yes, it cuts both ways, and override should be allowed for most
cases (unless absolutely necessary for healthy operation of the
system) for a simple reason that anybody who sets site-wide default
is not in a better position than those who set per-repository or
per-project setting to judge what is good for them.

Thanks.

@kontaxis
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

Submitted as pull.910.v3.git.1616347731514.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v3

To fetch this version to local tag pr-910/kontaxis/kontaxis/email_privacy-v3:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-910/kontaxis/kontaxis/email_privacy-v3

WARNING: kontaxis has no public email address set on GitHub

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:

> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers and they may not respect robots.txt.
> This may result in unsolicited messages.
> This is a feature for redacting e-mail addresses
> from the generated HTML, etc. content.
>
> This feature does not prevent someone from downloading the
> unredacted commit log, e.g., by cloning the repository, and
> extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.
>
> Changes since v1:
> - Turned off the feature by default.
> - Removed duplicate code.
> - Added note about Gitweb consumers receiving redacted logs.
>
> Changes since v2:
> - The feature can be set on a per-project basis. ('override' => 1)
>
> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---
>     gitweb: Redacted e-mail addresses feature.
>     
>     Gitweb extracts content from the Git log and makes it accessible over
>     HTTP. As a result, e-mail addresses found in commits are exposed to web
>     crawlers. This may result in unsolicited messages. This is a feature for
>     redacting e-mail addresses from the generated HTML content.
>     
>     This feature does not prevent someone from downloading the unredacted
>     commit log and extracting information from it. It aims to hinder the
>     low-effort bulk collection of e-mail addresses by web crawlers.
>     
>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>
> Range-diff vs v2:
>
>  1:  74af11ca8bf2 ! 1:  930cdefe7ee0 gitweb: redacted e-mail addresses feature.
>      @@ Commit message
>           - Removed duplicate code.
>           - Added note about Gitweb consumers receiving redacted logs.
>       
>      +    Changes since v2:
>      +    - The feature can be set on a per-project basis. ('override' => 1)
>      +
>           Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>       
>        ## Documentation/gitweb.conf.txt ##
>      @@ gitweb/gitweb.perl: sub evaluate_uri {
>       +	# $feature{'email_privacy'}{'default'} = [1];
>       +	'email_privacy' => {
>       +		'sub' => sub { feature_bool('email_privacy', @_) },
>      -+		'override' => 0,
>      ++		'override' => 1,
>       +		'default' => [0]},
>        );
>        
>
>
>  Documentation/gitweb.conf.txt | 16 +++++++++++++
>  gitweb/gitweb.perl            | 42 ++++++++++++++++++++++++++++++++---
>  2 files changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> index 7963a79ba98b..b7af3240177d 100644
> --- a/Documentation/gitweb.conf.txt
> +++ b/Documentation/gitweb.conf.txt
> @@ -896,6 +896,22 @@ same as of the snippet above:
>  It is an error to specify a ref that does not pass "git check-ref-format"
>  scrutiny. Duplicated values are filtered.
>  
> +email_privacy::
> +    Redact e-mail addresses from the generated HTML, etc. content.
> +    This hides e-mail addresses found in the commit log from web crawlers.
> +    Disabled by default.
> ++
> +It is highly recommended to enable this feature unless web crawlers are
> +hindered in some other way. Note that crawlers intent on harvesting e-mail
> +addresses may disregard robots.txt. You can enable this feature like so:
> ++
> +---------------------------------------------------------------------------
> +$feature{'email_privacy'}{'default'} = [1];
> +---------------------------------------------------------------------------
> ++
> +Note that if Gitweb is not the final step in a workflow then subsequent
> +steps may misbehave because of the redacted information they receive.
> +
>  
>  EXAMPLES
>  --------
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..174cc566d97d 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -569,6 +569,15 @@ sub evaluate_uri {
>  		'sub' => \&feature_extra_branch_refs,
>  		'override' => 0,
>  		'default' => []},
> +
> +	# Redact e-mail addresses.
> +
> +	# To enable system wide have in $GITWEB_CONFIG
> +	# $feature{'email_privacy'}{'default'} = [1];
> +	'email_privacy' => {
> +		'sub' => sub { feature_bool('email_privacy', @_) },
> +		'override' => 1,
> +		'default' => [0]},
>  );
>  
>  sub gitweb_get_feature {
> @@ -3449,6 +3458,19 @@ sub parse_date {
>  	return %date;
>  }
>  
> [snip]

So in the v1 feedback I suggested:

BEGIN QUOTE
    sub maybe_hide_email {
        my $email = shift;
        return $email unless gitweb_check_feature('email_privacy');
        return hide_email($email);
    }

then:

    $tag{author_email} = maybe_hide_email($2);
END QUOTE

But:

>  sub parse_tag {
>  	my $tag_id = shift;
>  	my %tag;
> @@ -3471,6 +3493,10 @@ sub parse_tag {
>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$tag{'author_name'}  = $1;
>  				$tag{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$tag{'author_email'} = "private";
> +					$tag{'author'} = hide_mailaddr($tag{'author'});
> +				}

This code seems quite awkward, we've already done the regex match, but
this code:

> [snip]
> +sub hide_mailaddr_if_private {
> +	my $line = shift;
> +	return $line unless (gitweb_check_feature('email_privacy') &&
> +						$line =~ m/^([^<]+) <([^>]*)>/);
> +	return hide_mailaddr($line)
> +}
> +
> +sub hide_mailaddr {
> +	my $mailaddr = shift;
> +	$mailaddr =~ s/<([^>]*)>/<private>/;
> +	return $mailaddr;
> +}

Is going to do it again incrementally, and then just act on a
search-replacement if we've got the feature enabled.

It seems much simpler to just turn your:

> +				if (gitweb_check_feature('email_privacy')) {
> +					$tag{'author_email'} = "private";
> +					$tag{'author'} = hide_mailaddr($tag{'author'});
> +				}

Into:

    $tag{'author'} = maybe_hide_mailaddr($tag{author}, \$tag{author_email});

Using:

    sub maybe_hide_email {
        my ($email, $ref) = shift;
        return $email unless gitweb_check_feature('email_privacy');
        $$ref = "private" if $ref;
        return hide_email($email);
    }

Which also works for the case where you don't have a "private" hash key
to assign to. But maybe it overcomplicates things...

>  			} else {
>  				$tag{'author_name'} = $tag{'author'};
>  			}
> @@ -3519,6 +3545,10 @@ sub parse_commit_text {
>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'author_name'}  = $1;
>  				$co{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'author_email'} = "private";
> +					$co{'author'} = hide_mailaddr($co{'author'});
> +				}
>  			} else {
>  				$co{'author_name'} = $co{'author'};
>  			}
> @@ -3529,6 +3559,10 @@ sub parse_commit_text {
>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'committer_name'}  = $1;
>  				$co{'committer_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'committer_email'} = "private";
> +					$co{'committer'} = hide_mailaddr($co{'committer'});
> +				}
> [...]
>  			} else {
>  				$co{'committer_name'} = $co{'committer'};
>  			}
> @@ -3568,9 +3602,10 @@ sub parse_commit_text {
>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>  	}
> -	# remove added spaces
> +	# remove added spaces, redact e-mail addresses if applicable.
>  	foreach my $line (@commit_lines) {
>  		$line =~ s/^    //;
> +		$line = hide_mailaddr_if_private($line);
>  	}
>  	$co{'comment'} = \@commit_lines;
>  
> @@ -8060,8 +8095,9 @@ sub git_commitdiff {
>  		close $fd
>  			or print "Reading git-diff-tree failed\n";
>  	} elsif ($format eq 'patch') {
> -		local $/ = undef;
> -		print <$fd>;
> +		while (my $line = <$fd>) {
> +			print hide_mailaddr_if_private($line);
> +		}

Urm, have you tested this? How does a while loop over a <$fd> make sense
when $/ is undef, the readline() operator will always return just one
record, so having a while loop doesn't make sense.

I'm not sure of the input here, but given that if you're expecting to
replace all e-mail addresses on all lines with this function that's not
how it'll work, the s/// doesn't have a /g, so it'll stop at the first
replacement.

>  		close $fd
>  			or print "Reading git-format-patch failed\n";
>  	}
>
> base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers and they may not respect robots.txt.
> This may result in unsolicited messages.
> This is a feature for redacting e-mail addresses
> from the generated HTML, etc. content.
>
> This feature does not prevent someone from downloading the
> unredacted commit log, e.g., by cloning the repository, and
> extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.

Lines from here ...

>
> Changes since v1:
> - Turned off the feature by default.
> - Removed duplicate code.
> - Added note about Gitweb consumers receiving redacted logs.
>
> Changes since v2:
> - The feature can be set on a per-project basis. ('override' => 1)

... to here are to help reviewers on the mailing list while
iterations of the same change is being reviewed and polished.

But it is useless noise for those who only read "git log".  They
simply do not have access to earlier iterations.

Such "changes between iterations" comments needs to be written after
the three-dash lines.

> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---
>     gitweb: Redacted e-mail addresses feature.
>     
>     Gitweb extracts content from the Git log and makes it accessible over
>     HTTP. As a result, e-mail addresses found in commits are exposed to web
>     crawlers. This may result in unsolicited messages. This is a feature for
>     redacting e-mail addresses from the generated HTML content.
>     
>     This feature does not prevent someone from downloading the unredacted
>     commit log and extracting information from it. It aims to hinder the
>     low-effort bulk collection of e-mail addresses by web crawlers.
>     
>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>
> Range-diff vs v2:
>
>  1:  74af11ca8bf2 ! 1:  930cdefe7ee0 gitweb: redacted e-mail addresses feature.
>      @@ Commit message
>           - Removed duplicate code.
>           - Added note about Gitweb consumers receiving redacted logs.
>       
>      +    Changes since v2:
>      +    - The feature can be set on a per-project basis. ('override' => 1)
>      +
>           Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>       
>        ## Documentation/gitweb.conf.txt ##
>      @@ gitweb/gitweb.perl: sub evaluate_uri {
>       +	# $feature{'email_privacy'}{'default'} = [1];
>       +	'email_privacy' => {
>       +		'sub' => sub { feature_bool('email_privacy', @_) },
>      -+		'override' => 0,
>      ++		'override' => 1,
>       +		'default' => [0]},
>        );
>        
>
>
>  Documentation/gitweb.conf.txt | 16 +++++++++++++
>  gitweb/gitweb.perl            | 42 ++++++++++++++++++++++++++++++++---
>  2 files changed, 55 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> index 7963a79ba98b..b7af3240177d 100644
> --- a/Documentation/gitweb.conf.txt
> +++ b/Documentation/gitweb.conf.txt
> @@ -896,6 +896,22 @@ same as of the snippet above:
>  It is an error to specify a ref that does not pass "git check-ref-format"
>  scrutiny. Duplicated values are filtered.
>  
> +email_privacy::
> +    Redact e-mail addresses from the generated HTML, etc. content.
> +    This hides e-mail addresses found in the commit log from web crawlers.
> +    Disabled by default.
> ++
> +It is highly recommended to enable this feature unless web crawlers are
> +hindered in some other way. Note that crawlers intent on harvesting e-mail
> +addresses may disregard robots.txt.

Up to this line is more-or-less OK, but with a few comments:

 - "This hides ... from web crawlers"?  Doesn't this hide it
   indiscriminately, whether the requester is an interactive
   end-user or a crawler?

 - The name of the configured feature, 'email_privacy'.  There is
   another existing feature with underscore in its name
   (remote_heads), but I think it is an odd-ball mistake we do not
   want to imitate.  Instead, I think the "extra branch refs" may be
   a good example to follow.  The feature name (written in Perl
   code) is "extra-branch-refs" (downcased words with inter-word
   dashes) and the corresponding configuration (I am not saying we
   should add one to support this feature, if we do not have one
   already) is "gitweb.extraBranchRefs" (camelCased words).

 - Do not exaggerate with words like "highly", but trust the
   intelligence of your readers to make the right decision when they
   understand the reason why this feature exists in the first place.

 - I do not think this entry should be added to the end of feature
   list.  How about listing it just after 'avatar', which is also
   about how author/committer identity is shown?

> +---------------------------------------------------------------------------
> +$feature{'email_privacy'}{'default'} = [1];
> +---------------------------------------------------------------------------

I do not think this should be here.  None of the boolean features
listed early in "Features in `%feature`" section like blame,
snapshot, grep, pickaxe, ... features tell readers how to enable a
specific feature like that.

Unless the syntax to configure a feature is one-off oddball that is
different from all other features, we shouldn't clutter the
description with an example like this.

At the beginning of the section "Configuring Gitweb Features" there
is a general description and that should be sufficient (and if it is
not sufficient, it is OK to add an example to that section).

> +Note that if Gitweb is not the final step in a workflow then subsequent
> +steps may misbehave because of the redacted information they receive.

In other words, this will break crawlers that expect email addresses
are there and want to use it for some good purpose.  But that is an
natural consequence of the "feature", and should be described as
such when we said "Redact e-mail addresses", not as a "by the way"
mention in a footnote.

	... after reading more, readers realize that the damage is
	far worse in the current incarnation of the patch, but let's
	read on ...

> @@ -3449,6 +3458,19 @@ sub parse_date {
>  	return %date;
>  }
>  
> +sub hide_mailaddr_if_private {
> +	my $line = shift;
> +	return $line unless (gitweb_check_feature('email_privacy') &&
> +						$line =~ m/^([^<]+) <([^>]*)>/);

I find that the second line is way too deeply indented.  Wouldn't

> +	return $line unless (gitweb_check_feature('email_privacy') &&
> +				$line =~ m/^([^<]+) <([^>]*)>/);

be enough?

> +	return hide_mailaddr($line)
> +}
> +
> +sub hide_mailaddr {
> +	my $mailaddr = shift;
> +	$mailaddr =~ s/<([^>]*)>/<private>/;

s/private/redacted/ perhaps?

> @@ -3568,9 +3602,10 @@ sub parse_commit_text {
>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>  	}
> -	# remove added spaces
> +	# remove added spaces, redact e-mail addresses if applicable.
>  	foreach my $line (@commit_lines) {
>  		$line =~ s/^    //;
> +		$line = hide_mailaddr_if_private($line);

A commit log that ends with

	Thank you, A <a@example.com> and B <b@example.com>, for discovering
	this bug and quickly proposing a solution.

would recact only the first one but not the other, I suspect.

Much more problematic is that I see too many hits from:

    $ git log v2.30.0..v2.31.0 | grep ' <[^>@]*>'

That is, "find a line in the commit log message that has a SP, open <bra,
run of characters that are not ket> or at@sign, closed with ket>".  These
are clearly not e-mail addresses.

This "feature" butchers a commit title:

    mktag doc: say <hash> not <sha1>

to

    mktag doc: say <private> not <sha1>

doesn't it?

> @@ -8060,8 +8095,9 @@ sub git_commitdiff {
>  		close $fd
>  			or print "Reading git-diff-tree failed\n";
>  	} elsif ($format eq 'patch') {
> -		local $/ = undef;
> -		print <$fd>;
> +		while (my $line = <$fd>) {
> +			print hide_mailaddr_if_private($line);
> +		}

And this is even worse.

 $ git log -p --no-merges v2.30.0..v2.31.0 | grep ' <[^>@]*>' | wc -l

gives me ~4700 hits.  Because the patch text as well as the log
message is munged, the "feature" makes the output utterly unreliable
as an input to "git am" by munging too much.  Interesting examples
are like these:

        -	for (i = 0; i < pairs->nr; ++i) {
                'git <command> [<revision>...] -- [<file>...]'
        +	for (i = split; i < geometry->pack_nr; i++) {

Now, I am *NOT* saying that you should tighten the e-mail address
syntax detection and keep munging the patch text.  The lines that
begin with minus and SP in a patch must match the preimage text,
so munging out existing e-mail addresses from them will make the
patch fail to find the part that needs to be modified.  And
replacing an e-mail address on a line that begins with plus would
redact it from the source---if they wrote an address, they must have
meant it to be available to those who consume the source, so I doubt
the wisdom of munging the patch part at all.

I may be sympathetic to the cause of the patch, but, I do not agree
with its execution in this iteration of the patch.

Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 31, 2021

This patch series was integrated into seen via git@5de382a.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 2, 2021

This patch series was integrated into seen via git@04eda8c.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 2, 2021

This patch series was integrated into seen via git@29646b5.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 4, 2021

This patch series was integrated into seen via git@fc06f33.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 6, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

Junio C Hamano <gitster@pobox.com> writes:

> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
>> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This can result in unsolicited messages.
>> ...
> Will queue as-is.  Input from those who are more adept at Perl
> and/or interested in helping polish gitweb are still welcome, but at
> my level of interest on the topic, this version looks as good as it
> gets ;-)

Comments by anybody who is more adept at Perl and/or more well
versed in gitweb and e-mail privacy issues?

Otherwise I am inclined to merge it down to 'next' as-is in a couple
of days.

Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 6, 2021

This patch series was integrated into seen via git@a60fc85.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 6, 2021

There was a status update about the branch gk/gitweb-redacted-email on the Git mailing list:

"gitweb" learned "e-mail privacy" feature to redact strings that
look like e-mail addresses on various pages.

Waiting for reviews.
cf. <xmqq5z19k9wu.fsf@gitster.g>

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 7, 2021

This patch series was integrated into seen via git@4d16bbb.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

This patch series was integrated into seen via git@c6ee6e2.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

On the Git mailing list, Eric Wong wrote (reply to this):

Georgios Kontaxis <geko1702+commits@99rst.org> wrote:
> > Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote:
> >> Introduce an 'email-privacy' feature which redacts e-mail addresses
> >> from the generated HTML content
> >
> Eric Wong wrote:
> > A general reply to the topic: have you considered munging
> > addresses in a way that is still human readable, but obviously
> > obfuscated?
> >
> > On some other project, I settled on HTML "&#8226;" as a replacement
> > for '.' for admins who enable that option.  The $USER@$NO_DOT
> > remains as-is for easy identification+recognition of hosts.
> >
> Thanks for the suggestion.
> 
> People have been trying to hinder address harvesting for a while now.
> Replacing '@' with "at", the dot with "dot", adding spaces, etc.
> was pretty common at some point. May still be.
> I would expect crawlers to have caught up and this includes
> all sorts of character encodings and unicode look-alike substitutions.

I figure the crawlers hit a combinatorial explosion and
give up since they'd be wasting time with false-positives.

> > I also considered Unicode homographs which can look identical
> > to replacement characters, too; but rejected that idea since
> > it would cause grief for legitimate users who would not notice
> > the homograph when pasting into their mail client.

As a data point, none of the homograph@ candidates I posted here
on Mar 29 have attracted any attempts on my mail server.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

This patch series was integrated into seen via git@dac300c.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

Eric Wong <e@80x24.org> writes:

> Georgios Kontaxis <geko1702+commits@99rst.org> wrote:
>> > Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote:
>> >> Introduce an 'email-privacy' feature which redacts e-mail addresses
>> >> from the generated HTML content
>> >
>> Eric Wong wrote:
>> > A general reply to the topic: have you considered munging
>> > addresses in a way that is still human readable, but obviously
>> > obfuscated?
>> ...
>> > I also considered Unicode homographs which can look identical
>> > to replacement characters, too; but rejected that idea since
>> > it would cause grief for legitimate users who would not notice
>> > the homograph when pasting into their mail client.
>
> As a data point, none of the homograph@ candidates I posted here
> on Mar 29 have attracted any attempts on my mail server.

That is an interesting observation.  All homograph@ non-addresses,
if a human corrected the funnies in their spelling, would have hit
whoever handles @80x24.org mailboxes.

I take it to mean that as a future direction, replacing <redacted>
with the obfuscated-but-readable-by-humans homographs is a likely
improvement that would help human users while still inconveniencing
the crawlers.  It may however need some provision to prevent casual
end-users from cutting-and-pasting these homographs, as you said in
your original mention of the homograph approach.

But other than that, does the patch look reasonable?

Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

On the Git mailing list, Eric Wong wrote (reply to this):

Junio C Hamano <gitster@pobox.com> wrote:
> Eric Wong <e@80x24.org> writes:
> > As a data point, none of the homograph@ candidates I posted here
> > on Mar 29 have attracted any attempts on my mail server.
> 
> That is an interesting observation.  All homograph@ non-addresses,
> if a human corrected the funnies in their spelling, would have hit
> whoever handles @80x24.org mailboxes.
> 
> I take it to mean that as a future direction, replacing <redacted>
> with the obfuscated-but-readable-by-humans homographs is a likely
> improvement that would help human users while still inconveniencing
> the crawlers.  It may however need some provision to prevent casual
> end-users from cutting-and-pasting these homographs, as you said in
> your original mention of the homograph approach.

Yes, exactly.

> But other than that, does the patch look reasonable?

I only took a cursory glance at it, but v6 seemed fine.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

This patch series was integrated into seen via git@2fe6eb5.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

There was a status update about the branch gk/gitweb-redacted-email on the Git mailing list:

"gitweb" learned "e-mail privacy" feature to redact strings that
look like e-mail addresses on various pages.

Waiting for reviews.
cf. <xmqq5z19k9wu.fsf@gitster.g>

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Mon, Mar 29 2021, Georgios Kontaxis via GitGitGadget wrote:

> [...]
> +email-privacy::
> +	Redact e-mail addresses from the generated HTML, etc. content.
> +	This obscures e-mail addresses retrieved from the author/committer
> +	and comment sections of the Git log.
> +	It is meant to hinder web crawlers that harvest and abuse addresses.
> +	Such crawlers may not respect robots.txt.
> +	Note that users and user tools also see the addresses as redacted.
> +	If Gitweb is not the final step in a workflow then subsequent steps
> +	may misbehave because of the redacted information they receive.
> +	Disabled by default.
> +
>  highlight::
>  	Server-side syntax highlight support in "blob" view.  It requires
>  	`$highlight_bin` program to be available (see the description of
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..01c6faf88006 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -569,6 +569,15 @@ sub evaluate_uri {
>  		'sub' => \&feature_extra_branch_refs,
>  		'override' => 0,
>  		'default' => []},
> +
> +	# Redact e-mail addresses.
> +
> +	# To enable system wide have in $GITWEB_CONFIG
> +	# $feature{'email-privacy'}{'default'} = [1];
> +	'email-privacy' => {
> +		'sub' => sub { feature_bool('email-privacy', @_) },
> +		'override' => 1,
> +		'default' => [0]},
>  );
>  
>  sub gitweb_get_feature {
> @@ -3449,6 +3458,13 @@ sub parse_date {
>  	return %date;
>  }
>  
> +sub hide_mailaddrs_if_private {
> +	my $line = shift;
> +	return $line unless gitweb_check_feature('email-privacy');
> +	$line =~ s/<[^@>]+@[^>]+>/<redacted>/ig;

The /i here is redundant, since you have nothing that'll case-fold on
the LHS of the s///, doesn't harm anything either. Just a small note
since it's new in v6...

> +	return $line;
> +}
> +
>  sub parse_tag {
>  	my $tag_id = shift;
>  	my %tag;
> @@ -3465,7 +3481,7 @@ sub parse_tag {
>  		} elsif ($line =~ m/^tag (.+)$/) {
>  			$tag{'name'} = $1;
>  		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
> -			$tag{'author'} = $1;
> +			$tag{'author'} = hide_mailaddrs_if_private($1);
>  			$tag{'author_epoch'} = $2;
>  			$tag{'author_tz'} = $3;
>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
> @@ -3513,7 +3529,7 @@ sub parse_commit_text {
>  		} elsif ((!defined $withparents) && ($line =~ m/^parent ($oid_regex)$/)) {
>  			push @parents, $1;
>  		} elsif ($line =~ m/^author (.*) ([0-9]+) (.*)$/) {
> -			$co{'author'} = to_utf8($1);
> +			$co{'author'} = hide_mailaddrs_if_private(to_utf8($1));
>  			$co{'author_epoch'} = $2;
>  			$co{'author_tz'} = $3;
>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
> @@ -3523,7 +3539,7 @@ sub parse_commit_text {
>  				$co{'author_name'} = $co{'author'};
>  			}
>  		} elsif ($line =~ m/^committer (.*) ([0-9]+) (.*)$/) {
> -			$co{'committer'} = to_utf8($1);
> +			$co{'committer'} = hide_mailaddrs_if_private(to_utf8($1));
>  			$co{'committer_epoch'} = $2;
>  			$co{'committer_tz'} = $3;
>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
> @@ -3568,9 +3584,10 @@ sub parse_commit_text {
>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>  	}
> -	# remove added spaces
> +	# remove added spaces, redact e-mail addresses if applicable.
>  	foreach my $line (@commit_lines) {
>  		$line =~ s/^    //;
> +		$line = hide_mailaddrs_if_private($line);
>  	}
>  	$co{'comment'} = \@commit_lines;
>  
> @@ -7489,7 +7506,8 @@ sub git_log_generic {
>  			         -accesskey => "n", -title => "Alt-n"}, "next");
>  	}
>  	my $patch_max = gitweb_get_feature('patches');
> -	if ($patch_max && !defined $file_name) {
> +	if ($patch_max && !defined $file_name &&
> +		!gitweb_check_feature('email-privacy')) {
>  		if ($patch_max < 0 || @commitlist <= $patch_max) {
>  			$paging_nav .= " &sdot; " .
>  				$cgi->a({-href => href(action=>"patches", -replay=>1)},
> @@ -7550,7 +7568,8 @@ sub git_commit {
>  			} @$parents ) .
>  			')';
>  	}
> -	if (gitweb_check_feature('patches') && @$parents <= 1) {
> +	if (gitweb_check_feature('patches') && @$parents <= 1 &&
> +		!gitweb_check_feature('email-privacy')) {
>  		$formats_nav .= " | " .
>  			$cgi->a({-href => href(action=>"patch", -replay=>1)},
>  				"patch");
> @@ -7863,7 +7882,8 @@ sub git_commitdiff {
>  		$formats_nav =
>  			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
>  			        "raw");
> -		if ($patch_max && @{$co{'parents'}} <= 1) {
> +		if ($patch_max && @{$co{'parents'}} <= 1 &&
> +			!gitweb_check_feature('email-privacy')) {
>  			$formats_nav .= " | " .
>  				$cgi->a({-href => href(action=>"patch", -replay=>1)},
>  					"patch");

I didn't run this, and hadn't kept up for a few rounds. I'm happy to see
the pos/while etc. looping gone, this LGTM.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Thu, Apr 08 2021, Eric Wong wrote:

> Junio C Hamano <gitster@pobox.com> wrote:
>> Eric Wong <e@80x24.org> writes:
>> > As a data point, none of the homograph@ candidates I posted here
>> > on Mar 29 have attracted any attempts on my mail server.
>> 
>> That is an interesting observation.  All homograph@ non-addresses,
>> if a human corrected the funnies in their spelling, would have hit
>> whoever handles @80x24.org mailboxes.
>> 
>> I take it to mean that as a future direction, replacing <redacted>
>> with the obfuscated-but-readable-by-humans homographs is a likely
>> improvement that would help human users while still inconveniencing
>> the crawlers.  It may however need some provision to prevent casual
>> end-users from cutting-and-pasting these homographs, as you said in
>> your original mention of the homograph approach.
>
> Yes, exactly.
>
>> But other than that, does the patch look reasonable?
>
> I only took a cursory glance at it, but v6 seemed fine.

Ditto, I left a small nit comment about a needless /i in a regex, but I
don't think that needs a re-roll.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> +sub hide_mailaddrs_if_private {
>> +	my $line = shift;
>> +	return $line unless gitweb_check_feature('email-privacy');
>> +	$line =~ s/<[^@>]+@[^>]+>/<redacted>/ig;
>
> The /i here is redundant, since you have nothing that'll case-fold on
> the LHS of the s///, doesn't harm anything either. Just a small note
> since it's new in v6...

True.  If it were left original version that was suggested during
the review, e.g.

	s/<[^@]+@[-.a-z0-9]+>/<red@cted>/ig;

the /i/ would have been needed, but without the "after the @-sign
should be a run of DNS-valid characters", I agree that there is no
need.

Thanks.  Will locally tweak.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Thu, Apr 08 2021, Eric Wong wrote:
>
>> Junio C Hamano <gitster@pobox.com> wrote:
>>> Eric Wong <e@80x24.org> writes:
>>> > As a data point, none of the homograph@ candidates I posted here
>>> > on Mar 29 have attracted any attempts on my mail server.
>>> 
>>> That is an interesting observation.  All homograph@ non-addresses,
>>> if a human corrected the funnies in their spelling, would have hit
>>> whoever handles @80x24.org mailboxes.
>>> 
>>> I take it to mean that as a future direction, replacing <redacted>
>>> with the obfuscated-but-readable-by-humans homographs is a likely
>>> improvement that would help human users while still inconveniencing
>>> the crawlers.  It may however need some provision to prevent casual
>>> end-users from cutting-and-pasting these homographs, as you said in
>>> your original mention of the homograph approach.
>>
>> Yes, exactly.
>>
>>> But other than that, does the patch look reasonable?
>>
>> I only took a cursory glance at it, but v6 seemed fine.
>
> Ditto, I left a small nit comment about a needless /i in a regex, but I
> don't think that needs a re-roll.

Thanks, both.

Will tweak the /i out, and re-queue with acked-by from you two.

@arwababes2008

This comment has been minimized.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 9, 2021

This patch series was integrated into seen via git@2d5e496.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 9, 2021

This patch series was integrated into next via git@8a19c3c.

@gitgitgadget gitgitgadget bot added the next label Apr 9, 2021
@gitgitgadget
Copy link

gitgitgadget bot commented Apr 13, 2021

There was a status update in the "Cooking" section about the branch gk/gitweb-redacted-email on the Git mailing list:

"gitweb" learned "e-mail privacy" feature to redact strings that
look like e-mail addresses on various pages.

Waiting for reviews.
cf. <xmqq5z19k9wu.fsf@gitster.g>

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 14, 2021

This patch series was integrated into seen via git@a9414b8.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 14, 2021

This patch series was integrated into next via git@a9414b8.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 14, 2021

This patch series was integrated into master via git@a9414b8.

@gitgitgadget gitgitgadget bot added the master label Apr 14, 2021
@gitgitgadget gitgitgadget bot closed this Apr 14, 2021
@gitgitgadget
Copy link

gitgitgadget bot commented Apr 14, 2021

Closed via a9414b8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants