Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

another mass update

* tests
* anchor mailbox at the end with ;|,|$ otherwise we could match less
* jump over comments and quoted strings in recover from error code
* documentation
  • Loading branch information...
commit 4341cba0fe0fb1207f914bf83f26e9bec1142bb4 1 parent 2763bb5
@ruz ruz authored
View
119 README
@@ -2,46 +2,85 @@ NAME
Email::Address::List - RFC close address list parsing
DESCRIPTION
- Parser for From, To, Cc, Bcc, Reply-To, Sender and previouse prefixed
- with Resent- prefix (eg Resent-From).
+ Parser for From, To, Cc, Bcc, Reply-To, Sender and previous prefixed
+ with Resent- (eg Resent-From) headers.
REASONING
- Email::Address is good at parsing addresses out of text and this module
- uses regexpes from the module to parse addresses.
-
- mailbox = name-addr / addr-spec
- name-addr = [display-name] angle-addr
- angle-addr = [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
- display-name = phrase
-
- from = "From:" mailbox-list CRLF
- sender = "Sender:" mailbox CRLF
- reply-to = "Reply-To:" address-list CRLF
-
- to = "To:" address-list CRLF
- cc = "Cc:" address-list CRLF
- bcc = "Bcc:" [address-list / CFWS] CRLF
-
- resent-from = "Resent-From:" mailbox-list CRLF
- resent-sender = "Resent-Sender:" mailbox CRLF
- resent-to = "Resent-To:" address-list CRLF
- resent-cc = "Resent-Cc:" address-list CRLF
- resent-bcc = "Resent-Bcc:" [address-list / CFWS] CRLF
-
- obs-from = "From" *WSP ":" mailbox-list CRLF
- obs-sender = "Sender" *WSP ":" mailbox CRLF
- obs-reply-to = "Reply-To" *WSP ":" address-list CRLF
-
- obs-to = "To" *WSP ":" address-list CRLF
- obs-cc = "Cc" *WSP ":" address-list CRLF
- obs-bcc = "Bcc" *WSP ":" (address-list / (*([CFWS] ",") [CFWS])) CRLF
-
- obs-resent-from = "Resent-From" *WSP ":" mailbox-list CRLF
- obs-resent-send = "Resent-Sender" *WSP ":" mailbox CRLF
- obs-resent-date = "Resent-Date" *WSP ":" date-time CRLF
- obs-resent-to = "Resent-To" *WSP ":" address-list CRLF
- obs-resent-cc = "Resent-Cc" *WSP ":" address-list CRLF
- obs-resent-bcc = "Resent-Bcc" *WSP ":" (address-list / (*([CFWS] ",") [CFWS])) CRLF
- obs-resent-mid = "Resent-Message-ID" *WSP ":" msg-id CRLF
- obs-resent-rply = "Resent-Reply-To" *WSP ":" address-list CRLF
+ Email::Address is good at parsing addresses out of any text even
+ mentioned headers and this module is derived work from Email::Address.
+
+ However, mentioned headers are structured and contain lists of
+ addresses. Most of the time you want to parse it from start to end
+ keeping every bit even if it's a invalid input.
+
+METHODS
+ parse
+ A class method that takes a header value (w/o name and :) and a set of
+ named options. See below.
+
+ Returns list of hashes. Each hash at least has 'type' key that describes
+ the entry. Types:
+
+ mailbox
+ A mailbox entry with Email::Address object under value key.
+
+ If mailbox has obsolete parts then 'obsolete' is true.
+
+ If address (not display-name/phrase or comments, but
+ local-part@domain) contains not ASCII chars then 'not_ascii' is set
+ to true. According to RFC 5322 not ASCII chars are not allowed
+ within mailbox. However, there are no big problems if those are used
+ and actually RFC 6532 extends a few rules from 5322 with
+ UTF8-non-ascii. Either use the feature or just skip such addresses
+ with skip_not_ascii option.
+
+ group start
+ Some headers with mailboxes may contain groupped addresses. This
+ element is returned for position where group starts. Under value key
+ you find name of the group. NOTE that value is not post processed at
+ the moment, so it may contain spaces, comments, quoted strings and
+ other noise. Author willing to take patches and warns that this will
+ be changed at some point without additional notifications, so if you
+ need groups info then you better send a patch :)
+
+ Groups can not be nested, but one field may have multiple groups or
+ mix of addresses that are in a group and not in any.
+
+ See skip_groups option.
+
+ group end
+ Returned when a group ends.
+
+ comment
+ Obsolete syntax allows to use standalone comments between mailboxes
+ that can not be addressed to any mailbox. In such situations a
+ comment returned as an entry of this type. Comment itself is under
+ value.
+
+ unknown
+ Returned if parser met something that shouldn't be there. Parser
+ tries to recover by jumping over to next comma (or semicolon if
+ inside group) that is out quoted string or comment, so "foo, bar,
+ baz" string results in three unknown entries. Jumping over comments
+ and quoted strings means that parser is very sensitive to unbalanced
+ quotes and parens, but it's on purpose.
+
+ It can be controlled which elements are skipped, for example:
+
+ Email::Address::List->parse($line, skip_unknown => 1, ...);
+
+ skip_comments
+ Skips comments between mailboxes. Comments inside and next to a
+ mailbox are not skipped, but returned as part of mailbox entry.
+
+ skip_not_ascii
+ Skips mailboxes where address part has not ASCII characters.
+
+ skip_groups
+ Skips group starts and end elements, however emails within groups
+ are still returned.
+
+ skip_unknown
+ Skip anything that is not recognizable. It still tries to recover as
+ described earlier.
View
168 lib/Email/Address/List.pm
@@ -20,6 +20,100 @@ L<Email::Address> is good at parsing addresses out of any text
even mentioned headers and this module is derived work
from Email::Address.
+However, mentioned headers are structured and contain lists
+of addresses. Most of the time you want to parse it from start
+to end keeping every bit even if it's a invalid input.
+
+=head1 METHODS
+
+=head2 parse
+
+A class method that takes a header value (w/o name and :) and
+a set of named options. See below.
+
+Returns list of hashes. Each hash at least has 'type' key that
+describes the entry. Types:
+
+=over 4
+
+=item mailbox
+
+A mailbox entry with L<Email::Address> object under value key.
+
+If mailbox has obsolete parts then 'obsolete' is true.
+
+If address (not display-name/phrase or comments, but
+local-part@domain) contains not ASCII chars then 'not_ascii' is
+set to true. According to RFC 5322 not ASCII chars are not
+allowed within mailbox. However, there are no big problems if
+those are used and actually RFC 6532 extends a few rules
+from 5322 with UTF8-non-ascii. Either use the feature or just
+skip such addresses with skip_not_ascii option.
+
+=item group start
+
+Some headers with mailboxes may contain groupped addresses. This
+element is returned for position where group starts. Under value
+key you find name of the group. B<NOTE> that value is not post
+processed at the moment, so it may contain spaces, comments,
+quoted strings and other noise. Author willing to take patches
+and warns that this will be changed at some point without additional
+notifications, so if you need groups info then you better send a
+patch :)
+
+Groups can not be nested, but one field may have multiple groups or
+mix of addresses that are in a group and not in any.
+
+See skip_groups option.
+
+=item group end
+
+Returned when a group ends.
+
+=item comment
+
+Obsolete syntax allows to use standalone comments between mailboxes
+that can not be addressed to any mailbox. In such situations a comment
+returned as an entry of this type. Comment itself is under value.
+
+=item unknown
+
+Returned if parser met something that shouldn't be there. Parser
+tries to recover by jumping over to next comma (or semicolon if inside
+group) that is out quoted string or comment, so "foo, bar, baz" string
+results in three unknown entries. Jumping over comments and quoted strings
+means that parser is very sensitive to unbalanced quotes and parens,
+but it's on purpose.
+
+=back
+
+It can be controlled which elements are skipped, for example:
+
+ Email::Address::List->parse($line, skip_unknown => 1, ...);
+
+=over 4
+
+=item skip_comments
+
+Skips comments between mailboxes. Comments inside and next to a mailbox
+are not skipped, but returned as part of mailbox entry.
+
+=item skip_not_ascii
+
+Skips mailboxes where address part has not ASCII characters.
+
+=item skip_groups
+
+Skips group starts and end elements, however emails within groups are
+still returned.
+
+=item skip_unknown
+
+Skip anything that is not recognizable. It still tries to recover as
+described earlier.
+
+=back
+
=cut
# mailbox = name-addr / addr-spec
@@ -81,7 +175,7 @@ for (1 .. $COMMENT_NEST_LEVEL) {
$RE{'cfws'} = qr/$RE{'comment'}|\s+/;
$RE{'qcontent'} = qr/$RE{'qtext'}|$RE{'quoted_pair'}/;
-$RE{'quoted_string'} = qr/$RE{'cfws'}*"$RE{'qcontent'}+"$RE{'cfws'}*/;
+$RE{'quoted-string'} = qr/$RE{'cfws'}*"$RE{'qcontent'}+"$RE{'cfws'}*/;
$RE{'atom'} = qr/$RE{'cfws'}*$RE{'atext'}+$RE{'cfws'}*/;
@@ -91,12 +185,11 @@ $RE{'display-name'} = $RE{'phrase'};
$RE{'dot_atom_text'} = qr/$RE{'atext'}+(?:\.$RE{'atext'}+)*/;
$RE{'dot_atom'} = qr/$RE{'cfws'}*$RE{'dot_atom_text'}$RE{'cfws'}*/;
-$RE{'local-part'} = qr/$RE{'dot_atom'}|$RE{'quoted_string'}/;
+$RE{'local-part'} = qr/$RE{'dot_atom'}|$RE{'quoted-string'}/;
$RE{'dcontent'} = qr/$RE{'dtext'}|$RE{'quoted_pair'}/;
$RE{'domain_literal'} = qr/$RE{'cfws'}*\[(?:\s*$RE{'dcontent'})*\s*\]$RE{'cfws'}*/;
$RE{'domain'} = qr/$RE{'dot_atom'}|$RE{'domain_literal'}/;
-$RE{'obs-domain'} = qr/$RE{'atom'}(?:\.$RE{'atom'})*|$RE{'domain_literal'}/;
$RE{'addr-spec'} = qr/$RE{'local-part'}\@$RE{'domain'}/;
$RE{'angle-addr'} = qr/$RE{'cfws'}* < $RE{'addr-spec'} > $RE{'cfws'}*/x;
@@ -145,7 +238,11 @@ sub parse {
# in obs- case we have number of optional comments/spaces/
# address-list = (address *("," address)) / obs-addr-list
# obs-addr-list = *([CFWS] ",") address *("," [address / CFWS]))
- $line =~ s/^(?:$RE{'cfws'}?,)+//o;
+ if ( $line =~ s/^(?:($RE{'cfws'})?,)//o ) {
+ push @res, {type => 'comment', value => $1 }
+ if $1 && !$args{'skip_comments'} && $1 =~ /($RE{'comment'})/;
+ next;
+ }
$line =~ s/^\s+//o;
# now it's only comma separated address where address is:
@@ -155,10 +252,13 @@ sub parse {
# group = display-name ":" [group-list] ";" [CFWS]
# group-list = mailbox-list / CFWS / obs-group-list
# obs-group-list = 1*([CFWS] ",") [CFWS])
- if ( !$in_group && $line =~ s/^$RE{'display-name'}://o ) {
+ if ( !$in_group && $line =~ s/^($RE{'display-name'})://o ) {
+ push @res, {type => 'group start', value => $1 }
+ unless $args{'skip_groups'};
$in_group = 1; next;
}
if ( $in_group && $line =~ s/^;// ) {
+ push @res, {type => 'group end'} unless $args{'skip_groups'};
$in_group = 0; next;
}
@@ -168,33 +268,43 @@ sub parse {
# so address-list is now comma separated list of mailboxes:
# address-list = (mailbox *("," mailbox))
- if ( $line =~ s/^($CRE{'mailbox'})//o ) {
- my ($original, $phrase, $user, $host, @comments) = $self->_process_mailbox(
- $1,$2,$3,$4,$5,$6,$7
- );
- push @res, Email::Address->new(
- $phrase, "$user\@$host", join(' ', grep defined, @comments),
- $original,
- );
- next;
- }
- elsif ( $line =~ s/^($CRE{'obs-mailbox'})//o ) {
+ my $obsolete = 0;
+ if ( $line =~ s/^($CRE{'mailbox'})($RE{cfws}*)(?=,|;|$)//o
+ || ($line =~ s/^($CRE{'obs-mailbox'})($RE{cfws}*)(?=,|;|$)//o and $obsolete = 1)
+ ) {
my ($original, $phrase, $user, $host, @comments) = $self->_process_mailbox(
- $1,$2,$3,$4,$5,$6,$7
- );
- push @res, Email::Address->new(
- $phrase, "$user\@$host", join(' ', grep defined, @comments),
- $original,
+ $1,$2,$3,$4,$5,$6,$7,$8,$9
);
+ my $not_ascii = "$user\@$host" =~ /\P{ASCII}/? 1 : 0;
+ next if $not_ascii && $args{skip_not_ascii};
+
+ push @res, {
+ type => 'mailbox',
+ value => Email::Address->new(
+ $phrase, "$user\@$host", join(' ', @comments),
+ $original,
+ ),
+ obsolete => $obsolete,
+ not_ascii => $not_ascii,
+ };
next;
}
# if we got here then something unknown on our way
# try to recorver
- if ( $line =~ s/^(.+?)\s*(?=(;)|,|$)//o ) {
- push @res, { type => 'unknown', value => $1 };
- if ($2) { $in_group = 1 }
+ if ($in_group) {
+ if ( $line =~ s/^([^;,"\)]*(?:(?:$RE{'quoted-string'}|$RE{'comment'})[^;,"\)]*)*)(?=;|,)//o ) {
+ push @res, { type => 'unknown', value => $1 } unless $args{'skip_unknown'};
+ next;
+ }
+ } else {
+ if ( $line =~ s/^([^,"\)]*(?:(?:$RE{'quoted-string'}|$RE{'comment'})[^,"\)]*)*)(?=,)//o ) {
+ push @res, { type => 'unknown', value => $1 } unless $args{'skip_unknown'};
+ next;
+ }
}
+ push @res, { type => 'unknown', value => $line } unless $args{'skip_unknown'};
+ last;
}
return @res;
}
@@ -217,8 +327,8 @@ sub _process_mailbox {
my @comments;
foreach ( grep defined, splice @rest ) {
- s{ ($RE{'quoted_string'}) | ($RE{comment}) }
- { $1? $1 : do { push @comments, $2; $comments[-1]=~/^\s|\s$/? ' ' : '' } }xgoe;
+ s{ ($RE{'quoted-string'}) | ($RE{comment}) }
+ { $1? $1 : do { push @comments, $2; $comments[-1] =~ /^\s|\s$/? ' ' : '' } }xgoe;
s/^\s+//; s/\s+$//;
next unless length;
@@ -227,13 +337,13 @@ sub _process_mailbox {
my ($host, $user, $phrase) = reverse @rest;
# deal with spaces out of quoted strings
- s{ ($RE{'quoted_string'}) | \s+ }{ $1? $1 : ' ' }xgoe
+ s{ ($RE{'quoted-string'}) | \s+ }{ $1? $1 : ' ' }xgoe
foreach grep defined, $phrase;
- s{ ($RE{'quoted_string'}) | \s+ }{ $1? $1 : '' }xgoe
+ s{ ($RE{'quoted-string'}) | \s+ }{ $1? $1 : '' }xgoe
foreach $user, $host;
# dequote
- s{ ($RE{'quoted_string'}) }{ $dequote->($1) }xgoe
+ s{ ($RE{'quoted-string'}) }{ $dequote->($1) }xgoe
foreach grep defined, $phrase, $user;
$user = $quote->($user) unless $user =~ /^$RE{'dot_atom'}$/;
View
3  t/basics.t
@@ -2,10 +2,9 @@ use strict; use warnings;
use Test::More tests => 3;
use_ok 'Email::Address::List';
-use Scalar::Util qw(blessed);
{
my @addresses = Email::Address::List->parse(q{ruz@bestpractical.com});
is scalar @addresses, 1;
- is $addresses[0]->format, q{ruz@bestpractical.com};
+ is $addresses[0]{'value'}->format, q{ruz@bestpractical.com};
}
View
1  t/generate.pl
@@ -1,7 +1,6 @@
use strict; use warnings; use autodie; use lib 'lib/';
use Email::Address::List;
-my $file = 't/data/RFC5233.single.valid.txt';
foreach my $file (qw(t/data/RFC5233.single.valid.txt t/data/RFC5233.single.obs.txt)) {
process_file($file);
}
View
28 t/invalid.t
@@ -0,0 +1,28 @@
+use strict; use warnings;
+use Test::More tests => 4;
+use_ok 'Email::Address::List';
+
+run_test('root', [{type => 'unknown', value => 'root'}]);
+run_test(
+ 'boo@boo, root, foo@foo',
+ [
+ {type => 'mailbox', value => 'boo@boo', obsolete => 0, not_ascii => 0},
+ {type => 'unknown', value => 'root'},
+ {type => 'mailbox', value => 'foo@foo', obsolete => 0, not_ascii => 0},
+ ],
+);
+run_test(
+ '"Doe, John" foo@foo, root',
+ [
+ {type => 'unknown', value => '"Doe, John" foo@foo' },
+ {type => 'unknown', value => 'root'},
+ ],
+);
+
+sub run_test {
+ my $line = shift;
+ my @list = Email::Address::List->parse($line);
+ $_->{'value'} .= '' foreach grep defined $_->{'value'}, @list;
+ is_deeply( \@list, shift );
+}
+
View
60 t/random.combinations.t
@@ -0,0 +1,60 @@
+use strict;
+use warnings;
+
+use Test::More;
+use JSON ();
+
+use_ok('Email::Address::List');
+
+my @data;
+foreach my $file (qw(t/data/RFC5233.single.valid.json t/data/RFC5233.single.obs.json)) {
+ my $obsolete = $file =~ /\bobs\b/? 1 : 0;
+
+ open my $fh, '<', $file;
+ push @data, @{ JSON->new->decode( do { local $/; <$fh> } ) };
+ close $fh;
+}
+
+diag "srand is ". (my $seed = int rand( 2**16-1 ));
+srand($seed);
+
+for (1..100) {
+ my @list;
+ push @list, $data[ rand @data ] for 1..3;
+
+ my $line = join ', ', map $_->{'mailbox'}, @list;
+ note $line;
+
+ my @res = Email::Address::List->parse( $line );
+ is scalar @res, scalar @list;
+
+ for (my $i = 0; $i < @list; $i++) {
+ my $test = $list[$i];
+ my $v = $res[$i]{'value'};
+ is $v->phrase, $test->{'display-name'}, 'correct value';
+ is $v->address, $test->{'address'}, 'correct value';
+ is $v->comment, join( ' ', @{$test->{'comments'}} ), 'correct value';
+ }
+}
+
+for (1..100) {
+ my @list;
+ push @list, $data[ rand @data ] for 1..3;
+
+ my $line = join ",\n ,", '', (map $_->{'mailbox'}, @list), '';
+ note $line;
+
+ my @res = Email::Address::List->parse( $line );
+ is scalar @res, scalar @list;
+
+ for (my $i = 0; $i < @list; $i++) {
+ my $test = $list[$i];
+ my $v = $res[$i]{'value'};
+ is $v->phrase, $test->{'display-name'}, 'correct value';
+ is $v->address, $test->{'address'}, 'correct value';
+ is $v->comment, join( ' ', @{$test->{'comments'}} ), 'correct value';
+ }
+}
+
+done_testing;
+
View
28 t/single.suit.t
@@ -0,0 +1,28 @@
+use strict;
+use warnings;
+
+use Test::More;
+use JSON ();
+
+use_ok('Email::Address::List');
+
+foreach my $file (qw(t/data/RFC5233.single.valid.json t/data/RFC5233.single.obs.json)) {
+ my $obsolete = $file =~ /\bobs\b/? 1 : 0;
+
+ open my $fh, '<', $file;
+ my $tests = JSON->new->decode( do { local $/; <$fh> } );
+ close $fh;
+
+ foreach my $test ( @$tests ) {
+ note $test->{'description'};
+ my @list = Email::Address::List->parse( $test->{'mailbox'} );
+ is scalar @list, 1, "one entry in result set" or do { use Data::Dumper; diag Dumper \@list };
+ is $list[0]{'type'}, 'mailbox', 'one mailbox';
+ my $v = $list[0]{'value'};
+ is $v->phrase, $test->{'display-name'}, 'correct value';
+ is $v->address, $test->{'address'}, 'correct value';
+ is $v->comment, join( ' ', @{$test->{'comments'}} ), 'correct value';
+ }
+}
+
+done_testing();
Please sign in to comment.
Something went wrong with that request. Please try again.