Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Import format reader #129

Closed
nichtich opened this issue Apr 20, 2022 · 2 comments
Closed

Add Import format reader #129

nichtich opened this issue Apr 20, 2022 · 2 comments

Comments

@nichtich
Copy link
Member

In continuation of #128 add PICA::Parser::Import based on https://wiki-cbs.oclc.org/wiki/images/Software_for_Data_Import.pdf.

@nichtich
Copy link
Member Author

Current workaround:

#!/usr/bin/env perl
use v5.14.1;
use Encode;

# Convert PICA+ export format to normalized PICA+ with valid UTF-8

# Input format:
# - each record starts with an empty line and a line with \x1D
# - each field is one line, started with \x1E

while (<>) {
    chomp;
    next unless $_;    # ignore empty lines
    if ( $_ eq "\x1D" ) {
        say "" if $. > 2;    # start of next record
    }
    else {
        if ( $_ =~ /^\x1E[012][0-9][0-9][A-Z@]/ ) {
            my $field = substr $_, 1;

            # invalid UTF-8 => U+FFFD (Unicode REPLACEMENT CHARACTER)
            my $bytes = encode( 'UTF-8', decode( 'UTF-8', $field ) );
            if ( $field ne $bytes ) {
                warn "$.: invalid UTF-8\n";
            }

            print $bytes, "\x1E";
        }
        elsif ( $. > 2 ) {    # empty line after record
            say "$.: '$_'\n";
        }
    }
}

# newline after last record
say "";

@nichtich
Copy link
Member Author

nichtich commented Aug 9, 2023

Implemented in release 2.10 (not released yet).

@nichtich nichtich closed this as completed Aug 9, 2023
nichtich added a commit that referenced this issue Aug 9, 2023
Changelog diff is:

diff --git a/Changes b/Changes
index 9d70e1b..c0520cc 100644
--- a/Changes
+++ b/Changes
@@ -1,6 +1,8 @@
 Revision history for PICA::Data
 
 {{$NEXT}}
+
+2.10 2023-08-09T14:01:25Z
     - Add PICA Import format parser (#129)
     - Add parser counter (method: count)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant