Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Sha instead of md5 #6

Closed
goetzk opened this issue Nov 26, 2018 · 9 comments
Closed

Use Sha instead of md5 #6

goetzk opened this issue Nov 26, 2018 · 9 comments

Comments

@goetzk
Copy link

@goetzk goetzk commented Nov 26, 2018

Hi,
As a related point to #5 ; I would like to propose the spec supports Sha256 instead of md5 as its default checksum.

As a future feature it should probably support md5, sha1 and sha256 (archivematica's three options) but I don't see it being a pressing issue from the start as sha256 and md5 are those supported by eprints OOTB.

From eprints commit be42eb7f4dd9ebc184ff8f7fb0282d2f8e778f21

=item $md5 = $stored->generate_md5

Calculates and returns the MD5 for this file.

=cut

sub generate_md5
{
        my( $self ) = @_;

        my $md5 = Digest::MD5->new;

        $self->get_file(sub {
                $md5->add( $_[0] )
        });

        return $md5->hexdigest;
}

sub update_md5
{
        my( $self ) = @_;

        my $md5 = $self->generate_md5;

        $self->set_value( "hash", $md5 );
        $self->set_value( "hash_type", "MD5" );
}

=item $digest = $file->generate_sha( [ ALGORITHM ] )

Generate a SHA for this file, see L<Digest::SHA> for a list of supported algorithms. Defaults to "256" (SHA-256).

Returns the hex-encoded digest.

=cut

sub generate_sha
{
        my( $self, $alg ) = @_;

        $alg ||= "256";

        my $sha = Digest::SHA->new( $alg );

        $self->get_file(sub {
                $sha->add( $_[0] )
        });

        return $sha->hexdigest;
}

sub update_sha
{
        my( $self, $alg ) = @_;

        $alg ||= "256";

        my $digest = $self->generate_sha( $alg );

        $self->set_value( "hash", $digest );
        $self->set_value( "hash_type", "SHA-$alg" );
}
@tw4l
Copy link
Collaborator

@tw4l tw4l commented Nov 26, 2018

I'm definitely okay with preferring sha256, especially if it's already being used by eprints.

Am I understanding correctly that a stock installation of eprints will generate both md5 and sha256 hashes for all uploaded files?

@tw4l
Copy link
Collaborator

@tw4l tw4l commented Nov 26, 2018

I would add that we should aim to support multiple hash algorithms from the start. In my experience, this should not greatly complicate development.

@photomedia - thoughts?

@tw4l
Copy link
Collaborator

@tw4l tw4l commented Dec 17, 2018

I took a look at a MySQL dump from a development instance of EPrints here at Concordia, and it looks like all uploaded files and EPrints-generated derivatives have an MD5 digest stored in the documents table in the MySQL database. This particular instance of EPrints was not generating any SHA-* hashes, only MD5s.

I tried to look in the EPrints source code where the generate_md5 and/or generate_sha functions are called, but only found the function definitions as pointed out by @goetzk above, I'm not terribly familiar with EPrints and don't know Perl, so it's very possible I'm missing something here. Is it possible to determine which function is being called by default when new files are uploaded?

@geo-mac
Copy link

@geo-mac geo-mac commented Dec 19, 2018

Our EPrints instance generates MD5 only too. This seems to be the default for generating technical metadata about files in EPrints because -- although sha256 is supported by EPrints -- MD5 is specified in a number of configuration files, inc.

perl_lib/EPrints/System.pm
perl_lib/EPrints/Apache/Storage.pm
perl_lib/EPrints/DataObj/File.pm

There are other .pm and .pl files within EPrints that seem to reference the MD5 digest but I can't interpret why! But certainly the aforementioned would need to be modified and tested. It would be interesting to know (perhaps via the Eprints Tech listserve...?) whether anyone has modified their configuration to support sha256...

@photomedia
Copy link
Collaborator

@photomedia photomedia commented Dec 19, 2018

Since EPrints uses MD5 by default, the majority of repository instances will have MD5 hashes in their databases. Therefore, I suggest that we keep our spec to MD5 at this point. Supporting sha256 is a nice feature that would be useful for new repository instances once EPrints starts adding these to the database, but for now, the majority content in repositories likely only has MD5 hashes, so let's support these first.

@eprintsug eprintsug deleted a comment from geo-mac Dec 19, 2018
@photomedia
Copy link
Collaborator

@photomedia photomedia commented Dec 19, 2018

@geo-mac I edited your comment to include the clarification from the comment/correction.
I agree, let's reach out on the tech list to ask if anyone has modified their config to support sha256.
Overall, I agree that it would be nice to include sha256 as an option for the export, but I would place this as a feature to implement later on if it is unlikely to be useful for anyone currently running eprints.

@goetzk
Copy link
Author

@goetzk goetzk commented Jan 24, 2019

For anyone watching this discussion in the future:
@photomedia contacted the tech list in early January - http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/2019-January/007648.html .

The only reply on list was mine, observing the same hashes (md5) and behaviour (inconsistency in their usage).
http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/2019-January/007650.html

@goetzk
Copy link
Author

@goetzk goetzk commented Jan 24, 2019

Based on the above discussion(s) I think we should remain with md5 as the specified format. It may be appropriate to add a note "Future revisions of this specification will add optional Sha support" or similar.

@photomedia
Copy link
Collaborator

@photomedia photomedia commented Jan 30, 2019

OK, so it is decided, we are going with MD5. Thanks @goetzk for the mysql table in your response on the list, it looks like we have inconsistent hashes for files in EPrints, so the issue of missing hashes remains open, #5 , but I'm closing this one.

@photomedia photomedia closed this Jan 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.