Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: Cannot store AIP with large files #981

Closed
5 tasks
jorikvankemenade opened this issue Nov 8, 2019 · 6 comments
Closed
5 tasks

Problem: Cannot store AIP with large files #981

jorikvankemenade opened this issue Nov 8, 2019 · 6 comments
Milestone

Comments

@jorikvankemenade
Copy link

jorikvankemenade commented Nov 8, 2019

Expected behaviour
When creating a transfer with large single files, i.e. videos, a verified AIP should be stored successfully. This is of course given that the file system supports the size of the files.

Current behaviour
If an AIP contains a file that is bigger than 2.15 GB the "Store AIP step" fails. In the dashboard it will show the following error:

500 Server Error: INTERNAL SERVER ERROR for url: http://archivematica-storage-dev.cern.ch/api/v2/file/Traceback (most recent call last):
  File "/usr/lib/archivematica/MCPClient/job.py", line 111, in JobContext
    yield
  File "/usr/lib/archivematica/MCPClient/clientScripts/store_aip.py", line 356, in call
    args.sip_type,
  File "/usr/lib/archivematica/MCPClient/clientScripts/store_aip.py", line 241, in store_aip
    related_package_uuid,
  File "/usr/lib/archivematica/MCPClient/clientScripts/store_aip.py", line 77, in _create_file
    agents=get_agents_from_db(uuid),
  File "/usr/lib/archivematica/archivematicaCommon/storageService.py", line 347, in create_file
    response.raise_for_status()
  File "/usr/share/archivematica/virtualenvs/archivematica-mcp-client/lib/python2.7/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: http://archivematica-storage-dev.cern.ch/api/v2/file/

And in the storage service we will have the following error:

ERROR     2019-11-08 02:09:30  django.request.tastypie:resources:_handle_500:301:  Internal Server Error: /api/v2/file/
Traceback (most recent call last):
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/tastypie/resources.py", line 220, in wrapper
    response = callback(request, *args, **kwargs)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/tastypie/resources.py", line 451, in dispatch_list
    return self.dispatch('list', request, **kwargs)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/tastypie/resources.py", line 483, in dispatch
    response = method(request, **kwargs)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/tastypie/resources.py", line 1380, in post_list
    updated_bundle = self.obj_create(bundle, **self.remove_api_resource_names(kwargs))
  File "/usr/lib/archivematica/storage-service/locations/api/resources.py", line 1046, in obj_create
    bundle = super(PackageResource, self).obj_create(bundle, **kwargs)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/tastypie/resources.py", line 2164, in obj_create
    return self.save(bundle)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/tastypie/resources.py", line 2319, in save
    bundle.obj.save()
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/django/db/models/base.py", line 734, in save
    force_update=force_update, update_fields=update_fields)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/django/db/models/base.py", line 762, in save_base
    updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/django/db/models/base.py", line 846, in _save_table
    result = self._do_insert(cls._base_manager, using, fields, update_pk, raw)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/django/db/models/base.py", line 885, in _do_insert
    using=using, raw=raw)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/django/db/models/manager.py", line 127, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/django/db/models/query.py", line 920, in _insert
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 974, in execute_sql
    cursor.execute(sql, params)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/django/db/utils.py", line 98, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 124, in execute
    return self.cursor.execute(query, args)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/MySQLdb/cursors.py", line 312, in _query
    db.query(q)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python2.7/site-packages/MySQLdb/connections.py", line 224, in query
    _mysql.connection.query(self, query)
DataError: (1264, "Out of range value for column 'size' at row 1")

So there is a problem with the size field in the files endpoint of the storage service. The files endpoint is served by the Package model. Looking at the table we get:

[root@archivematica-storage-dev ~]# mysql -D "SS" -e "SHOW COLUMNS FROM locations_package"
+----------------------------+--------------+------+-----+---------+----------------+
| Field                      | Type         | Null | Key | Default | Extra          |
+----------------------------+--------------+------+-----+---------+----------------+
| id                         | int(11)      | NO   | PRI | NULL    | auto_increment |
| uuid                       | varchar(36)  | NO   | UNI | NULL    |                |
| current_path               | longtext     | NO   |     | NULL    |                |
| pointer_file_path          | longtext     | YES  |     | NULL    |                |
| size                       | int(11)      | NO   |     | NULL    |                |
| package_type               | varchar(8)   | NO   |     | NULL    |                |
| status                     | varchar(8)   | NO   |     | NULL    |                |
| current_location_id        | varchar(36)  | NO   | MUL | NULL    |                |
| origin_pipeline_id         | varchar(36)  | YES  | MUL | NULL    |                |
| pointer_file_location_id   | varchar(36)  | YES  | MUL | NULL    |                |
| misc_attributes            | longtext     | YES  |     | NULL    |                |
| description                | varchar(256) | YES  |     | NULL    |                |
| encryption_key_fingerprint | varchar(512) | YES  |     | NULL    |                |
| replicated_package_id      | varchar(36)  | YES  | MUL | NULL    |                |
+----------------------------+--------------+------+-----+---------+----------------+

So the size field is an int(11), or a 32-bit integer. This limits the maximum file size to a maximum of 2.147.483.647 bytes or 2.15 GB. Since I think it is perfectly reasonable to have bigger files I think this number should be increased. This means changing the type of the size field in the Package model in the storage service. If we change the number field from IntegerField to BigIntegerField we increase the max file size to the order of exabytes. I would think a pretty save limit :). I'll start working on creating and testing a PR this afternoon.

Steps to reproduce

  1. Find two video files, one smaller than 2.15 GB and one bigger than 2.15 GB.
  2. Start a separate standard transfer for each file.
  3. Do not normalize or compress the files (this saves you a lot of time).
  4. Observe that the larger file fails during the "Store AIP" phase.

Your environment (version of Archivematica, operating system, other relevant details)
CentOS 7, Archivematica 1.10, Storage Service 0.15, MySQL 5.7.26 for both the Storage Service and Archivematica.


For Artefactual use:

Before you close this issue, you must check off the following:

  • All pull requests related to this issue are properly linked
  • All pull requests related to this issue have been merged
  • A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
  • Documentation regarding this issue has been written and merged
  • Details about this issue have been added to the release notes (if applicable)
@ross-spencer
Copy link
Contributor

Hi @jorikvankemenade this is a good spot. That field looks like it definitely needs to be updated. However, I am seeing a different behavior using percona in the docker-compose setup.

If i create a 4GB file:

dd if=/dev/urandom of=file.txt bs=1048576 count=4000

And store as uncompressed:in the storage service I see:

mysql> select uuid, size from locations_package;
+--------------------------------------+------------+
| uuid                                 | size       |
+--------------------------------------+------------+
| 7fa9ac66-3c91-4abe-9183-5a58e1a5453e | 2147483647 |
| 5dba4d94-8aa5-4d92-b530-bbd069b1aa76 | 2147483647 |
+--------------------------------------+------------+
2 rows in set (0.00 sec)

In the storage service I have two AIPS:

4.0G	./7fa9
4.0G	./5dba

So you can see, the field has maxed out, but it hasn't caused the storage service to fall-over.


My reason for looking is that storing this size of AIP is not unusual in a lot of deployments, so it's odd that it's happening here.

I can see you're using mysql in a CentOS deployment where currently we still only have documented support for that in docker.

In my CentOS deployment, i can see the column is likely fixed to 32-bits:

sqlite> select uuid, size from locations_package;
uuid                                  size      
------------------------------------  ----------
7a3e3d5e-1ac8-4d16-b8b4-4cadfb4bb577  4194444916
sqlite> .schema locations_package
CREATE TABLE "locations_package" ("id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
"uuid" varchar(36) NOT NULL UNIQUE, 
"current_path" text NOT NULL, 
"pointer_file_path" text NULL, 
"size" integer NOT NULL, 
"package_type" varchar(8) NOT NULL, 
"status" varchar(8) NOT NULL, 
"current_location_id" varchar(36) NOT NULL REFERENCES "locations_location" ("uuid"), 
"origin_pipeline_id" varchar(36) NULL REFERENCES "locations_pipeline" ("uuid"), 
"pointer_file_location_id" varchar(36) NULL REFERENCES "locations_location" ("uuid"), 
"misc_attributes" text NULL, "description" varchar(256) NULL, 
"encryption_key_fingerprint" varchar(512) NULL, 
"replicated_package_id" varchar(36) NULL REFERENCES "locations_package" ("uuid"));

If you can think of anything else that might be impacting this, or perhaps, let us know if updating that field helps at all it will be useful for other readers of this ticket looking at mysql in other deployments of Archivematica.

@jorikvankemenade
Copy link
Author

jorikvankemenade commented Nov 8, 2019

In my CentOS deployment, i can see the column is likely fixed to 32-bits:

I see that you posted the schema for SQLite. This schema shows the field as an integer. According to the SQLite documentation:

  • INTEGER. The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value.

So I guess it is "a lucky find" in the sense that I shot myself in the foot by using MySQL instead of SQLite for the storage service.

If you can think of anything else that might be impacting this, or perhaps, let us know if updating that field helps at all it will be useful for other readers of this ticket looking at mysql in other deployments of Archivematica.

I am currently in the process of testing ingesting running big files in my setup. I'll update this ticket when I know more.

@jorikvankemenade
Copy link
Author

@ross-spencer I have successfully imported a 5.1GB video file in my MySQL backed storage service. It requires not one, but two very small pull requests. So I hope that it wouldn't take you guys to much time to check them. Let me know what you think!

@ross-spencer
Copy link
Contributor

Awesome, we'll make a note here for @sromkey about considering those next week for inclusion in 1.11 and then get on that. Thanks @jorikvankemenade!

@sromkey sromkey added this to the 1.11.0 milestone Nov 15, 2019
@sromkey sromkey added Status: ready The issue is sufficiently described/scoped to be picked up by a developer. and removed triage-release-1.11 labels Nov 15, 2019
@mamedin
Copy link

mamedin commented Nov 15, 2019

I could reproduce the issue on CentOS, SS with percona 5.7 and AM 1.10.1 deployed with ansible. I tested with a 4.3 GB transfer (linux iso file)

When using the default sql_mode I got the same behavior than @jorikvankemenade.

MySQL 5.7 uses the strict sql mode by default, it is an important change from MySQL 5.6.

I tried again disabling the MySQL strict mode with sql_mode=IGNORE_SPACE,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION and I could store the AIP, but when looking at the dashboard's Archival Storage or the Storage Service's Packages tab the AIP is shown as 2 GB size.

At last, I tested the change in database that @jorikvankemenade proposes and then I could storage the AIP with the default MySQL 5.7 strict mode, and the Archival Storage shows the AIP size. This is the change in database (when SS is the Storage Service database name):

mysql -e "ALTER TABLE locations_package MODIFY size BIGINT;" SS

@ross-spencer ross-spencer added Status: review The issue's code has been merged and is ready for testing/review. Status: in progress Issue that is currently being worked on. and removed Status: ready The issue is sufficiently described/scoped to be picked up by a developer. Status: review The issue's code has been merged and is ready for testing/review. labels Nov 29, 2019
@ross-spencer ross-spencer added Status: review The issue's code has been merged and is ready for testing/review. and removed Status: in progress Issue that is currently being worked on. labels Jan 7, 2020
@sallain
Copy link
Member

sallain commented Mar 23, 2020

I've tested with a transfer containing three large files - 2.3 GB, 8.3 GB, and 11.3 GB - and the AIP stored fine. I believe that the MySQL settings are standard for a fresh install on the test site I'm using. The Archival Storage tab shows the correct AIP size.

The test server is running on CentOS.

@sallain sallain closed this as completed Mar 23, 2020
@sallain sallain removed the Status: review The issue's code has been merged and is ready for testing/review. label Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants