IOError: [Errno 36] File name too long #292

rfelten · 2017-02-17T12:30:59Z

Hi,
using DumpGenerator 0.3.0-alpha on 4.4.0-59-generic #80-Ubuntu x86_64 and I ran into issues to dump from http://www.kochwiki.org/w/api.php to an encryptfs'ed file system.

Stacktrace:

$ python dumpgenerator.py --api=http://www.kochwiki.org/w/api.php --xml --curonly --path=/home/rf/Projects/kochen/KochWiki --resume 
Checking API... http://www.kochwiki.org/w/api.php
API is OK: http://www.kochwiki.org/w/api.php
Checking index.php... http://www.kochwiki.org/w/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2017 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing http://www.kochwiki.org/w/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
365 images were found in the directory from a previous session
Retrieving images from "Assortiment de différentes préparation à bases de légumes et féculents, bien sur servit avec de l'injara.JPG"
Filename is too long, truncating. Now it is: Assortiment de différentes préparation à bases de légumes et féculents, bien sur servit avec de l'inf1f192008cca2209820a6db246f5e3b1.JPG
Traceback (most recent call last):
  File "dumpgenerator.py", line 2093, in <module>
    main()
  File "dumpgenerator.py", line 2083, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1808, in resumePreviousDump
    session=other['session'])
  File "dumpgenerator.py", line 1126, in generateImageDump
    f = open('%s/%s.desc' % (imagepath, filename2), 'w')
IOError: [Errno 36] File name too long: u"/home/rf/Projects/kochen/KochWiki/images/Assortiment de diff\xe9rentes pr\xe9paration \xe0 bases de l\xe9gumes et f\xe9culents, bien sur servit avec de l'inf1f192008cca2209820a6db246f5e3b1.JPG.desc"

The encryptfs file system supports filenames up to ~140 chars (source).

The dumpgenerator.py contains code to trim file names if they are too long - which failed here. Therefore I consider this as bug ;)

The text was updated successfully, but these errors were encountered:

nemobis · 2017-02-17T21:26:25Z

I think the dumps should be produced in the same way whatever the filesystem, otherwise we'll end up with multiple incompatible formats of dumps. Ideally we would not store the filenames in the local filesystem and we'd be able to keep the original wiki's filesystem metadata, but so it isn't currently. I'm fine with adding a note in the README that your filesystem isn't supported (or better, how long a filename we assume is possible).

rfelten · 2017-02-21T11:47:47Z

I don't think that a note on the readme that ecryptfs is not supported is a "solution". It is not "my" file system, it is the default of Ubuntu if you encrypt your home directory. Therefore a lot of users are affected.

Keep the original wiki's filesystem structure (= filenames) sounds like a good idea for me since IHMO a dump(er software) should copy the source w/o change it -> ideal solution.

Coming from the ideal solution, the current filename handling is a dirty hack and also buggy (sorry to say that). Let me elaborate this claim on the current code:

generateImageDump() truncates the filename (so bye bye ideal solution). Based the comment # truncate filename if length > 100 (100 + 32 (md5) = 132 < 143 (crash limit). Later .desc is added to filename, so better 100 as max) I guess the intention is to meet the requirement of encryptfs (143 chars max).
So if 100 the max truncateFilename() is doing wrong - or atleast very misleading. It cuts the first 100 chars, then adds 32 chars md5. So the result is 132 chars, and not the value of the configuration variable, which the user might expect
But the real bug is a hidden in the unicode handling of Python:

Lets have a look on this innocent looking French filename: Assortiment de différentes préparation à bases de légumes et féculents, bien sur servit avec de l'injara.JPG. Why should break this innocent real world example the code?

>>> fn = "Assortiment de différentes préparation à bases de légumes et féculents, bien sur servit avec de l'injara.JPG"
>>> len(fn)
108

108 > 100, so it will truncated. Worst case with added .desc suffix too, so the result is:

fn = u"Assortiment de diff\xe9rentes pr\xe9paration \xe0 bases de l\xe9gumes et f\xe9culents, bien sur servit avec de l'inf1f192008cca2209820a6db246f5e3b1.JPG.desc"
>>> fn
"Assortiment de différentes préparation à bases de légumes et féculents, bien sur servit avec de l'inf1f192008cca2209820a6db246f5e3b1.JPG.desc"
>>> len(fn)
141

Should be save, since it is below the crash limit (143). Or not?

>>> with open(fn, 'w') as f:
...     f.write("BOOOM")
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 36] File name too long: "Assortiment de différentes préparation à bases de légumes et féculents, bien sur servit avec de l'inf1f192008cca2209820a6db246f5e3b1.JPG.desc"

Simply said: To store an unicode char on the filesystem, you need more than one char. This is also the case if you store them in ram, of cause. So the "real length" of fn is:

>>> len(fn.encode("utf-8"))
146

(The additional +5 chars came from the ééàéé.)

So the real bug is IMHO in line 1103 where the unicode length should be taken into account. Maybe in other locations too.

I can prepare a PR but I'm not sure to test this stuff in an appropriate manner to avoid bugs like this in the future.

nemobis · 2017-02-21T11:53:13Z

As long as we write files to disk, there is no perfect solution other than downloading a wiki's files only from a host which uses the same filesystem as the wiki's server. The only alternative I can think of is to require tar and append downloaded files straight from memory to the tar file without ever using the local filesystem (maybe even 7z allows to do so, but I'm not sure). Robert Felten, 21/02/2017 12:47:

But the /real bug/ is a hidden in the unicode handling of Python:

On this I can certainly agree. Thanks for the diagnosis. I think a way to test the bug is simply to download images on a wiki which has such filenames, interrupt the download at some point and then resume the download. The resume usually fails, probably for the reason you described.

…ikiTeam#292

rfelten · 2017-02-22T16:59:18Z

I've created a PR, see #293. I hope I've fixed all bugs and did'nt break something.

I've also created a new testcase file for stuff that can be tested offline, since I don't want to download several gigabytes very time I change something. Unfortunately the current codebase is not very testing friendly, for instance I see no way to get a the other-dict (contains parts of the configuration) from dumpgenerator.py :(

There was also another bug: if the filename-parameter contains no '.', the filename was doubled. Also fixed.

rfelten added a commit to rfelten/wikiteam that referenced this issue Feb 22, 2017

WikiTeam#292: added testcase to reproduce issue

38e6974

rfelten added a commit to rfelten/wikiteam that referenced this issue Feb 22, 2017

WikiTeam#292: added more testcases

b64513e

rfelten added a commit to rfelten/wikiteam that referenced this issue Feb 22, 2017

WikiTeam#292: refactoring: truncate filename logic at one place

10f1337

rfelten added a commit to rfelten/wikiteam that referenced this issue Feb 22, 2017

WikiTeam#292: fixed WikiTeam#292

2f826ef

rfelten added a commit to rfelten/wikiteam that referenced this issue Feb 22, 2017

WikiTeam#292: make parameter type explicit, removed debug print

f92bf8b

rfelten added a commit to rfelten/wikiteam that referenced this issue Feb 22, 2017

WikiTeam#292: added another testcase

6c2cbbb

rfelten added a commit to rfelten/wikiteam that referenced this issue Feb 22, 2017

changed filenamelimit to 140 after bug fix filename truncation. see W…

d638712

…ikiTeam#292

rfelten mentioned this issue Feb 22, 2017

Fix for #292 and changed filenamelimit #293

Open

nemobis added bug cross platform labels Feb 10, 2020

yzqzss mentioned this issue Jan 20, 2023

Deprecate truncateFilename() and increase filename length limit. mediawiki-client-tools/mediawiki-dump-generator#104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IOError: [Errno 36] File name too long #292

IOError: [Errno 36] File name too long #292

rfelten commented Feb 17, 2017

nemobis commented Feb 17, 2017 via email

rfelten commented Feb 21, 2017

nemobis commented Feb 21, 2017 via email

rfelten commented Feb 22, 2017

IOError: [Errno 36] File name too long #292

IOError: [Errno 36] File name too long #292

Comments

rfelten commented Feb 17, 2017

nemobis commented Feb 17, 2017 via email

rfelten commented Feb 21, 2017

nemobis commented Feb 21, 2017 via email

rfelten commented Feb 22, 2017