Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FTP list fails with large number of file #57

Open
horkko opened this issue Aug 12, 2016 · 26 comments
Open

FTP list fails with large number of file #57

horkko opened this issue Aug 12, 2016 · 26 comments
Labels

Comments

@horkko
Copy link
Contributor

horkko commented Aug 12, 2016

Hi,

I'm facing a problem with a bank that download a lots of files.
I'm trying to get files from Genbank WGS (ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs).
This directory contains around 84,000 files. Then when I run biomaj, I always get this error:

[ftp.py:list:277] Could not get errcode:(56, 'FTP response reading failed')

It somehow mean that the ftp reponse is longer than expected to retrieve the list of files.
I've try to set some options like (FTP_RESPONSE_TIME) but no success.
So my question is, do you have any clue on how to avoid such problem?
The problem is similar using Firefox, listing wgs directory ends with a blank page.
However, using ncftp, command dir succeed but we need to wait around a minute to get the file list.

Thanks

Emmanuel

@osallou
Copy link
Contributor

osallou commented Aug 12, 2016

hum, I did not face the issue. I would also have looked at timeout issue, but if it does not solve the problem I don't know.
Maybe it is a different timeout, not a response time but a connect time or something like that. I will have a look next week.

@osallou osallou added the bug label Aug 12, 2016
@osallou
Copy link
Contributor

osallou commented Aug 12, 2016

Could you try with curl directly with option --trace trace.txt ? I saw same issue on internet about sftp servers, and not ftp.

@horkko
Copy link
Contributor Author

horkko commented Aug 12, 2016

Yes me too. But the NCBI site is not sftp :(
Here is the command:
curl --trace-ascii trace.txt --use-ascii ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs/

Here is the trace.txt output:

== Info: About to connect() to ftp.ncbi.nlm.nih.gov port 21 (#0)
== Info:   Trying 130.14.250.13... == Info: connected
== Info: Connected to ftp.ncbi.nlm.nih.gov (130.14.250.13) port 21 (#0)
<= Recv header, 6 bytes (0x6)
0000: 220-
<= Recv header, 18 bytes (0x12)
0000:  Warning Notice!
<= Recv header, 3 bytes (0x3)
0000:  
<= Recv header, 77 bytes (0x4d)
0000:  You are accessing a U.S. Government information system which in
0040: cludes this
<= Recv header, 66 bytes (0x42)
0000:  computer, network, and all attached devices. This system is for
<= Recv header, 80 bytes (0x50)
0000:  Government-authorized use only. Unauthorized use of this system
0040:  may result in
<= Recv header, 77 bytes (0x4d)
0000:  disciplinary action and civil and criminal penalties. System us
0040: ers have no
<= Recv header, 80 bytes (0x50)
0000:  expectation of privacy regarding any communications or data pro
0040: cessed by this
<= Recv header, 72 bytes (0x48)
0000:  system. At any time, the government may monitor, record, or sei
0040: ze any
<= Recv header, 73 bytes (0x49)
0000:  communication or data transiting or stored on this information 
0040: system.
<= Recv header, 6 bytes (0x6)
0000:  ---
<= Recv header, 90 bytes (0x5a)
0000:  Welcome to the NCBI ftp server! The anonymous access URL is ftp
0040: ://ftp.ncbi.nlm.nih.gov/
<= Recv header, 3 bytes (0x3)
0000:  
<= Recv header, 102 bytes (0x66)
0000:  Public data may be downloaded by logging in as "anonymous" usin
0040: g your E-mail address as a password.
<= Recv header, 3 bytes (0x3)
0000:  
<= Recv header, 85 bytes (0x55)
0000:  Please see ftp://ftp.ncbi.nlm.nih.gov/README.ftp for hints on l
0040: arge file transfers
<= Recv header, 23 bytes (0x17)
0000: 220 FTP Server ready.
=> Send header, 16 bytes (0x10)
0000: USER anonymous
<= Recv header, 75 bytes (0x4b)
0000: 331 Anonymous login ok, send your complete email address as your
0040:  password
=> Send header, 22 bytes (0x16)
0000: PASS ftp@example.com
<= Recv header, 50 bytes (0x32)
0000: 230 Anonymous access granted, restrictions apply
=> Send header, 5 bytes (0x5)
0000: PWD
<= Recv header, 34 bytes (0x22)
0000: 257 "/" is the current directory
== Info: Entry path is '/'
=> Send header, 13 bytes (0xd)
0000: CWD genbank
<= Recv header, 28 bytes (0x1c)
0000: 250 CWD command successful
=> Send header, 9 bytes (0x9)
0000: CWD wgs
<= Recv header, 28 bytes (0x1c)
0000: 250 CWD command successful
=> Send header, 6 bytes (0x6)
0000: EPSV
== Info: Connect data stream passively
<= Recv header, 48 bytes (0x30)
0000: 229 Entering Extended Passive Mode (|||50241|)
== Info:   Trying 130.14.250.13... == Info: connected
== Info: Connecting to 130.14.250.13 (130.14.250.13) port 50241
=> Send header, 8 bytes (0x8)
0000: TYPE A
<= Recv header, 19 bytes (0x13)
0000: 200 Type set to A
=> Send header, 6 bytes (0x6)
0000: LIST
<= Recv header, 54 bytes (0x36)
0000: 150 Opening ASCII mode data connection for file list
== Info: Maxdownload = -1
<= Recv data, 0 bytes (0x0)
== Info: Remembering we are in dir "genbank/wgs/"
== Info: FTP response reading failed
== Info: Connection #0 to host ftp.ncbi.nlm.nih.gov left intact
=> Send header, 6 bytes (0x6)
0000: QUIT
== Info: FTP response reading failed
== Info: Closing connection #0

@osallou
Copy link
Contributor

osallou commented Aug 12, 2016

I think it expects to start receiving something within X seconds and cancel
if timeout reached.

Le ven. 12 août 2016 14:29, Emmanuel Quevillon notifications@github.com a
écrit :

Yes me too. But the NCBI site is not sftp :(
Here is the command:
curl --trace-ascii trace.txt --use-ascii
ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs/

Here is the trace.txt output:

== Info: About to connect() to ftp.ncbi.nlm.nih.gov port 21 (#0)
== Info: Trying 130.14.250.13... == Info: connected
== Info: Connected to ftp.ncbi.nlm.nih.gov (130.14.250.13) port 21 (#0)
<= Recv header, 6 bytes (0x6)
0000: 220-
<= Recv header, 18 bytes (0x12)
0000: Warning Notice!
<= Recv header, 3 bytes (0x3)
0000:
<= Recv header, 77 bytes (0x4d)
0000: You are accessing a U.S. Government information system which in
0040: cludes this
<= Recv header, 66 bytes (0x42)
0000: computer, network, and all attached devices. This system is for
<= Recv header, 80 bytes (0x50)
0000: Government-authorized use only. Unauthorized use of this system
0040: may result in
<= Recv header, 77 bytes (0x4d)
0000: disciplinary action and civil and criminal penalties. System us
0040: ers have no
<= Recv header, 80 bytes (0x50)
0000: expectation of privacy regarding any communications or data pro
0040: cessed by this
<= Recv header, 72 bytes (0x48)
0000: system. At any time, the government may monitor, record, or sei
0040: ze any
<= Recv header, 73 bytes (0x49)
0000: communication or data transiting or stored on this information
0040: system.
<= Recv header, 6 bytes (0x6)
0000: ---
<= Recv header, 90 bytes (0x5a)
0000: Welcome to the NCBI ftp server! The anonymous access URL is ftp
0040: ://ftp.ncbi.nlm.nih.gov/
<= Recv header, 3 bytes (0x3)
0000:
<= Recv header, 102 bytes (0x66)
0000: Public data may be downloaded by logging in as "anonymous" usin
0040: g your E-mail address as a password.
<= Recv header, 3 bytes (0x3)
0000:
<= Recv header, 85 bytes (0x55)
0000: Please see ftp://ftp.ncbi.nlm.nih.gov/README.ftp for hints on l
0040: arge file transfers
<= Recv header, 23 bytes (0x17)
0000: 220 FTP Server ready.
=> Send header, 16 bytes (0x10)
0000: USER anonymous
<= Recv header, 75 bytes (0x4b)
0000: 331 Anonymous login ok, send your complete email address as your
0040: password
=> Send header, 22 bytes (0x16)
0000: PASS ftp@example.com
<= Recv header, 50 bytes (0x32)
0000: 230 Anonymous access granted, restrictions apply
=> Send header, 5 bytes (0x5)
0000: PWD
<= Recv header, 34 bytes (0x22)
0000: 257 "/" is the current directory
== Info: Entry path is '/'
=> Send header, 13 bytes (0xd)
0000: CWD genbank
<= Recv header, 28 bytes (0x1c)
0000: 250 CWD command successful
=> Send header, 9 bytes (0x9)
0000: CWD wgs
<= Recv header, 28 bytes (0x1c)
0000: 250 CWD command successful
=> Send header, 6 bytes (0x6)
0000: EPSV
== Info: Connect data stream passively
<= Recv header, 48 bytes (0x30)
0000: 229 Entering Extended Passive Mode (|||50241|)
== Info: Trying 130.14.250.13... == Info: connected
== Info: Connecting to 130.14.250.13 (130.14.250.13) port 50241
=> Send header, 8 bytes (0x8)
0000: TYPE A
<= Recv header, 19 bytes (0x13)
0000: 200 Type set to A
=> Send header, 6 bytes (0x6)
0000: LIST
<= Recv header, 54 bytes (0x36)
0000: 150 Opening ASCII mode data connection for file list
== Info: Maxdownload = -1
<= Recv data, 0 bytes (0x0)
== Info: Remembering we are in dir "genbank/wgs/"
== Info: FTP response reading failed
== Info: Connection #0 to host ftp.ncbi.nlm.nih.gov left intact
=> Send header, 6 bytes (0x6)
0000: QUIT
== Info: FTP response reading failed
== Info: Closing connection #0


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA-gYoksvcHOufEUam_ruIHo39c4U4z9ks5qfGcugaJpZM4Ji9ai
.

@horkko
Copy link
Contributor Author

horkko commented Aug 12, 2016

Yes that what I suspected, but i could not find any documentation on this using pycurl.
It exists an option FTP_RESPONSE_TIMEOUT in curl library, but not in pycurl.
I tried to set it in ftp.list as self.crl.setopt(pycurl.FTP_RESPONSE_TIMEOUT, 300). No warning from biomaj, but did not succeed in listing.

By the way, I've tried with pdb, which download more than 100,000 files. The listing does not fails!!
So, I've discovered a small difference between the 2 output logs:

Genbank

> CWD wgs
< 250 CWD command successful
> EPSV
* Connect data stream passively
< 229 Entering Extended Passive Mode (|||50205|)
*   Trying 130.14.250.12... * connected
* Connecting to 130.14.250.12 (130.14.250.12) port 50205
> TYPE A
< 200 Type set to A
> LIST
< 150 Opening ASCII mode data connection for file list

PDB

< 250K. Current directory is /pub/pdb/derived_data
> PASV
* Connect data stream passively
< 227 Entering Passive Mode (165,230,17,202,197,113)
*   Trying 165.230.17.202... * connected
* Connecting to 165.230.17.202 (165.230.17.202) port 50545
> LIST
< 150 Accepted data connection

The only diff I see is the mode, PASV for PDB and EPSV for Genbank. It could be a clue?

EDIT: I've try to disable EPSV mode for ftp self.crl.setopt(pycurl.FTP_USE_EPSV, 0) but it has no effect :(

@osallou
Copy link
Contributor

osallou commented Aug 12, 2016

Passive vs active should not be issue. This makes pb usually when going
through firewalls.
Pycurl provide sam libcurl options.
The issue is the time to get the start of the list. Pdb id quite immediate.
Seems their server has issue to return the listing (why so long).

Le ven. 12 août 2016 15:39, Emmanuel Quevillon notifications@github.com a
écrit :

Yes that what I suspected, but i could not find any documentation on this
using pycurl.
It exists an option FTP_RESPONSE_TIMEOUT in curl library, but not in
pycurl.
I tried to set it in ftp.list as self.crl.setopt(pycurl.FTP_RESPONSE_TIMEOUT,
300). No warning from biomaj, but did not succeed in listing.

By the way, I've tried with pdb, which download more than 100,000 files.
The listing does not fails!!
So, I've discovered a small difference between the 2 output logs:

Genbank

CWD wgs
< 250 CWD command successful
EPSV

  • Connect data stream passively
    < 229 Entering Extended Passive Mode (|||50205|)
  • Trying 130.14.250.12... * connected
  • Connecting to 130.14.250.12 (130.14.250.12) port 50205
    TYPE A
    < 200 Type set to A
    LIST
    < 150 Opening ASCII mode data connection for file list

PDB

< 250K. Current directory is /pub/pdb/derived_data

PASV

  • Connect data stream passively
    < 227 Entering Passive Mode (165,230,17,202,197,113)
  • Trying 165.230.17.202... * connected
  • Connecting to 165.230.17.202 (165.230.17.202) port 50545
    LIST
    < 150 Accepted data connection

The only diff I see is the mode, PASV for PDB and EPSV for Genbank. It
could be a clue?


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA-gYl2Pgfh-cFC4c4hcxPgKviIjJjWJks5qfHeVgaJpZM4Ji9ai
.

@osallou
Copy link
Contributor

osallou commented Aug 12, 2016

Did you try setting CURLOPT_TIMEOUT just like for download step? (and set param in config)

@horkko
Copy link
Contributor Author

horkko commented Aug 12, 2016

CURLOPT_TIMEOUT is already set in ftp.list

     self.crl.setopt(pycurl.CONNECTTIMEOUT, 300)
     # Download should not take more than 5minutes
        self.crl.setopt(pycurl.TIMEOUT, self.timeout)
        self.crl.setopt(pycurl.NOSIGNAL, 1)

which refers to workflow.py

        timeout_download = self.bank.config.get('timeout.download')
        if timeout_download is not None and timeout_download:
            downloader.timeout = int(timeout_download)

Even if I increase this value, it has no effect :(

@horkko
Copy link
Contributor Author

horkko commented Aug 16, 2016

Hi,

Maybe a clue to fix this problem. Using curl option CURLOPT_DIRLISTONLY partially solves the problem.

...
< 200 Type set to A
> NLST
< 150 Opening ASCII mode data connection for file list
* Maxdownload = -1
* Remembering we are in dir "genbank/wgs/"
< 226 Transfer complete
* Connection #0 to host ftp.ncbi.nlm.nih.gov left intact
...

At least the dir listing is available, however, we fail later in the workflow as this cul option only list the directory content, is does not retrieve metadata such as permissions, date, size etc...
So, for a bank having a release.file set, it should not be a problem, but for bank which base its release number on date of last updated file, then we end with such error:

2016-08-16 14:47:55,917 ERROR [root][MainThread] [workflow.py:start:135] Workflow:downloadException:'year'

which the build of the release based on last updated files :(
Does this new option is good start to solve the problem?

@osallou
Copy link
Contributor

osallou commented Aug 16, 2016

we need all metadata, so it is not good :-(

@horkko
Copy link
Contributor Author

horkko commented Aug 16, 2016

Yes I know, unless we can combine such bank (with huge file list) with a release file number.

@osallou
Copy link
Contributor

osallou commented Aug 16, 2016

this is a workaround for specific bank, and it is not even sure it will work 100%.

@horkko
Copy link
Contributor Author

horkko commented Aug 16, 2016

yeah you're right :(

@osallou
Copy link
Contributor

osallou commented Aug 16, 2016

could you share the bank ini file?

@horkko
Copy link
Contributor Author

horkko commented Aug 16, 2016

Here are the info for Genbank WGS

protocol=ftp
server=ftp.ncbi.nlm.nih.gov
remote.dir=/genbank/wgs/
remote.files=^wgs\.\w{4}[\.\d]*\.g(np|bff)\.gz$

@osallou
Copy link
Contributor

osallou commented Aug 16, 2016

I am trying option TCP_KEEPALIVE, which needs pycurl/curl version >= 7.25.0.
I reached default timeout (5minutes), I will try higher value to see if I can get something.

@horkko
Copy link
Contributor Author

horkko commented Aug 16, 2016

Ok. For info I've update my pycurl from 7.19 to 7.43 today. But it did not change anything compared to original problem.
Let me know about this new option.

@osallou
Copy link
Contributor

osallou commented Aug 16, 2016

not better, but error (56, 'response reading failed) occurs between 1min and more ( occured at 5 minutes), it depends.... so it depends on remote server.

@horkko
Copy link
Contributor Author

horkko commented Aug 16, 2016

For info, using ncftp with command line, does not report error 56. And the first output of the directory listing appears after about a minute.

@osallou
Copy link
Contributor

osallou commented Aug 16, 2016

does ncftp report all metadata ?

@horkko
Copy link
Contributor Author

horkko commented Aug 16, 2016

I dont think so I dont remember actually

Le 16 août 2016 17:23, "Olivier Sallou" notifications@github.com a écrit :

does ncftp report all metadata ?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AI2lZWdIdCpPAFjkAJyV-woP2DiUdllSks5qgdXfgaJpZM4Ji9ai
.

@osallou
Copy link
Contributor

osallou commented Aug 16, 2016

maybe it acts like CURLOPT_DIRLISTONLY

@horkko
Copy link
Contributor Author

horkko commented Aug 16, 2016

probably :(

Le 16 août 2016 17:34, "Olivier Sallou" notifications@github.com a écrit :

maybe it acts like CURLOPT_DIRLISTONLY


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AI2lZVIrIDIo0X9DLbrh35gINyfyVkqxks5qgdhdgaJpZM4Ji9ai
.

@horkko
Copy link
Contributor Author

horkko commented Aug 19, 2016

Hi Olivier,

Back on the problem. We've found the source of the problem. It is not related to pycurl or even libcurl itself. The directory listing works well when we only ask for the name of the file(s) in the remote directory (ftp command NLST instead of LIST).
As soon as we ask for related metadata (time, size, permissions), then the time taken from
the server to build the list is greater than a certain amount of time from when the remote ftp server close the connection with a FIN-ACK on the command channel as well as the data channel.
That's why we get an error (56, 'FTP response reading failed')
Hope that help.

Emmanuel

@osallou
Copy link
Contributor

osallou commented Aug 19, 2016

Nice analysis. Maybe you should contact upstream ftp maintainer to raise
the issue and solve it.
Biomaj needs metadata, and beyond this, this is an issue for any user with
browsers.

Le ven. 19 août 2016 16:27, Emmanuel Quevillon notifications@github.com a
écrit :

Hi Olivier,

Back on the problem. We've found the source of the problem. It is not
related to pycurl or even libcurl itself. The directory listing works
well when we only ask for the name of the file(s) in the remote directory
(ftp command NLST instead of LIST).
As soon as we ask for related metadata (time, size, permissions), then the
time taken from
the server to build the list is greater than a certain amount of time from
when the remote ftp server close the connection with a FIN-ACK on the
command channel as well as the data channel.
That's why we get an error (56, 'FTP response reading failed')
Hope that help.

Emmanuel


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA-gYvNbZY7IfvymMVN4BBFBkQLq7shEks5qhb1ggaJpZM4Ji9ai
.

@horkko
Copy link
Contributor Author

horkko commented Aug 19, 2016

Thanks :)
I have already contacter ncbi support for this I am waiting for their
reply.
And yes you are right the directory listing is not possible with a web
browser :(

Le 19 août 2016 16:39, "Olivier Sallou" notifications@github.com a écrit :

Nice analysis. Maybe you should contact upstream ftp maintainer to raise
the issue and solve it.
Biomaj needs metadata, and beyond this, this is an issue for any user with
browsers.

Le ven. 19 août 2016 16:27, Emmanuel Quevillon notifications@github.com
a
écrit :

Hi Olivier,

Back on the problem. We've found the source of the problem. It is not
related to pycurl or even libcurl itself. The directory listing works
well when we only ask for the name of the file(s) in the remote directory
(ftp command NLST instead of LIST).
As soon as we ask for related metadata (time, size, permissions), then
the
time taken from
the server to build the list is greater than a certain amount of time
from
when the remote ftp server close the connection with a FIN-ACK on the
command channel as well as the data channel.
That's why we get an error (56, 'FTP response reading failed')
Hope that help.

Emmanuel


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#57 (comment),
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA-
gYvNbZY7IfvymMVN4BBFBkQLq7shEks5qhb1ggaJpZM4Ji9ai>
.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#57 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AI2lZbVcFlepKLATpC01s1VWRAH2sRbzks5qhcAJgaJpZM4Ji9ai
.

@osallou osallou added question and removed bug labels Aug 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants