Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing problems? #34

Closed
jcharaoui opened this issue Jul 19, 2019 · 7 comments
Closed

Parsing problems? #34

jcharaoui opened this issue Jul 19, 2019 · 7 comments
Labels
bug Something isn't working

Comments

@jcharaoui
Copy link
Collaborator

Since upgrading to 1.1.6 on Debian Buster I'm seeing problems which I think are related to index parsing. Now, with several pages, some directories are detected as files and vice-versa. See http://security-cdn.debian.org/debian-security/ for an example.

Also, for some directories, HTTPDirFS completely hangs in the background and its impossible to unmount the directory.

The same listings tested with version 1.0.1 work fine.

@jcharaoui
Copy link
Collaborator Author

jcharaoui commented Jul 19, 2019

Here is the log from parsing the debian-security URL

$ ./httpdirfs -f http://security-cdn.debian.org/debian-security/ ~/foo                                                                              
libcurl SSL engine: OpenSSL/1.1.1c
--------------------------------------------
 LinkTable 0x561430b5c240 for http://security-cdn.debian.org/debian-security/
--------------------------------------------
0 H 0  http://security-cdn.debian.org/debian-security/
1 F 183 README.security http://security-cdn.debian.org/debian-security/README.security
2 F 2214 dists http://security-cdn.debian.org/debian-security/dists
3 F 930 indices http://security-cdn.debian.org/debian-security/indices
4 F 413583 ls-lR.gz http://security-cdn.debian.org/debian-security/ls-lR.gz
5 F 1438 pool http://security-cdn.debian.org/debian-security/pool
6 F 1078 project http://security-cdn.debian.org/debian-security/project
7 F 1663 zzz-dists http://security-cdn.debian.org/debian-security/zzz-dists
--------------------------------------------
LinkTable_print(): Invalid link count: 0, http://security-cdn.debian.org/debian-security/.
--------------------------------------------

@jcharaoui jcharaoui added the bug Something isn't working label Jul 19, 2019
@fangfufu
Copy link
Owner

I am afraid this might have been a bug in v1.0.1. The idea is to open a link, and check if the response has Content-Length in the header. If that's the case, then it is a file. If not, then it is a directory. In http://security-cdn.debian.org/debian-security/, everything has a Content-Length in the header, so everything is a file. This is the intended behaviour.

$curl -I http://security-cdn.debian.org/debian-security/dists/
HTTP/1.1 200 OK
Server: Apache
X-Content-Type-Options: nosniff
X-Frame-Options: sameorigin
Referrer-Policy: no-referrer
X-Xss-Protection: 1
Cache-Control: max-age=120
Expires: Sat, 20 Jul 2019 14:37:44 GMT
X-Clacks-Overhead: GNU Terry Pratchett
Content-Type: text/html;charset=UTF-8
Via: 1.1 varnish
Content-Length: 2214
Accept-Ranges: bytes
Date: Sat, 20 Jul 2019 14:37:35 GMT
Via: 1.1 varnish
Age: 110
Connection: keep-alive
X-Served-By: cache-fra19142-FRA, cache-lcy19264-LCY
X-Cache: MISS, HIT
X-Cache-Hits: 0, 1
X-Timer: S1563633455.053133,VS0,VE0
Vary: Accept-Encoding

Another option is to detect content type, and classify everything with Content-Type: text/html;charset=UTF-8 as directory, but I think that might be a bad idea - what if you have a directory with web pages? I suppose we could provide a command line option to turn it off.

@fangfufu fangfufu added notabug and removed bug Something isn't working labels Jul 20, 2019
@jcharaoui
Copy link
Collaborator Author

Really? Where does HTTPDirFS do this? I can't find any reference to this header in the code.

What about /src/link.c#L48-L65 ?

@fangfufu
Copy link
Owner

fangfufu commented Jul 20, 2019

Yes, really. It is here:
/src/link.c#L162-L186

Perhaps this is not the smartest idea. Should I base the detection purely on the URL itself, rather than content length?

Basically we have two ways of setting link type in two separate function. linkname_type() is called when a LinkTable is first created. Link_set_stat() is called when the links in the link table are checked.

@jcharaoui
Copy link
Collaborator Author

Well I don't know if it's smart or not, but it's definitely unexpected. I think perhaps by default we should detect whether a link is a directory by checking for a trailing slash on the HREF URL, since that seems to be the way most HTTP indexes work. What do you think?

@fangfufu
Copy link
Owner

Yup, okay. I agree. Trailing slash definitely indicates a directory. I think I still need to keep the code to check whether a directory link is valid.

I will patch it up in a few hours time :) - I need to go.

@fangfufu fangfufu added bug Something isn't working and removed notabug labels Jul 20, 2019
@fangfufu
Copy link
Owner

Fixed in 78d8167

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants