UNIX man-page plugin (fathead) #36

Closed
wants to merge 33 commits into
from

Conversation

Projects
None yet
3 participants
Contributor

flaming-toast commented Aug 14, 2012

Hi there! :D

I've been working on the unix man-page idea:
https://duckduckhack.uservoice.com/forums/5168-plugins/suggestions/2678996-manual-pages-return-unix-like-man-info

This'll be my initial code check-in. There's probably still a few bugs here and there, but for the most part, I believe my code does what it's supposed to do.The output file follows the programming data file format.

I'd also LOVE to have some feedback and advice. In particular, I'd like some ideas for fetching data faster.

I CAN fetch the data properly, and the code can parse it properly but fetching takes a LONG time because the number of available man pages is MASSIVE.
I haven't found any compressed archives of all the man pages (the ones that do exist omit Section 1!), and I am unable to to grab compressed html files from my data source because of their robots.txt file (they do openly allow DuckDuckBot to crawl the site though, so maybe you guys will be have better luck testing this than I do).

Anyway, try the plugin out and let me know your thoughts! Thank you!!

Contributor

rpicard commented Aug 14, 2012

It'll be great to finally have man pages! Do you have an estimate of long fetching and parsing takes, i.e. seconds, minutes, hours, or days?

I'm pretty busy today and tomorrow (traveling / moving) so I'll probably get back to you with feedback later this week.

flaming-toast added some commits Aug 5, 2012

made a new directory with README
Signed-off-by: Robert Picard <mail@robert.io>
first working version
Signed-off-by: Robert Picard <mail@robert.io>
yay! second stable version, grabs name and synopsis correctly
Signed-off-by: Robert Picard <mail@robert.io>
removed redundant lines of code. code cleanup.
Signed-off-by: Robert Picard <mail@robert.io>
fixed @synopsis array bug
Signed-off-by: Robert Picard <mail@robert.io>
used for loop for regexes
Signed-off-by: Robert Picard <mail@robert.io>
added use strict; use warnings
Signed-off-by: Robert Picard <mail@robert.io>
The output.txt format was wrong. Fixed it to follow the guidelines
Signed-off-by: Robert Picard <mail@robert.io>
updated README with dependencies
Signed-off-by: Robert Picard <mail@robert.io>
Contributor

flaming-toast commented Aug 14, 2012

Thank you for checking in with me!

To fetch around 1,300 man pages from my data source, it took about an hour because I used wget's --wait option (delay 2 seconds every HTTP request), to ease server load (and also to not get blocked for sending too many requests at a time). I was blocked once for sending too many requests too quickly, and also because of their robots.txt.
Running the parser on these ~1,300 html man pages took about 1.5 seconds, and the resulting output format was correct.

They do welcome DDG to crawl their site though, so maybe you will have better luck fetching than I did. Fetching compressed files is also an option; however, that would mean we can't use wget's recursive downloading option.

There are a total of about a whopping ~8,100 man pages in Section 1 alone (user commands).
Sections 2-7 are system calls, library functions, games, etc, and Section 8 are administrative commands.
I believe users would be most interested in Section 1 and Section 8 only, so if I limited fetching to that, it would also reduce fetching time.

Again, thanks for looking into this. :-)

Contributor

rpicard commented Aug 20, 2012

Okay, it looks like I was blocked too. The problem is that even though they allow DuckDuckBot to crawl the site, that is something unrelated to what we're doing here. I'm going to look into this a little more, but I don't know that this will end up being a good source for us.

Contributor

flaming-toast commented Aug 20, 2012

I was thinking the same thing (considering another source).

But...It was pretty hard for me to find a site that would let me fetch an immense number of static files.

A lot of man page sites run CGI scripts of some sort to process the query (So I cannot wget/fetch any files from those sites), and the sources that provide a compressed archive of the man pages omit Section 1.

I'll see what I can do.

EDIT: I have found some other sources that look really promising. (http://linuxcommand.org/ and the man-pages from tldp.org) These sources are NOT as comprehensive (1,000+ section 1 man pages vs ~150) and complete as linux.die.net, but if things don't work out with that site, I may rewrite the plugin to parse data from these websites instead.

EDIT #2: I have decided to rewrite the plugin using less problematic sources. I'll add more commits to this pull request as soon as I'm done.

Contributor

flaming-toast commented Aug 21, 2012

OK! I've switched to a different source of data and fetching is much less problematic now :-) Give it a go!

Output.txt should follow the programming data file format.

As a side note,
./fetch.sh takes about 55 minutes to fetch 2,905 HTML man pages. Source doesn't support compression : (
./parse.sh takes 6 seconds to parse them all and output the tab delimited file.

Contributor

rpicard commented Aug 24, 2012

I'm glad to hear you found another source. Before I try it out though, could you switch to the general data file format? If you want to include a synopsis, just wrap it like this: <pre><code>synopsis goes here</code></pre>.

Contributor

flaming-toast commented Aug 25, 2012

Sure. I changed the file format. I'm not sure if it's correct, lemme know if I need to change anything.

flaming-toast added some commits Aug 21, 2012

switched data source to linuxcommand.org
Signed-off-by: Robert Picard <mail@robert.io>
code cleanup
Signed-off-by: Robert Picard <mail@robert.io>
changed output to general data format
Signed-off-by: Robert Picard <mail@robert.io>
forgot a space between name and synopsis.
Signed-off-by: Robert Picard <mail@robert.io>
unix_man/parse.pl
+ }
+ }
+ my $url="http://linuxcommand.org/man_pages/$page".$section.'.html';
+ print "$page\tA\t\t\t\t\t\t\t\t\t\t$description <pre>@synopsis</pre>\t$url\n";
@rpicard

rpicard Aug 29, 2012

Contributor

Won't using @synopsis in a scalar context like that just return the length of the array?

@flaming-toast

flaming-toast Aug 29, 2012

Contributor

Do you mean using quotes in print()? Using quotes does interpolate the array and print one item, which is the string produced by interpolating @synopsis (that string is basically the contents of @synopsis separated by spaces). Otherwise, print takes arguments in list context, doesn't it? (http://szabgab.com/scalar-and-list-context-in-perl.html). I have tested it and It does produce the correct output.

@rpicard

rpicard Aug 29, 2012

Contributor

I was thinking that @synopsis would return the length of the array so it would end up looking like:

<pre>7</pre>

...but if it's producing the right output, you can ignore me. :)

Contributor

rpicard commented Aug 29, 2012

Sorry it's been taking a while to get back to you. I just moved back on campus and started school so I've been getting settled in here. I'm running the fetch script now so I'll let you know if I run into any problems.

Contributor

flaming-toast commented Aug 29, 2012

No worries! I'm just starting school as well. So we're in the same boat. : )

And apologies in advance for any mistakes in the script. This is my first Perl program so I'm not used to all the quirks, different contexts, and such in Perl. Feel free to correct or point out any mistakes in my code. I've tested the script, though, and it appears to work as intended.

Contributor

rpicard commented Aug 29, 2012

If it's working for you that's a good sign!

I'm getting this error with fetch.sh though:

Malformed UTF-8 character (fatal) at ./parse.pl line 11, <MANPAGE> line 24.

Do you have any thoughts?

Contributor

flaming-toast commented Aug 29, 2012

That is a strange error...fetch.sh shouldn't have anything to do with parse.pl at all....How are you invoking fetch.sh?

Anyway, I removed the use encoding pragma in parse.pl since I realized I was using it incorrectly (we need stdout to be in unicode, not the actual code itself). Try running it again?

EDIT: (Now that I think of it...the cause of the error could be that the parser received a unicode character from one of the HTML pages and couldn't process it properly. )

removed use encoding pragma
Signed-off-by: Robert Picard <mail@robert.io>
Contributor

rpicard commented Aug 29, 2012

@flaming-toast It looks like that squelched the error. I have 2,905 lines in output.txt. Is that the correct number of man pages?

I'll let you know if I run into any problems moving forward.

Contributor

flaming-toast commented Aug 29, 2012

Yes, that's correct! (that's all the man pages the data source provided me, at least).

Contributor

rpicard commented Aug 29, 2012

Great!

unix_man: Put the synopsis at the top and wrap it in code blocks as w…
…ell as pre

These changes are intended to keep the formatting consistent with other current
plugins.
Contributor

rpicard commented Sep 2, 2012

Alright, we're off to a good start. I made a few small formatting changes in this commit.

  1. Some text in the "Synopsis" section for man perl is being captured by parse.pl and included in the synopsis block, which I think should just be the relevant commands.

See: https://ddh4.duckduckgo.com/?q=man+perl

  1. Also, there are still some unicode characters that are causing problems. The problematic characters are in the lines for these commands:
ERROR:  invalid byte sequence for encoding "UTF8": 0xe294e2

dprofpp and h2ph
ERROR:  invalid byte sequence for encoding "UTF8": 0xc2c2

ksh
ERROR:  invalid byte sequence for encoding "UTF8": 0xe280e2

pnmtopng, pppoe, and vorbiscomment

I deleted those lines for now, but I'd like to fix the problem so they can either be fixed, or not included in output.txt to begin with.

If you make some more commits, make sure that you include any commits in the unix_man branch so you have all of my latest changes.

Contributor

rpicard commented Sep 5, 2012

@flaming-toast I just realized that I didn't tag you in that last post. I'm not sure if it will still notify you of the reply if I don't tag you.

Contributor

flaming-toast commented Sep 5, 2012

No worries, I still got a notification : ) I'm looking into the problematic unicode characters and the extra lines in the synopsis. Should be able to come up with a fix by the weekend.

Contributor

rpicard commented Sep 5, 2012

@flaming-toast Okay, great. Let me know if you have any questions.

Contributor

flaming-toast commented Sep 8, 2012

Hi rpicard! I THINK I fixed the invalid byte sequence issue...however the fix itself was to include the use encoding pragma, which had caused the Malformed UTF-8 character (fatal) at ./parse.pl line 11, <MANPAGE> line 24. error you had earlier. Please test it and tell me if you run into any errors.

Here's a summary of the changes I made.

  1. The synopsis section
    Bottom line is, the synopses vary a LOT. Some are paragraphs long, some don't describe options, etc. To narrow things down, I've only decided to capture the synopsis section if it follows the typical 'command [options] [arguments] etc', and ignore any long walls of text.
  2. Bash built-ins
    I've also forgotten to take into account bash built-ins, which don't REALLY have man pages.
    For example, take a look at these "man pages" for alias, fg, bg, some well-known bash built-in's:
    http://linuxcommand.org/man_pages/bg1.html
    http://linuxcommand.org/man_pages/fg1.html
    http://linuxcommand.org/man_pages/alias1.html
    To deal with the built-in's, I had to parse them separately since their "man page" follows a different structure than the typical man page. The parser just grabs each built-in's "synopsis" on the man-page. In general though, this doesn't make a difference to the output file.

Also, commands that have an empty description and synopsis (meaning the parser didn't grab anything) aren't included in the output file (this is generally because the man-page doesn't have a synopsis section, or it's not a typical command, etc).

Apologies for the long comment! Thanks for working with me on this!

Contributor

rpicard commented Sep 13, 2012

@flaming-toast Could you rebase those changes so that they are based on the latest changes in unix_man. I can't apply the patches otherwise.

Contributor

flaming-toast commented Sep 13, 2012

Sorry about that. :\ Never rebased before so I ran into some problems...but git log shows your commits from the unix_man branch now so hopefully it's okay now.

Contributor

rpicard commented Sep 24, 2012

For some reason I was having difficulty getting git in the right place here, but I finally got it working (in the flaming-master branch). I'm getting this error again:

Malformed UTF-8 character (fatal) at ./parse.pl line 11, <MANPAGE> line 24
Contributor

flaming-toast commented Sep 26, 2012

It's really difficult to debug if I don't get the same error messages on my own machine....Could you maybe describe your testing environment?
In any case, this is possibly an issue with using 'UTF-8' vs 'utf8', the former being more strict. I'm also trying to use Encode to convert the strings to utf8 before handling them....Try and see if you get the same error message?

+#!/usr/bin/env perl
+use strict;
+use warnings;
+use encoding 'utf8';
@xfix

xfix Sep 27, 2012

Member

encoding pragma is for encodings other than UTF-*. For UTF-8 use utf8 pragma. But you don't need it anyways - your code doesn't use UTF-8 characters.

@flaming-toast

flaming-toast Sep 27, 2012

Contributor

At first I didn't have it, but I added it after running into an encoding bug @rpicard was running into. I was thinking that the script was reading in utf8 characters from the HTML pages and needed to be handled as utf8 characters -- I wasn't sure, so I added it to see if it would change anything...

unix_man/parse.pl
+ s/<\/[a-zA-Z]*>//g;
+ s/[\s\t]*$//g;
+ s/^[\s\t]*//g;
+ }
@xfix

xfix Sep 27, 2012

Member

First, your code makes modifications directly on @_ variables - this is very bad idea - it modifies variables directly. Make copy using my $line = shift or something.

| can be used for alternations and \s includes \t, so your code could look like $line =~ s/<\/?\w*>|^\s*|\s*$//g. In result, I would write that code like:

sub strip {
    my $line = shift;
    $line =~ s/<\/?\w*>|^\s*|\s*$//g;
    return $line;
}
@flaming-toast

flaming-toast Sep 27, 2012

Contributor

This was exactly what I was looking for (alternations), but my web searching had failed me. Thanks!

unix_man/parse.pl
+ return $_[0];
+}
+
+my @builtins = ( 'alias', 'bg', 'bind','break', 'builtin', 'cd', 'command', 'compgen', 'complete', 'continue', 'declare','dirs', 'disown', 'enable', 'eval', 'exec', 'exit', 'export', 'fc', 'getopts', 'hash', 'help', 'history', 'jobs', 'let', 'local', 'local', 'logout', 'popd', 'pushd', 'read', 'readonly', 'return', 'set', 'shift', 'shopt', 'source', 'suspend', 'times', 'trap', 'type', 'typeset', 'ulimit', 'umask', 'unalias', 'unset', 'wait', 'fg' );
@xfix

xfix Sep 27, 2012

Member

qw() construct could be used there (it's array without need for quotes and commas).

It looks like my @builtins = qw( alias bg bind break builtin cd command ... );

Contributor

rpicard commented Sep 27, 2012

@flaming-toast I'm going to try this out on my local machine to see if I still run into that problem (that commit didn't fix it on my server). I'm running fetch.sh now.

unix_man/parse.pl
+# for each HTML manpage in download/
+foreach my $page (@cmdlist)
+{
+ open MANPAGE, "download/$page" or die "Cannot open file: $page";
@xfix

xfix Sep 27, 2012

Member

First of all, you're using old deprecated form of open - you should make variable instead - open my $manpage. Second, you can specify encoding in open so you won't have to use decode() later - open my $manpage, '<:encoding(UTF-8)', "download/$page" - this will automatically decode UTF-8 when reading from filehandle.

@flaming-toast

flaming-toast Sep 27, 2012

Contributor

Nice, that makes things a lot easier. Didn't know that form of open was deprecated, the book I had been reading was using that form. But thanks for the heads up.

@xfix

xfix Sep 27, 2012

Member

If your book was recommending that syntax then it's old book and I really wouldn't use it to learn Perl. I would use good tutorial like http://perl-begin.org/tutorials/modern-perl/ or http://perl-begin.org/tutorials/perl-for-newbies/.

Well, actually that form wasn't deprecated (it was error on my part), but it isn't recommended as it's global.

unix_man/parse.pl
+ print "$url\n";
+ close (MANPAGE);
+}
+exit 0;
@xfix

xfix Sep 27, 2012

Member

That exit isn't needed.

unix_man/parse.pl
+ $nextline = <MANPAGE>;
+ }
+ while (!($nextline =~ m/<h2>/i) && !($nextline =~ m/^[\s\t]+$/) && !($nextline =~ m/^$/)) {
+ $description .= &strip($nextline);
@xfix

xfix Sep 27, 2012

Member

This is Perl 4 syntax. Today you would use strip($nextline).

unix_man/parse.pl
+ while ($nextline =~ /^[\s\t]*$/ || $nextline =~ /^$/) { # skip the blank lines.
+ $nextline = <MANPAGE>;
+ }
+ while (!($nextline =~ m/<h2>/i) && !($nextline =~ m/^[\s\t]+$/) && !($nextline =~ m/^$/)) {
@xfix

xfix Sep 27, 2012

Member

Instead of !($blah =~ /regexp/) you can write $blah !~ /regexp.

Contributor

flaming-toast commented Sep 27, 2012

@glitchmr Hi GlitchMr, Thanks for all your suggestions. I'm not a Perl programmer and this is my first go at writing a Perl script, so you're definitely going to find weird things in my code (especially because I'm following off of books and doing web searches to help me write Perl. It could be that some of the sources I had used were dated) I'll take a look at your suggestions.

Contributor

flaming-toast commented Sep 27, 2012

@rpicard I've fixed the code considering suggestions made by @glitchmr, it should still behave the same way.
Also a heads up, I'm also planning to remove some perl-specific man pages captured by the parser because they are not formatted like regular manpages (for example, perlfork). But let's focus on the encoding problem for now.

Contributor

rpicard commented Sep 28, 2012

@flaming-toast It works locally, but it still doesn't work on my server. I'm not sure what's up with that, but I'll just upload output.txt to the server to try it out.

Contributor

rpicard commented Sep 28, 2012

Okay, the latest version is on http://ddh4.duckduckgo.com.

https://ddh4.duckduckgo.com/?q=man+perl

There are some weird characters in that one that are showing up as question marks for me, but outside of that it's looking pretty good. Any thoughts?

Contributor

flaming-toast commented Sep 29, 2012

@rpicard
Those strange characters should be gone now.

They appear in the first place because the webpages that we're parsing contain malformed unicode characters (example: http://linuxcommand.org/man_pages/vorbiscomment1.html), I don't believe there's much I can do to fix that.
The new parse.sh, however, removes these problematic characters. Please try it out and let me know if the problem persists. (also try out parse.sh on the test server and see if you still get that error.....)

Contributor

rpicard commented Sep 29, 2012

That definitely fixed this problem, but it still won't work on the server.

Here's the updated version: https://ddh4.duckduckgo.com/?q=man+perl

Contributor

flaming-toast commented Sep 29, 2012

It looks great! I'm going to fix a few more small things and deal with differently formatted man pages. Wonder why the test server is still having problems..I'll look into that too. : (

Contributor

rpicard commented Sep 30, 2012

@flaming-toast Let me know when you're ready for me to redeploy it.

Contributor

rpicard commented Oct 1, 2012

@flaming-toast: @yegg suggested that we move the title to the top, then have the first sentence from the description, then have the arguments synopsis.

For example, man zip would look something like this:

zip, zipcloak, zipnote, zipsplit - package and compress (archive) files
zip is a compression and file packaging utility for Unix, VMS, MSDOS, OS/2, Windows NT, Minix, Atari and Macintosh, Amiga and Acorn RISC OS.

zip [-aABcdDeEfFghjklLmoqrRSTuvVwXyz!@$] [-b path] [-n suffixes]
 [-t mmddyyyy] [-tt mmddyyyy] [ zipfile [ file1 file2 ...]] [-xi list]
Contributor

flaming-toast commented Oct 2, 2012

@rpicard
I like the idea, since a lot of man pages have useful first-sentences in the description, but I think this could potentially introduce more problems.

For instance, some man pages don't have descriptions (which I suppose we could leave out), some description sections basically repeat the name section (not verbatim, but basically states the same thing.)

Like for instance, with passwd's manpage:
passwd -- update a user's authentication tokens.
description (first sentence) -- passwd is used to update a user's authentication tokens.
another example is ruby(1)'s man page.

Some descriptions also start with something out of context, and some require more than the first sentence to complete an idea, for example uptime(1):

description (first sentence) -- uptime gives a one-line display of the following information.

(which is an incomplete sentence) I mean, dealing with all these special cases makes parsing more troublesome and complex (When should the parser stop capturing input, what to do with incomplete sentences and off-topic first-sentences and such)

So for now, because of these potential issues, I think displaying the name (which includes a brief description anyway) and the synopsis is sufficient. Feel free to disagree with me though, and I'm open to more ideas/suggestions!

Contributor

rpicard commented Oct 2, 2012

@flaming-toast I figured some issues like that might get in the way. I agree that sticking with the title and synopsis is probably best. I do still think that we should put the title on top though.

Contributor

flaming-toast commented Oct 2, 2012

@rpicard oh yeah, forgot about that other suggestion. I'll switch back to title first and then synopsis. : )

Contributor

rpicard commented Oct 3, 2012

@flaming-toast Great, thanks!

Contributor

flaming-toast commented Oct 3, 2012

I switched the synopsis and name, can you check if the format's correct?

Contributor

rpicard commented Oct 7, 2012

@flaming-toast I updated the ZCI at https://ddh4.duckduckgo.com/?q=man+perl. I've passed it along internally, so I'll get back to you if there's any other feedback or if we're ready to deploy it.

Contributor

flaming-toast commented Oct 7, 2012

Sure thing. If you can, please let them know that I'm also still in the
process of getting some bugs fixed.

On Sat, Oct 6, 2012 at 5:14 PM, Robert Picard notifications@github.comwrote:

@flaming-toast https://github.com/flaming-toast I updated the ZCI at
https://ddh4.duckduckgo.com/?q=man+perl. I've passed it along internally,
so I'll get back to you if there's any other feedback.


Reply to this email directly or view it on GitHubhttps://github.com/duckduckgo/zeroclickinfo-fathead/pull/36#issuecomment-9203754.

Contributor

rpicard commented Oct 7, 2012

@flaming-toast Oh, okay. What are the bugs?

Contributor

flaming-toast commented Oct 7, 2012

@rpicard Nothing too serious haha.

I'm just working on having the parser ignore man pages that are really unnecessary to have in the output.txt file (e.g., all the perldeltas, there's like 10 of those) or don't follow the typical/common format of a man page, since it's giving the parser some trouble. (e.g. perlmodstyle, perlfunc, and several others).
Currently trying to think of a workaround for those, or ignore them completely.

Contributor

rpicard commented Oct 7, 2012

@flaming-toast Okay, sounds good. Let me know when you're ready.

Contributor

flaming-toast commented Oct 10, 2012

@rpicard I've pushed an update, could you re-stage? : ) Apologies for the slowness, midterm season is brutal!
The parser now ignores unwanted man-pages (which I just listed in a hash). The reason for removing these man-pages is that they are either outdated, extraneous, or they don't follow the typical name-synopsis format that we'd like. Feel free to pass this on internally, perhaps we could all help test and catch any remaining bugs.

Contributor

rpicard commented Oct 13, 2012

@flaming-toast No problem! I had three exams this past week, so I know the feeling. I've deployed the changes to http://ddh4.duckduckgo.com.

I'll go ahead and pass it along to see if anyone catches anything else.

Contributor

rpicard commented Oct 13, 2012

I just noticed a slight bug in man python. It looks like the word "languages" is broken into two lines with a hyphen in the original source, and that break carried over. That seems like the kind of thing that might appear in several different places, so would you mind taking a look at it? Thanks!

Contributor

flaming-toast commented Oct 13, 2012

Ooh...thanks for catching that. added a fix for trailing hyphens at the end of the line...

Contributor

rpicard commented Oct 21, 2012

Awesome! Thanks.

@yegg and I both agree that since we're not going with a description, we should switch the order back (snippet on top, with the text underneath). That way it'll be consistent with other results, like perl split.

Once we have that changed, I think this will be just about ready! 👍

Contributor

flaming-toast commented Oct 22, 2012

Sweet! :D I switched the synopsis and description back. Could you re-stage the latest pushes?

Thanks a lot!

Contributor

rpicard commented Oct 22, 2012

Thanks! I just staged the latest changes. I think it's ready to go live now, but I'm going to pass it along for one final check before I deploy it to the main site.

Contributor

flaming-toast commented Oct 23, 2012

Great! I'm excited. :D Let me know if you catch any more bugs. Thanks!! : )

Contributor

rpicard commented Oct 23, 2012

🚀 We have liftoff! You can see it live here.

Do you have a Twitter account that we can use for attribution?

Contributor

flaming-toast commented Oct 23, 2012

Wow this is good news!!! Thank you so much for working with me on this plugin for the past 3 months! My twitter handle is @action_potato :)

Contributor

rpicard commented Oct 23, 2012

Thanks to you too for submitting it in the first place, and going back and forth with me all of this time to get it where it is now. I've merged the code into master, so I'll close this pull request now.

Just as a side note, I was already using this earlier today when I read this article to try and figure out what nc was: https://duckduckgo.com/?q=man+nc

@rpicard rpicard closed this Oct 23, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment