Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wraps long URLs #38

Closed
stefanor opened this issue Oct 18, 2014 · 13 comments
Closed

Wraps long URLs #38

stefanor opened this issue Oct 18, 2014 · 13 comments
Labels

Comments

@stefanor
Copy link
Contributor

Forwarding aaronsw/html2text#7, so it doesn't get forgotten:
Forwarding http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616090:

Long URLs are wrapped, which they probably shouldn't be.

Example:

<html>
<head><title>Test</title></head>
<body>
<p>And <a href="http://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=multiarch;users=debian-dpkg@lists.debian.org">here</a> is a long link I had at hand.</p>
</body>
</html>

Results in:

And [here][1] is a long link I had at hand.

   [1]: http://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=multiarch;users
=debian-dpkg@lists.debian.org
@Alir3z4 Alir3z4 added the bug label Oct 19, 2014
@Alir3z4
Copy link
Owner

Alir3z4 commented Oct 19, 2014

I label this as bug.

@stefanor Thanks for follow up on this. Please feel free to open up/forward issues/bugs from
https://github.com/aaronsw/html2text/.

@jacobsvante
Copy link

Took me a while to find that html2text indirectly caused the %0A (newline in URL encoding) occurring in my links. I've temporarily disabled body_width wrapping in my code to prevent it.

parser = HTML2Text()
parser.body_width = 0
parser.handle(value)

@Alir3z4
Copy link
Owner

Alir3z4 commented Apr 16, 2015

Hey @jmagnusson @stefanor
I see these two hacks trying to fix the issue:

Do you think we can apply the same into htmltext without explicitly set body_width=0 ?

@theSage21
Copy link
Collaborator

@stefanor does this fix help?

@stefanor
Copy link
Contributor Author

stefanor commented Jul 4, 2015

Yeah, combined with --reference-links that seems to do the right thing.

@theSage21
Copy link
Collaborator

@stefanor @Alir3z4 Consider closed?

@Alir3z4
Copy link
Owner

Alir3z4 commented Jul 6, 2015

I'm going to close this then, thanks for your awesome collaboration on this ;)

@Alir3z4 Alir3z4 closed this as completed Jul 6, 2015
@nguyenl95
Copy link

nguyenl95 commented Apr 5, 2018

This issue still happens to me when the link contains special characters like "-".

Are there anyway to rebuild this package with BODY_WIDTH = 0 (config.py) ?

@Alir3z4
Copy link
Owner

Alir3z4 commented Apr 5, 2018

@nguyenl95 have you consider --reference-links ?

@nguyenl95
Copy link

nguyenl95 commented Apr 6, 2018

@Alir3z4 I intend to use html2text as lib instead of command-line.
Btw I just read your /tests and it is really useful.

I use this lib for my crawler (this case seems popular), and I think body_width=0 or protect_links=True and skip_internal_links=False should be default. Baseurl is really good one that need to be exposed for readers btw.

def html2md(raw):
  h = html2text.HTML2Text()
  h.body_width = 0
  h.baseurl = "https://example.org" # this is hidden
  return h.handle(raw)

@Alir3z4
Copy link
Owner

Alir3z4 commented Apr 6, 2018

@nguyenl95 Thanks for mentioning.
I didn't noticed you were referring to use of of the lib itself and not he CLI.

I'd love to see a pull request for updating the documentation so other can see and use it.
You would be modifying:

Let me know if I can help you with anything else.

@nguyenl95
Copy link

nguyenl95 commented Apr 7, 2018

@Alir3z4 Actually there is one feature I think of.

It is the limit of output, my forum platform doesn't allow my crawler to post the content over 32000 characters.

@Alir3z4
Copy link
Owner

Alir3z4 commented Apr 8, 2018

@nguyenl95 Great, feel free to make a feature request or even better a pull request, I would love to know more about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants