Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Erratic <p> insertion in Macrumors article #30

Closed
akavlie opened this Issue · 3 comments

2 participants

@akavlie

Macrumors uses <br> tags to separate their paragraphs. Readability attempts to insert <p> tags into the article, but results are not as you'd expect.

For example, this article: http://www.macrumors.com/2012/10/12/apples-ipad-mini-media-event-reportedly-scheduled-for-october-23/

results in this output for Document(content).summary():

<html><body><div><div class="content">
                        <a href="http://allthingsd.com/20121012/apple-likely-to-unveil-ipad-mini-at-october-23-event/">
<i>AllThingsD</i> reports</a><p> that Apple appears to be planning to hold a media event on Tuesday, October 23 to introduce 
the "iPad mini", Apple's smaller tablet device said to be carrying a display measuring 7.85 inches diagonally.</p>
<p class="quote">As AllThingsD reported in August, Apple will hold a special event this month at which it will showcase a new, 
smaller iPad. People familiar with Apple’s plans tell us that the company will unveil the so-called “iPad mini” on October 23 at an 
invitation-only event.<br/><br/>
That’s a Tuesday, not a Wednesday, so this is a bit of a break with recent tradition. It also happens to be just three days prior to 
the street date for Microsoft’s new Surface tablet.</p><center>
<img src="http://cdn.macrumors.com/article-new/2012/09/ipadmini_small.jpg"/><br/><i>Physical mockup of rumored iPad mini design</i></center><p>
The location of the event is unconfirmed, but the report suggests that it is likely to be held at the company's Town Hall 
auditorium at its corporate headquarters in Cupertino, California.</p><i>AllThingsD</i><p> has an excellent track record 
regarding Apple media event rumors, giving this claim a high probability of proving true.  Given past history, Apple would be 
expected to send out invitations early next week if the event is to be held on October 23.</p><b>Update</b><p>: </p><em>The 
Loop</em><p>'s Jim Dalrymple weighs in, </p>
<a href="http://www.loopinsight.com/2012/10/12/apples-rumored-oct-23-ipad-mini-event/">confirming the date</a><p> with a "Yep."
                                        </p><p/>
                    </div>
                    </div></body></html>
@buriy buriy was assigned
@buriy
Owner

You can subclass Document class, and insert your code into _parse or transform_misused_divs_into_paragraphs methods.
If you think there could be a generic approach, could you please suggest a fix, and I will apply it.

@buriy buriy closed this
@akavlie

I looked at "transform_misused_divs_into_paragraphs" a bit, but couldn't easily discern what that method was doing that caused the odd

tag placement. In any event, I revised my approach a bit, so the

tags don't matter to me any more (I'm mining content, not representing it).

Failing a better

transformation, I think it would be helpful to have a flag that switches off that feature, so the content is returned as is.

@buriy
Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.