You can clone with
HTTPS or Subversion.
Macrumors uses <br> tags to separate their paragraphs. Readability attempts to insert <p> tags into the article, but results are not as you'd expect.
For example, this article: http://www.macrumors.com/2012/10/12/apples-ipad-mini-media-event-reportedly-scheduled-for-october-23/
results in this output for Document(content).summary():
<i>AllThingsD</i> reports</a><p> that Apple appears to be planning to hold a media event on Tuesday, October 23 to introduce
the "iPad mini", Apple's smaller tablet device said to be carrying a display measuring 7.85 inches diagonally.</p>
<p class="quote">As AllThingsD reported in August, Apple will hold a special event this month at which it will showcase a new,
smaller iPad. People familiar with Apple’s plans tell us that the company will unveil the so-called “iPad mini” on October 23 at an
That’s a Tuesday, not a Wednesday, so this is a bit of a break with recent tradition. It also happens to be just three days prior to
the street date for Microsoft’s new Surface tablet.</p><center>
<img src="http://cdn.macrumors.com/article-new/2012/09/ipadmini_small.jpg"/><br/><i>Physical mockup of rumored iPad mini design</i></center><p>
The location of the event is unconfirmed, but the report suggests that it is likely to be held at the company's Town Hall
auditorium at its corporate headquarters in Cupertino, California.</p><i>AllThingsD</i><p> has an excellent track record
regarding Apple media event rumors, giving this claim a high probability of proving true. Given past history, Apple would be
expected to send out invitations early next week if the event is to be held on October 23.</p><b>Update</b><p>: </p><em>The
Loop</em><p>'s Jim Dalrymple weighs in, </p>
<a href="http://www.loopinsight.com/2012/10/12/apples-rumored-oct-23-ipad-mini-event/">confirming the date</a><p> with a "Yep."
You can subclass Document class, and insert your code into _parse or transform_misused_divs_into_paragraphs methods.
If you think there could be a generic approach, could you please suggest a fix, and I will apply it.
I looked at "transform_misused_divs_into_paragraphs" a bit, but couldn't easily discern what that method was doing that caused the odd
tag placement. In any event, I revised my approach a bit, so the
tags don't matter to me any more (I'm mining content, not representing it).
Failing a better
transformation, I think it would be helpful to have a flag that switches off that feature, so the content is returned as is.