Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lyrics: Review my patch please! #1953

Closed
wisp3rwind opened this issue Apr 24, 2016 · 12 comments
Closed

Lyrics: Review my patch please! #1953

wisp3rwind opened this issue Apr 24, 2016 · 12 comments
Labels
bug bugs that are confirmed and actionable discussion

Comments

@wisp3rwind
Copy link
Member

Yesterday I did this

271f7c8...607f41b

because my imports were breaking (i.e. with a traceback) halfway-through. I don't know the lyric plugin's code, though. So it might be good if someone who does had a look at it.
Maybe these checks should be at top-level and would let the plugin issue a warning already on initialization if any of the activated sources need beautifulsoup4 or langdetect?

@wisp3rwind wisp3rwind added bug bugs that are confirmed and actionable discussion labels Apr 24, 2016
@sampsyo
Copy link
Member

sampsyo commented Apr 24, 2016

Thanks for looking into this!

I'm a little confused about the need for 607f41b where the langdetect check was moved outside of the if lyrics: condition. This seems fine, but I don't quite see what it solves. @Kraymer, perhaps you'd like to take a look, since this is related to #1939?

And we apparently added an undocumented dependency on BeautifulSoup for the LyricsWiki backend in #1912. It would be awesome to avoid needing BS for such a simple scraping task (as we did before). @jackwilsdon, any thoughts on whether scrape_lyrics_from_html is really necessary here? If it is, we at least need to document the new dependency, and we should probably disable the source by default.

@wisp3rwind
Copy link
Member Author

The idea behind the second commit was:

  • The initial change to the beautifulsoup import didn't fix the breakage. I catched the ImportError, but that would only have postponed it to a NameError or somesuch.
  • Do not print the langdetect warning for every song, but only on the first. I clearly failed at actually doing that. Which probably means that I should not push anything to master at that time of day.

@Kraymer
Copy link
Collaborator

Kraymer commented Apr 24, 2016

Do not print the langdetect warning for every song, but only on the first.

So, should that be done in LyricsPlugin::__init__ ?
check that langdetect can be imported if bing_client_secret != None. If ImportError occurs, print warning and set bing_client_secretto None.

@sampsyo
Copy link
Member

sampsyo commented Apr 24, 2016

Aha! Late-night coding has gotten me before too. 😇

Doing the check in __init__ and disabling the option makes sense to me.

@jackwilsdon
Copy link
Member

jackwilsdon commented Apr 25, 2016

@jackwilsdon, any thoughts on whether scrape_lyrics_from_html is really necessary here? If it is, we at least need to document the new dependency, and we should probably disable the source by default.

I believe it's needed due to to the way LyricsWiki formats their lyrics. I'll take a look at refactoring scrape_lyrics_from_html to use Python's HTMLParser.

Here's what the unescaped lyrics look like (with added indentation):

<div class='lyricbox'>
  I guess I should've known by the way you parked your car sideways<br />
  That it wouldn't last<br />
  See, you're the kinda person that believes in makin' out once<br />
  Love 'em and leave 'em fast<br />
  I guess I must be dumb 'cause you had a pocket full of horses<br />
  Trojan and some of them used<br />
  But it was Saturday night, I guess that makes it all right<br />
  And you say - "What have I got to lose?"<br />
  <br />

  ...

  <br />
  Little red Corvette, oh
  <!--
    <p>
      NewPP limit report
      Preprocessor node count: 857/300000
      Post‐expand include size: 6188/2097152 bytes
      Template argument size: 1451/2097152 bytes
      Expensive parser function count: 1/100
      ExtLoops count: 2/100
    </p>
  -->

  <div class='lyricsbreak'></div>
</div>

@sampsyo
Copy link
Member

sampsyo commented Apr 25, 2016

If it's just a matter of converting <br>s to newlines, removing the comments, and stripping that <div>, then perhaps the BREAK_RE, COMMENT_RE, and DIV_RE regular expressions we already have will suffice.

@jackwilsdon
Copy link
Member

Aha, I didn't know we already had some expressions for those. I'll take a look.

@jackwilsdon
Copy link
Member

Just opened #1956 with a fix, it seems we can use _scrape_strip_cruft instead and it works fine!

@sampsyo
Copy link
Member

sampsyo commented Apr 26, 2016

Let's call this good as of #1956.

@sampsyo sampsyo closed this as completed Apr 26, 2016
@Kraymer Kraymer reopened this Apr 26, 2016
@Kraymer
Copy link
Collaborator

Kraymer commented Apr 26, 2016

reopening, the failing langdetect import is still ongoing

@jackwilsdon
Copy link
Member

What exactly is the issue, @Kraymer? Does the langdetect warning code need moving to the constructor so that it does not print for every song?

@Kraymer
Copy link
Collaborator

Kraymer commented Apr 27, 2016

@jackwilsdon yep that's it #1963

@Kraymer Kraymer closed this as completed Apr 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs that are confirmed and actionable discussion
Projects
None yet
Development

No branches or pull requests

4 participants