Undefined method `text` for NilClass (Broken scrapers) #72

fabianrbz · 2018-03-20T17:48:19Z

While running the example code - with other apps happens too-:

# Download/parse the app.
app = MarketBot::Play::App.new('com.facebook.katana')
app.update

I get the following error quite frequently -Added the verbose output from typhoeus-:

From what I've seen, it's because of the html that's returned. Sometimes it matches the one this gem expects, but sometimes it doesn't.

Expected html:

And the one that generates the error:

Is anyone else getting this? I guess we would need to add another parser and select one depending on the html. I couldn't find a way to get the same response consistently though.
If anyone else is getting this, I might find some time to implement it

sunboshan · 2018-03-20T18:01:04Z

This is a issue with Google Play store. If you do curl couple of times

$ curl -v https://play.google.com/store/apps/details?id=com.facebook.katana

You will get two htmls for the same page. One is the old format, with all divs has fixed property, like itemprop="contentRating", so that we can parse it easily.

Another html is using new format, with all divs has random class name, like <span class="htlgb">. In this case it's very hard to parse meaningful content.

The first time I saw this is in end of Feb, and I'm seeing this lot more often now. I believe they are doing an update on their server gradually. Some server returning the old format, some returning the new format.

fabianrbz · 2018-03-20T18:04:27Z

@sunboshan exactly, that's what I'm seeing. What do you think about having another parser for the new format? I guess they are moving towards that one and soon the current implementation will be obsolete.

sunboshan · 2018-03-20T18:08:41Z

I think the purpose for Google is pretty obvious, which is they don't want others to parse the content of their play store. I don't have a good idea on how to parse div/span with random class name. Yes soon the current implementation will be obsolete when Google full roll out their random html page for play store.

chadrem · 2018-03-20T18:18:36Z

@lightalloy I believed mentioned this problem to me too. So definitely sounds like it’s going to be a major problem. Hopefully it’s possible to have a second parser to resolve this. I don’t have time to implement a new parser so hopefully the community can work on it. I’m happy to help coordinate as I’m sure this will be useful to a lot of people.

fabianrbz · 2018-03-20T18:21:16Z

@sunboshan do you think it has random classes? I'm getting the same ones, but yeah, it could be that they are planning to make them random.

sunboshan · 2018-03-20T18:34:08Z

As far as I can see, the metadata part is using <span class="htlgb">. Here's a simple parse on <span class="htlgb"> on fb app(in Erlang terms)

[
  {"span", [{"class", "htlgb"}], ["March 19, 2018"]},
  {"span", [{"class", "htlgb"}], ["Varies with device"]},
  {"span", [{"class", "htlgb"}], ["1,000,000,000+"]},
  {"span", [{"class", "htlgb"}], ["Varies with device"]},
  {"span", [{"class", "htlgb"}], []},
  {"span", [{"class", "htlgb"}],
...

So the parser needs to pretty much assume we are looking for span on class htlgb, then the first one is Update time, the second one is binary size, the third one is download number, etc. Who knows if Google will change the order, or change the class name, it will break again.

lightalloy · 2018-03-30T12:41:17Z

It seems like today Google changed the apps' page design (and HTML) fully, though yesterday it was mostly ok. The new parser needs to be written.
I know that @chadrem doesn't have time for that. I think I'll do the task, but I first need to discuss the issue with my colleagues.

chadrem · 2018-03-30T14:49:55Z

@lightalloy I hope your colleagues agree to it! PRs welcome!

chadrem · 2018-04-02T14:30:21Z

I've updated all the test data by running ./bin/update_test_data and pushed the commit to master. This results in most of the tests now failing. If anyone works on new scrapers then please make sure these tests pass.

chadrem · 2018-04-02T14:45:41Z

@refaelos submitted PR #74 that may fix the scrapers, but decreases code coverage. Is anyone else working on fixing the scrapers? If so, it would be helpful to coordinate here. I'd like to see any PR that fixes this problem maintain 100% test coverage. Lets communicate any fixes with scrapers in this issue so that we can all work together to keep market bot working.

lightalloy · 2018-04-03T07:56:53Z

As for the pull-request and the feature overall, I think that the code related to the old HTML should be removed, cause, as I see, Google doesn't serve the old version anymore.
Also, the tests in the PR remain the same as they were, so they still rely on the old HTML version which is in spec/market_bot/play/data which is incorrect. The downloaded files should be updated (replaced) and the tests should be reworked accordingly (if needed).
I will look more precisely and decide if it's possible to rework the PR, or I'll just make another one.

lightalloy · 2018-04-03T08:01:46Z

@chadrem I see you've updated the test data in the master branch 👍

Fix the tests according to the new html chadrem#72

chadrem · 2018-04-03T14:10:00Z

@lightalloy thanks!!! I agree the code is a good start and needs further refactoring and has some missing attributes. Do you feel it is good enough to merge into master, release it, and use at your company? I'm looking for people who can put some real world use on this PR and then help improve it over time.

I also think we should setup rubocop as I've noticed the syntax is looking a bit different than what I originally wrote.

If anyone is for or against merging this then please let me know.

refaelos · 2018-04-03T14:17:27Z

@lightalloy @chadrem the code was written VERY quickly so should be carefully tested.
We ran it on a few hundreds of package names and added a lot of 'ifs' to handle different google play pages.
And yes ... some attributes were not taken b/c we don't need them.

chadrem · 2018-04-03T14:22:08Z

@refaelos Thank you for the fast response. A few questions:

Are all of those 'ifs' already in the PR? or do you have new ones?
Are you planning to improve the code over time?
It's ok that some attribute are missing. If anyone needs them they can send a PR to add them back.

@refaelos @lightalloy I've also setup Rubocop for the project. Once we get this PR merged I will have it automatically fix the code and then I will manually fix anything it warns about. You will then want to use this new code for any future PRs to keep the syntax consistent.

lightalloy · 2018-04-03T14:23:02Z

@chadrem I've also thought about adding rubocop to the project. Thanks for adding it.
As for the production, I'm not 100% sure yet, I plan to test it more tomorrow.
But obviously, I'm not against merging, cause the current version doesn't work.

lightalloy · 2018-04-03T14:25:20Z

@chadrem I've removed the code for the old html in my pr. Maybe that was too early.

chadrem · 2018-04-03T14:35:01Z

I'm going to merge #75 from @lightalloy (that contains fixes from @refaelos) since it is better to have something work even if it has bugs. I am also going to run rubocop on it to clean up all the syntax.

Is everyone willing to then work off this new master branch to fix and refactor any remaining problems?

refaelos · 2018-04-03T14:38:53Z

@chadrem
Are all of those 'ifs' already in the PR? or do you have new ones?
Yes. I believe that when I push new commits it goes automatically into the PR so yes.

Are you planning to improve the code over time?
Yes. Whenever we see issues we'll fix and PR.

It's ok that some attribute are missing. If anyone needs them they can send a PR to add them back.
👍

Fix the tests according to the new html #72

refaelos · 2018-04-03T14:41:31Z

@chadrem I don't think you should remove the old html parsing. Sometimes google keeps replying with the older page format and things will work as usual. I suggest we wait a couple of months before eliminating older code.

chadrem · 2018-04-03T15:03:18Z

I've merged the PR, but I don't plan on releasing a new gem until I receive feedback that the code is working. Please send me feedback when you think it's ready for a release. Even if it's buggy, it's better than nothing.
Rubocop has been run on all the code. I've disabled many exceptions. Please run it before submitted new PRs with the command bundle exec rubocop to see the errors. Run bundle exec rubocop -a to try to automatically fix errors.
If anyone has additional PRs then please send them. Please make sure you base your PR off the new master branch that contains a lot of changes from rubocop.
The old scrapers are removed, but feel free to add them back if necessary. They are removed in commit 1824aea

chadrem · 2018-04-03T15:05:17Z

@refaelos sorry I missed your recommendation. I didn't understand that Google was still using the old page format on some sites. Do you have examples where they are using it? I've received comments from other people that it isn't being used anymore so I thought it was ok to remove.

refaelos · 2018-04-03T16:44:01Z

@chadrem it rarely happens but they do. I suggest leaving the old parsing method just as a fallback.

cschroed · 2018-04-12T18:01:22Z

@chadrem How can I help with this? We use this gem and was waiting to see if things got resolved. Looks like some extra hands would help.

Where can I help/start?

chadrem · 2018-04-12T18:25:40Z

@cschroed The latest code is in the master branch. Pretty much start hacking on it, submitting PRs, and communicate with other developers here. The gem is community supported at this point since my time and need for it is limited. I’m looking for proper maintainers(s) to give commit access to once they demonstrate interest and ability. Please ask questions here. A few others here have already done significant work on fixing things. Once the community feels the code is solid I can easily release a new version.

pdesantis · 2018-04-12T19:34:09Z

FYI - I've been using the latest code in master for the past few days, and it's been running very well. Just wanted to give thanks to everyone involved for your efforts!

bstegmaier · 2018-04-16T12:17:26Z

There were still some issues in extracting the screenshot urls. Fixed them in PR#77.

chadrem · 2018-04-16T14:03:58Z

I’ve merged the screenshot fix from @bstegmaier. Thanks for the fix!

chadrem · 2018-04-16T14:07:56Z

@pdesantis thanks for the feedback. Is everyone else using the master branch successfully? What features are still broken? Thought on releasing master as a new version?

bstegmaier · 2018-04-16T16:08:44Z

@chadrem Actually there is still a problem with screenshots and the cover image: The current implementation relies on alt texts for finding them. However, alt texts are localized (e.g. it's "Covergestaltung" in German instead of "Cover art"). So this approach does not work for all non-default languages.
edit: Maybe someone else has a good idea of how to fix this issue?

chadrem · 2018-04-17T04:48:42Z

Sounds like maybe a lockup table is the easiest solution? Users can then add additional languages if they need them. Feel free to send a pull request if you have time to sort this out.

bstegmaier · 2018-04-27T09:54:03Z

The parser was broken once again (at least for me) since google changed ratingCount to reviewCount. See here: bstegmaier@a697e00

refaelos · 2018-04-27T11:59:40Z

@bstegmaier this is how I would do it: soomla/market_bot@1e2fa8a

pdesantis · 2018-05-02T23:03:51Z

Thanks for the suggestions @bstegmaier and @refaelos. I just got a chance to try this out. Neither solution is parsing the rating count for me. From what I can tell, the HTML doesn't have ratingCount nor ratingCount

refaelos · 2018-05-03T20:54:45Z

@pdesantis you mean reviewCount ?
If you have a better solution I'd love to see it.

pdesantis · 2018-05-04T04:10:45Z

Yes, sorry, I meant that I'm not seeing ratingCount nor reviewCount in the HTML.

I'm seeing this for number of ratings:
<span class="AYi5wd TBRnV"><span class="" aria-label="54,394,288 ratings">54,394,288</span>

And this for average rating.
<div class="BHMmbe" aria-label="Rated 4.0 stars out of five stars">4.0</div>

Perhaps I can add a check for those CSS classes as a fallback

pdesantis · 2018-05-04T06:30:08Z

I'm seeing promising results so far with this pdesantis@fcc876b

Will post an update after running it for a while longer

refaelos · 2018-05-04T15:37:28Z

@pdesantis you're right, they probably removed ratingCount and reviewCount.
I don't think relying on their auto generated class names is smart. Those will change from time to time for sure.
I think the best way to go would be to take the element that contains aria-label with the word ratings inside it. Then you can take its text and go to parent to get the other element for rating.

pdesantis · 2018-05-05T21:29:32Z

@refaelos agreed 100%. I'm going to keep running my brittle solution for the time being, but may attempt the more robust solution in a couple of weeks.

refaelos · 2018-05-06T09:05:15Z

@pdesantis 👍 I'd love to see it when it's done

chadrem mentioned this issue Apr 1, 2018

Not working since updated playstore layout #73

Closed

chadrem changed the title ~~Undefined method text for NilClass~~ Undefined method text for NilClass (Broken scrapers) Apr 2, 2018

chadrem mentioned this issue Apr 2, 2018

Extracting Google Play info when classes are obfuscated #74

Closed

lightalloy added a commit to lightalloy/market_bot that referenced this issue Apr 3, 2018

Remove the code for the old-html parsing

020188c

Fix the tests according to the new html chadrem#72

chadrem pushed a commit that referenced this issue Apr 3, 2018

Remove the code for the old-html parsing

1824aea

Fix the tests according to the new html #72

chadrem mentioned this issue Apr 5, 2018

Fix for screenshot_urls and cover_image_url #76

Merged

chadrem added the help wanted label Apr 5, 2018

chadrem added the bug label Apr 5, 2018

kp1 mentioned this issue May 16, 2018

plugins/playstore.rbが動作しない tdiary/tdiary-contrib#199

Open

Undefined method text for NilClass (Broken scrapers) #72

Undefined method text for NilClass (Broken scrapers) #72

Comments

fabianrbz commented Mar 20, 2018

sunboshan commented Mar 20, 2018

fabianrbz commented Mar 20, 2018

sunboshan commented Mar 20, 2018

chadrem commented Mar 20, 2018

fabianrbz commented Mar 20, 2018

sunboshan commented Mar 20, 2018

lightalloy commented Mar 30, 2018

chadrem commented Mar 30, 2018

chadrem commented Apr 2, 2018

chadrem commented Apr 2, 2018

lightalloy commented Apr 3, 2018

lightalloy commented Apr 3, 2018

chadrem commented Apr 3, 2018

refaelos commented Apr 3, 2018

chadrem commented Apr 3, 2018

lightalloy commented Apr 3, 2018

lightalloy commented Apr 3, 2018

chadrem commented Apr 3, 2018

refaelos commented Apr 3, 2018

refaelos commented Apr 3, 2018

chadrem commented Apr 3, 2018

chadrem commented Apr 3, 2018

refaelos commented Apr 3, 2018

cschroed commented Apr 12, 2018

chadrem commented Apr 12, 2018

pdesantis commented Apr 12, 2018

bstegmaier commented Apr 16, 2018

chadrem commented Apr 16, 2018

chadrem commented Apr 16, 2018

bstegmaier commented Apr 16, 2018 • edited

chadrem commented Apr 17, 2018

bstegmaier commented Apr 27, 2018

refaelos commented Apr 27, 2018

pdesantis commented May 2, 2018

refaelos commented May 3, 2018

pdesantis commented May 4, 2018

pdesantis commented May 4, 2018

refaelos commented May 4, 2018

pdesantis commented May 5, 2018

refaelos commented May 6, 2018

Undefined method `text` for NilClass (Broken scrapers) #72

Undefined method `text` for NilClass (Broken scrapers) #72

bstegmaier commented Apr 16, 2018 •

edited