Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular expression adds false-matches to the number of documents returned in query #2

Open
GoogleCodeExporter opened this issue Mar 30, 2015 · 0 comments

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?
1. Run metagoofil.py as expected
2. Observe the number of documents found always 5 more than what was specified

What is the expected output? What do you see instead?

The expected number of matches is always 5 more than requested.  This is due to 
cruft at the bottom of the Google results page that is being matched by the 
compiled regular expression for "<a href=".

What version of the product are you using? On what operating system?
metagoofil-read-only from SVN dated May 16, 2011 (revision 2)

Please provide any additional information below.

The indication of the erroneous matches is seen from metagoofil.py's output 
from the googlesearch.py's invocation of parser.py's fileurls call.  Note the 
output below where "Searching 100 results..." is followed by "Results: 105 
files found".
-------
[-] Searching for doc files, with a limit of 10
        Searching 100 results...
Results: 105 files found
Starting to download 10 of them..

From inspection, the 5 extra matches are:
 '/'
 '/intl/en/ads/'
 '/services/'
 '/intl/en/privacy.html'
 '/intl/en/about.html' 

and are being matched from URLs in the footer of the google search page.

Patch attached to address the additional (spurious) pattern matches.


Original issue reported on code.google.com by lnxp...@gmail.com on 16 Jun 2011 at 3:00

Attachments:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant