Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown URL type and sites that hang scraper #7

Closed
luluhoc opened this issue Jan 27, 2019 · 10 comments
Closed

Unknown URL type and sites that hang scraper #7

luluhoc opened this issue Jan 27, 2019 · 10 comments

Comments

@luluhoc
Copy link

luluhoc commented Jan 27, 2019

Hello,
I'm getting this error that stops the program from extracting emails.

unknown url type: 'robert@broofa.com'
Press enter to continue

There is also issue with some sites that hang scraper, I'm not sure if it can be overcome, but here are some examples of the sites maybe you can figure it out from it.

Searching in https://whyy.streamguys1.com/whyy-mp3

Searching in http://www.investor.reuters.com/business/BusCompanyOverview.aspx?t
icker=SCI&symbol=SCI&target=%2fbusiness%2fbuscompany%2fbuscompfake%2fbuscompove
rview

Searching in http://www.accuweather.com/en/us/jersey-city-nj/07306/weather-fore
cast/2735_pc

groupon.com


I'm getting also this error
`[Errno 104] Connection reset by peer`
@luluhoc luluhoc changed the title Unknown URL type Unknown URL type and sites that are freezing scraper Jan 27, 2019
@luluhoc
Copy link
Author

luluhoc commented Jan 28, 2019

@luluhoc luluhoc changed the title Unknown URL type and sites that are freezing scraper Unknown URL type and sites that hangs scraper Jan 28, 2019
@luluhoc luluhoc changed the title Unknown URL type and sites that hangs scraper Unknown URL type and sites that hang scraper Jan 28, 2019
DiegoCaraballo added a commit that referenced this issue Jan 30, 2019
@DiegoCaraballo
Copy link
Owner

@luluhoc
Hi, add a control for the broken urls in options 1, 2 and 3.
The operation is now a bit slower, but I will try to solve it when I switch to objects.
regards

@luluhoc
Copy link
Author

luluhoc commented Jan 30, 2019

Thanks, keep up the great work

@DiegoCaraballo
Copy link
Owner

Hi, add a control for the broken urls in options 1, 2 and 3.

@luluhoc
Copy link
Author

luluhoc commented Jan 30, 2019

The problem still persists.
I'm searching for "funeral home new jersey" and I search for 500 results.

I have run updated python script 2 times and the script is hanging on the same url.

Searching in /notices/Alejandro-Hernandez
Searching in /notices/Patrick-Montella
Searching in /notices/Frank-Petrecca
Searching in /notices/Anthony-Ferlazzo
Searching in /notices/Alejandro-Hernandez
Searching in /notices/Patrick-Montella
Searching in /notices/Frank-Petrecca
Searching in /notices/Anthony-Ferlazzo
Searching in /notices/Warren-Vernon
Searching in /notices/Victoria-Rooney
Searching in /notices/Warren-Vernon
Searching in /notices/Victoria-Rooney
Searching in javascript:navigateTo('/mailinglist')
Searching in javascript:navigateTo('/listings')
Searching in /send-flowers
Searching in /mailinglist
Searching in /listings
Searching in /send-flowers
Searching in /our-facilities
Searching in /concierge-services
Searching in http://www.nfda.org/
Searching in http://www.nutleychamber.com/
86 - chamber@nutleychamber.com
87 - info@tempotherapy.com
Searching in https://web.njsfda.org/public/professional-home/about-njsfda/relat
ed-entities/njfds.aspx
Searching in https://web.njsfda.org/public/preplanning/preplanning-a-funeral/ch
eck-trust-balances-and-choices-tax-statements.aspx
Searching in https://www.facebook.com/pages/Biondi-Funeral-Home/154470051254851
Searching in http://www.accuweather.com/en/us/nutley-nj/07110/weather-forecast/
2709_pc

@DiegoCaraballo
Copy link
Owner

Hello @luluhoc , I'm gonna check it.

@luluhoc
Copy link
Author

luluhoc commented Jan 30, 2019

Thanks

@DiegoCaraballo
Copy link
Owner

Hello @luluhoc ,
In the lines 837 and 876 add "timeout = 10" and test.
f = urllib.request.urlopen(req, timeout=10)

Later I upload the changes with other fixes
 
Regards

@luluhoc
Copy link
Author

luluhoc commented Jan 30, 2019

Ok I'll change it and I'll try it out

@luluhoc
Copy link
Author

luluhoc commented Feb 2, 2019

It doesn't work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants