Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There seem to be complaints related to the user agent scraping permission issue #1002

Closed
sutgeorge opened this issue Apr 10, 2024 · 3 comments

Comments

@sutgeorge
Copy link

Hello,

I think quite a lot of people seem to have created issues similar to this one. I solved my problem with the user agent trick (I was not allowed to scrape the contents of a website, for whatever reason, and the result of article.html was basically an empty string).

Either way, I found out that the solution is to use a Config object as a parameter to the Article class, with the browser_user_agent set to something like Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0. I'm wondering if this detail should be added to the main README.md file or not. I'm convinced that this will be helpful and will save a lot of time for other people.

Thank you.

@rajitkhanna
Copy link

Hi @sutgeorge , could you share your code?

@sutgeorge
Copy link
Author

sutgeorge commented Jun 13, 2024

Sure @rajitkhanna, this is a snippet of the Jupyter Notebook that I used:

import newspaper
import tqdm
from newspaper import Article, Config
from bs4 import BeautifulSoup

config = Config()
config.browser_user_agent = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0"

url = 'https://www.capital.ro/{}/page/'.format(section) + str(page_number)
page = Article(url, language='ro', config=config)
page.download()

...

@sutgeorge
Copy link
Author

Obviously, you can replace the URL with anything you'd like (I wanted to scrape the page containing a list of articles from a news publication).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants