There seem to be complaints related to the user agent scraping permission issue #1002

sutgeorge · 2024-04-10T03:41:15Z

Hello,

I think quite a lot of people seem to have created issues similar to this one. I solved my problem with the user agent trick (I was not allowed to scrape the contents of a website, for whatever reason, and the result of article.html was basically an empty string).

Either way, I found out that the solution is to use a Config object as a parameter to the Article class, with the browser_user_agent set to something like Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0. I'm wondering if this detail should be added to the main README.md file or not. I'm convinced that this will be helpful and will save a lot of time for other people.

Thank you.

The text was updated successfully, but these errors were encountered:

rajitkhanna · 2024-06-13T13:50:06Z

Hi @sutgeorge , could you share your code?

sutgeorge · 2024-06-13T13:54:20Z

Sure @rajitkhanna, this is a snippet of the Jupyter Notebook that I used:

import newspaper
import tqdm
from newspaper import Article, Config
from bs4 import BeautifulSoup

config = Config()
config.browser_user_agent = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0"

url = 'https://www.capital.ro/{}/page/'.format(section) + str(page_number)
page = Article(url, language='ro', config=config)
page.download()

...

sutgeorge · 2024-06-13T13:55:25Z

Obviously, you can replace the URL with anything you'd like (I wanted to scrape the page containing a list of articles from a news publication).

sutgeorge closed this as completed Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There seem to be complaints related to the user agent scraping permission issue #1002

There seem to be complaints related to the user agent scraping permission issue #1002

sutgeorge commented Apr 10, 2024

rajitkhanna commented Jun 13, 2024

sutgeorge commented Jun 13, 2024 •

edited

Loading

sutgeorge commented Jun 13, 2024

There seem to be complaints related to the user agent scraping permission issue #1002

There seem to be complaints related to the user agent scraping permission issue #1002

Comments

sutgeorge commented Apr 10, 2024

rajitkhanna commented Jun 13, 2024

sutgeorge commented Jun 13, 2024 • edited Loading

sutgeorge commented Jun 13, 2024

sutgeorge commented Jun 13, 2024 •

edited

Loading