Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add crawler to get texts from websites #775

Merged
merged 26 commits into from
Feb 18, 2021

Conversation

DIVYA-19
Copy link
Contributor

@DIVYA-19 DIVYA-19 commented Jan 26, 2021

Hello

I tried to implement feature that is discussed in #770

  • handled list of urls and single urls
  • have option to specify whether to extract_sub_links or not
  • stores data to text files in specified output directory
  • meta data in file format {'url":, 'base_url":, "text":}
  • webdriver (chrome or firefox) need to be specified
  • external links and in-page navigation links are excluded

all required functions are added to preprocessor/utils.py

@tholor
Copy link
Member

tholor commented Jan 27, 2021

Hey @DIVYA-19 ,

Thanks for working on this! I really think this is a valuable addition to Haystack.

I am currently digging into your proposed implementation and a few early questions / comments came up:

  • Why using selenium? I am not really familiar with web scraping/crawling but seems like scrapy is also a popular, powerful alternative. From a quick look it seems that scrapy has less dependencies (e.g. browser driver) and is faster (https://medium.com/analytics-vidhya/scrapy-vs-selenium-vs-beautiful-soup-for-web-scraping-24008b6c87b8). Any particular reason why you chose selenium?
  • The code is currently scattered across various methods in utils.py. It might be cleaner to wrap everything in a Crawler class (potentially even in an own module). I am happy to propose a more concrete design once we settle on the framework
  • From a quick test, it seems that we are currently only crawling "level 1 sublinks", i.e. only links that are directly available from the passed URL. I think for many cases it would make sense to allow deeper levels to really find all pages on a website
  • Currently, we include all URLs that share the same base domain. In some use cases it might make sense to restrict it further to certain paths. For example: I want to crawl all Haystack docs. If I pass url=["https://haystack.deepset.ai/docs/latest/get_startedmd"] all pages with the domain "haystack.deepset.ai" are crawled. However, I am only interested in those with https://haystack.deepset.ai/docs/latest/*. Would be cool to allow a param that supports wildcards / regex for filtering those "relevant" URLs.

@DIVYA-19
Copy link
Contributor Author

DIVYA-19 commented Jan 28, 2021

Any particular reason why you choose selenium

No. not exactly. But it's very easy to use. And I'm familiar with it. I have thought the same about the driver path.
scrapy seems cool. I'll start looking into it.

The code is currently scattered across various methods in utils.py.

yeah! creating a module would be nice. I'll do that

I think for many cases it would make sense to allow deeper levels to really find all pages on a website

It'll be a good feature. But I have a doubt. When to stop looking for urls?

In some use cases it might make sense to restrict it further to certain paths

sounds good. we can do that

@DIVYA-19
Copy link
Contributor Author

Hi @tholor
I dig into scrapy. It has nice features (like extracting links with some settings). But it can only extract page source. To get dynamic content we need to use scrapy-splash or selenium

@tholor
Copy link
Member

tholor commented Jan 29, 2021

Ok, I see. From a quick look, scrapy-splash doesn't look better than Selenium in terms of dependencies as it requires running an external Docker container. I'd say let's go forward with Selenium and let's try to make the configuration + installation of the webdriver as simple as possible for the users.

@tholor
Copy link
Member

tholor commented Feb 4, 2021

@DIVYA-19 I saw that you already refactored some parts in the PR. Let me know if I shall do another review round :)

@DIVYA-19
Copy link
Contributor Author

DIVYA-19 commented Feb 4, 2021

yeah @tholor! I think it would be nice if you review and give some suggests. I couldn't update you on changes in PR since it was failing some checks.
updated things:

  • created separate module
  • function does both storing files and returning list of dictionaries
  • included a parameter for regex for retrieving data from matched urls only
    and it retrieves first level sub urls only. I haven't updated that feature

@tholor tholor changed the title add fetch_data_from_url to extract data and store as files Add crawler to get texts from websites Feb 11, 2021
@tholor
Copy link
Member

tholor commented Feb 11, 2021

@DIVYA-19 I just did one review round and pushed a few changes that seemed helpful from my perspective:

  • added a new module connector where we can put crawler.py but also extend in future to other external input streams (e.g. Confluence, OneDrive, GDrive, CRM systems ...)
  • Moved everything into a Crawler class (we need a class if we want to include crawlers as nodes in the upcoming Indexing pipelines, see Add support for indexing pipelines #816 )
  • Renamed a few params and added docstrings

Hope this makes sense from your perspective @DIVYA-19 ?

@tanaysoni Please review and let me know your thoughts on the integration into pipelines. Especially, I was wondering whether we want to put most of the params rather in the class init or the run(). I guess, this has quite some implications on the yaml representation.

@DIVYA-19
Copy link
Contributor Author

yes @tholor that totally makes sense to me too

@tholor tholor requested review from tanaysoni and removed request for tanaysoni February 15, 2021 14:41
@lalitpagaria
Copy link
Contributor

@tholor is there any plan to include this in next release?

I am trying this with confluence will share results.

@tholor tholor requested review from oryx1729 and removed request for tanaysoni February 18, 2021 08:29
@tholor
Copy link
Member

tholor commented Feb 18, 2021

@lalitpagaria yep, will be merged soon. Just waiting for @oryx1729 's review.

@tholor tholor merged commit 6c3ec54 into deepset-ai:master Feb 18, 2021
@ierezell
Copy link

ierezell commented Apr 8, 2021

Hi @tholor, @DIVYA-19

I just came here after a nice call with @PiffPaffM which pointed me this PR.

It's a really nice feature that I also wanted and implemented it on my own.

I'm having some questions / ideas about that (from my experimentation and which can be included after)

Putting only the text (l.133 & 134 in crawler.py) :

el = self.driver.find_element_by_tag_name('body')
text = el.text

This is nice but when I tried on my datasets, I lacked of context.

Indeed, once the raw text is extracted we lost all the human useful information. Is it in a paragraph ? On the side ? Which context was it in ?

We could first split the text as it is on the website (with H3, paragraphs etc... ) to generate chunks / documents more relevant / structured.

Also including the title, parent page, or any relevant information helped a lot in my case. Some are nicely included in every website (like the title).

Talking about vaccines in a Flue page or on the Covid page is not the same, and if I look only the text, I can have better results with "When the vaccine for covid would be ready" -> On the 04/05/1981 (On the flue page !)

Using selenium vs scrapy

In my opinion (but it's just mine) scrapy is more powerful.

Indeed for dynamically generated content you need selenium (or equivalent) but scrapy is a really nice framework with an already constructed pipeline to write data (collect -> filter -> format -> write) and magic already made for us (filtering URL, multithread).
It is also faster.

Also, we could make the collect component swapable so users can plug their own but the filter / format would be haystack universal.

Again it's just my opinion but I hope that all my struggles and time I spend on it could help you guys develop a better product !
Thanks again for all those great improvements !

I join the code for a scrapy crawler (which use rotating proxies, fake use agent, and all the pipeline) in case of.

Code for a parser with Scrapy (to adapt)
import json
import os
import re
from typing import Generator, Union
import click
import requests
import scrapy
from fp.fp import FreeProxy
from itemadapter import ItemAdapter
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import DropItem
from scrapy.item import Field, Item

from utils import get_datas_path, path_join

class Article(Item):
    content = Field()
    title = Field()
    url = Field()
    shortstory = Field()
    description = Field()


class CleanerPipeline:  # This could be Haystack privately universaly implemented
    def open_spider(self, spider):
        self.data_content = []
        if os.path.exists(spider.json_file_path):
            with open(spider.json_file_path, 'r') as file:
                for line in file:
                    self.data_content.append(json.loads(line)['content'])

    def process_item(self, item: Article, spider):
        adapter = ItemAdapter(item)

        if not adapter.get("title") or not adapter.get("content"):
            raise DropItem("No title or content")

        if not adapter["title"] or not adapter["content"]:
            raise DropItem("Title or content empty")

        if adapter['content'].startswith("\n VIDEO"):
            raise DropItem("Video content")

        item['content'] = re.sub(r'\n+', '\n', adapter["content"]
                                 ).replace(u'\xa0', u' ')

        if item['content'] in self.data_content:
            raise DropItem("Already in written datas")

        return item


class WriterPipeline: # This could be Haystack privately universaly implemented

    def open_spider(self, spider):
        self.json_file = open(spider.json_file_path, 'a')
        self.text_file = open(spider.text_file_path, 'a')

    def close_spider(self, spider):
        self.text_file.close()
        self.json_file.close()

    def process_item(self, item: Article, spider):
        json_line = json.dumps(
            ItemAdapter(item).asdict(), ensure_ascii=False
        ) + "\n"

        self.json_file.write(json_line)

        for line in item['content'].split('\n'):

            if len(line.split()) > 8:
                self.text_file.write(line+'\n')
            # else:
                # print("Content to small : Not writting\n", line)

        return item


class NarcitySpider(scrapy.Spider):
    name = "narcity"

    base_url = 'https://www.narcity.com'

    count_not_articles = 0

    def __init__(self, language: str):
        super().__init__(name="Narcity")
        self.language = 'fr-ca' if language == 'fr' else 'en-us'
        self.json_file_path = path_join(get_datas_path(), language,
                                        "journals", "narcity.json")

        self.text_file_path = self.json_file_path.replace('.json', '.txt')

    def start_requests(self) -> Generator[scrapy.Request, None, None]:
        for pagination_nb in range(10000000):
            response = requests.get(
                url=f'{self.base_url}/{self.language}/nouvelles.json?page={pagination_nb}'
            )

            datas = response.json()

            if not datas['articles']:
                self.count_not_articles += 1
                if self.count_not_articles > 10:
                    print("End of pagination : exiting")
                    break
            else:
                for article in datas['articles']:
                    yield scrapy.Request(url=f"{self.base_url}{article['path']}")

    def parse(self, response: scrapy.Request, **kwargs):  # This could be user dependent implemented for use cases
        title: str = response.xpath("//title/text()").get()

        shortstory: str = response.xpath(
            "//meta[@name=description]/@content"
        ).get()

        description: str = response.xpath(
            "//meta[@property='og:description']/@content"
        ).get()

        content = "".join(
            response.xpath(
                "//div[@class='body']//child::text()"
            ).getall()
        )

        self.log(f'Done : {title}')

        item = Article()
        item["content"] = content
        item["shortstory"] = shortstory
        item["title"] = title.replace(' - Narcity', '.')
        item["description"] = description
        item["url"] = response.url  
        yield item               # The item format would force users to be haystack compliant 
 

@click.command()  # Click is just for nice and easy command lines
@click.option(
    '--rotate',
    is_flag=True,
    help="To use a rotating proxy and rotating agents headers (slower)"
)
@click.option(
    '--lang',
    help="Which language to parse Narcity for",
    type=click.Choice(['fr', 'en'], case_sensitive=True)
)
def narcity(rotate: bool, lang: str):

    crawl_settings: dict[str, Union[bool, str, list[str], dict[str, Union[int, None]]]] = {
        "ITEM_PIPELINES": {
            "scrapers.narcity_parser.CleanerPipeline": 100,
            "scrapers.narcity_parser.WriterPipeline": 200
        },
        "LOG_LEVEL": "INFO",
        "AUTOTHROTTLE_ENABLED": True          # Slow down number of requests if the server choke
    }

    if rotate:
        rotating_settings = {
            "FAKEUSERAGENT_PROVIDERS": [
                # this is the first provider we'll try
                'scrapy_fake_useragent.providers.FakeUserAgentProvider',
                # if FakeUserAgentProvider fails, we'll use faker to generate a user-agent string for us
                'scrapy_fake_useragent.providers.FakerProvider',
                # fall back to USER_AGENT value
                'scrapy_fake_useragent.providers.FixedUserAgentProvider',
            ],
            "DOWNLOADER_MIDDLEWARES": {
                'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
                'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
                'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
                'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
                'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
                'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
            },
            "ROTATING_PROXY_LIST": FreeProxy().get_proxy_list(),
            "USER_AGENT": 'Mozilla/5.0 (X11; Datanyze; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
        }

        crawl_settings.update(rotating_settings)

    process = CrawlerProcess(settings=crawl_settings)
    process.crawl(NarcitySpider, language=lang)
    process.start()  # the script will block here until the crawling is finished

@tholor
Copy link
Member

tholor commented Apr 10, 2021

Hey @ierezell ,
Thanks a lot for your detailed thoughts on this. Very much appreciated! The current version of the crawler was just a first shot. We can definitely improve it :)

I think it would be very helpful to extract more of the structure of a website and use it for better chunking and additional metadata.
Especially the title is super important for retrieval.

Re scrapy vs selenium: I see pros and cons for both. Dynamic content was the killer argument for selenium in the current implementation, but as this is not needed in all use cases, scrapy might be a better choice for those. I like your suggestion of having multiple "collectors" that use similar parts further down the pipeline. We could have selenium and scrapy side by side and leave the choice to the end user.

Would you be interested in creating a PR for this? I think your code is already in a pretty good shape. We'd just need the integration into Haystack and tackling some consistencies with the selenium crawler...

@lalitpagaria
Copy link
Contributor

My apologies for jumping in the discussion. I just want to add following suggestion -
I see that newspaper lib is extracting this information via html tags.
We can copy paste their parse() funtion.

@ierezell
Copy link

Hi @tholor, with pleasure !
I was writing this as hindsight, or improvements ideas (I know this is a beta), and I'm really glad that you're taking a step in this direction.

The danger with extracting data is also to avoid trying to be a google (checking videos, images, relations and all search engine stuff which is useful but should be a bit out of focus for now). Else yes some text information can be essential and should be extracted.

For scrapy and selenium, scrapy can be standalone and we can plug selenium on top of it (and scrapy deal with it auto magically. Something like getDynamicContent=True which requires selenium) the other way (selenium first) don't seem possible.

For collectors, yes I think that because all website is different and can have subtilities, we could have a basic default parser but then user could replace only the parser with his own (which is easy to do and all the rest like cleaner, writer, etc... is is haystack made).

I would be really interested but unfortunatelly I don't think I will have time for this (I should have done another one for deepset as well and I never found the time....). However I will answer any question/help if someone implements it.

@lalitpagaria, thanks I didn't knew this library and indeed it seems not evolving anymore (last commit 10month ago) so we could use this library or implement something similar. It's a really nice inspiration to construct powerfull parsers.

Hope you have a great day

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants