Add crawler to get texts from websites #775

DIVYA-19 · 2021-01-26T12:08:40Z

Hello

I tried to implement feature that is discussed in #770

handled list of urls and single urls
have option to specify whether to extract_sub_links or not
stores data to text files in specified output directory
meta data in file format {'url":, 'base_url":, "text":}
webdriver (chrome or firefox) need to be specified
external links and in-page navigation links are excluded

all required functions are added to preprocessor/utils.py

tholor · 2021-01-27T15:57:49Z

Hey @DIVYA-19 ,

Thanks for working on this! I really think this is a valuable addition to Haystack.

I am currently digging into your proposed implementation and a few early questions / comments came up:

Why using selenium? I am not really familiar with web scraping/crawling but seems like scrapy is also a popular, powerful alternative. From a quick look it seems that scrapy has less dependencies (e.g. browser driver) and is faster (https://medium.com/analytics-vidhya/scrapy-vs-selenium-vs-beautiful-soup-for-web-scraping-24008b6c87b8). Any particular reason why you chose selenium?
The code is currently scattered across various methods in utils.py. It might be cleaner to wrap everything in a Crawler class (potentially even in an own module). I am happy to propose a more concrete design once we settle on the framework
From a quick test, it seems that we are currently only crawling "level 1 sublinks", i.e. only links that are directly available from the passed URL. I think for many cases it would make sense to allow deeper levels to really find all pages on a website
Currently, we include all URLs that share the same base domain. In some use cases it might make sense to restrict it further to certain paths. For example: I want to crawl all Haystack docs. If I pass url=["https://haystack.deepset.ai/docs/latest/get_startedmd"] all pages with the domain "haystack.deepset.ai" are crawled. However, I am only interested in those with https://haystack.deepset.ai/docs/latest/*. Would be cool to allow a param that supports wildcards / regex for filtering those "relevant" URLs.

DIVYA-19 · 2021-01-28T05:20:24Z

Any particular reason why you choose selenium

No. not exactly. But it's very easy to use. And I'm familiar with it. I have thought the same about the driver path.
scrapy seems cool. I'll start looking into it.

The code is currently scattered across various methods in utils.py.

yeah! creating a module would be nice. I'll do that

I think for many cases it would make sense to allow deeper levels to really find all pages on a website

It'll be a good feature. But I have a doubt. When to stop looking for urls?

In some use cases it might make sense to restrict it further to certain paths

sounds good. we can do that

DIVYA-19 · 2021-01-29T06:11:52Z

Hi @tholor
I dig into scrapy. It has nice features (like extracting links with some settings). But it can only extract page source. To get dynamic content we need to use scrapy-splash or selenium

tholor · 2021-01-29T14:47:56Z

Ok, I see. From a quick look, scrapy-splash doesn't look better than Selenium in terms of dependencies as it requires running an external Docker container. I'd say let's go forward with Selenium and let's try to make the configuration + installation of the webdriver as simple as possible for the users.

tholor · 2021-02-04T10:51:20Z

@DIVYA-19 I saw that you already refactored some parts in the PR. Let me know if I shall do another review round :)

DIVYA-19 · 2021-02-04T12:14:59Z

yeah @tholor! I think it would be nice if you review and give some suggests. I couldn't update you on changes in PR since it was failing some checks.
updated things:

created separate module
function does both storing files and returning list of dictionaries
included a parameter for regex for retrieving data from matched urls only
and it retrieves first level sub urls only. I haven't updated that feature

tholor · 2021-02-11T15:52:31Z

@DIVYA-19 I just did one review round and pushed a few changes that seemed helpful from my perspective:

added a new module connector where we can put crawler.py but also extend in future to other external input streams (e.g. Confluence, OneDrive, GDrive, CRM systems ...)
Moved everything into a Crawler class (we need a class if we want to include crawlers as nodes in the upcoming Indexing pipelines, see Add support for indexing pipelines #816 )
Renamed a few params and added docstrings

Hope this makes sense from your perspective @DIVYA-19 ?

@tanaysoni Please review and let me know your thoughts on the integration into pipelines. Especially, I was wondering whether we want to put most of the params rather in the class init or the run(). I guess, this has quite some implications on the yaml representation.

DIVYA-19 · 2021-02-12T09:28:06Z

yes @tholor that totally makes sense to me too

lalitpagaria · 2021-02-17T19:43:23Z

@tholor is there any plan to include this in next release?

I am trying this with confluence will share results.

tholor · 2021-02-18T08:30:05Z

@lalitpagaria yep, will be merged soon. Just waiting for @oryx1729 's review.

haystack/connector/crawler.py

ierezell · 2021-04-08T17:09:37Z

Hi @tholor, @DIVYA-19

I just came here after a nice call with @PiffPaffM which pointed me this PR.

It's a really nice feature that I also wanted and implemented it on my own.

I'm having some questions / ideas about that (from my experimentation and which can be included after)

Putting only the text (l.133 & 134 in `crawler.py`) :

el = self.driver.find_element_by_tag_name('body')
text = el.text

This is nice but when I tried on my datasets, I lacked of context.

Indeed, once the raw text is extracted we lost all the human useful information. Is it in a paragraph ? On the side ? Which context was it in ?

We could first split the text as it is on the website (with H3, paragraphs etc... ) to generate chunks / documents more relevant / structured.

Also including the title, parent page, or any relevant information helped a lot in my case. Some are nicely included in every website (like the title).

Talking about vaccines in a Flue page or on the Covid page is not the same, and if I look only the text, I can have better results with "When the vaccine for covid would be ready" -> On the 04/05/1981 (On the flue page !)

Using selenium vs scrapy

In my opinion (but it's just mine) scrapy is more powerful.

Indeed for dynamically generated content you need selenium (or equivalent) but scrapy is a really nice framework with an already constructed pipeline to write data (collect -> filter -> format -> write) and magic already made for us (filtering URL, multithread).
It is also faster.

Also, we could make the collect component swapable so users can plug their own but the filter / format would be haystack universal.

Again it's just my opinion but I hope that all my struggles and time I spend on it could help you guys develop a better product !
Thanks again for all those great improvements !

I join the code for a scrapy crawler (which use rotating proxies, fake use agent, and all the pipeline) in case of.

Code for a parser with Scrapy (to adapt)

import json
import os
import re
from typing import Generator, Union
import click
import requests
import scrapy
from fp.fp import FreeProxy
from itemadapter import ItemAdapter
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import DropItem
from scrapy.item import Field, Item

from utils import get_datas_path, path_join

class Article(Item):
    content = Field()
    title = Field()
    url = Field()
    shortstory = Field()
    description = Field()


class CleanerPipeline:  # This could be Haystack privately universaly implemented
    def open_spider(self, spider):
        self.data_content = []
        if os.path.exists(spider.json_file_path):
            with open(spider.json_file_path, 'r') as file:
                for line in file:
                    self.data_content.append(json.loads(line)['content'])

    def process_item(self, item: Article, spider):
        adapter = ItemAdapter(item)

        if not adapter.get("title") or not adapter.get("content"):
            raise DropItem("No title or content")

        if not adapter["title"] or not adapter["content"]:
            raise DropItem("Title or content empty")

        if adapter['content'].startswith("\n VIDEO"):
            raise DropItem("Video content")

        item['content'] = re.sub(r'\n+', '\n', adapter["content"]
                                 ).replace(u'\xa0', u' ')

        if item['content'] in self.data_content:
            raise DropItem("Already in written datas")

        return item


class WriterPipeline: # This could be Haystack privately universaly implemented

    def open_spider(self, spider):
        self.json_file = open(spider.json_file_path, 'a')
        self.text_file = open(spider.text_file_path, 'a')

    def close_spider(self, spider):
        self.text_file.close()
        self.json_file.close()

    def process_item(self, item: Article, spider):
        json_line = json.dumps(
            ItemAdapter(item).asdict(), ensure_ascii=False
        ) + "\n"

        self.json_file.write(json_line)

        for line in item['content'].split('\n'):

            if len(line.split()) > 8:
                self.text_file.write(line+'\n')
            # else:
                # print("Content to small : Not writting\n", line)

        return item


class NarcitySpider(scrapy.Spider):
    name = "narcity"

    base_url = 'https://www.narcity.com'

    count_not_articles = 0

    def __init__(self, language: str):
        super().__init__(name="Narcity")
        self.language = 'fr-ca' if language == 'fr' else 'en-us'
        self.json_file_path = path_join(get_datas_path(), language,
                                        "journals", "narcity.json")

        self.text_file_path = self.json_file_path.replace('.json', '.txt')

    def start_requests(self) -> Generator[scrapy.Request, None, None]:
        for pagination_nb in range(10000000):
            response = requests.get(
                url=f'{self.base_url}/{self.language}/nouvelles.json?page={pagination_nb}'
            )

            datas = response.json()

            if not datas['articles']:
                self.count_not_articles += 1
                if self.count_not_articles > 10:
                    print("End of pagination : exiting")
                    break
            else:
                for article in datas['articles']:
                    yield scrapy.Request(url=f"{self.base_url}{article['path']}")

    def parse(self, response: scrapy.Request, **kwargs):  # This could be user dependent implemented for use cases
        title: str = response.xpath("//title/text()").get()

        shortstory: str = response.xpath(
            "//meta[@name=description]/@content"
        ).get()

        description: str = response.xpath(
            "//meta[@property='og:description']/@content"
        ).get()

        content = "".join(
            response.xpath(
                "//div[@class='body']//child::text()"
            ).getall()
        )

        self.log(f'Done : {title}')

        item = Article()
        item["content"] = content
        item["shortstory"] = shortstory
        item["title"] = title.replace(' - Narcity', '.')
        item["description"] = description
        item["url"] = response.url  
        yield item               # The item format would force users to be haystack compliant 
 

@click.command()  # Click is just for nice and easy command lines
@click.option(
    '--rotate',
    is_flag=True,
    help="To use a rotating proxy and rotating agents headers (slower)"
)
@click.option(
    '--lang',
    help="Which language to parse Narcity for",
    type=click.Choice(['fr', 'en'], case_sensitive=True)
)
def narcity(rotate: bool, lang: str):

    crawl_settings: dict[str, Union[bool, str, list[str], dict[str, Union[int, None]]]] = {
        "ITEM_PIPELINES": {
            "scrapers.narcity_parser.CleanerPipeline": 100,
            "scrapers.narcity_parser.WriterPipeline": 200
        },
        "LOG_LEVEL": "INFO",
        "AUTOTHROTTLE_ENABLED": True          # Slow down number of requests if the server choke
    }

    if rotate:
        rotating_settings = {
            "FAKEUSERAGENT_PROVIDERS": [
                # this is the first provider we'll try
                'scrapy_fake_useragent.providers.FakeUserAgentProvider',
                # if FakeUserAgentProvider fails, we'll use faker to generate a user-agent string for us
                'scrapy_fake_useragent.providers.FakerProvider',
                # fall back to USER_AGENT value
                'scrapy_fake_useragent.providers.FixedUserAgentProvider',
            ],
            "DOWNLOADER_MIDDLEWARES": {
                'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
                'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
                'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
                'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
                'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
                'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
            },
            "ROTATING_PROXY_LIST": FreeProxy().get_proxy_list(),
            "USER_AGENT": 'Mozilla/5.0 (X11; Datanyze; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
        }

        crawl_settings.update(rotating_settings)

    process = CrawlerProcess(settings=crawl_settings)
    process.crawl(NarcitySpider, language=lang)
    process.start()  # the script will block here until the crawling is finished

tholor · 2021-04-10T06:19:47Z

Hey @ierezell ,
Thanks a lot for your detailed thoughts on this. Very much appreciated! The current version of the crawler was just a first shot. We can definitely improve it :)

I think it would be very helpful to extract more of the structure of a website and use it for better chunking and additional metadata.
Especially the title is super important for retrieval.

Re scrapy vs selenium: I see pros and cons for both. Dynamic content was the killer argument for selenium in the current implementation, but as this is not needed in all use cases, scrapy might be a better choice for those. I like your suggestion of having multiple "collectors" that use similar parts further down the pipeline. We could have selenium and scrapy side by side and leave the choice to the end user.

Would you be interested in creating a PR for this? I think your code is already in a pretty good shape. We'd just need the integration into Haystack and tackling some consistencies with the selenium crawler...

lalitpagaria · 2021-04-10T10:11:32Z

My apologies for jumping in the discussion. I just want to add following suggestion -
I see that newspaper lib is extracting this information via html tags.
We can copy paste their parse() funtion.

ierezell · 2021-04-12T15:22:11Z

Hi @tholor, with pleasure !
I was writing this as hindsight, or improvements ideas (I know this is a beta), and I'm really glad that you're taking a step in this direction.

The danger with extracting data is also to avoid trying to be a google (checking videos, images, relations and all search engine stuff which is useful but should be a bit out of focus for now). Else yes some text information can be essential and should be extracted.

For scrapy and selenium, scrapy can be standalone and we can plug selenium on top of it (and scrapy deal with it auto magically. Something like getDynamicContent=True which requires selenium) the other way (selenium first) don't seem possible.

For collectors, yes I think that because all website is different and can have subtilities, we could have a basic default parser but then user could replace only the parser with his own (which is easy to do and all the rest like cleaner, writer, etc... is is haystack made).

I would be really interested but unfortunatelly I don't think I will have time for this (I should have done another one for deepset as well and I never found the time....). However I will answer any question/help if someone implements it.

@lalitpagaria, thanks I didn't knew this library and indeed it seems not evolving anymore (last commit 10month ago) so we could use this library or implement something similar. It's a really nice inspiration to construct powerfull parsers.

Hope you have a great day

DIVYA-19 added 10 commits January 26, 2021 17:32

add fetch_data_from_url to extract data and store as files

37efc03

corrected a typo

ce2d1bc

corrected variable name error

c65eacf

correction of urlparse error

9b7f525

type error

6fc21e3

added selenium, urllib to requirements

fa3fc6a

removed urllib

97349b4

minor changes and added function to find out inpage navigation links

fe35b00

quick duplicate links fix

71367b5

quick type annotation fix

5cc7f4f

tholor assigned tholor and DIVYA-19 Jan 27, 2021

DIVYA-19 and others added 8 commits February 1, 2021 10:34

created seperate module for crawler

8def1cb

Merge branch 'master' into url-contents-to-files

d2e3768

type error fix

0e29083

type error fix

6525f66

import fix

613a1a5

quick type error fix

d94d3a8

addee return description

dd41989

updated include type to list

268868e

tholor added 4 commits February 11, 2021 16:35

refactor modules. Add Crawler class. rename params.

13cec57

merge lastest upstream master

834e0fa

add basic pipeline compatibility

7878d48

update docstrings

d9fc57e

tholor changed the title ~~add fetch_data_from_url to extract data and store as files~~ Add crawler to get texts from websites Feb 11, 2021

tholor requested a review from tanaysoni February 11, 2021 15:52

fix mypy issues

1ee2b51

tholor requested review from tanaysoni and removed request for tanaysoni February 15, 2021 14:41

tholor requested review from oryx1729 and removed request for tanaysoni February 18, 2021 08:29

update args, docstrings, return filepaths

54a1f28

oryx1729 reviewed Feb 18, 2021

View reviewed changes

haystack/connector/crawler.py Outdated Show resolved Hide resolved

fix mypy

9124ee9

oryx1729 approved these changes Feb 18, 2021

View reviewed changes

make urls optional in init

b996ff3

tholor merged commit 6c3ec54 into deepset-ai:master Feb 18, 2021

Timoeller mentioned this pull request Mar 19, 2021

retrieving content from website url #770

Closed

oryx1729 mentioned this pull request Jul 14, 2021

Add test cases for the Crawler module #1283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add crawler to get texts from websites #775

Add crawler to get texts from websites #775

DIVYA-19 commented Jan 26, 2021 •

edited

Loading

tholor commented Jan 27, 2021

DIVYA-19 commented Jan 28, 2021 •

edited

Loading

DIVYA-19 commented Jan 29, 2021

tholor commented Jan 29, 2021

tholor commented Feb 4, 2021

DIVYA-19 commented Feb 4, 2021 •

edited

Loading

tholor commented Feb 11, 2021 •

edited

Loading

DIVYA-19 commented Feb 12, 2021

lalitpagaria commented Feb 17, 2021

tholor commented Feb 18, 2021

ierezell commented Apr 8, 2021

tholor commented Apr 10, 2021

lalitpagaria commented Apr 10, 2021

ierezell commented Apr 12, 2021

Add crawler to get texts from websites #775

Add crawler to get texts from websites #775

Conversation

DIVYA-19 commented Jan 26, 2021 • edited Loading

tholor commented Jan 27, 2021

DIVYA-19 commented Jan 28, 2021 • edited Loading

DIVYA-19 commented Jan 29, 2021

tholor commented Jan 29, 2021

tholor commented Feb 4, 2021

DIVYA-19 commented Feb 4, 2021 • edited Loading

tholor commented Feb 11, 2021 • edited Loading

DIVYA-19 commented Feb 12, 2021

lalitpagaria commented Feb 17, 2021

tholor commented Feb 18, 2021

ierezell commented Apr 8, 2021

Putting only the text (l.133 & 134 in crawler.py) :

Using selenium vs scrapy

tholor commented Apr 10, 2021

lalitpagaria commented Apr 10, 2021

ierezell commented Apr 12, 2021

DIVYA-19 commented Jan 26, 2021 •

edited

Loading

DIVYA-19 commented Jan 28, 2021 •

edited

Loading

DIVYA-19 commented Feb 4, 2021 •

edited

Loading

tholor commented Feb 11, 2021 •

edited

Loading

Putting only the text (l.133 & 134 in `crawler.py`) :