# Creating custom crawlers with `advertools` - with one line of code

Many times you run repeated crawls with the same settings and options, depending on the type of crawl that tyou want to run.

This is a way to utilize Python's `partial` function from the `functools` module to achieve that with a single line of code.

TL; DR

Let's make an example:

In [1]:
from functools import partial
import advertools as adv

Let's say you want to create a crawler that runs in spider mode, and stops after having crawled X requests. This is what you typically do the first time, when you are just exploring a new website.

Let's call it `exploratory_crawler`.

The `partial` function allows us to take an existing function, and set default values to its parameters. The default in `adv.crawl` is that `follow_links=False`. So we want this to default to `True`.

Here's how we do it:

In [2]:
exploratory_crawler = partial(adv.crawl, follow_links=True)

That's it!

Now we have a new function `exploratory_crawler` (which is essentially `adv.crawl` with new defaults) that we can use normally:

In [5]:
exploratory_crawler(url_list='https://adver.tools', output_file='test_crawl.jl') # this will run with follow_links=True

In [6]:
import pandas as pd
crawldf = pd.read_json('test_crawl.jl', lines=True)
crawldf

Unnamed: 0,url,title,meta_desc,viewport,h1,h4,body_text,size,download_timeout,download_slot,...,request_headers_Accept,request_headers_Accept-Language,request_headers_User-Agent,request_headers_Accept-Encoding,redirect_times,redirect_ttl,redirect_urls,redirect_reasons,request_headers_Referer,h2
0,https://adver.tools,advertools,Get productive and get insights for your digit...,"width=device-width, initial-scale=1.0, maximum...",advertools: online marketing productivity and ...,@@@@@@@@@@@@@@@@@@,\n \n \n \...,5270,180,adver.tools,...,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.14.2,"gzip, deflate, br",,,,,,
1,https://adver.tools/audience-manager/,"Audience Manager - Compare, Mix, & Match Audie...",,"width=device-width, initial-scale=1.0, maximum...","Compare, Merge, and Analyze Audience Lists",,\n \n \n \...,8897,180,adver.tools,...,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.14.2,"gzip, deflate, br",1.0,19.0,https://adver.tools/audience-manager,308.0,https://adver.tools,
2,https://adver.tools/link-analysis/,Analyze internal and external links – advertools,,"width=device-width, initial-scale=1.0, maximum...",Internal Link Analysis Tool,,\n \n \n \...,10642,180,adver.tools,...,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.14.2,"gzip, deflate, br",1.0,19.0,https://adver.tools/link-analysis,308.0,https://adver.tools,What is internal link analysis?@@How to analyz...
3,https://adver.tools/user-agent-parser/,User-agent parser - advertools,,"width=device-width, initial-scale=1.0, maximum...",Bulk User-agent Parser,,\n \n \n \...,8845,180,adver.tools,...,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.14.2,"gzip, deflate, br",1.0,19.0,https://adver.tools/user-agent-parser,308.0,https://adver.tools,
4,https://adver.tools/xml-sitemaps/,Analyze XML Sitemaps – advertools,Download XML sitemaps using normal sitemap URL...,"width=device-width, initial-scale=1.0, maximum...","Download, parse, and analyze XML sitemaps",,\n \n \n \...,14996,180,adver.tools,...,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.14.2,"gzip, deflate, br",1.0,19.0,https://adver.tools/xml-sitemaps,308.0,https://adver.tools,
5,https://adver.tools/content-similarity/,Content Similarity - advertools,,"width=device-width, initial-scale=1.0, maximum...",Content Similarity,,\n \n \n \...,8854,180,adver.tools,...,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.14.2,"gzip, deflate, br",1.0,19.0,https://adver.tools/content-similarity,308.0,https://adver.tools,
6,https://adver.tools/urlytics/,Split and Analyze URL Structure – advertools,Analyze URL structure. This tool splits a list...,"width=device-width, initial-scale=1.0, maximum...",Analyze URL Structure,,\n \n \n \...,8978,180,adver.tools,...,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.14.2,"gzip, deflate, br",1.0,19.0,https://adver.tools/urlytics,308.0,https://adver.tools,
7,https://adver.tools/entity-extraction/,Entity Extraction powered by OpenAI's ChatGPT ...,,"width=device-width, initial-scale=1.0, maximum...",Entity Extraction powered by OpenAI's ChatGPT,Example input:@@Example output,\n \n \n \...,11140,180,adver.tools,...,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.14.2,"gzip, deflate, br",1.0,19.0,https://adver.tools/entity-extraction,308.0,https://adver.tools,What is entity extraction?@@The value of entit...
8,https://adver.tools/reverse-dns-lookup/,Bulk reverse DNS lookup - advertools,,"width=device-width, initial-scale=1.0, maximum...",Bulk Reverse DNS Lookup,,\n \n \n \...,8864,180,adver.tools,...,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.14.2,"gzip, deflate, br",1.0,19.0,https://adver.tools/reverse-dns-lookup,308.0,https://adver.tools,
9,https://adver.tools/seo-crawler/,Website and SEO Crawler - advertools,,"width=device-width, initial-scale=1.0, maximum...",SEO Crawler,,\n \n \n \...,8768,180,adver.tools,...,"text/html,application/xhtml+xml,application/xm...",en,advertools/0.14.2,"gzip, deflate, br",1.0,19.0,https://adver.tools/seo-crawler,308.0,https://adver.tools,


Done!

Let's also set a default number of pages, after which we want the crawler to stop, because we are just exploring and don't want to wait for 50k pages to be crawled.

We use the same approach with an additional default option (feel free to change the value for `CLOSESPIDER_PAGECOUNT` to anything else):

In [14]:
exploratory_crawler = partial(
    adv.crawl,
    follow_links=True,
    custom_settings={
        'CLOSESPIDER_PAGECOUNT': 2000
    })

Now `exploratory_crawler` would follow links and stop after having crawled 2k pages.

## More customization

Let's say we want to keep the defaults as is, but we want to add a new option.

If we override `custom_settings` we would overwrite all the previous defaults, which we don't want. We simply want to update them.

We can explor the `partial` object by checking three attributes:

In [16]:
exploratory_crawler.func # this is the original function we are overriding

<function advertools.spider.crawl(url_list, output_file, follow_links=False, allowed_domains=None, exclude_url_params=None, include_url_params=None, exclude_url_regex=None, include_url_regex=None, css_selectors=None, xpath_selectors=None, custom_settings=None)>

In [17]:
exploratory_crawler.args # we didn't use positional arguments, so it's empty

()

In [18]:
exploratory_crawler.keywords # we used some keyword args, which we want to update

{'follow_links': True, 'custom_settings': {'CLOSESPIDER_PAGECOUNT': 2000}}

Let's say we want to write the crawl logs to a certain log file:

In [22]:
exploratory_crawler.keywords['custom_settings'].update({'LOG_FILE': 'path/to/your/logfile.log'})

Let's see if it was updated:

In [23]:
exploratory_crawler

functools.partial(<function crawl at 0x1200fcc20>, follow_links=True, custom_settings={'CLOSESPIDER_PAGECOUNT': 2000, 'LOG_FILE': 'path/to/your/logfile.log'})

There you go.

Here is a bunch of custom crawlers that you can create with a single line of code each:

## Exploratory crawler: as explained above

In [None]:
exploratory_crawler = partial(
    adv.crawl,
    follow_links=True,
    custom_settings={
        'CLOSESPIDER_PAGECOUNT': 2000
    })

## Rude crawler:
* Spider mode: on
* Does not respect robots.txt rules

In [None]:
rude_crawler = partial(
    adv.crawl,
    follow_links=True,
    custom_settings={
        'ROBOTSTXT_OBEY': False
    })

## Polite crawler
* Respects robots.txt rules (it's the default anyway)
* Autothrottling enabled (changes crawling speed dynamically)
* Targets a concurrency of 1. Runs one request at a time
* Waits five seconds between crawled pages

In [None]:
polite_crawler = partial(
    adv.crawl,
    follow_links=True,
    custom_settings={
        'AUTOTHROTTLE_ENABLED': True,
        'AUTOTHROTTLE_TARGET_CONCURRENCY': 1,
        'DOWNLOAD_DELAY': 5
    })

## Greenlight crawler
* Run with 48 concurrent requests per domain
* Assumes you have authority to do so, otherwise you'll probably get blocked
* Don't obey robots.txt rules, see what you end up crawling

In [None]:
# I wanted to call it DDOS crawler, but it sounded evil
greenlight_crawler = partial(
    adv.crawl,
    follow_links=True,
    custom_settings={
        'ROBOTSTXT_OBEY': False,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 48,
    })

## My User-agent crawler
* Spider mode on
* Runs with a custom user agent (in cases where normal robots would be blocked and you have permission to crawl)

In [None]:
my_ua_crawler = partial(
    adv.crawl,
    follow_links=True,
    custom_settings={
        'USER_AGENT': 'MY CUSTOM USER-AGENT'
    })

## Shallow crawler
* Crawl, follow links, and stop after 2 links from the start URLs (feel free to change 2 to any other number)

In [None]:
shallow_crawler = partial(
    adv.crawl,
    follow_links=True,
    custom_settings={
        'DEPTH_LIMIT': 2
    })

## No-params crawler
* Spider mode: on
* Follow links, but only links that **don't** have URL query parameters

In [None]:
no_params_crawler = partial(
    adv.crawl,
    follow_links=True,
    exclude_url_params=True
)

## Incremental crawler
* Rerun the same crawl every month/week/day/hour
* Don't crawl pages that have already been crawled
* Every time, stop after X pages have been crawled
* Save crawl logs to a file, so you can check (this is a good practice to always use)

In [None]:
incremental_crawler = partial(
    adv.crawl,
    follow_links=True,
    custom_settings={
        'JOBDIR': 'path/to/your/jobdir',  # <-- change this for every crawling job (website)
        'CLOSESPIDER_PAGECOUNT': 500,
        'LOG_FILE': 'path/to/your/log_file.log'
    })