Skip to content

Lightweight robots.txt parser and generator written in Rust.

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

alexander-irbis/robots_txt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

crates.io crates.io

Build Status Minimal rust version 1.36 Nightly rust version from March 30, 2020

robots_txt

robots_txt is a lightweight robots.txt parser and generator for robots.txt written in Rust.

Nothing extra.

Unstable

The implementation is WIP.

Installation

Robots_txt is available on crates.io and can be included in your Cargo enabled project like this:

Cargo.toml:

[dependencies]
robots_txt = "0.7"

Parsing & matching paths against rules

use robots_txt::Robots;

static ROBOTS: &'static str = r#"

# robots.txt for http://www.site.com
User-Agent: *
Disallow: /cyberworld/map/ # this is an infinite virtual URL space
# Cybermapper knows where to go
User-Agent: cybermapper
Disallow:

"#;

fn main() {
    let robots = Robots::from_str(ROBOTS);

    let matcher = SimpleMatcher::new(&robots.choose_section("NoName Bot").rules);
    assert!(matcher.check_path("/some/page"));
    assert!(matcher.check_path("/cyberworld/welcome.html"));
    assert!(!matcher.check_path("/cyberworld/map/object.html"));

    let matcher = SimpleMatcher::new(&robots.choose_section("Mozilla/5.0; CyberMapper v. 3.14").rules);
    assert!(matcher.check_path("/some/page"));
    assert!(matcher.check_path("/cyberworld/welcome.html"));
    assert!(matcher.check_path("/cyberworld/map/object.html"));
}

Building & rendering

main.rs:

extern crate robots_txt;

use robots_txt::Robots;

fn main() {
    let robots1 = Robots::builder()
        .start_section("cybermapper")
            .disallow("")
            .end_section()
        .start_section("*")
            .disallow("/cyberworld/map/")
            .end_section()
        .build();

    let conf_base_url: Url = "https://example.com/".parse().expect("parse domain");
    let robots2 = Robots::builder()
        .host(conf_base_url.domain().expect("domain"))
        .start_section("*")
            .disallow("/private")
            .disallow("")
            .crawl_delay(4.5)
            .request_rate(9, 20)
            .sitemap("http://example.com/sitemap.xml".parse().unwrap())
            .end_section()
        .build();
        
    println!("# robots.txt for http://cyber.example.com/\n\n{}", robots1);
    println!("# robots.txt for http://example.com/\n\n{}", robots2);
}

As a result we get

# robots.txt for http://cyber.example.com/

User-agent: cybermapper
Disallow:

User-agent: *
Disallow: /cyberworld/map/


# robots.txt for http://example.com/

User-agent: *
Disallow: /private
Disallow:
Crawl-delay: 4.5
Request-rate: 9/20
Sitemap: http://example.com/sitemap.xml

Host: example.com

Alternatives

License

Licensed under either of

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

About

Lightweight robots.txt parser and generator written in Rust.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages