Url Detector

The url detector is a library created by the Linkedin Security Team to detect and extract urls in a long piece of text. This repository is a fork of the URL-Detector repository on GitHub.com.

It is able to find and detect any urls such as:

HTML 5 Scheme - //www.linkedin.com
Usernames - user:pass@linkedin.com
Email - fred@linkedin.com
IPv4 Address - 192.168.1.1/hello.html
IPv4 Octets - 0x00.0x00.0x00.0x00
IPv4 Decimal - http://123123123123/
IPv6 Address - ftp://[::]/hello
IPv4-mapped IPv6 Address - http://[fe30:4:3:0:192.3.2.1]/

Note: Keep in mind that for security purposes, its better to overdetect urls and check more against blacklists than to not detect a url that was submitted. As such, some things that we detect might not be urls but somewhat look like urls. Also, instead of complying with RFC 3986 (http://www.ietf.org/rfc/rfc3986.txt), we try to detect based on browser behavior, optimizing detection for urls that are visitable through the address bar of Chrome, Firefox, Internet Explorer, and Safari.

It is also able to identify the parts of the identified urls. For example, for the url: http://user@linkedin.com:39000/hello?boo=ff#frag

Scheme - "http"
Username - "user"
Password - null
Host - "linkedin.com"
Port - 39000
Path - "/hello"
Query - "?boo=ff"
Fragment - "#frag"

How to Use:

Using the URL detector library is simple. Simply import the UrlDetector object and give it some options. In response, you will get a list of urls which were detected.

For example, the following code will find the url linkedin.com

    UrlDetector parser = new UrlDetector("hello this is a url Linkedin.com", UrlDetectorOptions.Default);
    List<Url> found = parser.detect();

    for(Url url : found) {
        System.out.println("Scheme: " + url.getScheme());
        System.out.println("Host: " + url.getHost());
        System.out.println("Path: " + url.getPath());
    }

Quote Matching and HTML

Depending on your input string, you may want to handle certain characters in a special way. For example if you are parsing HTML, you probably want to break out of things like quotes and brackets. For example, if your input looks like

<a href="http://linkedin.com/abc">linkedin.com</a>

You probably want to make sure that the quotes and brackets are extracted. For that reason, using UrlDetectorOptions will allow you to change the sensitivity level of detection based on your expected input type. This way you can detect linkedin.com instead of linkedin.com</a>.

In code this looks like:

    UrlDetector parser = new UrlDetector("<a href="linkedin.com/abc">linkedin.com</a>", UrlDetectorOptions.HTML);
    List<Url> found = parser.detect();

About:

This library was written by the security team and Linkedin when other options did not exist. Some of the primary authors are:

Vlad Shlosberg (vshlosbe@linkedin.com)
Tzu-Han Jan (tjan@linkedin.com)
Yulia Astakhova (jastakho@linkedin.com)

Third Party Dependencies

TestNG

http://testng.org/
License: Apache 2.0

Apache CommonsLang3: org.apache.commons:commons-lang3:3.1

http://commons.apache.org/proper/commons-lang/
License: Apache 2.0

Other active forks

It seems I'm not the only one that forked the URL-Detector repo. Here are some other active forks:

pgalbraith/URL-Detector

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
gradle/wrapper		gradle/wrapper
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Url Detector

How to Use:

Quote Matching and HTML

About:

Third Party Dependencies

TestNG

Apache CommonsLang3: org.apache.commons:commons-lang3:3.1

Other active forks

About

Releases

Packages

Languages

License

cosmincloud/URL-Detector

Folders and files

Latest commit

History

Repository files navigation

Url Detector

How to Use:

Quote Matching and HTML

About:

Third Party Dependencies

TestNG

Apache CommonsLang3: org.apache.commons:commons-lang3:3.1

Other active forks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages