Skip to content

arbitraryrw/url-parsing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 

Repository files navigation

url-parsing

The purpose of this project is to explore an approach to handling relative URLs safely for redirects and forwards. Many web security vulnerabilities that originate from unvalidated redirects and forwards are often remediated by restricting URLs. This restriction usually takes the form of an allow-list of known good absolute URLs in some capacity. See OWASP Validating URLs or Google Open Redirect for examples of this. Unfortunately, not all applications can adopt an allow-listing approach because the absolute URL may not be known ahead of time. This can cause friction as the one-size-fits all approach does not always work.

Introduction

Objectively, URL parsing is difficult. There are many individual components that comprise a URL, and how each component interacts with one another can be confusing. For example, authority delegation in a URL. Orange Tsai presented A New Era of SSRF at Black Hat USA 2017 highlighting some of the problems that can arise.

TLDR: Recommended Approach

Much like any untrusted user input, relative URLs should be normalized, sanitised, and then validated - in that order. Normalisation and sanitation should be done through established URL parsing libraries such as URL Node package that follow the WHATWG standard. The output of these operations should then be validated using a strict pattern, only allowing required characters. Dangerous characters such as @, # and multiple / characters should not be on the allow list.

Where possible handle absolute URLs to avoid introducing unnecessary complexity, OWASP Validating URLs is a great resource on such solutions.

Background

The syntax and semantics of a URI are intentionally broad to create an extensible means for identifying resources. This introduces ambiguity as there are inconsistencies between URL parsers and the RFC2396 / RFC3986 specifications. WHATWG defined a contemporary implementation based on these specifications forming a standard. The following comporises URL Strings and URL Objects.

┌─────────────────────────────────────────────────────────────────────────────────────────────┐
│                                            href                                             │
├──────────┬──┬─────────────────────┬─────────────────────┬───────────────────────────┬───────┤
│ protocol │  │        auth         │        host         │           path            │ hash  │
│          │  │                     ├──────────────┬──────┼──────────┬────────────────┤       │
│          │  │                     │   hostname   │ port │ pathname │     search     │       │
│          │  │                     │              │      │          ├─┬──────────────┤       │
│          │  │                     │              │      │          │ │    query     │       │
"  https:   //    user   :   pass   @ sub.host.com : 8080   /p/a/t/h  ?  query=string   #hash "
│          │  │          │          │   hostname   │ port │          │                │       │
│          │  │          │          ├──────────────┴──────┤          │                │       │
│ protocol │  │ username │ password │        host         │          │                │       │
├──────────┴──┼──────────┴──────────┼─────────────────────┤          │                │       │
│   origin    │                     │       origin        │ pathname │     search     │ hash  │
├─────────────┴─────────────────────┴─────────────────────┴──────────┴────────────────┴───────┤
│                                            href                                             │
└─────────────────────────────────────────────────────────────────────────────────────────────┘

At a code level, a URL can be parsed and accessed through a convinient object as seen below:

const { URL } = require('url');
var url = 'https://user:pass@sub.host.com:8080/p/a/t/h?query=string#has'
var newUrl = new URL(url);
console.log(newUrl)
URL {
  href: 'https://user:pass@sub.host.com:8080//p/a/t/h?query=string#has',
  origin: 'https://sub.host.com:8080',
  protocol: 'https:',
  username: 'user',
  password: 'pass',
  host: 'sub.host.com:8080',
  hostname: 'sub.host.com',
  port: '8080',
  pathname: '//p/a/t/h',
  search: '?query=string',
  searchParams: URLSearchParams { 'query' => 'string' },
  hash: '#has'
}

Authority

RFC3986 - Authority

3.2.  Authority

   Many URI schemes include a hierarchical element for a naming
   authority so that governance of the name space defined by the
   remainder of the URI is delegated to that authority (which may, in
   turn, delegate it further).  The generic syntax provides a common
   means for distinguishing an authority based on a registered name or
   server address, along with optional port and user information.

   The authority component is preceded by a double slash ("//") and is
   terminated by the next slash ("/"), question mark ("?"), or number
   sign ("#") character, or by the end of the URI.

      authority   = [ userinfo "@" ] host [ ":" port ]

   URI producers and normalizers should omit the ":" delimiter that
   separates host from port if the port component is empty.  Some
   schemes do not allow the userinfo and/or port subcomponents.

   If a URI contains an authority component, then the path component
   must either be empty or begin with a slash ("/") character.  Non-
   validating parsers (those that merely separate a URI reference into
   its major components) will often ignore the subcomponent structure of
   authority, treating it as an opaque string from the double-slash to
   the first terminating delimiter, until such time as the URI is
   dereferenced.

Relative URLs

relative-part = "//" authority path-abempty
              / path-absolute
              / path-noscheme
              / path-empty
A relative reference that begins with two slash characters is termed
a network-path reference; such references are rarely used.  A
relative reference that begins with a single slash character is
termed an absolute-path reference.  A relative reference that does
not begin with a slash character is termed a relative-path reference.

Canonicalization

Defined in the WHATWG Goals, if a url contains percent-encoded bytes it returns percent-decode.

An example of this can be seen below:

node app.js
Server running at http://127.0.0.1:3000/

URL Requested
Raw url: /?nextUrl=/nikola.dev
Parsed nextUrl parameter: /nikola.dev

URL Requested
Raw url: /?nextUrl=%2Fnikola.dev
Parsed nextUrl parameter: /nikola.dev

Dangerous Characters

Modern browsers automatically convert back slashes (\) into forward slashes (/) despite this being against RFC3986 - URI Genric Syntax. In addition, the @ character can be used to define a target host redirecting the victim to a new domain, this type of attack is defined as Semantic Attacks.

The dangerous characters and encoded versions can be seen below:

127.0.0.1:3000?nextUrl=//nikola.dev
127.0.0.1:3000?nextUrl=/%2Fnikola.dev
127.0.0.1:3000?nextUrl=%2F%2Fnikola.dev
127.0.0.1:3000?nextUrl=\\nikola.dev
127.0.0.1:3000?nextUrl=\%5Cnikola.dev
127.0.0.1:3000?nextUrl=%5C%5Cnikola.dev

Interestingly, the \ and / characters (and URL encoded equivalents) can repeat and are interchangable. The following is a valid payload:

http://127.0.0.1:3000/?nextUrl=/%5C/%5C/\%2F\/\%2F\/\%2F\/nikola.dev

Attackers can use this to bypass filters also depending on the underlying logic, for example if the nextUrl must have example.com this can be bypassed:

127.0.0.1:3000?nextUrl=//example.com%40nikola.dev
127.0.0.1:3000?nextUrl=//example.com@nikola.dev

Basic Usage

Run the application locally using the following:

node app.js

Use Cases to Demonstrate

  1. Simple redirect
  2. Redirect to parameters for example ?params=blah

References:

About

Sample project to demonstrate parsing relative URLs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published