# Input Validation

Here we provide clear and actionable BKMs on a few input validation aspects

### The Purpose of Input Validation

- Ensures only properly formed data is entering the workflow.
- Preventing malformed data from persisting.

> Input validation should happen as early as possible in the data flow.

Input validation is performed to ensure only properly formed data is entering the workflow in an information system, preventing malformed data from persisting in the database and triggering malfunction of various downstream components.

Input validation should happen as early as possible in the data flow, preferably as soon as the data is received from the external party.

Data from all potentially untrusted sources should be subject to input validation, including not only Internet-facing web clients but also backend feeds over extranets, from suppliers, partners, vendors or regulators, each of which may be compromised on their own and start sending malformed data.

### Whitelisting vs. Blacklisting

- White list validation define what IS authorized, and reject everyting else.
- It is a **common mistake** to use black list validation which defines what is not autorized

## Implementing Input Validation

### Exceptions

- Make use of built-in exception classes when it makes sense: e.g. ValueError, IndexError, NotImplementedError ...

In [115]:
def connect_to_next_port(self, minimum):
    """Connects to the next available port.

    Args:
      minimum: A port value greater or equal to 1024.

    Returns:
      The new minimum port.

    Raises:
      ConnectionError: If no available port is found.
    """
    if minimum < 1024:
      # Note that this raising of ValueError is not mentioned in the doc
      # string's "Raises:" section because it is not appropriate to
      # guarantee this specific behavioral reaction to API misuse.
      raise ValueError(f'Min. port must be at least 1024, not {minimum}.')
    port = self._find_next_open_port(minimum)
    if not port:
      raise ConnectionError(
          f'Could not connect to service on port {minimum} or higher.')
    assert port >= minimum, (
        f'Unexpected port {port} when minimum was {minimum}.')
    return port
# credits https://google.github.io/styleguide/pyguide.html
 

- Never use catch-all except: statements, or catch Exception or StandardError, unless you are
    - re-raising the exception, or
    - suppress and record exception (e.g. to protect a thread or command line app from crashing).

Python is very tolerant in this regard and except: will really catch everything including misspelled names, sys.exit() calls, Ctrl+C interrupts, unittest failures and all kinds of other exceptions that you simply don’t want to catch.

In [116]:
# Return the width of the terminal, or None if it couldn't be
# determined (e.g. because we're not being run interactively).
def term_width(out):
  if not out.isatty():
    return None
  try:
    p = subprocess.Popen(["stty", "size"],
                         stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    (out, err) = p.communicate()
    if p.returncode != 0 or err:
      return None 
    return int(out.split()[1])
  except (IndexError, OSError, ValueError):
    return None
# credits gtest-parallel

- Minimize the amount of code in a try/except block.

- Use the finally clause (cleanup, closing a file, ...).

In [None]:
def peek(file_fd, num_bytes):
    """
    Peek num_bytes bytes from file_fd file object
    """
    pos = file_fd.tell()
    try:
        result = file_fd.read(num_bytes)
    except:
        raise  # reraise exception if any
    finally:
        file_fd.seek(pos)  # return position back
    return result

### What is Wrong with assert()?

- The `assert()` function and `__debug__` blocks are **debug code**.
- The debug code is removed in production via `python -O` or `PYTHONOPTIMIZE=1`.

Never use assert to validate inputs. assert is to test internal logic correctness, not to enforce correct usage or to indicate that some unexpected event occurred.

It is typical that in production Python apps runs with optimization enabled which makes all assert calls void.

In [2]:
from  pathlib import Path

LOGS_DIR = Path('./logs')

def get_log(name):
    """ Resolve log name to a path for download """
    path = LOGS_DIR / name
    assert(LOGS_DIR in path.resolve().parents)
    return path.resolve()

# assert() will catch the malicous patch in debug mode 
get_log('../../../../../etc/passwd')

AssertionError: 

In [4]:
# Now try calling the same code in optimized mode

!python -O ../src/assert.py

C:\etc\passwd


### Type conversion

Strict exeption handling with type conversion (`int()`, `float()`):

In [5]:
while True:
  try:
    a = int(input("Please enter your age: "))
    break
  except ValueError:
    print("Oops!  That was no valid number.  Try again...")

Please enter your age: 12


Minimum and maximum value range check for numerical parameters and dates, minimum and maximum length check for strings:


In [6]:
def process_order(age):
  if not 18 <= age <= 120:
    print("You must be 18 years old")
    return
  print("Order processed")

process_order(4)

You must be 18 years old


### Paths

Normalize paths and resolve symlinks:

In [15]:
from pathlib import Path

try:
  log_path = Path('logs/../logs/passwd').expanduser().resolve()
except (FileNotFoundError, RuntimeError):
  print("Path is invalid")
print(str(log_path))

C:\work\projects\python-security-tutorial\doc\logs\passwd


Verify root folder:

In [39]:
from pathlib import Path

def is_in_home(path):
    """Checks if path is under user HOME folder"""
    return str(path.expanduser().resolve(strict=True)).startswith(str(Path.home()))
        

is_in_home(Path.home() / '..')

False

### Regexp

Regular expressions for any other structured data covering the whole input string (^...$) and not using "any character" wildcard (such as . or \S)


In [20]:
import re

def is_hex(value):
    return bool(re.match("^[0-9A-Fa-f]+$", value))

is_hex('baadf00d')

True

## JSON Schema

### JSON Schema

JSON Schema is a vocabulary that allows you to annotate and validate JSON documents.

Also applicable to YAML and other structured data

In [29]:
from jsonschema import validate

# A sample schema, like what we'd get from json.load()
schema = {
    "type" : "object",
    "properties" : {
        "price" : {"type" : "number"},
        "name" : {"type" : "string"},
    },
}  

# If no exception is raised by validate(), the instance is valid.
validate(instance={"name" : "Eggs", "price" : 34.99}, schema=schema)

In [31]:
from jsonschema import ValidationError
try:
  validate(instance={"name" : "Eggs", "price" : "Invalid"}, schema=schema)
except ValidationError as err:
  print(f'Data is invalid: {err}')

Data is invalid: 'Invalid' is not of type 'number'

Failed validating 'type' in schema['properties']['price']:
    {'type': 'number'}

On instance['price']:
    'Invalid'


### Common Mistakes: Arrays vs Tuple validation

Two modes of array validation:
1. List validation: each item matches the same schema.
2. Tuple validation: each item may have a different schema.

In [41]:
from jsonschema import validate
# This schema has a mistake.
# It only validates the first array element and not limiting number of elements
schema = {
    "type": "array",
    "items":  {"$ref": "#/definitions/good"}],
    
    "definitions": {
        "good": {
            "type" : "object",
            "properties" : {
                "price" : {"type" : "number"},
                "name" : {"type" : "string"},
            }
        }
    }
}

# validation pass, which is unexpected
validate(instance=[{"name" : "Orange", "price" : 1},
                   {"name" : "Eggs", "price" : "2 rubbles"}], schema=schema)

In [42]:
# lets fix the scema and try again
schema["items"] = {"$ref": "#/definitions/good"}
validate(instance=[{"name" : "Orange", "price" : 1},
                   {"name" : "Eggs", "price" : "2 rubbles"}], schema=schema)

ValidationError: '2 rubbles' is not of type 'number'

Failed validating 'type' in schema['items']['properties']['price']:
    {'type': 'number'}

On instance[1]['price']:
    '2 rubbles'

### Useful Links

https://cheatsheetseries.owasp.org/cheatsheets/Input_Validation_Cheat_Sheet.html - OWASP Input Validation Cheat Sheet
https://google.github.io/styleguide/pyguide.html - Google Python style guide