# Sanitizing 

## Definition

Text sanitization refers to the process of cleaning or purifying input text to ensure it is safe, well-formed, and suitable for a particular purpose. The goal is to:
- prevent or mitigate *`potential security risks`* caused by malicious or improperly formatted input,
- prevent *`data corruption`*,
- improve *`consistency`*, and
- handing *`Whitespace`* and *`Control Characters`*.


## Example 1

In [1]:
import re 

html_input = '<script>alert("Hello!");</script><p>This is some <b>HTML</b>.</p>'
sanitized_text = re.sub('<.*?>', '', html_input)
sanitized_text

'alert("Hello!");This is some HTML.'

## Example 2

In [2]:
# Manual sanization can be error-prone. It is often better to use dedicated libraries.
import html

In [3]:
# Example input (potentially from user input)
user_input = '<script>alert("Hello!");</script>\nThis is a test.'

In [4]:
cleaned_text = html.escape(user_input, quote=True)
cleaned_text

'&lt;script&gt;alert(&quot;Hello!&quot;);&lt;/script&gt;\nThis is a test.'

In [5]:
cleaned_text = cleaned_text.replace('\n', '<br>')
cleaned_text

'&lt;script&gt;alert(&quot;Hello!&quot;);&lt;/script&gt;<br>This is a test.'

- This process ensures that any HTML tags or script elements in the input are safely displayed as plain text, as the special characters in the *user_input* are escaped to their HTML entity equivalents to prevent potential issues with rendering HTML or executing scripts. 

- Line breakes(newline characters) are converted to HTML *`<br>`* tags to maintain the desired line breaks for display in HTML.

## Example of Some Tricky Sanitization Problems

In [6]:
# A tricky string
s = 'p\xfdt\u0125\xf6\xf1\x0cis\tawesome\r\n'
s

'pýtĥöñ\x0cis\tawesome\r\n'

### (a) Remapping whitespace

In [7]:
remap = {
    ord('\t'): ' ',
    ord('\f'): ' ',
    ord('\r'): None     # Deleted
}

In [8]:
a = s.translate(remap)
a

'pýtĥöñ is awesome\n'

### (b) Remove all combining characters/marks

In [9]:
import unicodedata
import sys 

In [10]:
cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode)
                         if unicodedata.combining(chr(c)))

In [11]:
b = unicodedata.normalize('NFD', a)
c = b.translate(cmb_chrs)
c

'python is awesome\n'

### (c) Accent removal using I/O decoding

In [12]:
d = b.encode('ascii', 'ignore').decode('ascii')
d

'python is awesome\n'