Skip to content

"Simple" text cleaner/encoder written in ColdFusion that selectively allows some HTML elements

Notifications You must be signed in to change notification settings

doover/cf_text_util

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

cf_text_util

"Simple" ColdFusion text cleaner/encoder that selectively allows some HTML elements.

This tool is English ASCII only.

Requirements

This was written for ColdFusion 2018/2021. It has not been tested for earlier versions or for compatibility with Lucee.

Problem

When the user enters data we need to "clean" the data to prevent problems from XSS attacks to simply broken page formatting.

encodeForHTML() and canonicalize() both do a great job of cleaning user-entered text, but sometimes you want to allow some formatting options for the end user.

Path to a "solution"

CleanText came initially from a need to allow the user to preserve line breaks in their entered text. It was expanded to allow bold and italics and later lists.

It was eventually expanded to support most of the basic editing options provided by the bare-bones CKEditor install. CKEditor includes a downloadable component that will do the cleaning on the client side, but we wanted to minimze the download and processing load on the client and to allow the cleaning to be done on the server side.

Other options

We exerimented with Markdown, but it proved too confusing for our end-users to enter and was a problem for our reporting engine (the engine understood HTML, but did not have a Markdown intrepreter option).

Eventually we will probably implement a full anti-sami suite, but for now this simplified tool does the work just fine.

Components

The text_util.cfm package has two main components

stripWord(required string text)

MS Word loves replacing basic text with fancy unicode characters. stripWord goes through and replaces many of the special characters with their basic text equivalent and then removes all non-ascii characters from the string.

stripWord() is used mostly by the cleanText() function, but is provided as a separate call if needed for other uses (we have been known to call encodeForHTML( stripWord( TEXT_FIELD ) ) )

stripWord() specifically replaces the following codes:

          ANSII 8220 - #chr(8220)# - left quotes with "
          ANSII 8221 - #chr(8221)# - right quotes with "
          ANSII 8216 - #chr(8216)# - left quote with '
          ANSII 8217 - #chr(8217)# - right quote with '
          ANSII 8211 - #chr(8211)# - en dash with -
          ANSII 8212 - #chr(8212)# - em dash with -
          ANSII 8226 - #chr(8226)# - bullet with *
          ANSII 8230 - #chr(8230)# - ellipsis with ...

cleanText(required string text, numeric maxLength = 0, boolean links_ok = false)

Cleans and formats a string for display on the page.

cleanText first runs stripWord() to remove MS Word characters.

It then optionally trims the string to maxLength characters. This is a blind trim and it will cut off text.

Then run the CF function encodeForHTML(string) to remove HTML and other special characters and replace them with their escaped values

After the encodeForHTML(), the string will contian only screen-ready clean text and escaped special characters.

At this point, we want to go back to the string and replace some of the escaped HTML characters and replace them with real HTML to allow the user to have some formatting options.

Replace escaped strong, em, u, s, sup, sub, blockquote, ol, ul, and li with their html equivalents.

Replace escaped p, strong, em, u, s, sup, sub, blockquote, ol, ul, and li with their htmlequivalents. It will do minimal checks to ensure that the tags are balanced, but it is not perfect.

If links_ok is set, replace escaped links.

If the string was trimmed earlier, append …' …' to the end.

Escaped Links

CleanText will work for two common types of links:

  • <a href="URL">text</a> - the link must start with the string a href= (that's what the search keys on).

  • http://bareURL - a bare URL (starting with http or https) will be converted to <a href="http://bareURL">bareURL</a>

Usage

Just do a <cfinclude template="text_util.cfm" /> in your page and then call cleanText on any value you want cleaned. Use it in place of encodeForHTML() or canonicalize().

   <cfinclude template="./text_util.cfm" />
   <cfoutput>
       #cleanText(VARIABLE_WITH_POTENTIALLY_BAD_TEXT)#
   </cfoutput>

History

cleanText() (or a close veriant of cleanText) has been in use on our Intranet based page for over 10 years with no reported problems. It has also been used on a closed access publically facing Internet page.

Next Steps?

Well, I would love to be able to detect web and email addresses and make them clickable, but that is proving to be more trouble than I care for at this time.

About

"Simple" text cleaner/encoder written in ColdFusion that selectively allows some HTML elements

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published