Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bundle the old RegexSwap's regexes into the distribution #1767

Open
kaspersorensen opened this issue Nov 21, 2017 · 7 comments
Open

Bundle the old RegexSwap's regexes into the distribution #1767

kaspersorensen opened this issue Nov 21, 2017 · 7 comments

Comments

@kaspersorensen
Copy link
Member

Now that RegexSwap is no longer available, should we just put all those regexes into the application itself?

I've gone ahead and queried the regexes just to be able to preserve them for future use:

<?xml version="1.0" encoding="UTF-8"?>
<regexes>
	<regex>
		<name>Danish postal code</name>
		<expression>^(DK(-| )?)?[0-9]{4}$</expression>
		<description>This regex allows three different formats of danish
			postal codes. Examplified with the 2200 postal code:

			* DK-2200
			* DK2200
			* 2200
		</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>denmark,geographic,postal address</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Danish%20postal%20code
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Danish phone number</name>
		<expression>^(\(?\Q+45\E\)?( )?)?[0-9]{8}$</expression>
		<description>This regex is useful for validating Danish (8 digit)
			telephone numbers that may or may not be prefixed by +45 or (+45) and
			an optional space.</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>denmark,geographic,phone</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Danish%20phone%20number
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Website URL</name>
		<expression>^https?://[a-z0-9_-][\.[a-z0-9_-]]*\.(com|edu|org|net|int|info|eu|biz|mil|gov|aero|travel|pro|name|museum|coop|asia|[a-z][a-z])+(:[0-9]+)?[/[a-zA-Z0-9\._#-]]*/?$
		</expression>
		<description>Validates a HTTP or HTTPS url for a website. It works
			well for most uses, but there are a few corner-cases that it cannot
			handle.

			Known issues:
			* URL "GET" parameters are not supported.
			* Top-level domains are not literally validated.
		</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>internet</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Website%20URL
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Typical username</name>
		<expression>[a-zA-Z0-9_]{3,16}</expression>
		<description>A regex for a typical username format. Here are the
			username requirements:

			* 3 to 16 characters
			* alphanumeric characters are accepted
			* underscore (_) is also accepted
		</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>internet,identity</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Typical%20username
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Credit Card (JCB)</name>
		<expression>^(?:2131|1800|35\d{3})\d{11}$</expression>
		<description>JCB Credit Card number</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>creditcards,numbers</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Credit%20Card%20(JCB)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Credit Card (Diners Club)</name>
		<expression>^3(?:0[0-5]|[68][0-9])[0-9]{11}$</expression>
		<description>Diners Club Credit Card number.</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>creditcards</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Credit%20Card%20(Diners%20Club)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>24 Hour Time (hh:mm)</name>
		<expression>^([0-1]?[0-9]|2[0-4]):([0-5][0-9])(:[0-5][0-9])?$
		</expression>
		<description>Validates a time field with 24 (0-23) hours.
		</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>geographic,time</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/24%20Hour%20Time%20(hh:mm)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>ISO date (yyyy-mm-dd)</name>
		<expression>^((((19|20)(([02468][048])|([13579][26]))-02-29))|((20[0-9][0-9])|(19[0-9][0-9]))-((((0[1-9])|(1[0-2]))-((0[1-9])|(1[[0-9]])|(2[0-8])))|((((0[13578])|(1[02]))-31)|(((0[1,3-9])|(1[0-2]))-(29|30)))))$
		</expression>
		<description>Defines a datemask composed of a 4-digit year, a 2-digit
			month and a 2-digit date.

			Trailing zero's are required in single digit months or dates.
		</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>time</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/ISO%20date%20(yyyy-mm-dd)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Credit Card (Discover)</name>
		<expression>^6(?:011|5[0-9]{2})[0-9]{12}$</expression>
		<description>Discover Credit Card numbers.</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>creditcards</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Credit%20Card%20(Discover)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>German postal code</name>
		<expression>^[0-9]{5}$</expression>
		<description>Quite simple - it's five digits.</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>geographic,germany</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/German%20postal%20code
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Integer or rounded decimal</name>
		<expression>^[-+]?[1-9][[0-9]]*\.?[0]*$</expression>
		<description>This regex matches all positive or negative numbers that
			have no decimals or only zero's in the decimal numbers. Examples:

			* 1
			* -1
			* 12.0
			* -512.000
		</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>numbers</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Integer%20or%20rounded%20decimal
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>UK Phone Number</name>
		<expression>^(\+44[[:space:]]?7[[:digit:]]{3}|\(?07[[:digit:]]{3}\)?)[[:space:]]?[[:digit:]]{3}[[:space:]]?[[:digit:]]{3}$
		</expression>
		<description>Phone numbers for United Kingdom.</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>geographic,phone,united kingdom</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/UK%20Phone%20Number
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>FR Postal Code</name>
		<expression>^(0[1-9]|[1-9][0-9])[0-9]{3}$</expression>
		<description>Postal code for France / French cities.</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>france,geographic,postal address</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/FR%20Postal%20Code
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>BE Postal Code</name>
		<expression>^(F-[0-9]{4,5}|B-[0-9]{4})$</expression>
		<description>Postal codes for Belgium / Belgian cities</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>belgium,geographic,postal address</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/BE%20Postal%20Code
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Credit Card (Visa)</name>
		<expression>^4[0-9]{12}(?:[0-9]{3})?$</expression>
		<description>Visa Credit Card number.</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>creditcards</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Credit%20Card%20(Visa)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Credit Card (any)</name>
		<expression>^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35\d{3})\d{11})$
		</expression>
		<description>Any Credit Card number.</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>creditcards</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Credit%20Card%20(any)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Indian vehicle reg. number</name>
		<expression>^([A-Z|a-z]{2}\s{1}\d{2}\s{1}[A-Z|a-z]{1,2}\s{1}\d{1,4})?([A-Z|a-z]{3}\s{1}\d{1,4})?$
		</expression>
		<description>It validates Indian Vehicle Registration Number.
		</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>india,vehicles</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Indian%20vehicle%20reg.%20number
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>linux/unix network devices</name>
		<expression>^eth[0-9]+$</expression>
		<description>Very simple expression for validating ethX devices, such
			as eth0 or eth1.</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>computers</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/linux%2Funix%20network%20devices
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>MAC address</name>
		<expression>^([0-9a-fA-F][0-9a-fA-F]:){5}([0-9a-fA-F][0-9a-fA-F])$
		</expression>
		<description>Matches a MAC address of a network device.</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>computers,internet</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/MAC%20address
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>IPv4 address</name>
		<expression>\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b</expression>
		<description>Validates an IPv4 address</description>
		<positiveVotes>2</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>computers,internet,numbers</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/IPv4%20address
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Dutch Postal Code</name>
		<expression>^\s*?[0-9]{4}\s?[a-z|A-Z]{2}\s*?$</expression>
		<description>Postal code check for the Netherlands.
			Leading and trailing spaces are filtered.
		</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>postal address</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Dutch%20Postal%20Code
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>DISCO 88</name>
		<expression>\d{4}|110</expression>
		<description>Danish version of ISCO 88.</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>denmark</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/DISCO%2088</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>DISCO AMS</name>
		<expression>\d{7}|110(001|101|102)</expression>
		<description>Sub version of DISCO 88 used by the Danish National
			Labour Market Authority</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>denmark</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/DISCO%20AMS</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Danish Car registrationnumber</name>
		<expression>^[a-z|A-Z]{2}[1-9]\d{4}$</expression>
		<description></description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>denmark,identity,numbers,vehicles</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Danish%20Car%20registrationnumber
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>ITA Post Code</name>
		<expression>^[0-9]{5}$</expression>
		<description>Italian Postal Code</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>identity</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/ITA%20Post%20Code
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Java class name</name>
		<expression>^(([a-z])+.)+[A-Z]([A-Za-z])+$</expression>
		<description>Matches a fully qualified Java class name. Requires an
			Uppercase starting char, which is a commonly followed convention for
			Java classes.

			Matched example:

			* dk.eobjects.datacleaner.gui.DataCleanerGui

			Unmatches example:

			* dk.eobjects.datacleaner
		</description>
		<positiveVotes>3</positiveVotes>
		<negativeVotes>1</negativeVotes>
		<categories>computers,programming</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Java%20class%20name
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Integer</name>
		<expression>^[-+]?[1-9][[0-9]]*$</expression>
		<description>This regex matches positive and negative integers and
			nothing else.</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>1</negativeVotes>
		<categories>numbers</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Integer</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Credit Card (MasterCard)</name>
		<expression>^5[1-5][0-9]{14}$</expression>
		<description>!MasterCard Credit Card number.</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>creditcards</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Credit%20Card%20(MasterCard)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Month number (01-12)</name>
		<expression>^((0[1-9])|(1[0-2]))$</expression>
		<description>Defines a number between (0)1 and 12. Usefull for
			composition of regular expressions with other expressions.
		</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>time,partials</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Month%20number%20(01-12)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Credit Card (American Express)</name>
		<expression>^3[47][0-9]{13}$</expression>
		<description>American Express Credit Card number.</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>creditcards</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Credit%20Card%20(American%20Express)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>US Social Security Number</name>
		<expression>^([[:digit:]]{3}[ -][[:digit:]]{2}[
			-][[:digit:]]{4}|[[:digit:]]{9})$</expression>
		<description>US / American social security number regex.</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>geographic,social security,united states</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/US%20Social%20Security%20Number
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Week day</name>
		<expression>^(Sun|Mon|(T(ues|hurs))|Fri)(day|\.)?$|Wed(\.|nesday)?$|Sat(\.|urday)?$|T((ue?)|(hu?r?))\.?$
		</expression>
		<description>Defines an expression for matching weekdays. This
			expression validates both the short and long version of weekdays, but
			requires an upper-case first character, such as:

			* Mon
			* Tuesday
			* Thursday
			* Sat
			* Saturday

			Hope it's useful, if not for anything else, then for composition with
			other expressions.
		</description>
		<positiveVotes>2</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>time,partials</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Week%20day</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>DE Postal Code</name>
		<expression>^(D\-)?[0-9]{5}$</expression>
		<description>Postal codes for Germany / German cities.</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>geographic,germany,postal address</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/DE%20Postal%20Code
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Email address</name>
		<expression>[a-zA-Z0-9._%+-]*@[a-zA-Z0-9._%+-]*\.[a-zA-Z]{2,4}
		</expression>
		<description>This is definitely not the final word on an email address
			regular expression, but it's useful and definately narrows down the
			options. There are however some known issues such as no literal
			top-level domain check.</description>
		<positiveVotes>2</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>identity,internet</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Email%20address
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>MD5 hash</name>
		<expression>^([a-f0-9]{32})$</expression>
		<description>Matches a 32-digit MD5 hash. Useful for validating hashed
			database-columns.</description>
		<positiveVotes>2</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>computers,identity,programming</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/MD5%20hash</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>ID CARD of China</name>
		<expression>\d{15}|\d{18}</expression>
		<description>identity card of China
		</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>identity</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/ID%20CARD%20of%20China
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Danish CPR number</name>
		<expression>^(0[1-9]|[12]\d|3[01])((0[1-9])|(1[0-2]))[0-9]{2}(\Q-\E)?[0-9]{4}$
		</expression>
		<description>Definately not the most advanced and accurate Danish CPR
			(social security) number validation regex, but it's pretty useful and
			easy to figure out.</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>denmark,geographic,social security</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Danish%20CPR%20number
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Day number (01-31)</name>
		<expression>^(((0)[1-9])|((1|2)[0-9])|(3[0-1]))$</expression>
		<description>Defines a day number between 01 and 31. Useful for
			composition with other regexes for custom date masks and the like.
		</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>partials,time</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Day%20number%20(01-31)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>italian fixed phone numbers</name>
		<expression>(\+|00)?+\s*+(39|\(39\))+\s*+0+[1-9]{1}+[0-9]{0,1,2}+\s*+[0-9]{6,10}
		</expression>
		<description>geographic phone numbers</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>phone</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/italian%20fixed%20phone%20numbers
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Italy Cell Phone</name>
		<expression>^((\+|00)?+(\s)*+(39|
			\(39\))?+(/s)*+((38[3,8,9,0])|(39[1-3])|(34[0,3,7-9])|(36[0,3,6,8])|(33[0,
			1, 3-9])|(32[0,4,7-9]))(/s)*([0-9]{6,7}))?$</expression>
		<description>for client numbers only</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>phone</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Italy%20Cell%20Phone
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Duplicate works (Ignores content between &lt;style&gt; and
			&lt;script&gt; tags)</name>
		<expression>\b((?!(((?!&lt;(\w|/|\?)).)+?((?&lt;!\?)&gt;|&lt;/script&gt;|&lt;/style&gt;)))[a-zA-Z]{2,})\s+\1\b
		</expression>
		<description>\b # start at a word boundary;
			( # group 1 start
			(?!(
			((?!&lt; # -negative lookahead to validate tag starts only when it
			(\w|/|\?) # -is followed by {any 1 word character or
			).)+? # -/(to match closing html tags) or ?(to match &lt;?php..)}

			((?&lt;!\?)&gt;| # Validate if text does not belong to a
			&lt;script&gt;, &lt;style&gt; tag
			&lt;/script&gt;| # this is needed to ignore invalid matches
			&lt;/style&gt;) # such as 0px 0px or same css class names within a .tpl file
			))
			[a-zA-Z]{2,} # consider words that are &gt;= 2 characters
			) # group 1 end
			\s+ # words could be sperated by 1 or more spaces
			\1 # compare current word with group 1
			\b # the word should end at a word boundary
		</description>
		<positiveVotes>0</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>programming</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Duplicate%20works%20(Ignores%20content%20between%20%3Cstyle%3E%20and%20%3Cscript%3E%20tags)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
	<regex>
		<name>Find duplicate words (ignoring content between script and style
			tags)</name>
		<expression>\b((?!(((?!&lt;(\w|/|\?)).)+?((?&lt;!\?)&gt;|&lt;/script&gt;|&lt;/style&gt;)))[a-zA-Z]{2,})\s+\1\b
		</expression>
		<description>\b # start at a word boundary;
			( # group 1 start
			(?!(
			((?!&lt; # -negative lookahead to validate tag starts only when it
			(\w|/|\?) # -is followed by {any 1 word character or
			).)+? # -/(to match closing html tags) or ?(to match &lt;?php..)}

			((?&lt;!\?)&gt;| # Validate if text does not belong to a
			&lt;script&gt;, &lt;style&gt; tag
			&lt;/script&gt;| # this is needed to ignore invalid matches
			&lt;/style&gt;) # such as 0px 0px or same css class names within a .tpl file
			))
			[a-zA-Z]{2,} # consider words that are &gt;= 2 characters
			) # group 1 end
			\s+ # words could be sperated by 1 or more spaces
			\1 # compare current word with group 1
			\b # the word should end at a word boundary
		</description>
		<positiveVotes>1</positiveVotes>
		<negativeVotes>0</negativeVotes>
		<categories>programming</categories>
		<detailsUrl>https://datacleaner.org/ws/regex/Find%20duplicate%20words%20(ignoring%20content%20between%20script%20and%20style%20tags)
		</detailsUrl>
		<author>datacleaner.org</author>
		<timestamp>0</timestamp>
	</regex>
</regexes>
@kaspersorensen
Copy link
Member Author

I made a little tool to spit out some tentative output that would fit into the conf.xml format. Not great code, but works as a one-off:

import java.io.File;
import java.io.FileInputStream;

import org.datacleaner.util.StringUtils;
import org.datacleaner.util.xml.XmlUtils;
import org.junit.Test;
import org.springframework.util.xml.DomUtils;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class RegexSwapDataExtractor {

    @Test
    public void testExtractRegexPatterns() throws Exception {
        final File regexSwapDump = new File("src/test/resources/old-regexswap-patterns.xml");
        final Document doc = XmlUtils.parseDocument(new FileInputStream(regexSwapDump));
        final NodeList regexNodes = doc.getElementsByTagName("regex");
        for (int i = 0; i < regexNodes.getLength(); i++) {
            final Node regexNode = regexNodes.item(i);
            final String str = toConfXmlRegexPattern((Element) regexNode);
            System.out.print(str);
        }
    }

    private String toConfXmlRegexPattern(Element regexNode) {
        // <regex-pattern name="Website URL" description="Matches a HTTP or HTTPS based URL for a website. Does not
        // handle HTTP query parameters.">
        // <expression>^https?://[a-z0-9_-][\.[a-z0-9_-]]*\.(com|edu|org|net|int|info|eu|biz|mil|gov|aero|travel|pro|name|museum|coop|asia|[a-z][a-z])+(:[0-9]+)?[/[a-zA-Z0-9\._#-]]*/?$</expression>
        // </regex-pattern>

        String name = DomUtils.getChildElementByTagName(regexNode, "name").getTextContent();
        name = StringUtils.replaceAll(StringUtils.replaceWhitespaces(name, " "), "  ", " ");
        name = trim(name);

        String expression = DomUtils.getChildElementByTagName(regexNode, "expression").getTextContent();
        expression = trim(expression);

        String description = DomUtils.getChildElementByTagName(regexNode, "description").getTextContent();
        description = trim(description);
        if (description.indexOf('\n') != -1) {
            description = null;
        }

        return (description == null ? "\n<regex-pattern name=\"" + name + "\">"
                : "\n<regex-pattern name=\"" + name + "\" description=\"" + description + "\">") + "\n\t<expression>" + expression + "</expression>"
                + "\n</regex-pattern>";
    }

    private String trim(String str) {
        str = str.trim();
        str = str.replace("<", "&lt;").replace(">", "&gt;");
        return str;
    }
}

@LosD
Copy link
Contributor

LosD commented Dec 10, 2017

I'm not sure it really matters, since no one ever really contributed patterns, but wouldn't RegexSwap be really easy to just dump in a GitHub Page? That way, it's also easy to contribute a pattern, just make a PR against the RegexSwap GH page repo.

(of course, dynamic things like voting would not survive, but I'm not sure that's really a big loss. There's the possibility of discussing them in the repo's issues list, and improving them through a PR. This seems better than a simple voting system)

@kaspersorensen
Copy link
Member Author

Good point. We could even put it up on https://datacleaner.github.io somewhere, just like the new version endpoint that it has (https://datacleaner.github.io/meta/versions.json) which I intended for something similar (update notifications).

@LosD
Copy link
Contributor

LosD commented Dec 11, 2017

Cool. Both separate and combined makes perfect sense, so let's go with whatever you prefer :)

@kaspersorensen
Copy link
Member Author

I've made the regexes available at https://datacleaner.github.io/content/regexes.json

@kaspersorensen
Copy link
Member Author

@kaspersorensen
Copy link
Member Author

It seems to me that they're not very well maintained though. I'm gonna do a bit of cleanup in the descriptions and such, but I'm sure more people than me can help too, so let this be an open invite to any contributor to pitch in with their good regex contributions :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants