# Data Cleansing

## Regular Expressions

https://docs.python.org/3/library/re.html

https://realpython.com/regex-python/

https://realpython.com/regex-python-part-2/

https://developers.google.com/edu/python/regular-expressions

RegexGolf: https://alf.nu/RegexGolf?world=regex&level=r00

https://realpython.com/python-strings/

https://realpython.com/introduction-to-python-generators/

It is a complex. The way you describe a pattern of a text is using text to characters themselves. Search for the normal letters, and search for characters 0-9; all of the common characters are there. 

abcdefghjklmnopqrstuvwxyz0123456789 (e.g), The end!

special characters ```^```$ [] * is not used that frequently in common text, but is used frequently in regular expressions.They need to be used with /\. 

In [6]:
import re

## Python's 're' module

```python 
\w = '[a-zA-Z0-9_]'
\W = '[^a-zA-Z0-9_]'
```

Anything inside the square brackets count. In w any of the characters included in the square brackets count/are matched.
What the carat ^ mean is that anything but those characters count/are matched.
'+' means one or more of the previous characters are matched, as many as it can match.

what split does is, it looks for those substrings that match the patter and it splits the original string at those characters.

'r' means raw, it means do not escape any of the characters in the string. \ is a special character even in normal strings, normally means to escape the next character. '\n' normally means a new line in a normal string.

csv files are comma seperated values files, but often it is not the comma that separates values. There can be other delimiters also.

In [8]:
# A string to be manipulated
original = 'Words, words, words.'

#the pattern/regular expression to use on the above string
pattern = r'\W+'

#Split a string into substrings using the regular expression
result = re.split(pattern, original)

#Print the result
print(result)

# returns back a list of strings.

['Words', 'words', 'words', '']


## Exercise to explain what happens in the following lines of code

In [9]:
re.split(r'(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

As per https://docs.python.org/3/library/re.html documentation if there are capturing parenthesis are used in the pattern, the text/characters of all groups that can be matches are also returned in the output. In the above code the "r'(\W+)'" the parentheses are used and therefore the split is executed in a way that anything that cannot be matched from [a-zA-Z0-9_] is left alone and returned, together with the text that meets the criteria for \W = '[^a-zA-Z0-9_]', which are the commas (',') and the full stop ('.'). Parenthesis are used to crete subgroups (<regex>), A regex in parentheses just matches the contents of the parentheses.

In [10]:
re.split(r'\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

In the above example the maxsplit is also defined with an integer of '1', which means that one split will occur and the balance of the string is returned as the final element of the list. In this case the first split occurs at the first comma and the remainder text ("words, words.") is returned as a string. 

In [11]:
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

The above code is looking for a split when there is a match found to any alpha character from latter A to F ('[a-f]+'), ignoring whether it is lower case or upper case (flags=re.IGNORECASE). 
The returned values are '0', '3' and '9', as the splits were executed when 'a' and 'B' characters were matched. 

## Real Python (https://realpython.com/regex-python/)

In [12]:
'abccba' == 'abccba'

True

In [13]:
'abccba' == 'abcbba'

False

In [14]:
'abc' in 'cbaabc'

True

In [15]:
'abc' .index('a')

0

In [16]:
'cbaabc'[2]

'a'

In [17]:
'cbaabc' .find('aa')

2

Perl is an important programming language, podcast currently going Guido van Rossum, Rex Friedman. Syntax of PERL is very difficult. Guido van Rossum podcasts can help understanding the programming languages. Most of regex is based on PERL programming language. 

``` python re.search(<regex>, <string>)``` the whole string has to match the regex.

In [18]:
s = 'foo123bar'

re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

Match is a class to find in the re module. span is index 3 and up to/not including index 6. 

In [20]:
s[3:6]

'123'

Any string that contains characters it is a truthy value. If not it is a falsy value (such as empty ' ' or zero)

In [22]:
if 'abc':
    print("No way")
else:
    print("Correct")

No way


In [23]:
if 1 == 2:
    print("No way")
else:
    print("Correct")

Correct


In [24]:
if 0:
    print("No way")
else:
    print("Correct")

Correct


In [21]:
if re.search('123', s):
    print('Found a match.')
else:
    print('No match.')

Found a match.


Metacharacters = special characters. 

In [25]:
s = 'foo123bar'
re.search('[0-9][0-9][0-9]', s)

<re.Match object; span=(3, 6), match='123'>

match any three characters that are digits in a row. See below additional examples:

In [26]:
re.search('[0-9][0-9][0-9]', 'foo456bar')

<re.Match object; span=(3, 6), match='456'>

In [27]:
re.search('[0-9][0-9][0-9]', '234baz')

<re.Match object; span=(0, 3), match='234'>

In [28]:
re.search('[0-9][0-9][0-9]', 'qux678')

<re.Match object; span=(3, 6), match='678'>

Using metacharacter {} and specify the number of time the regex is to be repeated in the search, the above examples can be expressed as below. Matches an explicitly specified number of repetitions.

In [34]:
re.search('[0-9]{3}', 'qux678')

<re.Match object; span=(3, 6), match='678'>

On the other hand, a string that doesn’t contain three consecutive digits won’t match:

In [31]:
print(re.search('[0-9][0-9][0-9]', '12foo34'))

None


The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard

In [29]:
s = 'foo123bar'
re.search('1.3', s)

<re.Match object; span=(3, 6), match='123'>

In [30]:
s = 'foo13bar'
print(re.search('1.3', s))

None


In the first example, the regex 1.3 matches '123' because the '1' and '3' match literally, and the . matches the '2'. Here, you’re essentially asking, “Does s contain a '1', then any character (except a newline), then a '3'?” The answer is yes for 'foo123bar' but no for 'foo13bar'.

Metacharacter []

In [35]:
re.search('ba[artz]', 'foobarqux')

<re.Match object; span=(3, 6), match='bar'>

In [40]:
re.search('b[ar][tz]', 'foobrtzqux')

<re.Match object; span=(3, 6), match='brt'>

The metacharacter sequence [artz] matches any single 'a', 'r', 't', or 'z' character. In the example, the regex ba[artz] matches both 'bar' and 'baz' (and would also match 'baa' and 'bat').

In [42]:
re.search('[a-z][a-z]', 'FOObar')

<re.Match object; span=(3, 5), match='ba'>

In [44]:
re.search('[0-9][0-9][a-z]', 'foo123bar')

<re.Match object; span=(4, 7), match='23b'>

In [45]:
re.search('[0-9a-fA-f]', '--- a0 ---')

<re.Match object; span=(4, 5), match='a'>

In the above examples, the return value is always the leftmost possible match. re.search() scans the search string from left to right, and as soon as it locates a match for <regex>, it stops scanning and returns the match.

You can complement a character class by specifying ^ as the first character, in which case it matches any character that isn’t in the set. In the following example, [^0-9] matches any character that isn’t a digit:

In [46]:
re.search('[^0-9]', '12345foo')

<re.Match object; span=(5, 6), match='f'>

If a ^ character appears in a character class but isn’t the first character, then it has no special meaning and matches a literal '^' character:

In [47]:
re.search('[#:^]', 'foo^bar:baz#qux')

<re.Match object; span=(3, 4), match='^'>

As you’ve seen, you can specify a range of characters in a character class by separating characters with a hyphen. What if you want the character class to include a literal hyphen character? You can place it as the first or last character or escape it with a backslash (``` \ ```):

In [48]:
re.search('[-abc]', '123-456')

<re.Match object; span=(3, 4), match='-'>

In [56]:
re.search('[abc-]', '123-ab56')

<re.Match object; span=(3, 4), match='-'>

In [50]:
re.search('[ab\-c]', '123-456')

<re.Match object; span=(3, 4), match='-'>

In [57]:
re.search('[]]', 'foo[1]')

<re.Match object; span=(5, 6), match=']'>

In [59]:
re.search('[ab\]cd]', 'foo[1]')

<re.Match object; span=(5, 6), match=']'>

dot (.)

Specifies a wildcard.

The . metacharacter matches any single character except a newline:

In [61]:
print(re.search('foo.bar', 'fooxbar'))


print(re.search('foo.bar', 'foobar'))

print(re.search('foo.bar', 'foo\nbar'))

<re.Match object; span=(0, 7), match='fooxbar'>
None
None


As a regex, foo.bar essentially means the characters 'foo', then any character except newline, then the characters 'bar'. The first string shown above, 'fooxbar', fits the bill because the . metacharacter matches the 'x'.

 Word characters are uppercase and lowercase letters, digits, and the underscore (_) character, so \w is essentially shorthand for [a-zA-Z0-9_]:

In [62]:
re.search('\w', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

In [63]:
re.search('[a-zA-Z0-9_]', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

\W is the opposite. It matches any non-word character and is equivalent to [^a-zA-Z0-9_]:

In [64]:
re.search('\W', 'a_1*3Qb')

<re.Match object; span=(3, 4), match='*'>

In [65]:

re.search('[^a-zA-Z0-9_]', 'a_1*3Qb')

<re.Match object; span=(3, 4), match='*'>

\d
\D

Match based on whether a character is a decimal digit.

\d matches any decimal digit character. \D is the opposite. It matches any character that isn’t a decimal digit:



In [68]:
re.search('\d', 'a4def')

<re.Match object; span=(1, 2), match='4'>

In [69]:
re.search('\D', 'A234Q678')

<re.Match object; span=(0, 1), match='A'>

\s
\S

Match based on whether a character represents whitespace.

\s matches any whitespace character:
(Note that, unlike the dot wildcard metacharacter, \s does match a newline character.

\S is the opposite of \s. It matches any character that isn’t whitespace:)

In [70]:
re.search('\s', 'foo\nbar baz')

<re.Match object; span=(3, 4), match='\n'>

In [71]:
re.search('\s', 'foobar baz')

<re.Match object; span=(6, 7), match=' '>

In [72]:
re.search('\S', '  \n foo  \n  ')

<re.Match object; span=(4, 5), match='f'>

backslash (\)

Removes the special meaning of a metacharacter.

As you’ve just seen, the backslash character can introduce special character classes like word, digit, and whitespace. There are also special metacharacter sequences called anchors that begin with a backslash, which you’ll learn about below.

When it’s not serving either of these purposes, the backslash escapes metacharacters. A metacharacter preceded by a backslash loses its special meaning and matches the literal character instead. Consider the following examples:

    

In [3]:
s = r'foo\bar'
print(s)

foo\bar


In [7]:
re.search('\\', s)

error: bad escape (end of pattern) at position 0

In [6]:
# using a rawa string is a cleaner way to resolve this error message and it returns the correct position of the "\".
re.search(r'\\', s)

<re.Match object; span=(3, 4), match='\\'>

It’s good practice to use a raw string to specify a regex in Python whenever it contains backslashes.

Anchors
Anchors are zero-width matches. They don’t match any actual characters in the search string, and they don’t consume any of the search string during parsing. Instead, an anchor dictates a particular location in the search string where a match must occur.

In [None]:
re.search('^foo', 'foobar')

In [None]:

print(re.search('^foo', 'barfoo'))

In [None]:
re.search('\Afoo', 'foobar')


In [5]:

print(re.search('\Afoo', 'barfoo'))

None


the same outcome can be obtained by using either ^ or \A .

When the regex parser encounters ^ or \A, the parser’s current position must be at the beginning of the search string for it to find a match.

$

\Z

Anchor a match to the end of <string>.

When the regex parser encounters $ or \Z, the parser’s current position must be at the end of the search string for it to find a match. Whatever precedes $ or \Z must constitute the end of the search string:


In [82]:
re.search('bar$', 'foobar')


<re.Match object; span=(3, 6), match='bar'>

In [83]:

print(re.search('bar$', 'barfoo'))



None


In [84]:

re.search('bar\Z', 'foobar')


<re.Match object; span=(3, 6), match='bar'>

In [85]:

print(re.search('bar\Z', 'barfoo'))


None


$ and \Z behave slightly differently from each other in MULTILINE mode. See the section below on flags for more information on MULTILINE mode.

#### \b

Anchors a match to a word boundary.

\b asserts that the regex parser’s current position must be at the beginning or end of a word. A word consists of a sequence of alphanumeric characters or underscores ([a-zA-Z0-9_]), the same as for the \w character class:

In [86]:
re.search(r'\bbar', 'foo bar')
#There is a space boundary before bar


<re.Match object; span=(4, 7), match='bar'>

In [87]:
re.search(r'\bbar', 'foo.bar')
#there is a dot boundary before bar


<re.Match object; span=(4, 7), match='bar'>

In [94]:

print(re.search(r'\bbar', 'foobar'))
#there is no space, nor dot before bar, it is directly concatenated with foo, therefore it is a faulty situation, it returns None

None


In [89]:

# there is a bourndary (blank) after foo, therefore it matches foo below.
re.search(r'foo\b', 'foo bar')


<re.Match object; span=(0, 3), match='foo'>

In [92]:
# there is a . (dot) after foo which is also considered a boundary
re.search(r'foo\b', 'foo.bar')


<re.Match object; span=(0, 3), match='foo'>

In [93]:
# foo in the below example is followed by bar straight away, therefore it cannot match foo with a boundary after it.
print(re.search(r'foo\b', 'foobar'))

None


In [95]:
re.search(r'\bbar\b', 'foo bar baz')
# spaces around "bar" qualify for a boundary, hence there is a match

<re.Match object; span=(4, 7), match='bar'>

In [96]:
# parenthesis form a boundary around bar, therefore there is a match
re.search(r'\bbar\b', 'foo(bar)baz')


<re.Match object; span=(4, 7), match='bar'>

In [97]:
# Result is none as there is no boundary before and after "bar"
print(re.search(r'\bbar\b', 'foobarbaz'))

None


### \B

Anchors a match to a location that isn’t a word boundary.

\B does the opposite of \b. It asserts that the regex parser’s current position must not be at the start or end of a word:



In [101]:
print(re.search(r'\Bfoo\B', 'foo'))
#no whitespace or other boundary present => faulty result

None


In [102]:
# "." is a boundary, therefore result is faulty
print(re.search(r'\Bfoo\B', '.foo.'))

None


In [103]:
# no boundaries around foo, hence it matches.
re.search(r'\Bfoo\B', 'barfoobaz')

<re.Match object; span=(3, 6), match='foo'>

## *

Matches zero or more repetitions of the preceding regex.

For example, a* matches zero or more 'a' characters. That means it would match an empty string, 'a', 'aa', 'aaa', and so on.

Consider these examples:

In [104]:
re.search('foo-*bar', 'foobar')                     # Zero dashes


<re.Match object; span=(0, 6), match='foobar'>

In [105]:

re.search('foo-*bar', 'foo-bar')                    # One dash


<re.Match object; span=(0, 7), match='foo-bar'>

In [106]:

re.search('foo-*bar', 'foo--bar')                   # Two dashes

<re.Match object; span=(0, 8), match='foo--bar'>

In [107]:
# In this example, .* matches everything between 'foo' and 'bar', excluding # at the end of the original string:

re.search('foo.*bar', '# foo $qux@grault % bar #')

<re.Match object; span=(2, 23), match='foo $qux@grault % bar'>

## +

Matches one or more repetitions of the preceding regex.

This is similar to *, but the quantified regex must occur at least once:



## ?

Matches zero or one repetitions of the preceding regex.

Again, this is similar to * and +, but in this case there’s only a match if the preceding regex occurs once or not at all:

## Examples of * , +, ?

In [108]:
re.match('foo[1-9]*bar', 'foobar')

# * Matches zero or more repetitions of the preceding regex. therefore, found foobar

<re.Match object; span=(0, 6), match='foobar'>

In [109]:
re.match('foo[1-9]*bar', 'foo42bar')


<re.Match object; span=(0, 8), match='foo42bar'>

In [114]:

# + needs at least one occurrence of the regex, therefore faulty result as there are no numeric characters between foo and bar
print(re.match('foo[1-9]+bar', 'foobar'))


None


In [111]:
# Match found as there are numeric characters in string.
re.match('foo[1-9]+bar', 'foo42bar')


<re.Match object; span=(0, 8), match='foo42bar'>

In [112]:
# there’s only a match if the preceding regex occurs once or not at all:

re.match('foo[1-9]?bar', 'foobar')


<re.Match object; span=(0, 6), match='foobar'>

In [121]:
# Faulty as the regex only allows for one numeric character or none, see lines 122 and 123
print(re.match('foo[1-9]?bar', 'foo42bar'))


None


In [122]:
print(re.match('foo[1-9]?bar', 'foo4bar'))

<re.Match object; span=(0, 7), match='foo4bar'>


In [123]:
print(re.match('foo[1-9]?bar', 'foobar'))

<re.Match object; span=(0, 6), match='foobar'>


## {m}

Matches exactly m repetitions of the preceding regex.

This is similar to * or +, but it specifies exactly how many times the preceding regex must occur for a match to succeed:



In [124]:
print(re.search('x-{3}x', 'x--x'))                # Two dashes


None


In [125]:

re.search('x-{3}x', 'x---x')                      # Three dashes

<re.Match object; span=(0, 5), match='x---x'>

In [126]:

print(re.search('x-{3}x', 'x----x'))              # Four dashes

None


Here, x-{3}x matches 'x', followed by exactly three instances of the '-' character, followed by another 'x'. The match fails when there are fewer or more than three dashes between the 'x' characters.

## {m,n}

Matches any number of repetitions of the preceding regex from m to n, inclusive.

In the following example, the quantified <regex> is -{2,4}. The match succeeds when there are two, three, or four dashes between the 'x' characters but fails otherwise:

In [127]:
for i in range(1, 6):
    s = f"x{'-' * i}x"
    print(f'{i}  {s:10}', re.search('x-{2,4}x', s))

1  x-x        None
2  x--x       <re.Match object; span=(0, 4), match='x--x'>
3  x---x      <re.Match object; span=(0, 5), match='x---x'>
4  x----x     <re.Match object; span=(0, 6), match='x----x'>
5  x-----x    None


Omitting m implies a lower bound of 0, and omitting n implies an unlimited upper bound.
If you omit all of m, n, and the comma, then the curly braces no longer function as metacharacters. {} matches just the literal string '{}'.
In fact, to have any special meaning, a sequence with curly braces must fit one of the following patterns in which m and n are nonnegative integers:

{m,n}
{m,}
{,n}
{,}

{m,n} will match as many characters as possible, and {m,n}? will match as few as possible.

In [128]:
# Greedy (w/o ?) includes all, as many as possible
re.search('a{3,5}', 'aaaaaaaa')


<re.Match object; span=(0, 5), match='aaaaa'>

In [131]:
# Non-greedy (with ?) includes the least number of characters as per regex
re.search('a{3,5}?', 'aaaaaaaa')

<re.Match object; span=(0, 3), match='aaa'>

## Grouping Constructs and Backreferences
Grouping constructs break up a regex in Python into subexpressions or groups. This serves two purposes:

#### Grouping: A group represents a single syntactic entity. Additional metacharacters apply to the entire group as a unit. 

#### Capturing: Some grouping constructs also capture the portion of the search string that matches the subexpression in the group. You can retrieve captured matches later through several different mechanisms.
Here’s a look at how grouping and capturing work.

(```<regex>```)

Defines a subexpression or group.

This is the most basic grouping construct. A regex in parentheses just matches the contents of the parentheses

In [132]:
re.search('(bar)', 'foo bar baz')



<re.Match object; span=(4, 7), match='bar'>

In [133]:

re.search('bar', 'foo bar baz')


<re.Match object; span=(4, 7), match='bar'>

#### Treating a Group as a Unit
A quantifier metacharacter that follows a group operates on the entire subexpression specified in the group as a single unit.

The regex (ba[rz]){2,4}(qux)? matches 2 to 4 occurrences of either 'bar' or 'baz', optionally followed by 'qux':

In [134]:
re.search('(ba[rz]){2,4}(qux)?', 'bazbarbazqux')


<re.Match object; span=(0, 12), match='bazbarbazqux'>

In [135]:

re.search('(ba[rz]){2,4}(qux)?', 'barbar')


<re.Match object; span=(0, 6), match='barbar'>

#### Capturing Groups
Grouping isn’t the only useful purpose that grouping constructs serve. Most (but not quite all) grouping constructs also capture the part of the search string that matches the group. You can retrieve the captured portion or refer to it later in several different ways.

Remember the match object that re.search() returns? There are two methods defined for a match object that provide access to captured groups: .groups() and .group().

####  m.groups()

Returns a tuple containing all the captured groups from a regex match.

Consider this example:

In [136]:
m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
m

<re.Match object; span=(0, 12), match='foo,quux,baz'>

Each of the three (\w+) expressions matches a sequence of word characters. The full regex (\w+),(\w+),(\w+) breaks the search string into three comma-separated tokens.

Because the (\w+) expressions use grouping parentheses, the corresponding matching tokens are captured. To access the captured matches, you can use .groups(), which returns a tuple containing all the captured matches in order:

In [137]:
m.groups()

('foo', 'quux', 'baz')

#### m.group(```<n>```)

Returns a string containing the ```<n>```th captured match.

With one argument, .group() returns a single captured match. Note that the arguments are one-based, not zero-based. So, m.group(1) refers to the first captured match, m.group(2) to the second, and so on.
    
    m.group(0) returns the entire match, and m.group() does the same.

In [138]:
m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
m.groups()


('foo', 'quux', 'baz')

In [139]:
m.group(1)

'foo'

In [143]:
m.group(2)

'quux'

In [142]:
m.group(3)

'baz'

In [144]:
m.group(0)

'foo,quux,baz'

## Google Regular Expressions

Regular expressions are a powerful language for matching text patterns. This page gives a basic introduction to regular expressions themselves sufficient for our Python exercises and shows how regular expressions work in Python. The Python "re" module provides regular expression support.


match = re.search(pat, str)

In [10]:
import re

In [11]:
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print('found', match.group()) ## 'found word:cat'
else:
  print('did not find')

found word:cat


The basic rules of regular expression search for a pattern within a string are:

- The search proceeds through the string from start to end, stopping at the first match found
- All of the pattern must be matched, but not all of the string
- If match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text

In [16]:
  ## Search for pattern 'iii' in string 'piiig'.
  ## All of the pattern must match, but it may appear anywhere.
  ## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig') # found, match.group() == "iii"
match

<re.Match object; span=(1, 4), match='iii'>

In [17]:
match = re.search(r'igs', 'piiig') # not found, match == None
match

In [19]:
## . = any char but \n
match = re.search(r'..g', 'piiig') # found, match.group() == "iig"
match

<re.Match object; span=(2, 5), match='iig'>

In [20]:

## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g') # found, match.group() == "123"
match

<re.Match object; span=(1, 4), match='123'>

In [22]:
match = re.search(r'\w\w\w+', '@@abcd!!') # found, match.group() == "abc"
match

<re.Match object; span=(2, 6), match='abcd'>

In [26]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"
match

<re.Match object; span=(0, 4), match='piii'>

In [27]:
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"
match

<re.Match object; span=(1, 3), match='ii'>

In [28]:
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
match

<re.Match object; span=(2, 9), match='1 2   3'>

In [29]:
re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"


<re.Match object; span=(2, 7), match='12  3'>

In [30]:
re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"
# * -- 0 or more occurrences of the pattern to its left

<re.Match object; span=(2, 5), match='123'>

In [31]:
re.search(r'\d\s*\d\s*\d', 'xx1 2 3 xx')

<re.Match object; span=(2, 7), match='1 2 3'>

In [32]:
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
match

In [35]:
## but without the ^ it succeeds and finds a string that starts with "b":
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
match

<re.Match object; span=(3, 6), match='bar'>

## Email example

In [36]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())  ## 'b@google'

b@google


The search does not get the whole email address in this case because the \w does not match the '-' or '.' in the address. We'll fix this using the regular expression features below.

Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

In [37]:
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())  ## 'alice-b@google.com'

alice-b@google.com


In [44]:
str1 = 'purple a-lice-b@google.com monkey dishwasher'

match = re.search(r'[\w.-]+@[\w.-]+', str1)
if match:
    print(match.group())

a-lice-b@google.com


Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

### Group extraction using () parenthesis

In [46]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

#### findall

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.

In [47]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
# do something with each found email string
    print(email)


alice@google.com
bob@abc.com


#### findall With Files
For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you -- much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):

In [49]:
# Open file
## f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
## strings = re.findall(r'some pattern', f.read())

#### Substitution (optional)
The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '\1', '\2' which refer to the text from group(1), group(2), and so on from the original matching text.

Here's an example which searches for all the email addresses, and changes them to keep the user (\1) but have yo-yo-dyne.com as the host.

In [50]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str))
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher


Group 1 = ([\w\.-]+)@([\w\.-]+)

Group 2 = replacement  r'\1@yo-yo-dyne.com' referring to keep group 1 and add new string after @ sign.

## Exercise 1
***

Remember to do these exercises in your own notebook in your assessment repository.

Write a Python function to remove all non-alphanumeric characters from a string.

In [58]:
# Function to remove all non-alphanumerical characters from a string
def alpha_num(str):
    result = re.sub(r'[\W+]', '', str) # re.sub substitutes any characters that are not alphanumeric, i.e. \W, with no value ''
    return result # returns the result

#Importing random module to generate a text that includes both \w and \W characters
import random

#defining chars as the string for character selection
chars = 'abcdefghijklmnopqrstuvwxyz 0123456789_*$%^-{}[]ABCDEFGHIJKLMNOPQRSTUVWXYZ'

# GEnerating a 300 character long list
str1 = random.choices(chars, k=300)

# Joining the characters into a string
str1 = ''.join(str1)
# Printing the string
print(str1)


ts[6%F28rP UfXBNA6{L]4rbw$r}lz$uf%sraiv-nbY5yXc{{3FC%QCO6kpEFhi{k9l8KisByqr114[dDripPf_-%CUH%XFEnQ2dScKItiS*qWG}hvIIt38L4Zl[-g5iJ[kYPraP2UtJ1}A84j fIzAF8J1kKtdyd1]kTFFq3kB2z*1T_xNCAfLy40 [9WNd93qHb3rhYSLB6QDYG_eC[UlH$F*s-vOIbkXNUri3M$ISloq^xRV-1pjOgoy[Ty^v7dx-YrybquPZu[mZYpifml}[Dz1Bls9][LPAIi3s0rmA


In [60]:
#applying the function on the string
alpha_num(str1)

'ts6F28rPUfXBNA6L4rbwrlzufsraivnbY5yXc3FCQCO6kpEFhik9l8KisByqr114dDripPf_CUHXFEnQ2dScKItiSqWGhvIIt38L4Zlg5iJkYPraP2UtJ1A84jfIzAF8J1kKtdyd1kTFFq3kB2z1T_xNCAfLy409WNd93qHb3rhYSLB6QDYG_eCUlHFsvOIbkXNUri3MISloqxRV1pjOgoyTyv7dxYrybquPZumZYpifmlDz1Bls9LPAIi3s0rmA'

## Second part of Real Python

In [7]:
re.search(r'\d+', '123foobar')

<re.Match object; span=(0, 3), match='123'>

In [8]:
re.search(r'\d+', 'foo123bar')

<re.Match object; span=(3, 6), match='123'>

In [21]:
re.match(r'\d+', '123foobar')

<re.Match object; span=(0, 3), match='123'>

In [22]:
print(re.match(r'\d+', 'foo123bar')) # foo is in font of the 123, therefore no match

None


### Multiline

In [23]:
s = 'foo\nbar\nbaz'

In [24]:
re.search('^foo', s)

<re.Match object; span=(0, 3), match='foo'>

In [25]:
re.search('^bar', s, re.MULTILINE)

<re.Match object; span=(4, 7), match='bar'>

In [27]:
s = 'foo\nbar\nbaz'

re.match('^foo', s)


<re.Match object; span=(0, 3), match='foo'>

In [28]:
print(re.match('^bar', s, re.MULTILINE))

None


Even with the MULTILINE flag set, re.match() will match the caret (^) anchor only at the beginning of <string>, not at the beginning of lines contained within <string>.

Note that, although it illustrates the point, the caret (^) anchor on line 3 in the above example is redundant. With re.match(), matches are essentially always anchored at the beginning of the string.

### Fullmatch

In [14]:
print(re.fullmatch(r'\d+', '123foo')) # the full string has to match the full regular expression

None


In [29]:
re.fullmatch(r'\d+', '123') # the full text is digital only, therefore there is a match found.

<re.Match object; span=(0, 3), match='123'>

In [11]:
print(re.fullmatch(r'\d+', '123foo'))

None


In [12]:
print(re.fullmatch(r'\d+', 'foo123'))

None


In [15]:
print(re.fullmatch(r'\d+', 'foo123bar'))

None


In [19]:
re.search(r'^\d+$', '123') 

<re.Match object; span=(0, 3), match='123'>

In [None]:
re.match only returns if the match is found at the beginning of the string/text. Search has a hahrder job than match.


In [32]:
print(re.search(r'^\d+$', 'foo123foobar'))

None


Search can be used to behave as match with the ^ sign. Than it will only match the start of the string.
$ means the end of the line. The above use of ^ and $ is the mimicing of what re.match does, fixing to check only the beginning of the string for the digital characters.

### Findall

In [34]:
re.findall(r'\d+', '123foo456bar789.')

['123', '456', '789']

In [35]:
re.search(r'\d+', '123foo456bar789.')

<re.Match object; span=(0, 3), match='123'>

In [36]:
re.match(r'\d+', '123foo456bar789.')

<re.Match object; span=(0, 3), match='123'>

In [38]:
print(re.fullmatch(r'\d+', '123foo456bar789.'))

None


### Finditer:
re.finditer(<regex>, <string>) scans <string> for non-overlapping matches of <regex> and returns an iterator that yields the match objects from any it finds. It scans the search string from left to right and returns matches in the order it finds them:

In [40]:
matches = re.finditer(r'\d+', '123foo456bar789.')
matches

<callable_iterator at 0xa578247430>

In [41]:
for match in matches:
    print(match)

<re.Match object; span=(0, 3), match='123'>
<re.Match object; span=(6, 9), match='456'>
<re.Match object; span=(12, 15), match='789'>


In [42]:
it = re.finditer(r'\w+', '...foo,,,,bar:%$baz//|')

In [43]:
next(it)

<re.Match object; span=(3, 6), match='foo'>

In [44]:
next(it)

<re.Match object; span=(10, 13), match='bar'>

In [45]:
next(it)

<re.Match object; span=(16, 19), match='baz'>

In [47]:
for i in re.finditer(r'\w+', '...foo,,,,bar:%$baz//|'):
    print(i)

<re.Match object; span=(3, 6), match='foo'>
<re.Match object; span=(10, 13), match='bar'>
<re.Match object; span=(16, 19), match='baz'>


re.findall() and re.finditer() are very similar, but they differ in two respects:

re.findall() returns a list, whereas re.finditer() returns an iterator.

The items in the list that re.findall() returns are the actual matching strings, whereas the items yielded by the iterator that re.finditer() returns are match objects.

Any task that you could accomplish with one, you could probably also manage with the other. Which one you choose will depend on the circumstances. As you’ll see later in this tutorial, a lot of useful information can be obtained from a match object. If you need that information, then re.finditer() will probably be the better choice.

## Substitution Functions

re.sub()	Scans a string for regex matches, replaces the matching portions of the string with the specified replacement string, and returns the result

re.subn()	Behaves just like re.sub() but also returns information regarding the number of substitutions made

Both re.sub() and re.subn() create a new string with the specified substitutions and return it. The original string remains unchanged. (Remember that strings are immutable in Python, so it wouldn’t be possible for these functions to modify the original string.)

re.sub(```<regex>, <repl>, <string>```, count=0, flags=0)

re.sub(```<regex>, <repl>, <string>```) finds the leftmost non-overlapping occurrences of ```<regex>``` in ```<string>```, replaces each match as indicated by ```<repl>```, and returns the result. ```<string>``` remains unchanged.

<repl> can be either a string or a function, as explained below.

In [48]:
s = 'foo.123.bar.789.baz'

re.sub(r'\d+', '#', s)

'foo.#.bar.#.baz'

In [49]:
re.sub('[a-z]+', '(*)', s)

'(*).123.(*).789.(*)'

In [52]:
re.sub(r'([a-z]+)([0-9]+)', r'\2\1' , 'foo123')

'123foo'

In [53]:
re.sub(r'([a-z]+)([0-9]+)', r'\1\2\1' , 'foo123')

'foo123foo'

In [50]:
re.sub(r'(\w+),bar,baz,(\w+)', r'\2,bar,baz,\1', 'foo,bar,baz,qux')

'qux,bar,baz,foo'

() grab whatever the match is, grouping them together. It is to remember what was in the first and last (). with the () around them they can be referenced in the replacement string as \1 and \2.
qux went to the first position and foo went to the final position.


You can capture gruops of characters, the regular experssion can match all kind of things. 


In [55]:
re.sub(r'([a-z]+)([0-9]+)', r'\1\2\1' , 'foo123bar456') # the regex searches multiple times, it does not stop at the first match.

'foo123foobar456bar'

In [57]:
re.sub(r'foo,(?P<w1>\w+),(?P<w2>\w+),qux',
       r'foo,\g<w2>,\g<w1>,qux',
       'foo,bar,baz,qux') # ?P<w1> and ?P<w2> naming the groups in the regex 
# and then referring to them (backreference them) in the replcamenet as g<w1>, g<w2>.

'foo,baz,bar,qux'

In [None]:
re.sub(r'foo,(\w+),(\w+),qux',
       r'foo,\g<2>,\g<1>,qux',
       'foo,bar,baz,qux') 
# here the g<> backreferencing is used without naming and just referring to the group number
# with first () being group 1 and second () being group 2 in the regex syntax.

In [58]:
re.sub(r'(\d+)', r'\10', 'foo 123 bar')
# here there is a backreference to a group #10, however there is only 1 (), i.e. group specified in the regex.

error: invalid group reference 10 at position 1

It continues with the search until the end of the string and groups them as per the replacement. It allows for backreferencing and become verypowerful. Backreferencing only works if it is captured.

### Substitution by function
If you specify ```<repl>``` as a function, then re.sub() calls that function for each match found. It passes each corresponding match object as an argument to the function to provide information about the match. The function return value then becomes the replacement string:


In [60]:
re.sub(r'\w+', 'xxx', 'foo.bar.baz.qux')

'xxx.xxx.xxx.xxx'

In [61]:
re.sub(r'\w+', 'xxx', 'foo.bar.baz.qux')

'xxx.xxx.xxx.xxx'

In [62]:
re.subn(r'\w+', 'xxx', 'foo.bar.baz.qux')

('xxx.xxx.xxx.xxx', 4)

In [63]:
re.subn(r'\w+', 'xxx', 'foo.bar.baz.qux', count=2)

('xxx.xxx.baz.qux', 2)

In [64]:
def f(match_obj):
    s = match_obj.group(0)  # The matching string

    # s.isdigit() returns True if all characters in s are digits
    if s.isdigit():
        return str(int(s) * 10)
    else:
        return s.upper()

re.sub(r'\w+', f, 'foo.10.bar.20.baz.30')

#In this example, f() gets called for each match. As a result, 
# re.sub() converts each alphanumeric portion of <string> to all uppercase and multiplies each numeric portion by 10.

'FOO.100.BAR.200.BAZ.300'

## Utility Functions

### re.split( )  Splits a string into substrings using a regex as a delimiter

In [65]:
# The following example splits the specified string into substrings delimited 
# by a comma (,), semicolon (;), or slash (/) character, surrounded by any amount of whitespace:

re.split('\s*[,;/]\s*', 'foo,bar  ;  baz / qux')

['foo', 'bar', 'baz', 'qux']

In [None]:
#  If <regex> contains capturing groups, then the return list includes the matching delimiter strings as well:

re.split('(\s*[,;/]\s*)', 'foo,bar  ;  baz / qux')

#This time, the return list contains not only the substrings 'foo', 'bar', 'baz', and 'qux'
# but also several delimiter strings:

# ','
# '  ;  '
# ' / '

In [67]:
string = 'foo,bar  ;  baz / qux'
regex = r'(\s*[,;/]\s*)'
a = re.split(regex, string)
# List of tokens and delimiters
a


# Enclose each token in <>'s
for i, s in enumerate(a):

    # This will be True for the tokens but not the delimiters
    if not re.fullmatch(regex, s):
        a[i] = f'<{s}>'


# Put the tokens back together using the same delimiters
''.join(a)


'<foo>,<bar>  ;  <baz> / <qux>'

In [68]:
s = 'foo, bar, baz, qux, quux, corge'

re.split(r',\s*', s) # execute the split (\s) at the comma, search until the end of the string (*)

['foo', 'bar', 'baz', 'qux', 'quux', 'corge']

In [70]:
re.split(r',\s*', s, maxsplit=3) # execute the split (\s) at the comma, continue search until the third comma (*)

['foo', 'bar', 'baz', 'qux, quux, corge']

In [71]:
re.split(r',\s*', s, maxsplit=-3) #If maxsplit is negative, then re.split() returns <string> unchanged

['foo, bar, baz, qux, quux, corge']

In [72]:
re.split(r',\s*', s, maxsplit=0) # Explicitly specifying maxsplit=0 is equivalent to omitting it entirely.

['foo', 'bar', 'baz', 'qux', 'quux', 'corge']

In [73]:
re.split('(/)', '/foo/bar/baz/') 
# If <regex> contains capturing groups so that the return list includes delimiters, and <regex> matches the start of 
# <string>, then re.split() places an empty string as the first element in the return list. 

['', '/', 'foo', '/', 'bar', '/', 'baz', '/', '']

### re.escape(```<regex>)``` returns a copy of ```<regex>``` with each nonword character (anything other than a letter, digit, or underscore) preceded by a backslash

In [74]:
print(re.match('foo^bar(baz)|qux', 'foo^bar(baz)|qux'))
#  the regex 'foo^bar(baz)|qux' contains special characters that behave as metacharacters

None


In [75]:
re.match('foo\^bar\(baz\)\|qux', 'foo^bar(baz)|qux')
#  they’re explicitly escaped with backslashes, so a match occurs.

<re.Match object; span=(0, 16), match='foo^bar(baz)|qux'>

In [77]:
re.escape('foo^bar(baz)|qux') == 'foo\^bar\(baz\)\|qux'

True

In [78]:
re.match(re.escape('foo^bar(baz)|qux'), 'foo^bar(baz)|qux')

<re.Match object; span=(0, 16), match='foo^bar(baz)|qux'>

## Compiled Regex Objects in Python

The re module supports the capability to precompile a regex in Python into a regular expression object that can be repeatedly used later.

re.compile(```<regex>```, flags=0)

In [80]:
my_regex = re.compile(r'([0-9]+)')
my_regex

re.compile(r'([0-9]+)', re.UNICODE)

In [82]:
my_regex.search('foo123bar456')

<re.Match object; span=(3, 6), match='123'>

In [83]:
my_regex.findall('foo123bar456')

['123', '456']

In [85]:
# my_regex replaces re. as it is defined by re.compile and include the regex there.
my_regex.sub(r'...', 'foo123bar456')

'foo...bar...'

## Regular Expressions on Iris Datast

In [126]:
# https://stackoverflow.com/a/1393367

import urllib.request

url = r'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

iris = [line.decode('utf-8').strip() for line in urllib.request.urlopen(url)]
# look up list comprehension

iris 

['5.1,3.5,1.4,0.2,Iris-setosa',
 '4.9,3.0,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.3,0.2,Iris-setosa',
 '4.6,3.1,1.5,0.2,Iris-setosa',
 '5.0,3.6,1.4,0.2,Iris-setosa',
 '5.4,3.9,1.7,0.4,Iris-setosa',
 '4.6,3.4,1.4,0.3,Iris-setosa',
 '5.0,3.4,1.5,0.2,Iris-setosa',
 '4.4,2.9,1.4,0.2,Iris-setosa',
 '4.9,3.1,1.5,0.1,Iris-setosa',
 '5.4,3.7,1.5,0.2,Iris-setosa',
 '4.8,3.4,1.6,0.2,Iris-setosa',
 '4.8,3.0,1.4,0.1,Iris-setosa',
 '4.3,3.0,1.1,0.1,Iris-setosa',
 '5.8,4.0,1.2,0.2,Iris-setosa',
 '5.7,4.4,1.5,0.4,Iris-setosa',
 '5.4,3.9,1.3,0.4,Iris-setosa',
 '5.1,3.5,1.4,0.3,Iris-setosa',
 '5.7,3.8,1.7,0.3,Iris-setosa',
 '5.1,3.8,1.5,0.3,Iris-setosa',
 '5.4,3.4,1.7,0.2,Iris-setosa',
 '5.1,3.7,1.5,0.4,Iris-setosa',
 '4.6,3.6,1.0,0.2,Iris-setosa',
 '5.1,3.3,1.7,0.5,Iris-setosa',
 '4.8,3.4,1.9,0.2,Iris-setosa',
 '5.0,3.0,1.6,0.2,Iris-setosa',
 '5.0,3.4,1.6,0.4,Iris-setosa',
 '5.2,3.5,1.5,0.2,Iris-setosa',
 '5.2,3.4,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.6,0.2,Iris-setosa',
 '4.8,3.1,1.6,0.2,Iris-setosa',
 '5.4,3.

In [127]:
strip_iris = re.compile(r'([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),Iris-([a-z]+)')

In [135]:
[strip_iris.sub(r'(\5,\4,\3,\2,\1', line) for line in iris if line]

['(setosa,0.2,1.4,3.5,5.1',
 '(setosa,0.2,1.4,3.0,4.9',
 '(setosa,0.2,1.3,3.2,4.7',
 '(setosa,0.2,1.5,3.1,4.6',
 '(setosa,0.2,1.4,3.6,5.0',
 '(setosa,0.4,1.7,3.9,5.4',
 '(setosa,0.3,1.4,3.4,4.6',
 '(setosa,0.2,1.5,3.4,5.0',
 '(setosa,0.2,1.4,2.9,4.4',
 '(setosa,0.1,1.5,3.1,4.9',
 '(setosa,0.2,1.5,3.7,5.4',
 '(setosa,0.2,1.6,3.4,4.8',
 '(setosa,0.1,1.4,3.0,4.8',
 '(setosa,0.1,1.1,3.0,4.3',
 '(setosa,0.2,1.2,4.0,5.8',
 '(setosa,0.4,1.5,4.4,5.7',
 '(setosa,0.4,1.3,3.9,5.4',
 '(setosa,0.3,1.4,3.5,5.1',
 '(setosa,0.3,1.7,3.8,5.7',
 '(setosa,0.3,1.5,3.8,5.1',
 '(setosa,0.2,1.7,3.4,5.4',
 '(setosa,0.4,1.5,3.7,5.1',
 '(setosa,0.2,1.0,3.6,4.6',
 '(setosa,0.5,1.7,3.3,5.1',
 '(setosa,0.2,1.9,3.4,4.8',
 '(setosa,0.2,1.6,3.0,5.0',
 '(setosa,0.4,1.6,3.4,5.0',
 '(setosa,0.2,1.5,3.5,5.2',
 '(setosa,0.2,1.4,3.4,5.2',
 '(setosa,0.2,1.6,3.2,4.7',
 '(setosa,0.2,1.6,3.1,4.8',
 '(setosa,0.4,1.5,3.4,5.4',
 '(setosa,0.1,1.5,4.1,5.2',
 '(setosa,0.2,1.4,4.2,5.5',
 '(setosa,0.1,1.5,3.1,4.9',
 '(setosa,0.2,1.2,3.

## Exercise 2
***

Remember to do these exercises in your own notebook in your assessment repository.

Adapt the above code to capitalise the first letter of the iris species, using regular expressions.

In [208]:
# Using the above code to remove Iris- from class names
strip_iris = re.compile(r'([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),Iris-([a-z]+)')

In [209]:
#Assigning the subsituted content to a new list (new_iris)
new_iris = [strip_iris.sub(r'(\5,\4,\3,\2,\1', line) for line in iris if line]

In [210]:
# Defining a function to capitalize any string that has alpha characters (a-z):
def f(match_obj):
    s = match_obj.group()
    if s.isalpha():
        return s.capitalize()

In [218]:
#Use the Substitute method to replace alpha strings capitalized (first character to be upper case) 
[re.sub(r'[a-z]+', f, line) for line in new_iris]

['(Setosa,0.2,1.4,3.5,5.1',
 '(Setosa,0.2,1.4,3.0,4.9',
 '(Setosa,0.2,1.3,3.2,4.7',
 '(Setosa,0.2,1.5,3.1,4.6',
 '(Setosa,0.2,1.4,3.6,5.0',
 '(Setosa,0.4,1.7,3.9,5.4',
 '(Setosa,0.3,1.4,3.4,4.6',
 '(Setosa,0.2,1.5,3.4,5.0',
 '(Setosa,0.2,1.4,2.9,4.4',
 '(Setosa,0.1,1.5,3.1,4.9',
 '(Setosa,0.2,1.5,3.7,5.4',
 '(Setosa,0.2,1.6,3.4,4.8',
 '(Setosa,0.1,1.4,3.0,4.8',
 '(Setosa,0.1,1.1,3.0,4.3',
 '(Setosa,0.2,1.2,4.0,5.8',
 '(Setosa,0.4,1.5,4.4,5.7',
 '(Setosa,0.4,1.3,3.9,5.4',
 '(Setosa,0.3,1.4,3.5,5.1',
 '(Setosa,0.3,1.7,3.8,5.7',
 '(Setosa,0.3,1.5,3.8,5.1',
 '(Setosa,0.2,1.7,3.4,5.4',
 '(Setosa,0.4,1.5,3.7,5.1',
 '(Setosa,0.2,1.0,3.6,4.6',
 '(Setosa,0.5,1.7,3.3,5.1',
 '(Setosa,0.2,1.9,3.4,4.8',
 '(Setosa,0.2,1.6,3.0,5.0',
 '(Setosa,0.4,1.6,3.4,5.0',
 '(Setosa,0.2,1.5,3.5,5.2',
 '(Setosa,0.2,1.4,3.4,5.2',
 '(Setosa,0.2,1.6,3.2,4.7',
 '(Setosa,0.2,1.6,3.1,4.8',
 '(Setosa,0.4,1.5,3.4,5.4',
 '(Setosa,0.1,1.5,4.1,5.2',
 '(Setosa,0.2,1.4,4.2,5.5',
 '(Setosa,0.1,1.5,3.1,4.9',
 '(Setosa,0.2,1.2,3.

In [205]:
#Creating a new version of Iris with original Iris structure
# Compiling the list to have all elements of each string present but w/o the "Iris-" prefix
iris_version = re.compile(r'([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),Iris-([a-z]+)')

In [216]:
# assigning the new content to a new list
list = [iris_version.sub(r'\1,\2,\3,\4,\5', line) for line in iris if line]
list

['5.1,3.5,1.4,0.2,setosa',
 '4.9,3.0,1.4,0.2,setosa',
 '4.7,3.2,1.3,0.2,setosa',
 '4.6,3.1,1.5,0.2,setosa',
 '5.0,3.6,1.4,0.2,setosa',
 '5.4,3.9,1.7,0.4,setosa',
 '4.6,3.4,1.4,0.3,setosa',
 '5.0,3.4,1.5,0.2,setosa',
 '4.4,2.9,1.4,0.2,setosa',
 '4.9,3.1,1.5,0.1,setosa',
 '5.4,3.7,1.5,0.2,setosa',
 '4.8,3.4,1.6,0.2,setosa',
 '4.8,3.0,1.4,0.1,setosa',
 '4.3,3.0,1.1,0.1,setosa',
 '5.8,4.0,1.2,0.2,setosa',
 '5.7,4.4,1.5,0.4,setosa',
 '5.4,3.9,1.3,0.4,setosa',
 '5.1,3.5,1.4,0.3,setosa',
 '5.7,3.8,1.7,0.3,setosa',
 '5.1,3.8,1.5,0.3,setosa',
 '5.4,3.4,1.7,0.2,setosa',
 '5.1,3.7,1.5,0.4,setosa',
 '4.6,3.6,1.0,0.2,setosa',
 '5.1,3.3,1.7,0.5,setosa',
 '4.8,3.4,1.9,0.2,setosa',
 '5.0,3.0,1.6,0.2,setosa',
 '5.0,3.4,1.6,0.4,setosa',
 '5.2,3.5,1.5,0.2,setosa',
 '5.2,3.4,1.4,0.2,setosa',
 '4.7,3.2,1.6,0.2,setosa',
 '4.8,3.1,1.6,0.2,setosa',
 '5.4,3.4,1.5,0.4,setosa',
 '5.2,4.1,1.5,0.1,setosa',
 '5.5,4.2,1.4,0.2,setosa',
 '4.9,3.1,1.5,0.1,setosa',
 '5.0,3.2,1.2,0.2,setosa',
 '5.5,3.5,1.3,0.2,setosa',
 

In [217]:
# Using the previously created function to capitalize the Iris classes
[re.sub(r'[a-z]+', f, line) for line in list]

['5.1,3.5,1.4,0.2,Setosa',
 '4.9,3.0,1.4,0.2,Setosa',
 '4.7,3.2,1.3,0.2,Setosa',
 '4.6,3.1,1.5,0.2,Setosa',
 '5.0,3.6,1.4,0.2,Setosa',
 '5.4,3.9,1.7,0.4,Setosa',
 '4.6,3.4,1.4,0.3,Setosa',
 '5.0,3.4,1.5,0.2,Setosa',
 '4.4,2.9,1.4,0.2,Setosa',
 '4.9,3.1,1.5,0.1,Setosa',
 '5.4,3.7,1.5,0.2,Setosa',
 '4.8,3.4,1.6,0.2,Setosa',
 '4.8,3.0,1.4,0.1,Setosa',
 '4.3,3.0,1.1,0.1,Setosa',
 '5.8,4.0,1.2,0.2,Setosa',
 '5.7,4.4,1.5,0.4,Setosa',
 '5.4,3.9,1.3,0.4,Setosa',
 '5.1,3.5,1.4,0.3,Setosa',
 '5.7,3.8,1.7,0.3,Setosa',
 '5.1,3.8,1.5,0.3,Setosa',
 '5.4,3.4,1.7,0.2,Setosa',
 '5.1,3.7,1.5,0.4,Setosa',
 '4.6,3.6,1.0,0.2,Setosa',
 '5.1,3.3,1.7,0.5,Setosa',
 '4.8,3.4,1.9,0.2,Setosa',
 '5.0,3.0,1.6,0.2,Setosa',
 '5.0,3.4,1.6,0.4,Setosa',
 '5.2,3.5,1.5,0.2,Setosa',
 '5.2,3.4,1.4,0.2,Setosa',
 '4.7,3.2,1.6,0.2,Setosa',
 '4.8,3.1,1.6,0.2,Setosa',
 '5.4,3.4,1.5,0.4,Setosa',
 '5.2,4.1,1.5,0.1,Setosa',
 '5.5,4.2,1.4,0.2,Setosa',
 '4.9,3.1,1.5,0.1,Setosa',
 '5.0,3.2,1.2,0.2,Setosa',
 '5.5,3.5,1.3,0.2,Setosa',
 