# Data Cleansing 

Code Referenced: 

Ian Mc Loughlin: Cleansing

https://docs.python.org/3/library/re.html

https://realpython.com/regex-python/

https://developers.google.com/edu/python/regular-expressions

***

In [32]:
# Regular Expression

import re

In [33]:
# Matches a "word" character: a letter or digit or underscore [a-zA-Z0-9_].
\w = '[a-zA-Z0-9_]'

# Matches any non-word character.
\W = '[^a-zA-Z0-9_]'

SyntaxError: unexpected character after line continuation character (3708044245.py, line 2)

In [None]:
# A string to be manipulated.
original = 'Words, words, words.'

# The pattern/regular expression to use on the above string.
pattern = r'\W+'

# Splits a string into substrings using a regular expression.
result = re.split(pattern, original)

# Print the result.
print(result)

In [None]:
# A string to be manipulated.
original = 'Words, words, words.'

# The pattern/regular expression to use on the above string.
pattern = r'(\W+)'

# Splits a string into substrings using a regular expression.
result = re.split(pattern, original)

# Print the result.
print(result)

In [None]:
re.split(r'\W+', 'Words, words, words.', 1)


In [None]:
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

## Real Python

***

In [None]:
'abccba' == 'abccba'

In [None]:
'abccba' == 'cbaabc'


In [None]:
'abc' in 'cbaabc'


In [None]:
'cbaabc'.index('a')

In [None]:
'cbaabc'[2]

In [None]:
'cbaaabc'.find('aa')

In [None]:
s = 'foo123bar'

re.search('123', s)

In [None]:
s[3:6]

In [None]:
re.search(r'[0-9][0-9][0-9]', 'foo456bar')

In [None]:
re.search(r'[0-9][0-9][0-9]', '234baz')

In [None]:
re.search(r'[0-9][0-9][0-9]', 'qux678')

In [None]:
print(re.search(r'[0-9][0-9][0-9]', '12foo34'))

In [None]:
re.search(r'[0-9]{3}', 'qux678')

## Google for Education

***

In [None]:
import re

str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print('found', match.group()) ## 'found word:cat'
else:
  print('did not find')

In [None]:
string = 'aaaabaa'
pattern = r'a+'

re.search(pattern, string)

In [None]:
string = 'aaaabaa'
pattern = r'a*'

re.search(pattern, string)

In [None]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"
match

In [None]:
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"
match

In [None]:
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
print(match)
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
print(match)
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"
print(match)

In [None]:
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
match

In [None]:
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
match

In [34]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())  ## 'b@google'


b@google


In [35]:
# Square Brackets 
# For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars 
# which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())  ## 'alice-b@google.com'

alice-b@google.com


In [37]:
# Group Extraction
# The "group" feature of a regular expression allows you to pick out parts of the matching text, using (). 

str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


In [38]:
## Findall

# findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.

## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
    # do something with each found email string
    print(email)



alice@google.com
bob@abc.com


## Exercise 1

Remember to do these exercises in your own notebook in your assessment repository.

Write a Python function to remove all non-alphanumeric characters from a string.