## String Manipulation and Regular Expression

string in python can be defined either in single quote or double quote, they are functionally equivalent

In [1]:
x = 'a string'
y = "a string"
x == y

True

It is possible to define multiline strings using a triple quote syntax

In [6]:
multiline = """
 one 
 two
 three
"""

In [7]:
multiline

'\n one \n two\n three\n'

### Simple String Manipulation in python

Python makes it quite easy to adjust the case of a string. Here we'll look at the upper(), lower(), capitalize(), title(), and swapcase() methods, using the following messy string as an example:

In [8]:
fox = "tHe qUICk bROWn fOx."

In [9]:
fox.upper()

'THE QUICK BROWN FOX.'

In [10]:
fox.lower?

In [11]:
fox.lower()

'the quick brown fox.'

A common formatting need is to capitalize just the first letter of each word, or perhaps the first letter of each sentence. This can be done with the title() and capitalize() methods:

In [12]:
fox.title()

'The Quick Brown Fox.'

In [15]:
fox.capitalize() #

'The quick brown fox.'

In [13]:
fox.swapcase()

'ThE QuicK BrowN FoX.'

Another common need is to remove spaces (or other characters) from the beginning or end of the string. The basic method of removing characters is the strip() method, which strips whitespace from the beginning and end of the line:

In [17]:
line = '        this is the content        '

In [18]:
line.strip()

'this is the content'

In [19]:
line

'        this is the content        '

In [20]:
line.rstrip()

'        this is the content'

In [21]:
line.lstrip()

'this is the content        '

In [26]:
num = "0000000000004035"
num.strip('0')

'4035'

The opposite of this operation, adding spaces or other characters, can be accomplished using the center(), ljust(), and rjust() methods.

In [27]:
line = "this is the content"
line.center(30)

'     this is the content      '

Similarly, ljust() and rjust() will left-justify or right-justify the string within spaces of a given length:

In [29]:
line.ljust(30) #left justfy string

'this is the content           '

In [31]:
line.rjust(30)

'           this is the content'

In [32]:
'435'.rjust(10, '0')

'0000000435'

In [33]:
'435'.zfill(10)

'0000000435'

If you want to find occurrences of a certain character in a string, the find()/rfind(), index()/rindex(), and replace() methods are the best built-in methods.

find() and index() are very similar, in that they search for the first occurrence of a character or substring within a string, and return the index of the substring:

In [34]:
line = 'the quick brown fox jumped over the lazy dog'
line.find('fox')

16

In [35]:
line.index('fox')

16

The only difference between find() and index() is their behavior when the search string is not found; find() returns -1, while index() raises a ValueError:

In [36]:
line.find('bear')

-1

In [37]:
line.index('bear')

ValueError: substring not found

In [38]:
debug

> [0;32m<ipython-input-37-adf599eb4c79>[0m(1)[0;36m<module>[0;34m()[0m
[0;32m----> 1 [0;31m[0mline[0m[0;34m.[0m[0mindex[0m[0;34m([0m[0;34m'bear'[0m[0;34m)[0m[0;34m[0m[0m
[0m
ipdb> 'bear' in line
False
ipdb> q


The related rfind() and rindex() work similarly, except they search for the first occurrence from the end rather than the beginning of the string:

For the special case of checking for a substring at the beginning or end of a string, PythFor the special case of checking for a substring at the beginning or end of a string, Python provides the startswith() and endswith() methods:

In [40]:
line.endswith('dog')

True

In [41]:
line.startswith('fox')

False

To go one step further and replace a given substring with a new string, you can use the replace() method. Here, let's replace 'brown' with 'red'

### Splitting and partitioning strings

If you would like to find a substring and then split the string based on its location, the partition() and/or split() methods are what you're looking for. Both will return a sequence of substrings.

The partition() method returns a tuple with three elements: the substring before the first instance of the split-point, the split-point itself, and the substring after:

In [42]:
line.partition('fox')

('the quick brown ', 'fox', ' jumped over the lazy dog')

The rpartition() method is similar, but searches from the right of the string.

The split() method is perhaps more useful; it finds all instances of the split-point and returns the substrings in between. The default is to split on any whitespace, returning a list of the individual words in a string:

A related method is splitlines(), which splits on newline characters.

In [43]:
haiku = """matsushima-ya
aah matsushima-ya
matsushima-ya"""

In [44]:
haiku.splitlines()

['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya']

functionality of join in oppoite of split

In [46]:
'--'.join(['1', '2', '3'])

'1--2--3'

In [48]:
print("\n".join(Out[44]))

matsushima-ya
aah matsushima-ya
matsushima-ya


### Format Strings

Another use of string methods is to manipulate string representations of values of other types. Of course, string representations can always be found using the str() function; for example:

In [49]:
import math

In [51]:
pie = math.pi

In [53]:
str(pie)

'3.141592653589793'

In [54]:
"The value of pie is " + str(pie)

'The value of pie is 3.141592653589793'

In [55]:
"The value of pie is {}".format(pie)

'The value of pie is 3.141592653589793'

In [56]:
"""First letter: {0}. Last letter: {1}.""".format('A','z')

'First letter: A. Last letter: z.'

In [57]:
"""First letter: {first}. Last letter: {last}.""".format(last='Z', first='A')

'First letter: A. Last letter: Z.'

In [60]:
"pi = {0:.3f} and sin2 = {1:.4f}".format(pie,math.sin(2))

'pi = 3.142 and sin2 = 0.9093'

As before, here the "0" refers to the index of the value to be inserted. The ":" marks that format codes will follow. The ".3f" encodes the desired precision: three digits beyond the decimal point, floating-point format.

### Flexible Pattern Matching with Regular expression

In [62]:
!ls *.ipynb

Regular_Expression_guide.ipynb


The Python interface to regular expressions is contained in the built-in re module; as a simple example, let's use it to duplicate the functionality of the string split() method:

In [64]:
import re
regex = re.compile('\s+')
regex.split(line)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

In [65]:
line2 = 'The           quick broen       fox jumped over the dog'

In [66]:
line2

'The           quick broen       fox jumped over the dog'

In [67]:
regex.split(line2)

['The', 'quick', 'broen', 'fox', 'jumped', 'over', 'the', 'dog']

In [68]:
line2.split()

['The', 'quick', 'broen', 'fox', 'jumped', 'over', 'the', 'dog']

Just as Python's split() method returns a list of all substrings between whitespace, the regular expression split() method returns a list of all substrings between matches to the input pattern.

In this case, the input is "\s+": "\s" is a special character that matches any whitespace (space, tab, newline, etc.), and the "+" is a character that indicates one or more of the entity preceding it. Thus, the regular expression matches any substring consisting of one or more spaces

The split() method here is basically a convenience routine built upon this pattern matching behavior; more fundamental is the match() method, which will tell you whether the beginning of a string matches the pattern:

In [75]:
regex = re.compile('fox')
match = regex.search(line)
match.start()

16

In [76]:
line_ex = 'mera fox tera fox kyun karta h fox fox'
match1 = regex.search(line_ex)
match1

<_sre.SRE_Match object; span=(5, 8), match='fox'>

In [79]:
match2 = regex.findall(line_ex)
match2

['fox', 'fox', 'fox', 'fox']

In [85]:
regex = re.compile("\s+")
for s in ["     ", "abc  ", "  abc"]:
    if regex.match(s):
        print(repr(s), "matches")
    else:
        print(repr(s), "does not match")

'     ' matches
'abc  ' does not match
'  abc' matches


In [86]:
line

'the quick brown fox jumped over the lazy dog'

In [87]:
line.replace('fox', 'BEAR')

'the quick brown BEAR jumped over the lazy dog'

In [89]:
line2

'The           quick broen       fox jumped over the dog'

In [93]:
regex = re.compile('\s+')
regex.sub(' ',line2)

'The quick broen fox jumped over the dog'

In [97]:
line2

'The           quick broen       fox jumped over the dog'

In [96]:
' '.join(line2.split())

'The quick broen fox jumped over the dog'

In [98]:
email = re.compile('\w+@\w+\.[a-z]{3}')

In [99]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email.findall(text)

['guido@python.org', 'guido@google.com']

In [100]:
email.sub('--@--.--',text)

'To email Guido, try --@--.-- or the older address --@--.--.'

In [101]:
email.findall('barack.obama@whitehouse.gov')

['obama@whitehouse.gov']

### Basics of regular expression syntax

Simple strings are match exactly

In [102]:
regex = re.compile('ion')

In [103]:
regex.findall('Great Expectations')

['ion']

#### Some characters have special meaning

. ^ $ * + ? { } [ ] \ | ( )

We will discuss the meaning of some of these momentarily. In the meantime, you should know that if you'd like to match any of these characters directly, you can escape them with a back-slash(\\):

In [105]:
regex = re.compile(r'\$')
regex.findall("the cost is $20")

['$']

The r preface in r'\$' indicates a raw string; in standard Python strings, the backslash is used to indicate special characters. For example, a tab is indicated by "\t":

In [106]:
print('a\tb\tc')

a	b	c


In [107]:
print(r'a\tb\tc')

a\tb\tc


For this reason, whenever you use backslashes(\\)in a regular expression, it is good practice to use a raw string.

Just as the "\" character within regular expressions can escape special characters, turning them into normal characters, it can also be used to give normal characters special meaning. These special characters match specified groups of characters, and we've seen them before. In the email address regexp from before, we used the character "\w", which is a special marker matching any alphanumeric character. Similarly, in the simple split() example, we also saw "\s", a special marker indicating any whitespace character.

Putting these together, we can create a regular expression that will match any two letters/digits with whitespace between them:

In [108]:
regex = re.compile(r'\w\s\w')
regex.findall('the fox is 9 years old')

['e f', 'x i', 's 9', 's o']

Think why '2 l' is not selected

In [110]:
regex.findall('1 yes 2 little 31 nah')

['1 y', 's 2', 'e 3', '1 n']

"\d" =>	Match any digit		
"\D" =>	Match any non digit

"\s" =>	Match any whitespace		
"\S" =>	Match any non-whitespace

"\w" =>	Match any alphanumeric char		
"\W" =>	Match any non-alphanumeric char

We will explore things in more detail later

#### Square bracket match custom character group

If the built-in character groups aren't specific enough for you, you can use square brackets to specify any set of characters you're interested in. For example, the following will match any lower-case vowel:

In [112]:
regex = re.compile('[aeiou]')
regex.split('consequential')

['c', 'ns', 'q', '', 'nt', '', 'l']

In between q and nt there are two vowels hence one space

In [113]:
regex.split('goose')

['g', '', 's', '']

In [114]:
regex.split('facetious')

['f', 'c', 't', '', '', 's']

Similarly, you can use a dash to specify a range: for example, "[a-z]" will match any lower-case letter, and "[1-3]" will match any of "1", "2", or "3". For instance, you may need to extract from a document specific numerical codes that consist of a capital letter followed by a digit. You could do this as follows:

In [115]:
regex = re.compile('[A-Z][0-9]')
regex.findall('1043879, G2, H6')

['G2', 'H6']

#### Wildcards match repeated characters

If you would like to match a string with, say, three alphanumeric characters in a row, it is possible to write, for example, "\w\w\w". Because this is such a common need, there is a specific syntax to match repetitions – curly braces with a number:

In [116]:
regex = re.compile(r'\w{3}')
regex.findall('The quick brown fox')

['The', 'qui', 'bro', 'fox']

In [117]:
regex = re.compile(r'\w{3,5}')
regex.findall('To preach is easy but to implement is tough')

['preac', 'easy', 'but', 'imple', 'ment', 'tough']

"\b" matches word breaks
below regex extracts words of lengths 3 to 5

In [118]:
regex = re.compile(r'\b\w{3,5}\b')
regex.findall('To preach is easy but to implement is tough')

['easy', 'but', 'tough']

?	Match zero or one repetitions of preceding	
"ab?" matches "a" or "ab"

\*	Match zero or more repetitions of preceding	"ab*" matches "a", "ab", "abb", "abbb"...

\+	Match one or more repetitions of preceding	"ab+" matches "ab", "abb", "abbb"... but not "a"

{n}	Match n repetitions of preeeding	"ab{2}" matches "abb"

{m,n}	Match between m and n repetitions of preceding	"ab{2,3}" matches "abb" or "abbb"

In [122]:
regex = re.compile(r'\b\w\w?\b')
regex.findall('Practicing programming is a good thing')

['is', 'a']

In [126]:
regex = re.compile(r'\w+\.?\w+@\w+\.[a-z]{3}')
regex.findall('barack.obama@whitehouse.gov')

['barack.obama@whitehouse.gov']

In [127]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
regex.findall(text)

['guido@python.org', 'guido@google.com']

In [132]:
email2 = re.compile(r'\w[\w.]+@\w+\.[a-z]{3}')
email2.findall('barack.obama@whitehouse.gov')

['barack.obama@whitehouse.gov']

We have changed "\w+" to "[\w.]+", so we will match any alphanumeric character or a period. 

### Parentheses indicates groups to extract

For compound regular expressions like our email matcher, we often want to extract their components rather than the full match. This can be done using parentheses to group the results

In [137]:
email3 = re.compile(r'(\w[\w.]+)+@(\w+)\.([a-z]{3})')

In [138]:
email3.findall(text)

[('guido', 'python', 'org'), ('guido', 'google', 'com')]

We can go a bit further and name the extracted components using the "(?P<name> )" syntax, in which case the groups can be extracted as a Python dictionary:

In [146]:
email4 = re.compile(r'(?P<user>\w[\w.]+)@(?P<Domin>\w+)\.(?P<ORG>[a-z]{3})')

In [147]:
match = email4.match('guido@python.org')
match.groupdict()

{'Domin': 'python', 'ORG': 'org', 'user': 'guido'}

In [148]:
match.string?

https://docs.python.org/3.7/howto/regex.html#lookahead-assertions

Metacharacters are not active inside classes. For example, "[akm\$\]" 
will match any of the characters 'a', 'k', 'm', or '\$'; '\$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.

Think how "\." is stripped of its nature in (\w[\w.]+) while using it in email, whereas matching "\." outside character class in email match we have to explicitely write "\\."

You can match the characters not listed within the class by complementing the set. This is indicated by including a '^' as the first character of the class; '^' outside a character class will simply match the '^' character. For example, "[^5]" will match any character except '5'.

Perhaps the most important metacharacter is the backslash, \\. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a \[ or \\, you can precede them with a backslash to remove their special meaning: \\[ or \\\.

Some of the special sequences beginning with '\' represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.

\d
Matches any decimal digit; this is equivalent to the class "[0-9]".

\D
Matches any non-digit character; this is equivalent to the class [^0-9].

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

These sequences can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or ',' or '.'.



The final metacharacter in this section is .. It matches anything except a newline character, and there’s an alternate mode (re.DOTALL) where it will match even a newline. . is often used where you want to match “any character”.

There are two more repeating qualifiers. The question mark character, ?, matches either once or zero times; you can think of it as marking something as being optional. For example, home-?brew matches either 'homebrew' or 'home-brew'.

The most complicated repeated qualifier is {m,n}, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n. For example, a/{1,3}b will match 'a/b', 'a//b', and 'a///b'. It won’t match 'ab', which has no slashes, or 'a////b', which has four.

Readers of a reductionist bent may notice that the three other qualifiers can all be expressed using this notation. {0,} is the same as *, {1,} is equivalent to +, and {0,1} is the same as ?. It’s better to use *, +, or ? when you can, simply because they’re shorter and easier to read.

In [150]:
p = re.compile('ab*', re.IGNORECASE)

match()	Determine if the RE matches at the beginning of the string.

search()	Scan through a string, looking for any location where this RE matches.

findall()	Find all substrings where the RE matches, and returns them as a list.

finditer()	Find all substrings where the RE matches, and returns them as an iterator.

Method/Attribute	Purpose

group()	Return the string matched by the RE

start()	Return the starting position of the match

end()	Return the ending position of the match

span()	Return a tuple containing the (start, end) 
positions of the match

All functions are called by match object

In [160]:
regex = re.compile(r'[a-z]+')
regex.findall('tempo my name')

['tempo', 'my', 'name']

In [161]:
m = regex.match('tempo my name')


In [162]:
m.group()

'tempo'

In [163]:
m.start(),m.end()

(0, 5)

In [164]:
m.span()

(0, 5)

In [165]:
p = re.compile(r'\d+')

In [166]:
iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')


In [167]:
for match in iterator:
    print(match.span())

(0, 2)
(22, 24)
(29, 31)


 if you wish to match the word From only at the beginning of a line, the RE to use is ^From.

In [3]:
import re

In [5]:
print(re.search('^From', 'From Here to Eternity'))  


<_sre.SRE_Match object; span=(0, 4), match='From'>


In [6]:
print(re.search('^From', 'Reciting From Memory'))

None



\$ Matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.

In [7]:
print(re.search('}$', '{block}')) 

<_sre.SRE_Match object; span=(6, 7), match='}'>


\b
Word boundary. This is a zero-width assertion that matches only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character.

In [8]:
p = re.compile(r'\bclass\b')

In [9]:
print(p.search('no class at all'))

<_sre.SRE_Match object; span=(3, 8), match='class'>


In [10]:
print(p.search('the declassified algorithm'))

None


In [14]:
result = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result.group())

AV


In [15]:
result=re.split(r'i','Analytics Vidhya',maxsplit=1)

In [16]:
result

['Analyt', 'cs Vidhya']

In [17]:
result=re.sub(r'India','the World','AV is largest Analytics community of India')
result

'AV is largest Analytics community of the World'

In [18]:
result=re.findall(r'.','AV is largest Analytics community of India')
print(result)

['A', 'V', ' ', 'i', 's', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'I', 'n', 'd', 'i', 'a']


Above, space is also extracted, now to avoid it use “\w” instead of “.“.

In [19]:
result=re.findall(r'\w','AV is largest Analytics community of India')
print(result)


['A', 'V', 'i', 's', 'l', 'a', 'r', 'g', 'e', 's', 't', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', 'o', 'f', 'I', 'n', 'd', 'i', 'a']


In [20]:
result=re.findall(r'\w*','AV is largest Analytics community of India')
print(result)

['AV', '', 'is', '', 'largest', '', 'Analytics', '', 'community', '', 'of', '', 'India', '']


In [21]:
result=re.findall(r'\w+','AV is largest Analytics community of India')
result

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']

In [22]:
result=re.findall(r'^\w+','AV is largest Analytics community of India')
print(result)

['AV']


In [23]:
result=re.findall(r'\w+$','AV is largest Analytics community of India')
print(result)

['India']


In [24]:
result=re.findall(r'\b\w.','AV is largest Analytics community of India')
result

['AV', 'is', 'la', 'An', 'co', 'of', 'In']

If you want to match date

In [25]:
result=re.findall(r'\d{2}-\d{2}-\d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
result

['12-05-2007', '11-11-2011', '12-01-2009']

If you want to extract only year

In [26]:
result=re.findall(r'\d{2}-\d{2}-(\d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')


In [27]:
result

['2007', '2011', '2009']

Return all words of a string those starts with vowel

In [28]:
result=re.findall(r'\b[aeiouAEIOU]\w+','AV is largest Analytics community of India')
result

['AV', 'is', 'Analytics', 'of', 'India']

In similar ways, we can extract words those starts with constant using “^” within square bracket.

In [32]:
result=re.findall(r'\b[^aeiouAEIOU]\w+','AV is largest Analytics community of India')
result

[' is', ' largest', ' Analytics', ' community', ' of', ' India']

Include space also inside square bracket

In [33]:
result=re.findall(r'\b[^aeiouAEIOU ]\w+','AV is largest Analytics community of India')
result

['largest', 'community']

 Validate a phone number (phone number must be of 10 digits and starts with 8 or 9)
 
 r'[8-9]{1}[0-9]{9}'

Split a string with multiple delimiters


In [34]:
line = 'asdf fjdk;afed,fjek,asdf,foo' # String has multiple delimiters (";",","," ").
result= re.split(r'[;,\s]', line)
print(result)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']


We can also use method re.sub() to replace these multiple delimiters with one as space ” “.



In [35]:
import re
line = 'asdf fjdk;afed,fjek,asdf,foo'
result= re.sub(r'[;,\s]',' ', line)
print(result)

asdf fjdk afed fjek asdf foo


https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/

In [36]:
text = 'The quick brown fox jumped over the lazy brown bear.'

In [37]:
# Find any of fox, snake, or bear
re.findall(r'fox|snake|bear', text)

['fox', 'bear']

In [38]:
text = 'The quick brown fox jumped over the lazy brown bear.'

In [39]:
# Find any of fox, snake, or bear
re.findall(r'\b(fox|snake|bear)\b', text)


['fox', 'bear']

In [42]:
text = 'My birthday is 09/15/1983. My brother\'s birthday is 01/01/01. My other two brothers have birthdays of 9/3/2001 and 09/1/83.'
re.findall(r'\d{1,2}\/\d{1,2}/\d{2,4}',text)

['09/15/1983', '01/01/01', '9/3/2001', '09/1/83']

In [45]:
re.findall(r'\d\d?\/\d\d?/\d{2,4}',text)

['09/15/1983', '01/01/01', '9/3/2001', '09/1/83']