[regular expression cheat sheet](https://www.dataquest.io/blog/regex-cheatsheet/)

. - A period. Matches any single character except newline character.

In [9]:
re.search(r'Co.k.e', 'Cookie').group()

'Cookie'

\w - Lowercase w. Matches any single letter, digit or underscore.

In [10]:
re.search(r'Co\wk\we', 'Cookie').group()

'Cookie'

\W - Uppercase w. Matches any character not part of \w (lowercase w).

In [11]:
re.search(r'C\Wke', 'C@ke').group()

'C@ke'

\s - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.

In [12]:
re.search(r'Eat\scake', 'Eat cake').group()

'Eat cake'

\S - Uppercase s. Matches any character not part of \s (lowercase s).

In [13]:
re.search(r'Cook\Se', 'Cookie').group()

'Cookie'

\t - Lowercase t. Matches tab.



  \n - Lowercase n. Matches newline.

  \r - Lowercase r. Matches return.

  \d - Lowercase d. Matches decimal digit 0-9.


In [20]:
re.search(r'c\d\dkie', 'c00kie').group()

'c00kie'

^ - Caret. Matches a pattern at the start of the string.

In [21]:
re.search(r'^Eat', 'Eat cake').group()

'Eat'

$ - Matches a pattern at the end of string.

In [22]:
re.search(r'cake$', 'Eat cake').group()

'cake'

[abc] - Matches a or b or c.

[a-zA-Z0-9] - Matches any letter from (a to z) or (A to Z) or (0 to 9). Characters that are not within a range can be matched by complementing the set. If the first character of the set is ^, all the characters that are not in the set will be matched.

In [26]:
re.search(r'Number: [0-9]', 'Number: 5').group()

'Number: 5'

In [27]:
# Matches any character except 5
re.search(r'Number: [^5]', 'Number: 0').group()

'Number: 0'

\A - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.

In [28]:
re.search(r'\A[A-E]ookie', 'Cookie').group()

'Cookie'

\b - Lowercase b. Matches only the beginning or end of the word.

In [29]:
re.search(r'\b[A-E]ookie', 'Cookie').group()

'Cookie'

\ - Backslash. If the character following the backslash is a recognized escape character, then the special meaning of the term is taken. For example, \n is considered as newline. However, if the character following the \ is not a recognized escape character, then the \ is treated like any other character and passed through.

In [30]:
# This checks for '\' in the string instead of '\t' due to the '\' used 
re.search(r'Back\\stail', 'Back\stail').group()

'Back\\stail'

In [31]:
# This treats '\s' as an escape character because it lacks '\' at the start of '\s'
re.search(r'Back\stail', 'Back tail').group()

'Back tail'

## Repetitions

It becomes quite tedious if you are looking to find long patterns in a sequence. Fortunately, the re module handles repetitions using the following special characters:

    + -> Checks for one or more characters to its left.


In [32]:
re.search(r'Co+kie', 'Cooookie').group()

'Cooookie'

    * -> Checks for zero or more characters to its left.

In [33]:
# Checks for any occurrence of a or o or both in the given sequence
re.search(r'Ca*o*kie', 'Caokie').group()

'Caokie'

In [35]:
re.search(r'Ca*o*kie', 'Caaooookie').group()

'Caaooookie'

In [36]:
re.search(r'Ca*o*kie', 'Caaaaoooooooooookie').group()

'Caaaaoooooooooookie'

    ? -> Checks for exactly zero or one character to its left.

In [37]:
# Checks for exactly zero or one occurrence of a or o or both in the given sequence
re.search(r'Colou?r', 'Color').group()

'Color'

{x} - Repeat exactly x number of times.

{x,} - Repeat at least x times or more.

{x, y} - Repeat at least x times but no more than y times.

In [42]:
re.search(r'\d{9,10}', '0987654321').group()

'0987654321'

    The + and * qualifiers are said to be  greedy

### Greedy vs Non-Greedy Matching

When a special character matches as much of the search sequence (string) as possible, it is said to be a "Greedy Match". It is the normal behavior of a regular expression but sometimes this behavior is not desired:

In [43]:
pattern = "cookie"
sequence = "Cake and cookie"

heading  = r'<h1>TITLE</h1>'
re.match(r'<.*>', heading).group()

'<h1>TITLE</h1>'

In [44]:
heading  = r'<h1>TITLE</h1>'
re.match(r'<.*?>', heading).group()

'<h1>'

### search(pattern, string, flags=0)

With this function, you scan through the given string/sequence looking for the first location where the regular expression produces a match. It returns a corresponding match object if found, else returns None if no position in the string matches the pattern. Note that None is different from finding a zero-length match at some point in the string.

In [45]:
pattern = "cookie"
sequence = "Cake and cookie"

re.search(pattern, sequence).group()

'cookie'

#### search() versus match()

The match() function checks for a match only at the beginning of the string (by default) whereas the search() function checks for a match anywhere in the string.

### match(pattern, string, flags=0)

Returns a corresponding match object if zero or more characters at the beginning of string match the pattern. Else it returns None, if the string does not match the given pattern.

In [1]:
import re

pattern = "C"
sequence1 = "IceCream"

# No match since "C" is not at the start of "IceCream"
re.match(pattern, sequence1)

In [5]:
sequence2 = "Cake"

re.match(pattern,sequence2).group()

'C'

# findall(pattern, string, flags=0)
<br>

<font size=2>Finds all the possible matches in the entire sequence and returns them as a list of strings. Each returned string represents one match.</font>

In [6]:
email_address = "Please contact us at: support@datacamp.com, xyz@datacamp.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', email_address)
for address in addresses: 
    print(address)

support@datacamp.com
xyz@datacamp.com


# sub(pattern, repl, string, count=0, flags=0)
This is the substitute function. It returns the string obtained by replacing or substituting the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern is not found then the string is returned unchanged.

In [7]:
email_address = "Please contact us at: xyz@datacamp.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'support@datacamp.com', email_address)
print(new_email_address)

Please contact us at: support@datacamp.com


# compile(pattern, flags=0)

Compiles a regular expression pattern into a regular expression object. When you need to use an expression several times in a single program, using the compile() function to save the resulting regular expression object for reuse is more efficient. This is because the compiled versions of the most recent patterns passed to compile() and the module-level matching functions are cached

In [8]:
pattern = re.compile(r"cookie")
sequence = "Cake and cookie"
pattern.search(sequence).group()

'cookie'

**Tip :** an expression's behavior can be modified by specifying a flags value. You can add flag as an extra argument to the various functions that you have seen in this tutorial. Some of the flags used are: IGNORECASE, DOTALL, MULTILINE, VERBOSE, etc.

### Case Study: Working with Regular Expressions

In [46]:
import re
import requests
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

In [47]:
def get_book(url):
    # Sends a http request to get the text from project Gutenberg
    raw = requests.get(url).text
    # Discards the metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()
    # Discards the metadata from the end of the book
    stop = re.search(r"II", raw).start()
    # Keeps the relevant text
    text = raw[start:stop]
    return text

In [48]:
def preprocess(sentence): 
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

book = get_book(the_idiot_url)
processed_book = preprocess(book)
print(processed_book)

 produced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part i i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages fou

In [49]:
print(book)






Produced by Martin Adamson, David Widger, with corrections by Andrew Sly










THE IDIOT

By Fyodor Dostoyevsky


Translated by Eva Martin




PART I

I.

Towards the end of November, during a thaw, at nine o’clock one morning,
a train on the Warsaw and Petersburg railway was approaching the latter
city at full speed. The morning was so damp and misty that it was only
with great difficulty that the day succeeded in breaking; and it was
impossible to distinguish anything more than a few yards away from the
carriage windows.

Some of the passengers by this particular train were returning from
abroad; but the third-class carriages were the best filled, chiefly with
insignificant persons of various occupations and degrees, picked up at
the different stations nearer town. All of them seemed weary, and
most of them had sleepy eyes and a shivering expression, while their
complexions generally appeared to have taken on the colour of the fog
o

In [50]:
# Find the number of the pronoun "the" in the corpus. Hint: use the len() function. 
len(re.findall(r'the',processed_book))

302

In [54]:
# Try to convert every single stand-alone instance of 'i' to 'I' in the corpus. 
# Make sure not to change the 'i' occuring in a word:
processed_book= re.sub(r'\si\s'," I ",processed_book)
print(processed_book)


 produced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin partIi. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages found

In [57]:
# Find the number of times anyone was quoted ("") in the corpus.
len(re.findall(r'\”', book))


96

In [59]:
# What are the words connected by '--' in the corpus?
re.findall(r'[A-Z a-z 0-9]*--[A-Z a-z 0-9]*',book)

[' ironical--it might almost be called a',
 'malicious--smile',
 ' He wore a large fur--or rather',
 'astrachan--overcoat',
 'large cape to it--the sort of cloak one sees upon travellers during the',
 'winter months in Switzerland or North Italy--was by no means adapted to',
 'nervous malady--a kind of epilepsy',
 ' but my doctor gave me money--and he had very',
 'little--to pay my journey back',
 'No--Mr',
 'That is--where am I going to stay',
 ' I--I really don',
 'I--',
 '--though of course poverty is no crime',
 '--we must remember that',
 'or--judge',
 'from your costume and gaiters--still',
 '--if you can add to your possessions',
 'error through--well',
 ' say--through a too luxuriant fancy',
 'however--and that is commendable',
 ' Epanchin--oh yes',
 ' I know him too--at least',
 ' A fine fellow he was--and had a property of four thousand',
 ' Nicolai Andreevitch--that was his name',
 'They are people who know everyone--that is',
 ' which they reduce--or raise',
 '--to the stan

### Some Examples of Regular Expressions

#### Problem 1: Return the first word of a given string

In [60]:
import re
result=re.findall(r'.','AV is largest Analytics community of India')
print (result)

['A', 'V', ' ', 'i', 's', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'I', 'n', 'd', 'i', 'a']


In [61]:
result=re.findall(r'\w','AV is largest Analytics community of India')
print (result)

['A', 'V', 'i', 's', 'l', 'a', 'r', 'g', 'e', 's', 't', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', 'o', 'f', 'I', 'n', 'd', 'i', 'a']


####  Extract each word (using “*” or “+“)

In [62]:
result=re.findall(r'\w*','AV is largest Analytics community of India')
print (result)

['AV', '', 'is', '', 'largest', '', 'Analytics', '', 'community', '', 'of', '', 'India', '']


In [63]:
result=re.findall(r'\w+','AV is largest Analytics community of India')
print (result)

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']


#### Extract each word (using “^“)

In [64]:
result=re.findall(r'^\w+','AV is largest Analytics community of India')
print (result)

['AV']


In [65]:
result=re.findall(r'\w+$','AV is largest Analytics community of India')
print (result)

['India']


### Problem 2: Return the first two character of each word

#### Extract consecutive two characters of each word, excluding spaces

In [66]:
result=re.findall(r'\w\w','AV is largest Analytics community of India')
print (result)

['AV', 'is', 'la', 'rg', 'es', 'An', 'al', 'yt', 'ic', 'co', 'mm', 'un', 'it', 'of', 'In', 'di']


#### Extract consecutive two characters those available at start of word boundary

In [68]:
result=re.findall(r'\b\w.','AV is largest Analytics community of India')
print (result)

['AV', 'is', 'la', 'An', 'co', 'of', 'In']


### Problem 3: Return the domain type of given email-ids

#### Extract all characters after “@”

In [72]:
result=re.findall(r'@\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz') 
print (result )

['@gmail', '@test', '@analyticsvidhya', '@rest']


In [74]:
result=re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz') 
print (result) 

['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']


#### Extract only domain name using “( )”

In [75]:
result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz') 
print (result) 

['com', 'in', 'com', 'biz']


### Problem 4: Return date from given string

- Here we will use “\d” to extract digit.

In [76]:
result=re.findall(r'\d{2}-\d{2}-\d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print (result)

['12-05-2007', '11-11-2011', '12-01-2009']


In [77]:
# If you want to extract only year again parenthesis “( )” will help you.

result=re.findall(r'\d{2}-\d{2}-(\d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print (result)

['2007', '2011', '2009']


### Problem 5: Return all words of a string those starts with vowel

In [78]:
#Return each words

result=re.findall(r'\w+','AV is largest Analytics community of India')
print (result)

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']


In [79]:
# Return words starts with alphabets (using [])

result=re.findall(r'[aeiouAEIOU]\w+','AV is largest Analytics community of India')
print (result)

['AV', 'is', 'argest', 'Analytics', 'ommunity', 'of', 'India']


In [80]:
result=re.findall(r'\b[aeiouAEIOU]\w+','AV is largest Analytics community of India')
print (result)

['AV', 'is', 'Analytics', 'of', 'India']


In [81]:
# In similar ways, we can extract words those starts with constant using “^” within square bracket.

result=re.findall(r'\b[^aeiouAEIOU]\w+','AV is largest Analytics community of India')
print (result)

[' is', ' largest', ' Analytics', ' community', ' of', ' India']


In [82]:
# Above you can see that it has returned words starting with space. To drop it from output, include space in square bracket[].

result=re.findall(r'\b[^aeiouAEIOU ]\w+','AV is largest Analytics community of India')
print (result)

['largest', 'community']


### Problem 6: Validate a phone number (phone number must be of 10 digits and starts with 8 or 9) 

In [84]:
import re
li=['9999999999','999999-999','99999x9999']
for val in li:
 if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val) == 10:
     print ('yes')
 else:
     print ('no')

yes
no
no


### Problem 6: Validate a phone number (phone number must be of 10 digits and starts with 7 or 8 or 9)


In [85]:
import re
li=['9999999999','9999998999','99999x9999','7897675452','8923456721']
for val in li:
 if re.match(r'[7-8-9]{1}[0-9]{9}',val) and len(val) == 10:
     print ('yes')
 else:
     print ('no')

yes
yes
no
yes
yes


### Problem 7: Split a string with multiple delimiters

In [86]:
import re
line = 'asdf fjdk;afed,fjek,asdf,foo?shrikant' # String has multiple delimiters (";",","," ","?").
result= re.split(r'[?,;,\s]', line)
print (result)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo', 'shrikant']


In [87]:
# We can also use method re.sub() to replace these multiple delimiters with one as space ” “.

import re
line = 'asdf fjdk;afed,fjek,asdf,foo'
result= re.sub(r'[;,\s]',' ', line)
print (result)

asdf fjdk afed fjek asdf foo


### Problem 8: Retrieve Information from HTML file

I want to extract information from a HTML file (see below sample data). Here we need to extract information available between <td> and </td> except the first numerical index. I have assumed here that below html code is stored in a string str.

Sample HTML file (str)

SyntaxError: invalid syntax (<ipython-input-90-4b3970bc0d7c>, line 1)

Operators 	Description
. 	 Matches with any single character except newline ‘\n’.
? 	 match 0 or 1 occurrence of the pattern to its left
+ 	 1 or more occurrences of the pattern to its left
* 	 0 or more occurrences of the pattern to its left
\w 	 Matches with a alphanumeric character whereas \W (upper case W) matches non alphanumeric character.
\d 	  Matches with digits [0-9] and /D (upper case D) matches with non-digits.
\s 	 Matches with a single white space character (space, newline, return, tab, form) and \S (upper case S) matches any non-white space character.
\b 	 boundary between word and non-word and /B is opposite of /b
[..] 	 Matches any single character in a square bracket and [^..] matches any single character not in square bracket
\ 	 It is used for special meaning characters like \. to match a period or \+ for plus sign.
^ and $ 	 ^ and $ match the start or end of the string respectively
{n,m} 	 Matches at least n and at most m occurrences of preceding expression if we write it as {,m} then it will return at least any minimum occurrence to max m preceding expression.
a| b 	 Matches either a or b
( ) 	Groups regular expressions and returns matched text
\t, \n, \r 	 Matches tab, newline, return