## __Introduction to Regular Expressions__

They are used to match different patterns and strings. We define regular expressions, by using special characters for matching strings.


Let's look at an example, to understand how regular expressions work. First, let’s understand a few notations or characters, that regular expressions have identified.
- Square brackets ([ ]) represent a set of characters.
- Forward slash ( \ ) is used to explain special sequences.
- Caret (^) represents starts with. 
- Dollar ($) represents ends with. 
- Asterisk (*) for zero or more occurrences
- Plus sign (+) for 1 or more occurrences
- Question mark (?) for zero or one occurrence.
- Round brackets (()) to capture a group.


Now, let's look at regular expressions
in some strings. Let's import the module RE.

## Step 1: Import the Regular Expression Library

- Import the regular expression (re) library:


In [None]:
#Import library
import re

## Step 2: Define a Sample String

Let's create a string and a pattern for a regular expression. Let's start with a simple example.
- Define a sample string:


In [None]:
s = 'This is a sample string.'

## Step 3: Perform a Simple Search

Let's perform a simple search for the pattern **is** in the string. This can be done using the search method.

In [None]:
pattern = 'is'

In [None]:
re.search(pattern,s)

<re.Match object; span=(2, 4), match='is'>

__Observation__

- Now, this returns the first occurrence of the word, which is at indexes 2 and 4.

If we add the start to this, it gives us the starting index, and the end would give us the last index.



In [None]:
re.search(pattern,s).start()

2

In [None]:
re.search(pattern,s).end()

4

When we slice the strings into start and end, it would give us the pattern we are looking for.


In [None]:
s[2:4]

'is'

## Step 4: Search for the Pattern at the Beginning of the String

Let's look at another example where the string starts with **This**. 


In [None]:
pattern = '^This'
re.search(pattern,s)

<re.Match object; span=(0, 4), match='This'>

__Observation__

- It matched the word **This** at (0, 4).

## __Special Sequences__

Now, let's look at special sequences.

* Forward slash d (\d) starts stands for digits.
* Slash D  (\D) starts for nondigits.
* Slash w (\w) starts for small alphabet a to z, capital alphabet A to Z, numbers 0-9 and underscore (_).
* Slash W (\W) is the negation (~) of slash w.

Let's explore some methods that
the regular expression library offers, which is the findall method.

## Step 5: Use the FindAll Method


This returns all the matches that are found in a string. Let's see an example of it.



In [None]:
s = 'This is a sentence. This is also a sentence. Is this a sentence too?'

In [None]:
re.findall('sentence',s)

['sentence', 'sentence', 'sentence']

__Observation__


- Since we gave the entire word, it gave the word itself as an output.

## Step 6: Extract Emails from a String


Now, we have a string here that contains emails. Let's write a regular expression to find out all the emails along with the first names, the domain names and the type of website.

In [None]:
emails = '''
nithin@gmail.com
thomas@stanford.edu
ram_11@fashion.now.inc'''

Here, instead of using re-dot search or re-dot find all, we can first load the regular expression into a variable, and that can be done using the dot compile method.


For example, let's say that it needs to be a character, a number or an underscore. Also, it should have one or more occurrences.

In [None]:
pattern = re.compile('\w+')
re.findall(pattern,emails)

['nithin',
 'gmail',
 'com',
 'thomas',
 'stanford',
 'edu',
 'ram_11',
 'fashion',
 'now',
 'inc']

__Obseravtions__


- Once it is executed, we get everything that matches this regular expression as an output.


If we choose only the types of websites, we can add groups by enclosing them in brackets as given below.

In [None]:
pattern = re.compile('\w+@\w+\..*(...)')
re.findall(pattern,emails)

['com', 'edu', 'inc']

__Observation__


- Thus, it returns all types of websites.

## Step 7: Extract Phone Numbers from a String


Let's look at an example with phone numbers where we will try to find area codes.

In [None]:
phone_numbers = '''
0821-1234567
0800-1234567
1234-1234567'''

Let's look at the regular expression to capture their digits and separate the first four digits and the last seven digits with a hyphen.

In [None]:
pattern = re.compile('\d\d\d\d-\d\d\d\d\d\d\d')
re.findall(pattern,phone_numbers)

['0821-1234567', '0800-1234567', '1234-1234567']

__Observations__


- Here, it writes down all the phone numbers. But, this is too cumbersome to look at; this can be replaced using a set as given below.

In [None]:
pattern = re.compile('\d{4}-\d{7}')
re.findall(pattern,phone_numbers)

['0821-1234567', '0800-1234567', '1234-1234567']

__Observations__


- This gives the same output.


Now that we have all the phone numbers in order to just capture the area codes, we enclose these in groups.

In [None]:
pattern = re.compile('(\d{4})-\d{7}')
re.findall(pattern,phone_numbers)

['0821', '0800', '1234']