## Regular expressions in python

Sample of the text

In [122]:
sentence = "As for households economic projection, policy interest rates have only partially passed through to the 29 per cent of households that have a mortgage, depending to a large degree on whether mortgage rates are fixed or variable."

paragraph = """Last week ECB staff published their latest projections for the euro area and this week we published our latest outlook for the Irish economy.  Let me give some reflections on both.
At our meeting last week, my Governing Council colleagues and I decided to keep the three key ECB interest rates unchanged.  In the latest projections, inflation has been revised down, in particular for 2024 which mainly reflects lower energy prices.  
The current data we have suggests that measures of underlying inflation have eased further, which gives us more confidence about returning to our 2 per cent medium-term target.  Against this, domestic inflation remains high, in part driven by strong gr
The projection for economic growth in 2024 has been revised down to 0.6 per cent, with activity expected to remain weak in the near term. Growth is expected to pick up to 1.5 per cent in 2025 and 1.6 per cent in 2026. Given downward revisions to growth
In light of these new projections, how do I see the interest rate path?"""

website = "https://www.shorelineleisure.ie/"

In [123]:
import re

In [124]:
sentence.split(" ") # this will split sentence based on the spaces

['As',
 'for',
 'households',
 'economic',
 'projection,',
 'policy',
 'interest',
 'rates',
 'have',
 'only',
 'partially',
 'passed',
 'through',
 'to',
 'the',
 '29',
 'per',
 'cent',
 'of',
 'households',
 'that',
 'have',
 'a',
 'mortgage,',
 'depending',
 'to',
 'a',
 'large',
 'degree',
 'on',
 'whether',
 'mortgage',
 'rates',
 'are',
 'fixed',
 'or',
 'variable.']

### To find the match characters we ca use:

`findall(pattern, string)` - will returns a list of all matches

`search(pattern, string)` - returns match object if any in the string

`sub(pattern, replacement, string)` - replaces one or many matches

In [125]:
pattern = "economic"  #this will find the word 'economic' with lowercase letter
re.findall (pattern, sentence)

['economic']

In [126]:
pattern = "Economic" #this is searching for the word 'Economic' with capital letter
re.findall (pattern, sentence)

[]

In [127]:
pattern = "Economic" #this will search for word 'economic' ignoring capital or lowercase letter
re.findall (pattern, sentence,re.IGNORECASE)

['economic']

In [128]:
pattern = "economic" #this will search for word 'economic' in paragraph and through out the position of the word
re.search (pattern, paragraph)

<re.Match object; span=(706, 714), match='economic'>

### Matching literal characters

In order to match any literal characters ( *any character except `[\^$.|?*+()`* ) use a `backslash \`followed by the character.

In [129]:
pattern = "\d" #this will look for any numbers in the paragraph
re.findall (pattern, paragraph)

['2',
 '0',
 '2',
 '4',
 '2',
 '2',
 '0',
 '2',
 '4',
 '0',
 '6',
 '1',
 '5',
 '2',
 '0',
 '2',
 '5',
 '1',
 '6',
 '2',
 '0',
 '2',
 '6']

In [130]:
pattern = "\w+" #this will look for any word character, excluding special characters
re.findall(pattern, paragraph)

['Last',
 'week',
 'ECB',
 'staff',
 'published',
 'their',
 'latest',
 'projections',
 'for',
 'the',
 'euro',
 'area',
 'and',
 'this',
 'week',
 'we',
 'published',
 'our',
 'latest',
 'outlook',
 'for',
 'the',
 'Irish',
 'economy',
 'Let',
 'me',
 'give',
 'some',
 'reflections',
 'on',
 'both',
 'At',
 'our',
 'meeting',
 'last',
 'week',
 'my',
 'Governing',
 'Council',
 'colleagues',
 'and',
 'I',
 'decided',
 'to',
 'keep',
 'the',
 'three',
 'key',
 'ECB',
 'interest',
 'rates',
 'unchanged',
 'In',
 'the',
 'latest',
 'projections',
 'inflation',
 'has',
 'been',
 'revised',
 'down',
 'in',
 'particular',
 'for',
 '2024',
 'which',
 'mainly',
 'reflects',
 'lower',
 'energy',
 'prices',
 'The',
 'current',
 'data',
 'we',
 'have',
 'suggests',
 'that',
 'measures',
 'of',
 'underlying',
 'inflation',
 'have',
 'eased',
 'further',
 'which',
 'gives',
 'us',
 'more',
 'confidence',
 'about',
 'returning',
 'to',
 'our',
 '2',
 'per',
 'cent',
 'medium',
 'term',
 'target',
 '

In [131]:
pattern = "\D+" #this will look for any character other than digit
re.findall(pattern, paragraph)

['Last week ECB staff published their latest projections for the euro area and this week we published our latest outlook for the Irish economy.  Let me give some reflections on both.\nAt our meeting last week, my Governing Council colleagues and I decided to keep the three key ECB interest rates unchanged.  In the latest projections, inflation has been revised down, in particular for ',
 ' which mainly reflects lower energy prices.  \nThe current data we have suggests that measures of underlying inflation have eased further, which gives us more confidence about returning to our ',
 ' per cent medium-term target.  Against this, domestic inflation remains high, in part driven by strong gr\nThe projection for economic growth in ',
 ' has been revised down to ',
 '.',
 ' per cent, with activity expected to remain weak in the near term. Growth is expected to pick up to ',
 '.',
 ' per cent in ',
 ' and ',
 '.',
 ' per cent in ',
 '. Given downward revisions to growth\nIn light of these new 

In [132]:
string = "householdseconomicprojection"
pattern ="^hous"                        #this will look for 'hous' on the begining of the string
print(re.findall(pattern, string))

['hous']


In [133]:
pattern ="hous$"                      #this will look for 'hous' on the end of the string
print(re.findall(pattern, string))    # will return empty [] as there is none

[]


In [134]:
pattern ="c"                         #this will look for letter 'c' in the string
print(re.findall(pattern, string))

['c', 'c', 'c']


In [135]:
re.findall("\w{1,}", sentence) #this will show more than 1 word from sentecen

['As',
 'for',
 'households',
 'economic',
 'projection',
 'policy',
 'interest',
 'rates',
 'have',
 'only',
 'partially',
 'passed',
 'through',
 'to',
 'the',
 '29',
 'per',
 'cent',
 'of',
 'households',
 'that',
 'have',
 'a',
 'mortgage',
 'depending',
 'to',
 'a',
 'large',
 'degree',
 'on',
 'whether',
 'mortgage',
 'rates',
 'are',
 'fixed',
 'or',
 'variable']

In [136]:
re.findall("\w+", sentence) #this is the same output using different regular expression function

['As',
 'for',
 'households',
 'economic',
 'projection',
 'policy',
 'interest',
 'rates',
 'have',
 'only',
 'partially',
 'passed',
 'through',
 'to',
 'the',
 '29',
 'per',
 'cent',
 'of',
 'households',
 'that',
 'have',
 'a',
 'mortgage',
 'depending',
 'to',
 'a',
 'large',
 'degree',
 'on',
 'whether',
 'mortgage',
 'rates',
 'are',
 'fixed',
 'or',
 'variable']

In [137]:
phone_numbers = """46 70 712 34 56
                    311 345 678
                    27 311 234 45 78
                    44-7473-2345
                    31-3115-4875
                    46-8596-7842
                    456.785.333  #IP address
                    895.223.478  #IP address
                    +65 9341 3004
                    +65 9646 4785
                    +65 8823 3412"""
print(phone_numbers)

46 70 712 34 56
                    311 345 678
                    27 311 234 45 78
                    44-7473-2345
                    31-3115-4875
                    46-8596-7842
                    456.785.333  #IP address
                    895.223.478  #IP address
                    +65 9341 3004
                    +65 9646 4785
                    +65 8823 3412


In [138]:
#this will print out the phone number which has a sequence 2 numbers '-'4 numbers '-' 4 numbers
re.findall("\d{2}\-\d{4}\-\d{4}", phone_numbers)

['44-7473-2345', '31-3115-4875', '46-8596-7842']

In [139]:
#this will pring out the phone numbers which has a sequence '+'or 'nothing'2 numbers '-' or 'blank space' 4 numbers '-' or 'blank space' 4 numbers
re.findall("\+?\d{2}[\- ]\d{4}[\- ]\d{4}", phone_numbers)

['44-7473-2345',
 '31-3115-4875',
 '46-8596-7842',
 '+65 9341 3004',
 '+65 9646 4785',
 '+65 8823 3412']

In [140]:
#this will retun only IP address numbers
re.findall("\d{3}\.\d{3}\.\d{3}", phone_numbers)

['456.785.333', '895.223.478']

In [141]:
#this will show all number which starts with '+'
string = "\+\d{2}[\- ]\d{4}[\- ]\d{4}"
re.findall (string, phone_numbers)

['+65 9341 3004', '+65 9646 4785', '+65 8823 3412']

In [144]:
#this will replace the '+' with '00'
pattern = "\+(\d{2}[\- ]\d{4}[\- ]\d{4})"
replacement = "00\\1"
print (re.sub(pattern, replacement, phone_numbers))
print (phone_numbers)

46 70 712 34 56
                    311 345 678
                    27 311 234 45 78
                    44-7473-2345
                    31-3115-4875
                    46-8596-7842
                    456.785.333  #IP address
                    895.223.478  #IP address
                    0065 9341 3004
                    0065 9646 4785
                    0065 8823 3412
46 70 712 34 56
                    311 345 678
                    27 311 234 45 78
                    44-7473-2345
                    31-3115-4875
                    46-8596-7842
                    456.785.333  #IP address
                    895.223.478  #IP address
                    +65 9341 3004
                    +65 9646 4785
                    +65 8823 3412
