# Overview of Regular Expressions

Regular Expressions (sometimes called regex for short) allows a user to search for strings using almost any sort of rule they can come up. For example, finding all capital letters in a string, or finding a phone number in a document. 

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

Regex itu susah dihapal, jadi gausah apal ngertiin aja, jadi next kalo mau pake bisa langsung cari specific

In [2]:
# contoh pattern
# Phone Number
# (888)-888-8888
# \d == digit
# r"(\d\d\d)-\d\d\d-\d\d\d\d" atau r"(\d{3})-\d{3}-\d{4}

### Basic Patterns

In [3]:
text = 'My phone number is 15251886383'

In [4]:
'phone' in text

True

skrg kita pake regex

In [5]:
import re

In [6]:
pattern = 'phone'

In [7]:
# cari pattern('phone') di dalem text
re.search(pattern,text)

<re.Match object; span=(3, 8), match='phone'>

In [10]:
# coba kita ganti pattern yang ga ada
re.search(pattern='phones',string=text)

# gamuncul karena ga ketemu

In [11]:
# skrg kita coba masukin match nya ke dalem variable
match = re.search(pattern,text)

In [12]:
match

<re.Match object; span=(3, 8), match='phone'>

In [14]:
match.span()

(3, 8)

In [15]:
match.start()

3

In [16]:
match.end()

8

tapi kalo misalkan ada 2 phone di kalimat itu, dia cuma bakalan munculin yang pertama

In [17]:
text = 'my new phone is the phone I bought yesterday'

In [18]:
match = re.search(pattern,text)

In [20]:
match.span()

(7, 12)

kalo mau cari semua nya pake findall

In [21]:
matches = re.findall(pattern,text)

In [22]:
matches

['phone', 'phone']

In [23]:
len(matches)

2

tapi kalo mau cari match objectnya, pake iterator (finditer)

In [24]:
for match in re.finditer(pattern,text):
    print(match)

<re.Match object; span=(7, 12), match='phone'>
<re.Match object; span=(20, 25), match='phone'>


In [26]:
for match in re.finditer(pattern,text):
    print(match.span())

(7, 12)
(20, 25)


In [27]:
for match in re.finditer(pattern,text):
    print(match.group())

phone
phone


# Patterns

So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this, often its just a matter of looking up the pattern code.

Let' begin!

## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [28]:
# contoh
text = 'My number is 1525-188-6383'

In [29]:
# kita mau grab no hp nya
phone = re.search('1525-188-6383',text)

In [30]:
# ketemu tapi ini ga pake pattern dan harus tau nomornya
phone

<re.Match object; span=(13, 26), match='1525-188-6383'>

In [31]:
# pake pattern
# didepan harus ada r soalnya python kan ada built in pake "\"
# contoh \n \t dll, jadi pake r biar python tau ini buat regex
phone = re.search(r'\d\d\d\d-\d\d\d-\d\d\d\d',text)

In [32]:
phone

<re.Match object; span=(13, 26), match='1525-188-6383'>

In [33]:
# group() buat munculin
phone.group()

'1525-188-6383'

tapi bayangin kalo angkanya ada ratusan, kan ga mungkin ketik \d sampe ratusan kali, jadi bisa pake quantifier

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [34]:
# kita tulis ulang pake quantifiers
phone = re.search(r'\d{4}-\d{3}-\d{4}',text)

In [35]:
phone.group()

'1525-188-6383'

## Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down). 

Using the phone number example, we can separate groups of regular expressions using parenthesis:

In [36]:
# kita bisa pake cara lain, jadi di bagi beberapa part patternnya
# di pisah pake ()
# pake compile
phone_pattern = re.compile(r'(\d{4})-(\d{3})-(\d{4})')

In [37]:
results = re.search(phone_pattern,text)

In [38]:
results.group()

'1525-188-6383'

In [39]:
# benefitnya kita bisa panggil per groupnya
# misal 4 digit pertama kan operator sim nya
# kita bisa call 4 digit pertamanya aja
results.group(1)

'1525'

#### notes: kalo index group itu mulai dari 1 bukan 0!!

In [40]:
results.group(2)

'188'

In [41]:
results.group(3)

'6383'

## Additional Regex Syntax

### Or operator |

Use the pipe operator to have an **or** statment. For example

In [42]:
re.search(r'cat','This cat is so cute')

<re.Match object; span=(5, 8), match='cat'>

In [43]:
re.search(r'dog','This cat is so cute')

In [45]:
# pake or buat antara ini atau itu
re.search(r'cat|dog','This cat is so cute')

<re.Match object; span=(5, 8), match='cat'>

### The Wildcard Character

Use a "wildcard" as a placement that will match any character placed there. You can use a simple period **.** for this. For example:

In [47]:
re.findall(r'at','The cat in the hat sat here')

['at', 'at', 'at']

In [48]:
# kita mau ambil kata2 yg belakangnya at
# pake titik buat nambahin space
re.findall(r'.at','The cat in the hat sat here')

['cat', 'hat', 'sat']

In [49]:
# kalo misal ada kata yg 5 huruf
re.findall(r'...at','The cat in the hat splat here')

['e cat', 'e hat', 'splat']

In [51]:
# tapi kata2 yg lain jadi aneh. jadi harusnya gini
# one or more non-whitespace that ends with 'at'
re.findall(r'\S+at','The cat in the hat splat here')

['cat', 'hat', 'splat']

### Starts with and Ends With

We can use the **^** to signal starts with, and the **$** to signal ends with:

In [52]:
# belakangnya angka
re.findall(r'\d$','This ends with a number 2')

['2']

In [55]:
# depannya angka
re.findall(r'^\d','1 is gonna start this text.')

['1']

Inget ini buat entire strings, bukan indivual words!

In [56]:
# kalo angka di tengah ga ketemu
re.findall(r'^\d','I ate 1 piece of cake')

[]

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. Anything inside the brackets is excluded. For example:

In [57]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [58]:
# misal kita mau exclude angka tinggal pake ^ terus apa yang mau di exclude
# dan inget pisahin pake []
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

To get the words back together, use a + sign 

In [60]:
# gabungin pake grouping []+
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

In [None]:
# contoh lain

hilangin punctuation dari kalimatnya

In [67]:
test_phrase = 'Kamu kenapa sih diem aja! Ngomong donk! Semoga kamu cepat sembuh.'

In [71]:
# exclude !, ., sama ?
re.findall(r'[^!.?]+',test_phrase)

['Kamu kenapa sih diem aja', ' Ngomong donk', ' Semoga kamu cepat sembuh']

In [72]:
# kita pisah perkata, tambahin exclude ' ' spasi
re.findall(r'[^!.? ]+',test_phrase)

['Kamu',
 'kenapa',
 'sih',
 'diem',
 'aja',
 'Ngomong',
 'donk',
 'Semoga',
 'kamu',
 'cepat',
 'sembuh']

In [73]:
# terus kita join
clean = ' '.join(re.findall(r'[^!.? ]+',test_phrase))

In [74]:
clean

'Kamu kenapa sih diem aja Ngomong donk Semoga kamu cepat sembuh'

## Brackets for Grouping

As we showed above we can use brackets to group together options, for example if we wanted to find hyphenated words:

In [75]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [76]:
# coba exclude dash(-) nya dengan cara include semua selain dash
re.findall(r'[\w]+',text)

['Only',
 'find',
 'the',
 'hypen',
 'words',
 'in',
 'this',
 'sentence',
 'But',
 'you',
 'do',
 'not',
 'know',
 'how',
 'long',
 'ish',
 'they',
 'are']

In [77]:
# atau kalo kita cuma mau include hypen word nya
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

In [78]:
# tanpa [] juga, cuma lebih rapih pake []
re.findall(r'\w+-\w+',text)

['hypen-words', 'long-ish']

## Parenthesis for Multiple Options

If we have multiple options for matching, we can use parenthesis to list out these options. For Example:

kalo [] buat grouping, kalo () buat multiple options

In [79]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [81]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [82]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [83]:
# ga ketemu soalnya cat-erpillar bukan claw
re.search(r'cat(fish|nap|claw)',textthree)

In [84]:
# ini baru ketemu
re.search(r'cat(fish|nap|erpillar)',textthree)

<re.Match object; span=(26, 37), match='caterpillar'>

### Conclusion

Excellent work! For full information on all possible patterns, check out: https://docs.python.org/3/howto/regex.html