# Enthusiastics Statistics Weekend 2017

## Intro to Text Processing Using Python
#### Module yang akan digunakan adalah re, nltk, 

Author : Ade Ihsan Hidayatullah

## What are various methods of Regular Expressions?

The ‘re’ package provides multiple methods to perform queries on an input string. Here are the most commonly used methods, I will discuss:

- re.match()
- re.search()
- re.findall()
- re.split()
- re.sub()
- re.compile()

Let’s look at them one by one.

### re.match(pattern, string):

This method finds match if it occurs at start of the string. For example, calling match() on the string ‘Stat UII alias Statistika Universitas Islam Indonesia’ and looking for a pattern ‘Stat’ will match. However, if we look for only Statistika, the pattern will not match. Let’s perform it in python now.

In [1]:
import re
result = re.match(r'Stat', 'Stat UII alias Statistika Universitas Islam Indonesia')
print(result)

<_sre.SRE_Match object; span=(0, 4), match='Stat'>


In [2]:
result = re.match(r'Stat', 'Stat UII alias Statistika Universitas Islam Indonesia')
print(result.group())

Stat


In [3]:
result = re.match(r'Stat', 'Stat UII alias Statistika Universitas Islam Indonesia')
print(result)

<_sre.SRE_Match object; span=(0, 4), match='Stat'>


In [4]:
print(result.span())
print(result.start())
print(result.end())

(0, 4)
0
4


### re.search(pattern, string):

It is similar to match() but it doesn’t restrict us to find matches at the beginning of the string only. Unlike previous method, here searching for pattern ‘Analytics’ will return a match.

In [5]:
result = re.search(r'Statistika', 'Stat UII alias Statistika Universitas Islam Indonesia')
print (result.group(0))

Statistika


In [6]:
print (result)

<_sre.SRE_Match object; span=(15, 25), match='Statistika'>


### re.findall (pattern, string):

It helps to get a list of all matching patterns. It has no constraints of searching from start or end. If we will use method findall to search ‘AV’ in given string it will return both occurrence of AV. While searching a string, I would recommend you to use re.findall() always, it can work like re.search() and re.match() both.

In [7]:
result = re.findall(r'Stat', 'Stat UII alias Statistika Universitas Islam Indonesia')
print (result)

['Stat', 'Stat']


### re.split(pattern, string, [maxsplit=0]):

This methods helps to split string by the occurrences of given pattern.

In [8]:
result=re.split(r'-','Statistika-Universitas-Islam-Indonesia')
result

['Statistika', 'Universitas', 'Islam', 'Indonesia']

In [9]:
result=re.split(r'-','Statistika-Universitas-Islam-Indonesia', maxsplit=1 )
result

['Statistika', 'Universitas-Islam-Indonesia']

### re.sub(pattern, repl, string):

It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.

In [10]:
result=re.sub(r'jogja','D.I.Yogya','Statistika UII bertempat di jogjakarta')
result

'Statistika UII bertempat di D.I.Yogyakarta'

### re.compile(pattern, repl, string):

We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.

In [11]:
import re
pattern=re.compile('Stat')
result=pattern.findall('Stat UII alias Statistika Universitas Islam Indonesia')
print (result)
result2=pattern.findall('Prodi Statistika UII Jogjakarta Terbaik Dahhh')
print (result2)

['Stat', 'Stat']
['Stat']


## What are the most commonly used operators?

Regular expressions can specify patterns, not just fixed characters. Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file. It is commonly used in web scrapping and  text mining to extract required information.

### Operators		  Description

.	 	      	  Matches with any single character except newline ‘\n’.

?	 	      	  match 0 or 1 occurrence of the pattern to its left

+	 	      	  1 or more occurrences of the pattern to its left

*	 	      	  0 or more occurrences of the pattern to its left

\w	 	      	  Matches with a alphanumeric character whereas \W (upper case W) matches non alphanumeric character.

\d	  	      	  Matches with digits [0-9] and /D (upper case D) matches with non-digits.

\s	      	      Matches with a single white space character (space, newline, return, tab, form) and \S (upper case S) matches 
	      	      any non-white space character.

\b	 	          boundary between word and non-word and /B is opposite of /b

[..]	 	      Matches any single character in a square bracket and [^..] matches any single character not in square bracket

\	 	          It is used for special meaning characters like \. to match a period or \+ for plus sign.

^ and $	 	      ^ and $ match the start or end of the string respectively

{n,m}	 	      Matches at least n and at most m occurrences of preceding expression if we write it as {,m} then it will 
	      	      return at least any minimum occurrence to max m preceding expression.

a| b	 	      Matches either a or b

( )		      	  Groups regular expressions and returns matched text

\t, \n, \r	      Matches tab, newline, return

## Some Examples of Regular Expressions

 

### Problem 1: Return the first word of a given string

#### Solution-1  Extract each character (using “\w“)

In [None]:
import re
result=re.findall(r'.','Stat UII alias Statistika Universitas Islam Indonesia')
print (result)

In [None]:
result=re.findall(r'\w','Stat UII alias Statistika Universitas Islam Indonesia')
print (result)

#### Solution-2  Extract each word (using “*” or “+“)

In [None]:
result=re.findall(r'\w*','Stat UII alias Statistika Universitas Islam Indonesia')
print (result)

In [None]:
result=re.findall(r'\w+','Stat UII alias Statistika Universitas Islam Indonesia')
print (result)

#### Solution-3 Extract each word (using “^“)

In [None]:
result=re.findall(r'^\w+','Stat UII alias Statistika Universitas Islam Indonesia')
print (result)

In [None]:
result=re.findall(r'\w+$','Stat UII alias Statistika Universitas Islam Indonesia')
print (result)

### Problem 2: Return the first two character of each word

#### Solution-1  Extract consecutive two characters of each word, excluding spaces (using “\w“)

In [None]:
result=re.findall(r'\w\w','Stat UII alias Statistika Universitas Islam Indonesia')
print (result)

In [None]:
result=re.findall(r'\b\w\w.','Stat UII alias Statistika Universitas Islam Indonesia')
print (result)

### Problem 3: Return the domain type of given email-ids

To explain it in simple manner, I will again go with a stepwise approach:

#### Solution-1  Extract all characters after “@”

In [None]:
result=re.findall(r'@\w+','ade@gmail.com, xyz@test.in, admin@masterstatistik.com, statistik@uii.ac.id')
print (result)

In [None]:
result=re.findall(r'@\w+.\w+','ade@gmail.com, xyz@test.in, admin@masterstatistik.com, statistik@uii.ac.id')
print (result)

In [None]:
result=re.findall(r'@\w+.(\w+)','ade@gmail.com, xyz@test.in, admin@masterstatistik.com, statistik@uii.ac.id')
print (result)

### Problem 4: Return date from given string

Here we will use “\d” to extract digit.

#### Solution:

In [None]:
result=re.findall(r'\d{2}-\d{2}-\d{4}','UII 34-3456 08-07-1945, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print (result)

In [None]:
result=re.findall(r'(\d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print (result)

In [None]:
result=re.findall(r'\d{2}-\d{2}-(\d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print (result)

### Problem 5: Return all words of a string those starts with vowel

#### Solution-1  Return each words

In [None]:
result=re.findall(r'\w+','Stat UII alias Statistika Universitas Islam Indonesia')
print (result)

#### Solution-2  Return words starts with alphabets (using [ ])

In [None]:
result=re.findall(r'[aeiouAEIOU]\w+','Stat UII alias Statistika Universitas 1slam Indonesia')
print (result)
# [aeiouAEIOU]\w+ : ambil kata mulai dari aiueoAIUEO pertama.

#### Solution- 3

In [None]:
result=re.findall(r'\b[aeiouAEIOU]\w+','Stat UII alias Statistika Universitas Islam Indonesia')
print (result)
# \b[aeiouAEIOU]\w+ : kata yang depannya [aiueoAIUEO] dan merupakan 

In [None]:
result=re.findall(r'\b[^aeiouAEIOU]\w+','Stat UII alias Statistika Universitas Islam Indonesia')
print (result)
#tanda ^ berarti kebalikan dari huruf" setelahnya,
# UII ada karena ada spasi didepannya

In [None]:
result=re.findall(r'\b[^aeiouAEIOU ]\w+','UII Stat alias Statistika Universitas 1slam Indonesia')
print (result)
#tanda ^ berarti kebalikan dari huruf" setelahnya yaitu aiueo, AIUEO, dan spasi.

### Problem 6: Validate a phone number (phone number must be of 10 digits and starts with 8 or 9) 

We have a list phone numbers in list “li” and here we will validate phone numbers using regular

#### Solution

In [None]:
import re
li=['081997612458','999999-99999','999999x999999']
for val in li:
    if re.match(r'[0-9]{12}',val) and len(val) == 12:
        print ('Nomor HP anda Benar')
    else:
        print ('Nomor HP anda Salah')

In [None]:
re.match(r'[0-9]{10}','0819976124xx')
# { } menandakan jumlah digit yang diinginkan

### Problem 7: Split a string with multiple delimiters

#### Solution

In [None]:
import re
line = 'esw stat;uii,jogja,keren,sekali' # kalimat untuk di multiple delimiters (";",","," ").
result= re.split(r'[;,\s]', line)
print (result)

In [None]:
import re
line = 'esw stat;uii,jogja,keren,sekali'
result= re.sub(r'[;,\s]',' ', line)
print (result)

### Problem 8: Retrieve Information from HTML file

I want to extract information from a HTML file (see below sample data). Here we need to extract information available between < td> and < /td> except the first numerical index. I have assumed here that below html code is stored in a string str.

##### Sample HTML file (str)

< tr align="center">< td>1< /td> < td>Noah< /td> < td>Emma< /td>< /tr> <br>
< tr align="center">< td>2< /td> < td>Liam< /td> < td>Olivia< /td>< /tr> <br>
< tr align="center">< td>3< /td> < td>Mason< /td> < td>Sophia< /td>< /tr> <br>
< tr align="center">< td>4< /td> < td>Jacob< /td> < td>Isabella< /td>< /tr> <br>
< tr align="center">< td>5< /td> < td>William< /td> < td>Ava< /td>< /tr> <br>
< tr align="center">< td>6< /td> < td>Ethan< /td> < td>Mia< /td>< /tr> <br>
< tr align="center">< td>7< /td> < td HTML>Michael< /td> < td>Emily< /td>< /tr> <br>

#### Solution

In [None]:
str = '''
<tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr>
<tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr>
<tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr>
<tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr>
<tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr>
<tr align="center"><td>6</td> <td>Ethan</td> <td>Mia</td></tr>
<tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>
'''
result=re.findall(r'<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str)
print (result)

In [None]:
import autocorrect

In [None]:
autocorrect.spell('conect')