In [1]:
import re
from IPython import display

# Regular Expressions (Regex)

## Introduction

Regular expressions are: *queries to search in a text*, the're also called *regex*

These expressions / queries are usually a combination of several special characters that represent a special query

<br />

Example usage of regex:
1. Validate length
2. Validate pattern (eg: email pattern)
3. Find text that falls in a specific pattern (eg: website name in link, numbers in text)
4. Select a specific part of text and ignore the other (eg: select text only without punctuation marks)

<br />

Remember:
- Regular expressions aren't easy to master nor to remember all the expressions
- Whenever you have a case that you need a regex for, google it first
- Go to [regexr](https://regexr.com/) and test your regex, change it as you need until you're satisfied


### Regex Basics

In [2]:
display.IFrame('https://devhints.io/regexp', width=800, height=500)

The best way to learn is by example so let's explore the basics first

### Example 1

Select alphanumerical characters only (characters and numbers) from the following tweet **(with changes)**

In [3]:
display.HTML('<blockquote class="twitter-tweet"><p lang="en" dir="ltr">How to get started in Natural Language Processing (NLP) thread 🧵🪡</p>&mdash; Munich🥨NLP (@MunichNlp) <a href="https://twitter.com/MunichNlp/status/1551257503018717184?ref_src=twsrc%5Etfw">July 24, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ')

#### Alternative I

In [4]:
tweet = "How to get started in Natural Language Processing (NLP) ?! thread 🧵🪡 #NLP #datascience #nlp_2022 ..."

In [5]:
matches = re.findall(pattern='\w', string=tweet)
print(matches)

['H', 'o', 'w', 't', 'o', 'g', 'e', 't', 's', 't', 'a', 'r', 't', 'e', 'd', 'i', 'n', 'N', 'a', 't', 'u', 'r', 'a', 'l', 'L', 'a', 'n', 'g', 'u', 'a', 'g', 'e', 'P', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', 'N', 'L', 'P', 't', 'h', 'r', 'e', 'a', 'd', 'N', 'L', 'P', 'd', 'a', 't', 'a', 's', 'c', 'i', 'e', 'n', 'c', 'e', 'n', 'l', 'p', '_', '2', '0', '2', '2']


What does this RegEx mean?

It means match any **alphanumeric** character and the **underscore** charachter ***(one at a time)***

In [6]:
"".join(matches)

'HowtogetstartedinNaturalLanguageProcessingNLPthreadNLPdatasciencenlp_2022'

The result from the current solution to this example yields the alphanumerical characters (with underscore), but not seperated by spaces, why?

Because the whitespaces in the tweet weren't selected

In [7]:
matches = re.findall(pattern='[\w\s]', string=tweet)
print(matches)

['H', 'o', 'w', ' ', 't', 'o', ' ', 'g', 'e', 't', ' ', 's', 't', 'a', 'r', 't', 'e', 'd', ' ', 'i', 'n', ' ', 'N', 'a', 't', 'u', 'r', 'a', 'l', ' ', 'L', 'a', 'n', 'g', 'u', 'a', 'g', 'e', ' ', 'P', 'r', 'o', 'c', 'e', 's', 's', 'i', 'n', 'g', ' ', 'N', 'L', 'P', ' ', ' ', 't', 'h', 'r', 'e', 'a', 'd', ' ', ' ', 'N', 'L', 'P', ' ', 'd', 'a', 't', 'a', 's', 'c', 'i', 'e', 'n', 'c', 'e', ' ', 'n', 'l', 'p', '_', '2', '0', '2', '2', ' ']


What does this RegEx mean?

It means match any **alphanumeric and underscore** <ins>***or***</ins>  **whitespace** characters ***(one at at time)***

In [8]:
"".join(matches)

'How to get started in Natural Language Processing NLP  thread  NLP datascience nlp_2022 '

#### Alternative II

In [9]:
matches = re.findall(pattern='\w+', string=tweet)
print(matches)

['How', 'to', 'get', 'started', 'in', 'Natural', 'Language', 'Processing', 'NLP', 'thread', 'NLP', 'datascience', 'nlp_2022']


What does this RegEx mean?

1. `\w`: match any alphanumeric and underscore
2. ` +`: match one or more at a time 

In [10]:
" ".join(matches)

'How to get started in Natural Language Processing NLP thread NLP datascience nlp_2022'

Takeaways:
1. For a single problem there are many solutions using RegEx
2. Unless you specify else, RegEx matches single characters at a time
3. `\w`: select whitespace character
4. `\s`: select alphanumeric and underscore character
5. `+` : match one or more at a time 

### Example 2

Select Arabic characters only from the following tweet

In [11]:
display.HTML(data='<blockquote class="twitter-tweet"><p lang="ar" dir="rtl">غابات دبين 🇯🇴<br>محمية طبيعية استثنائية في مدينة جرش العريقة، وتزخر في فصلي الربيع والصيف بالزوار نظراً لكونها واحدة من أهم المعالم السياحية في الأردن، حيث تضم مساحاتها الشاسعة أنواع عديدة من أنواع الاشجار والنباتات المختلفة والزهور الخلابة ⬇️<a href="https://twitter.com/hashtag/%D8%A7%D9%84%D8%A3%D8%B1%D8%AF%D9%86_%D8%A8%D8%B9%D9%8A%D9%88%D9%86_%D8%A3%D8%B1%D8%AF%D9%86%D9%8A%D8%A9?src=hash&amp;ref_src=twsrc%5Etfw">#الأردن_بعيون_أردنية</a> 🇯🇴 <a href="https://t.co/JR8eNo7qzi">pic.twitter.com/JR8eNo7qzi</a></p>&mdash; يزن ⚖︎🇯🇴 (@YazanTheeb22) <a href="https://twitter.com/YazanTheeb22/status/1552275609426006017?ref_src=twsrc%5Etfw">July 27, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ')

In [12]:
tweet = """
غابات دبين 🇯🇴
محمية طبيعية استثنائية في مدينة جرش العريقة، وتزخر في فصلي الربيع والصيف بالزوار نظراً لكونها واحدة من أهم المعالم السياحية في الأردن، حيث تضم مساحاتها الشاسعة أنواع عديدة من أنواع الاشجار والنباتات المختلفة والزهور الخلابة ⬇️
#الأردن_بعيون_أردنية 🇯🇴
"""

In [13]:
re.findall("[\u0600-\u06FF]+", tweet)

['غابات',
 'دبين',
 'محمية',
 'طبيعية',
 'استثنائية',
 'في',
 'مدينة',
 'جرش',
 'العريقة،',
 'وتزخر',
 'في',
 'فصلي',
 'الربيع',
 'والصيف',
 'بالزوار',
 'نظراً',
 'لكونها',
 'واحدة',
 'من',
 'أهم',
 'المعالم',
 'السياحية',
 'في',
 'الأردن،',
 'حيث',
 'تضم',
 'مساحاتها',
 'الشاسعة',
 'أنواع',
 'عديدة',
 'من',
 'أنواع',
 'الاشجار',
 'والنباتات',
 'المختلفة',
 'والزهور',
 'الخلابة',
 'الأردن',
 'بعيون',
 'أردنية']

What does this RegeX mean?

1. `[]` match any character within this list
2. `\u0600 - \u6FF`: *unicode* characters range
3. `+`: match one or more at a time


#### What is Unicode?

It's a system to handle and represent alphabets, characters, emojis, symbols ...etc in computer systems.

The Unicode system is widely adopted and has become a standard, other alternatives are the ASCII system for example.

Resources:
- [Offical Site](https://home.unicode.org/) 
- [Arabic language in Unicode](https://unicode.org/charts/PDF/U0600.pdf)