# Welcome to the Notebook 
---

## Task 1
### What is Regex?

Regular expressions (Regex) allow us to extract substrings with a specific pattern from a text.



##### Meta characters: 
characters with special meaning 

<img src="images/t1.png" >

##### Special Sequences: 
Special Sequences with a special meaning

<img src="images/t2.png" >
    

In [1]:
import re

In [2]:
paragraph = """John is 24 years old and Sara is 23 and Maiki is 15 years old."""

let's extract all ages

In [4]:
pattern_for_ages = r'\d+'


pattern_for_ages_2 = r'[0-9]+'
ages = re.findall(pattern_for_ages_2, paragraph)
ages

['24', '23', '15']

let's extrac all names

In [7]:
pattern_for_names = r'[A-Z][a-z]+'

names = re.findall(pattern_for_names, paragraph)
names

['John', 'Sara', 'Maiki']

## Task 2
    - extracting phone numbers
    - formatting phone numbers so that all of them has the same format
    - extracting names
    - storing the data into a python dictionary

In [8]:
phone_numbers = """
john: 145-202-9330
Sara: 156.201.3333
maiki: 111*505*1254
"""

Extracting phone numbers 

In [9]:
phone_pattern = r'\d\d\d.\d\d\d.\d\d\d\d'
phones = re.findall(phone_pattern, phone_numbers)

phones

['145-202-9330', '156.201.3333', '111*505*1254']

Reformatting phone numbers and then extracting them again

In [13]:
replace_pattern = r'[.*]'
phone_numbers = re.sub(replace_pattern, "-", phone_numbers)

phone_pattern = r'\d\d\d.\d\d\d.\d\d\d\d'
phone_pattern_2 = r'\d{3}.\d{3}.\d{4}'
phones = re.findall(phone_pattern, phone_numbers)

phones


['145-202-9330', '156-201-3333', '111-505-1254']

Let's extract the names

In [14]:
name_pattern = r'[A-Z]?[a-z]+'
names = re.findall(name_pattern, phone_numbers)
names

['john', 'Sara', 'maiki']

Let's store it into a dictionary

In [15]:
contacts = dict()
for name in names:
    for number in phones:
        contacts[name] = number
contacts

{'john': '111-505-1254', 'Sara': '111-505-1254', 'maiki': '111-505-1254'}

## Task 3
    the user enters his/her email, we want to check if the email address entered by the user is in a correct format.

<img height= 400 width=600 src="images/emailparts.png">

In [None]:
email = input("please enter your email address ")
email_format = r'[A-Za-z0-9_.-]+@[A-Za-z-.]+\.(com|net|edu|uk)'

if re.match(email_format, email):
    print("its ok")
else:
    print("wrong format")

- Exercise: write a RegEx pattern that <b>does not</b> allow the user to enter an email address that begins with numbers.

In [None]:
email = input("please enter your email address ")
email_format = r'[A-Za-z0-9_.-]+@[A-Za-z-.]+\.(com|net|edu|uk)'

if re.match(email_format, email):
    print("its ok")
else:
    print("wrong format")

### Task 4
     - write a RegEx pattern to recognize the following urls
     - reformatting the urls into => domain name + top level domain 
       E.g. coursera.org
    
<img height= 400 width=600 src="images/urlparts.png">

In [None]:
urls = """
https://www.google.com
http://youtube.com
https://www.nasa.gov
https://coursera.org
"""

url_pattern = r'https?://(www\.)?(\w+)(\.\w+)'

urls_list = re.findall(url_pattern, urls)
urls_list

for g1,g2,g3 in urls_list:
    
    new_urls += g2+g3+"\n"
    
print(new_urls)

let's reformat the urls


In [None]:
new_urls = re.sub(url_pattern, r'\2\3', urls)
print(new_urls)

- Exercise: below you can see a list of dates with different formats. reformat all the dates into dd/mm/yyyy using Regex.

In [None]:
dates = """
12 01 2020
15.05.2021
07/03/2020
10-3-2019
"""


In [None]:
date_pattern = r"(\d{2}).(\d{2}).(\d{4})"
new_dates = re.sub(date_pattern, r'\1/2/\3',dates)
print(new_dates)

# Task 5

### Text Mining using RegEx

we have a dataset containing of some personal notes

In [None]:
import pandas as pd
data = pd.read_csv("dataset.csv")
data

We want to extract some useful information out of this text data and store them into a dataframe. <br>
so let's create the dataframe

In [None]:
information = pd.DataFrame(columns=["day", "month", "year", "weekday", "time"])
information

Counting the number of words in each note

In [None]:
data["notes"].str.count(r"\w+")


Get the list of all of the words in each note

In [None]:
data["notes"].str.findall(r"\w+")


Exercise: find all of the dates in each note

In [None]:
data['notes'].str.findall("\d{2}.\d{2}.\d{4}")


let's clean these dates and extract them 

In [None]:
data['notes'].str.replace(r'\d{2}).(\d{2}).(\d{4})', lambda groups: groups[1]+"/"+groups[2]+"/"+groups[3]
dates.str.extract(r'(?P<day>\d{2}).(?P<month>(\d{2}).(?P<year>(\d{4})')
dates_df

Let's extract the times

In [None]:
time_df = data['notes'].str.extract('(?P<time>\d+:\d+)')
time_df

Exercise: Extract the weekday names

In [None]:
weekday_df = data['notes'].str.extract("?P<weekday>\w+day")
weekday_df

Now let's merge these three dataframes

In [None]:
information = dates_df.join(time_df).join(weekday_df)
information

Now we can do a lot of analysis based on this dataframe <br>

Let's answer to this question: <b> In which days of the week, I am busier? </b>

aggregate the dataframe on weekday column and count 

In [None]:
information.weekday.value_counts()

as another analytical question: In which months this year, I have been busier?

In [None]:
information[information.year=="2021"].month.value_counts()