# Welcome to the Notebook 
---

## Task 1
### What is Regex?

Regular expressions (Regex) allow us to extract substrings with a specific pattern from a text.



##### Meta characters: 
characters with special meaning 

<img src="images/t1.png" >

##### Special Sequences: 
Special Sequences with a special meaning

<img src="images/t2.png" >
    

In [1]:
import re

In [2]:
paragraph = """John is 24 years old and Sara is 23 and Maiki is 15 years old."""

let's extract all ages

In [5]:
pattern_for_ages = r'\d+'

pattern_for_ages2 = r'[0-9]+'

ages = re.findall(pattern_for_ages, paragraph)
ages2 = re.findall(pattern_for_ages2, paragraph)
ages2

['24', '23', '15']

let's extrac all names

In [6]:
pattern_for_names = r'[A-Z][a-z]+'

names = re.findall(pattern_for_names, paragraph)
names

['John', 'Sara', 'Maiki']

## Task 2
    - extracting phone numbers
    - formatting phone numbers so that all of them has the same format
    - extracting names
    - storing the data into a python dictionary

In [9]:
phone_numbers = """
john: 145-202-9330
Sara: 156.201.3333
maiki: 111*505*1254
"""

Extracting phone numbers 

In [10]:
phone_pattern = r'\d\d\d.\d\d\d.\d\d\d\d'
phones = re.findall(phone_pattern, phone_numbers)
phones

['145-202-9330', '156.201.3333', '111*505*1254']

Reformatting phone numbers and then extracting them again

In [13]:
replace_pattern = r'[.*]'
phone_numbers = re.sub(replace_pattern, '-', phone_numbers)

phone_pattern = r'\d\d\d.\d\d\d.\d\d\d\d'

phone_pattern2 = r'\d{3}.\d{3}.\d{4}'

phones = re.findall(phone_pattern2, phone_numbers)
phones

['145-202-9330', '156-201-3333', '111-505-1254']

Let's extract the names

In [14]:
name_pattern = r'[A-Z]?[a-z]+'
names = re.findall(name_pattern, phone_numbers)
names

['john', 'Sara', 'maiki']

Let's store it into a dictionary

In [15]:
contacts = {}
for name in names:
    for number in phones:
        contacts[name] = number
contacts

{'john': '111-505-1254', 'Sara': '111-505-1254', 'maiki': '111-505-1254'}

## Task 3
    the user enters his/her email, we want to check if the email address entered by the user is in a correct format.

<img height= 400 width=600 src="images/emailparts.png">

In [2]:
import re
email = input('Please enter your email')
email_format = r'[a-zA-Z0-9_.-]+@[a-zA-Z-.]+\.(com|net|edu|uk)'
if re.match(email_format, email):
    print('its okay')
else:
    print('wrong format')

Please enter your emailanika45@yahoo.com
its okay


- Exercise: write a RegEx pattern that <b>does not</b> allow the user to enter an email address that begins with numbers.

In [None]:
import re
email = input('Please enter your email')
email_format = r'[a-zA-Z][a-zA-Z0-9_.-]+@[a-zA-Z-.]+\.(com|net|edu|uk)'
if re.match(email_format, email):
    print('its okay')
else:
    print('wrong format')

### Task 4
     - write a RegEx pattern to recognize the following urls
     - reformatting the urls into => domain name + top level domain 
       E.g. coursera.org
    
<img height= 400 width=600 src="images/urlparts.png">

In [7]:
urls = """
https://www.google.com
http://youtube.com
https://www.nasa.gov
https://coursera.org
"""
url_pattern = r'https?://(www\.)?(\w+)(\.\w+)'

url_list = re.findall(url_pattern, urls)
new_urls = ''
for g1,g2,g3 in url_list:
    new_urls += g2+g3+'\n'
print(new_urls)

google.com
youtube.com
nasa.gov
coursera.org



let's reformat the urls


In [11]:
new_urls = re.sub(url_pattern, r'\2\3', urls)
print(new_urls)


google.com
youtube.com
nasa.gov
coursera.org



- Exercise: below you can see a list of dates with different formats. reformat all the dates into dd/mm/yyyy using Regex.

In [12]:
dates = """
12 01 2020
15.05.2021
07/03/2020
10-3-2019
"""


In [20]:
date_pattern = r'(\d{2}).(\d{2}|\d{1}).(\d{4})'
new_dates = re.sub(date_pattern, r'\1/\2/\3', dates)
print(new_dates)


12/01/2020
15/05/2021
07/03/2020
10/3/2019



# Task 5

### Text Mining using RegEx

we have a dataset containing of some personal notes

In [1]:
import pandas as pd
data = pd.read_csv('dataset.csv')
data

Unnamed: 0,notes
0,Friends reunion on Thursday 25-05-2021 at 6:00 pm
1,on Saturday-night 29-05-2021 at 3:00 pm champi...
2,the doctor's appointment is on Tuesday 12-04-2...
3,Meeting with friends on Friday 14-12-2020 at 8...
4,On Wednesday 06.01.2021 at 9:30 pm there is a ...
5,Don't forget to call Dani on Friday 22/07/2020...
6,"Wednesday 25/05/2021 at 5:30 pm, meeting with ..."
7,Job interview at 9:00 am Monday 02.02.2021


We want to extract some useful information out of this text data and store them into a dataframe. <br>
so let's create the dataframe

In [2]:
information = pd.DataFrame(columns = ['day','month','year','weekday','time'])
information

Unnamed: 0,day,month,year,weekday,time


Counting the number of words in each note

In [4]:
data['notes'].str.count(r'\w+')

0    11
1    15
2    14
3    12
4    14
5    15
6    12
7    10
Name: notes, dtype: int64

Get the list of all of the words in each note

In [5]:
data['notes'].str.findall(r'\w+')

0    [Friends, reunion, on, Thursday, 25, 05, 2021,...
1    [on, Saturday, night, 29, 05, 2021, at, 3, 00,...
2    [the, doctor, s, appointment, is, on, Tuesday,...
3    [Meeting, with, friends, on, Friday, 14, 12, 2...
4    [On, Wednesday, 06, 01, 2021, at, 9, 30, pm, t...
5    [Don, t, forget, to, call, Dani, on, Friday, 2...
6    [Wednesday, 25, 05, 2021, at, 5, 30, pm, meeti...
7    [Job, interview, at, 9, 00, am, Monday, 02, 02...
Name: notes, dtype: object

Exercise: find all of the dates in each note

In [6]:
data['notes'].str.findall(r'\d{2}.\d{2}.\d{4}')

0    [25-05-2021]
1    [29-05-2021]
2    [12-04-2021]
3    [14-12-2020]
4    [06.01.2021]
5    [22/07/2020]
6    [25/05/2021]
7    [02.02.2021]
Name: notes, dtype: object

let's clean these dates and extract them 

In [21]:
dates = data['notes'].str.replace(r'(\d{2}).(\d{2}).(\d{4})', lambda groups : groups[1]+'/'+groups[2]+'/'+groups[3])
dates_df = dates.str.extract(r'(?P<Day>\d{2}).(?P<Month>\d{2}).(?P<year>\d{4})')
dates_df

Unnamed: 0,Day,Month,year
0,25,5,2021
1,29,5,2021
2,12,4,2021
3,14,12,2020
4,6,1,2021
5,22,7,2020
6,25,5,2021
7,2,2,2021


Let's extract the times

In [10]:
time_df = data['notes'].str.extract(r'(?P<Time>\d+:\d+)')
time_df

Unnamed: 0,Time
0,6:00
1,3:00
2,4:40
3,8:30
4,9:30
5,12:00
6,5:30
7,9:00


Exercise: Extract the weekday names

In [16]:
weekdays_df = data['notes'].str.extract(r'(?P<Weekday>\w+day)')
weekdays_df

Unnamed: 0,Weekday
0,Thursday
1,Saturday
2,Tuesday
3,Friday
4,Wednesday
5,Friday
6,Wednesday
7,Monday


Now let's merge these three dataframes

In [22]:
information = dates_df.join(time_df).join(weekdays_df)
information

Unnamed: 0,Day,Month,year,Time,Weekday
0,25,5,2021,6:00,Thursday
1,29,5,2021,3:00,Saturday
2,12,4,2021,4:40,Tuesday
3,14,12,2020,8:30,Friday
4,6,1,2021,9:30,Wednesday
5,22,7,2020,12:00,Friday
6,25,5,2021,5:30,Wednesday
7,2,2,2021,9:00,Monday


Now we can do a lot of analysis based on this dataframe <br>

Let's answer to this question: <b> In which days of the week, I am busier? </b>

aggregate the dataframe on weekday column and count 

In [24]:
information.Weekday.value_counts()

Wednesday    2
Friday       2
Tuesday      1
Monday       1
Saturday     1
Thursday     1
Name: Weekday, dtype: int64

as another analytical question: In which months this year, I have been busier?

In [29]:
information[information.year == '2021'].Month.value_counts()

05    3
04    1
02    1
01    1
Name: Month, dtype: int64