## Grouping and capturing

In [2]:
text = """
Clary has 2 friends who she spends a lot time with.
Susan has 3 brothers while Jhon has 4 sisters."""

In [3]:
# group characters
import re
regx = '[A-Za-z]+\s\w+\s\d+\s\w+'

re.findall(regx,text)

['Clary has 2 friends', 'Susan has 3 brothers', 'Jhon has 4 sisters']

## Capturing groups
![](https://i.imgur.com/VasrAqZ.png)





In [5]:
# group characters
regx = '([A-Za-z]+)\s\w+\s\d+\s\w+'

re.findall(regx,text)

['Clary', 'Susan', 'Jhon']

## CApturing groups

![](https://i.imgur.com/Isx7Oy8.png)


In [6]:
regx = '([A-Za-z]+)\s\w+\s(\d+)\s(\w+)'

re.findall(regx,text)

[('Clary', '2', 'friends'),
 ('Susan', '3', 'brothers'),
 ('Jhon', '4', 'sisters')]

In [7]:
# organize data
t = 'Clary has 2 dogs but John has 3 cats'
regx = '([A-Za-z]+)\s\w+\s(\d+)\s(\w+)'

pets = re.findall(regx,t)
pets

[('Clary', '2', 'dogs'), ('John', '3', 'cats')]

In [8]:
pets[0][0]

'Clary'

### Capturing groups
- *immediately to the left*
- `r"apple+"`: `+` applies to e and not to apple

In [16]:
regx = "(\d[A-Za-z])+"
t = "My user name is 3e4r5fg"

re.search(regx,t)


<_sre.SRE_Match object; span=(16, 22), match='3e4r5f'>

### Capturing groups
- capture a repeated group `(\d+)` vs repeat a capturing group `(\d)+`

In [17]:
my_string = "My lucky numbers are 8755 and 33"
re.findall(r"(\d)+", my_string)

['5', '3']

In [18]:
re.findall(r"(\d+)", my_string)

['8755', '33']

## problem
```
Exercise
Exercise
Try another name
You are still working on your Twitter sentiment analysis. You analyze now some things that caught your attention. You noticed that there are email addresses inserted in some tweets. Now, you are curious to find out which is the most common name.

You want to extract the first part of the email. E.g. if you have the email marysmith90@gmail.com, you are only interested in marysmith90.
You need to match the entire expression. So you make sure to extract only names present in emails. Also, you are only interested in names containing upper (e.g. A,B, Z) or lowercase letters (e.g. a, d, z) and numbers.

The list sentiment_analysis containing the text of three tweets as well as the re module were loaded in your session. You can use print() to view it in the IPython Shell.
```
```python
```

## problem
```
```
```python
```

## problem
```
```
```python
```

## problem
```
```
```python
```

## Alernation and non-capturing groups
### pipe
- verical bar or pipe `|`

In [19]:
my_string = "I want to have a pet. But I don't know if I want a cat, a dog or a bird."
re.findall(r"cat|dog|bird", my_string)

['cat', 'dog', 'bird']

### find all animals after a number

In [20]:
my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
re.findall(r"\d+\scat|dog|bird", my_string)

['2 cat', 'dog', 'bird']

![](https://i.imgur.com/a44LGoK.png)

## Alternation

![](https://i.imgur.com/zsfH0KR.png)


In [22]:
my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
re.findall(r"\d+\s(cat|dog|bird)", my_string)

['cat', 'dog']

![](https://i.imgur.com/3IvFw7C.png)


In [23]:
my_string = "I want to have a pet. But I don't know if I want 2 cats, 1 dog or a bird."
re.findall(r"(\d)+\s(cat|dog|bird)", my_string)

[('2', 'cat'), ('1', 'dog')]

## Non-capturing groups
- match but **not capture** a grp
- when grou is not backreference
- add `?:` `(?:regex)`
![](https://i.imgur.com/Lo62ukN.png)


In [24]:
# extract last part of pattern

my_string = "John Smith: 34-34-34-042-980, Rebeca Smith: 10-10-10-434-425"
re.findall(r"(?:\d{2}-){3}(\d{3}-\d{3})", my_string)

['042-980', '434-425']

In [26]:
# extract number and ignore that follows th or rd 
my_date = "Today is 23rd May 2019. Tomorrow is 24th May 19."
re.findall(r"(\d+)(?:th|rd)", my_date)

['23', '24']

## Backrefrence

### Numbered groups
![](https://i.imgur.com/dEh3gR1.png)



In [27]:
text = "Python 3.0 was released on 12-03-2008."
information = re.search('(\d{1,2})-(\d{2})-(\d{4})', text)
information.group(3)

'2008'

In [28]:
information.group(0)

'12-03-2008'

### Named groups

![](https://i.imgur.com/68daewC.png)


In [29]:
text = "Austin, 78701"
cities = re.search(r"(?P<city>[A-Za-z]+).*?(?P<zipcode>\d{5})", text)
cities.group("city")

'Austin'

In [30]:
cities.group("zipcode")

'78701'

### Backrefrence

![](https://i.imgur.com/DyvXox2.png)


In [31]:
sentence = "I wish you a happy happy birthday!"
re.findall(r"(\w+)\s ", sentence)

[]

In [32]:
sentence = "I wish you a happy happy birthday!"
re.findall(r"(\w+)\s\1", sentence)

['happy']

In [33]:
sentence = "I wish you a happy happy birthday!"
re.sub(r"(\w+)\s\1", r"\1", sentence)

'I wish you a happy birthday!'

![](https://i.imgur.com/N7MKNOo.png)


In [34]:
sentence = "Your new code number is 23434. Please, enter 23434 to open the door."
re.findall(r"(?P<code>\d{5}).*?(?P=code)", sentence)

['23434']

![](https://i.imgur.com/3qvDn57.png)


In [36]:
sentence = "This app is not working! It's repeating the last word word."
re.sub(r"(?P<word>\w+)\s(?P=word)", r"\g<word>", sentence)

"This app is not working! It's repeating the last word."

## Look around

![](https://i.imgur.com/MfVNZFt.png)


In [37]:
my_text = "tweets.txt transferred, mypass.txt transferred, keywords.txt error"
re.findall(r"\w+\.txt ", my_text)

['tweets.txt ', 'mypass.txt ', 'keywords.txt ']

In [38]:
my_text = "tweets.txt transferred, mypass.txt transferred, keywords.txt error"
re.findall(r"\w+\.txt(?=\stransferred)", my_text)

['tweets.txt', 'mypass.txt']

## Negatgvie look ahead
- Non-capturing group
- Checks that the rst part ofthe expression is notfollowed by the lookahead expression
- Return only the rst part ofthe expression

In [39]:
my_text = "tweets.txt transferred, mypass.txt transferred, keywords.txt error"
re.findall(r"\w+\.txt ", my_text)

['tweets.txt ', 'mypass.txt ', 'keywords.txt ']

In [40]:
my_text = "tweets.txt transferred, mypass.txt transferred, keywords.txt error"
re.findall(r"\w+\.txt(?!\stransferred)", my_text)

['keywords.txt']

![](https://i.imgur.com/cITSH3B.png)

## Positive look-behind

- Non-capturing group
- Get allthe matches that are preceded by a specic pattern.
- Return pattern after look-behind expression

In [43]:
my_text = "Member: Angus Young, Member: Chris Slade, Past: Malcolm Young, Past: Cliff Williams."
re.findall(r" \w+\s\w+", my_text)

[' Angus Young', ' Chris Slade', ' Malcolm Young', ' Cliff Williams']

In [44]:
my_text = "Member: Angus Young, Member: Chris Slade, Past: Malcolm Young, Past: Cliff Williams."
re.findall(r"(?<=Member:\s)\w+\s\w+", my_text)

['Angus Young', 'Chris Slade']

## Negative look-behind

- Non-capturing group
- Get allthe matches that are not preceded by a specic pattern.
- Return pattern after look-behind expression

In [45]:
my_text = "My white cat sat at the table. However, my brown dog was lying on the couch."
re.findall(r"(?<!brown\s)(cat|dog)", my_text)

['cat']