# Regular Expressions (regex) - Part 2

### We'll now connect ```regex``` to ```Python```.

**Pages we need:**

- Live testing: https://regex101.com/
- Sandeep's <a href="https://docs.google.com/spreadsheets/d/1A39lM4SiGZbzZPxrjEXkR05aksAewudccWmojoIU0Gk/edit?usp=sharing">REGEX Tip sheet</a>
- Download <a href="https://drive.google.com/file/d/1thZ7xDTb0IXiauaSvxMskDL_BAFjM_fi/view?usp=sharing">this notebook demo text</a>
- Download these <a href="https://drive.google.com/file/d/1nZ6GMwQKuQEQ2zKftkod4oHh-RqxfwXB/view?usp=sharing">confesssion judgment excerpts</a>

## `Regex` Packages

* `re` is the oldest and most widely used library. It's showing its age!
* `regex` is a newer, fresher library with major advantages over `re` as you get more advanced.

In [1]:
# import libraries
import regex as re
import pandas as pd
import glob

In [2]:
## Run this cell
some_text = "The dog barked at the other dog, cat and kitten."

### Simple Python literal search

In [3]:
# is dog in my text
"dog" in some_text

True

### 1. `regex` pattern compilation

To use a regex pattern effectively, we first need to compile it. 

This basically means we are taking a regex pattern, turning it into something that Python can easily understand — a `regex Python object`. 

`regex.compile(some_pattern)` or `re.compile(some_pattern)`

In [4]:
# set a pattern to find all the words 'dog' 
dog_pat = re.compile(r"d\wg")
type(dog_pat)

_regex.Pattern

### 2. `search`

* Like `find()` in BeautifulSoup, `search()` returns ONLY the first instance of text found by a pattern.
* It also tells you where a patter was found.
* Knowing that an instance of a pattern match exists can be used to our advantage (more on this later).

`a_result = pattern_variable.search(some_string)`

In [5]:
# search for pattern
dog_pat.search(some_text)

<regex.Match object; span=(4, 7), match='dog'>

In [6]:
## run this
other_text = "The military vet who loves her dog was able to overcome her PTSD with the help of a service dog."

In [7]:
# what type of object
type(dog_pat.search(other_text))

_regex.Match

In [8]:
# search for pattern
dog_pat.search(other_text)

<regex.Match object; span=(31, 34), match='dog'>

### 3. Returning the match objects

We usually want to do more than simply know that a pattern exists or return its first instance. Ideally, we want to capture it and place it in a spreadsheet. 

The `group()` method returns the matched pattern as text.

In [9]:
# run dog patter and save in vet_dog
vet_dog = dog_pat.search(other_text)
vet_dog

<regex.Match object; span=(31, 34), match='dog'>

In [10]:
# pull out the matched pattern itself
vet_dog.group()

'dog'

In [11]:
# type
type(vet_dog.group())

str

## Read our demo text

In [12]:
# read the file
with open("regex-notebook-demo.txt", "r") as text_obj:
    demo_text = text_obj.read()

In [13]:
print(demo_text)

12534 127 ab aba abba sandeepj@bloomberg.net abbba, abbbbba, (518) 469-4581 abcde.The dog is a not a hog. ABA ABBA ABBBA.

Florence smith
John Smith
Carol Brown Tel. 212-452-5728


Ab_CD123  123456 and 12456	tor 12531245134562. 123867584789. $40.44 or $3 or $52,583.08 or $610,235.11
Tel. (917) 492-8601

The cat sat down and called 514-957-3453 while the other caaaaaat purred. This cat is in California while this caaaat is in Iraq, but none are in ct. My dog prefers cat food to dog food but hates fish food.My food tastes yummy!
Peter Trapp Jon Smythe Luca Smith John Smythe
Maggie Veach Tel. 718 601 7567

AB_cd 	<+>-.,!@# $%^&*();\/|_^@1# (917) 488-5410

*!dsar2d1

I told him to search Jon Smith
Katie Pine
John Smth

TELEPHONE NUMBERS
Tel. 718-392-4793
Tel. (646) 469-3211
Tel. 917.425.2796

NUMBERS
1
12
123
123,456
123,456,789
123,456,789.01

EMAILS
sandeep.junnarkar@journalism.cuny.edu
sandeepj@gmail.com
sjunnarkar@bloomberg.net
sandeep.junnarkar@news.co.uk


POLICE
Det. Frank Castille


### 4. `findall` finds every item that matches a pattern

`pattern.findall(text)`

In [14]:
# use findall to find all matching text instances of grouped letters, numbers, and underscores
pat1 = re.compile(r"\w+")
pat1.findall(demo_text)

['12534',
 '127',
 'ab',
 'aba',
 'abba',
 'sandeepj',
 'bloomberg',
 'net',
 'abbba',
 'abbbbba',
 '518',
 '469',
 '4581',
 'abcde',
 'The',
 'dog',
 'is',
 'a',
 'not',
 'a',
 'hog',
 'ABA',
 'ABBA',
 'ABBBA',
 'Florence',
 'smith',
 'John',
 'Smith',
 'Carol',
 'Brown',
 'Tel',
 '212',
 '452',
 '5728',
 'Ab_CD123',
 '123456',
 'and',
 '12456',
 'tor',
 '12531245134562',
 '123867584789',
 '40',
 '44',
 'or',
 '3',
 'or',
 '52',
 '583',
 '08',
 'or',
 '610',
 '235',
 '11',
 'Tel',
 '917',
 '492',
 '8601',
 'The',
 'cat',
 'sat',
 'down',
 'and',
 'called',
 '514',
 '957',
 '3453',
 'while',
 'the',
 'other',
 'caaaaaat',
 'purred',
 'This',
 'cat',
 'is',
 'in',
 'California',
 'while',
 'this',
 'caaaat',
 'is',
 'in',
 'Iraq',
 'but',
 'none',
 'are',
 'in',
 'ct',
 'My',
 'dog',
 'prefers',
 'cat',
 'food',
 'to',
 'dog',
 'food',
 'but',
 'hates',
 'fish',
 'food',
 'My',
 'food',
 'tastes',
 'yummy',
 'Peter',
 'Trapp',
 'Jon',
 'Smythe',
 'Luca',
 'Smith',
 'John',
 'Smythe',
 '

In [15]:
type(pat1.findall(demo_text))

list

In [16]:
# "a" followed by 1 or 2 "b""
pat2 = re.compile(r"ab{1,2}")

In [17]:
pat2.findall(demo_text)

['ab', 'ab', 'abb', 'abb', 'abb', 'ab', 'ab', 'ab', 'abb']

In [18]:
pat2

regex.Regex('ab{1,2}', flags=regex.V0)

### 5. Flags

Regex flags are optional modifiers that alter the behavior of regular explressuins, like making it case-insensitive. 

`re.IGNORECASE` or `re.I` for ignore case

`re.complies(some_pattern, re.I)`

In [19]:
pat2 = re.compile(r"ab{1,2}", re.I)

In [20]:
pat2.findall(demo_text)

['ab',
 'ab',
 'abb',
 'abb',
 'abb',
 'ab',
 'AB',
 'ABB',
 'ABB',
 'Ab',
 'AB',
 'ab',
 'ab',
 'AB',
 'abb']

In [21]:
# find dog, ignore case
pat3 = re.compile(r"dog", re.I)

In [22]:
pat3.findall(demo_text)

['dog', 'dog', 'dog', 'Dog', 'dog', 'DOG', 'dog']

In [23]:
# without ignore case
pat4 = re.compile(r"dog")

In [24]:
pat4.findall(demo_text)

['dog', 'dog', 'dog', 'dog', 'dog']

## Put Regex to work

In [25]:
## text string for demo
string = '''line 1 with more to follow.

this is line 2

LINE 3 has more...

Text before line 4
'''

In [26]:
print(string)

line 1 with more to follow.

this is line 2

LINE 3 has more...

Text before line 4



### `string` analysis

* The string starts with line `
* While there appear to be four lines with the line #, and three blank lines, this is actually considered on string block by `regex`
* To treat the block as having multiple lines, we need to use `re.MULTILINE` or `re.M`.

In [27]:
# pattern to find all "line number"
pat = re.compile(r"line\s\d")

In [28]:
pat.findall(string)

['line 1', 'line 2', 'line 4']

In [29]:
# pattern to find all line with number
pat2 = re.compile(r"^line\s\d")

In [30]:
pat2.findall(string)

['line 1']

In [31]:
# pattern to find all with number at start of the line with re.M
pat3 = re.compile(r"line\s\d", flags=re.I | re.M)

In [32]:
pat3.findall(string)

['line 1', 'line 2', 'LINE 3', 'line 4']

In [33]:
# pattern to find all line with number at the end of the line
pat4 = re.compile(r"line\s\d$", flags=re.I | re.M)

In [34]:
pat4.findall(string)

['line 2', 'line 4']

#### `re.DOTALL` or `re.S` for period includes new lines

In [35]:
# without a dotall pattern
pat5 = re.compile(r".*")

In [36]:
pat5.search(string)

<regex.Match object; span=(0, 27), match='line 1 with more to follow.'>

In [37]:
pat6 = re.compile(r".*", re.DOTALL)

In [38]:
pat6.search(string)

<regex.Match object; span=(0, 84), match='line 1 with more to follow.\n\nthis is line 2\n\nLINE 3 has more...\n\nText before line 4\n'>

## Capture groups vs no capture groups

In [39]:
# no capture groups
animal_pat = re.compile(r"^[a-z]+\s+\d{1,4}\s+\w+", re.M)

In [40]:
# find targets
animals_data = animal_pat.findall(demo_text)
animals_data

['cat 3452 black',
 'dog 234 white',
 'fish 3562 silver',
 'rat 94 white',
 'bird 1957 purple',
 'horse 3 black',
 'chameleon  5724 green',
 'snake 6730 red']

In [41]:
# turn into df
df = pd.DataFrame(animals_data)
df

Unnamed: 0,0
0,cat 3452 black
1,dog 234 white
2,fish 3562 silver
3,rat 94 white
4,bird 1957 purple
5,horse 3 black
6,chameleon 5724 green
7,snake 6730 red


In [42]:
# set pattern, using capture groups
animal_pat2 = re.compile(r"^([a-z]+)\s+(\d{1,4})\s+(\w+)", re.M)

In [43]:
# find targets
data = animal_pat2.findall(demo_text)
data

[('cat', '3452', 'black'),
 ('dog', '234', 'white'),
 ('fish', '3562', 'silver'),
 ('rat', '94', 'white'),
 ('bird', '1957', 'purple'),
 ('horse', '3', 'black'),
 ('chameleon', '5724', 'green'),
 ('snake', '6730', 'red')]

In [44]:
# turn into df
df = pd.DataFrame(data)
df.columns = ["anima", "quantity", "color"]
df

Unnamed: 0,anima,quantity,color
0,cat,3452,black
1,dog,234,white
2,fish,3562,silver
3,rat,94,white
4,bird,1957,purple
5,horse,3,black
6,chameleon,5724,green
7,snake,6730,red


In [45]:
type(data)

list

In [46]:
type(data[0])

tuple

In [47]:
# what if we don't want the number / anchor using `?:`
animal_pat3 = re.compile(r"^([a-z]+)\s+(?:\d{1,4})\s+(\w+)", re.M)

# adds a marker using `?P<some_name>`
animal_pat4 = re.compile(r"^(?P<Animal>[a-z]+)\s+(?:\d{1,4})\s+(\w+)", re.M)

In [48]:
more_data = animal_pat3.findall(demo_text)
more_data

[('cat', 'black'),
 ('dog', 'white'),
 ('fish', 'silver'),
 ('rat', 'white'),
 ('bird', 'purple'),
 ('horse', 'black'),
 ('chameleon', 'green'),
 ('snake', 'red')]

In [49]:
new_data = animal_pat4.findall(demo_text)
new_data

[('cat', 'black'),
 ('dog', 'white'),
 ('fish', 'silver'),
 ('rat', 'white'),
 ('bird', 'purple'),
 ('horse', 'black'),
 ('chameleon', 'green'),
 ('snake', 'red')]

# Class Challenge

## Confession Judgments

Read <a href="https://drive.google.com/file/d/1nZ6GMwQKuQEQ2zKftkod4oHh-RqxfwXB/view?usp=sharing">confesssion judgment excerpts</a> and place into a dataframe with column labels:

- case number
- date received
- total repayment amount

In [50]:
# regex for case number
pat_case_number = re.compile(r"INDEX NO.\s+(\d+\/\d{4})", re.I)

In [51]:
# defining regex for date received
pat_date_received = re.compile(r"NYSCEF:\s{0,3}(\d{2}\/\d{2}\/\d{4})", re.I)

# {0,} also works like *

In [52]:
# regex for repayment amount
pat_repay_total = re.compile(r"\(\$(\d{1,3}(?:,?\d{3})*\.\d{2})\)\W? within", re.I)

In [53]:
# read file
with open("confession-judgments.txt", "r") as file:
    confessions_text = file.read()

In [54]:
pat_case_number.findall(confessions_text)

['811678/2020', '813094/2020', '801263/2020']

In [55]:
pat_date_received.findall(confessions_text)

['08/03/2020', '10/15/2020', '01/28/2020']

In [56]:
pat_repay_total.findall(confessions_text)

['2100.00', '214.59', '862.00']

### Breaking up the file into cases

In [57]:
# break up text
split_pat = re.compile(r"FILE_Info: conversion_img")
cases = re.split(split_pat, confessions_text, re.S)
cases

['',
 '/811678_2020_Marie_A_Cannon_Commi_v_Marie_A_Cannon_Commi_CONFESSION_OF_JUDGM_1_page_1.jpg \n\n\n              FILED: ERIE COUNTY CLERK 08/03/2020 02:27 PM INDEX NO. 811678/2020\n\nNYSCEF DOC. NO. 1 RECEIVED NYSCEF: 08/03/2020\n\nSTATE OF NEW YORK\nSUPREME COURT: COUNTY OF ERIE\n\nAL Dirschberger PhD, as Acting Commissioner of\nErie County Department of Social Services\n\nPLAINTIFF Affidavit of Confession of Judgment\nOliveras, Ana\nVS Index No.\n\nDEFENDANT\nSID 6301\n\nSTATE OF NEW YORK |\nCOUNTY OF ERIE ] SS:\nCITY OF BUFFALO]\n\nThe Deponent being duly sworn, deposes and says:\n1. I am the defendant in the above entitled action.\n\n2. [reside at 78 Elgas Street (upper), Burraco 14207, County of Ene, State of New York. I authorize entry of\njudgment in Erie County, State of New York if my residence is not in New York State.\n\n3. confess judgment in this court, in favor of the plaintiff and against the defendant in the sum of Two\nThousand One Hundred Doilars and 00/100 ($2100

In [58]:
len(cases)

4

In [59]:
## MY EXERCISE

# initializing
check_case = re.compile(r"FILED:\s+ERIE\s+COUNTY\s+CLERK")
errors_list = [] # holds our errors
counter = 0

# lists to hold our variables
case_number = []
date_received = []
repay_total = []

for item in cases:
    counter += 1
    if check_case.findall(item):
        try:
            case = pat_case_number.findall(item)[0]
            case_number.append(case)
            date = pat_date_received.findall(item)[0]
            date_received.append(date)
            repayment = pat_repay_total.findall(item)[0]
            repay_total.append(repayment)

            temp_list = list(zip(case_number, date_received, repay_total))
        
        except Exception as e:
            print(f"There is an error for item #{counter}: {e}")
            errors_list.append(item, e)
        print(f"Case #{counter} extracted.")
        
    else:
        print(f"Nothing to extract for case #{counter}.")
    
main_df = pd.DataFrame(temp_list, columns=[
    "case_number", "date_received", "repayment_amount"
    ])
main_df

Nothing to extract for case #1.
Case #2 extracted.
Case #3 extracted.
Case #4 extracted.


Unnamed: 0,case_number,date_received,repayment_amount
0,811678/2020,08/03/2020,2100.0
1,813094/2020,10/15/2020,214.59
2,801263/2020,01/28/2020,862.0


In [60]:
## SANDEEP'S SOLUTION

judgments_list = []

for case in cases:
    if pat_case_number.search(case) != None:
        case_no = pat_case_number.findall(case)[0]
        date_received = pat_date_received.findall(case)[0]
        total_due = pat_repay_total.findall(case)[0]

        judgments_list.append({
            "case": case_no,
            "date_received": date_received,
            "total_due": total_due
        })

judgments_list

[{'case': '811678/2020',
  'date_received': '08/03/2020',
  'total_due': '2100.00'},
 {'case': '813094/2020', 'date_received': '10/15/2020', 'total_due': '214.59'},
 {'case': '801263/2020', 'date_received': '01/28/2020', 'total_due': '862.00'}]

In [61]:
## MY REFINED EXERCISE SOLUTION

# initializing
check_case = re.compile(r"FILED:\s+ERIE\s+COUNTY\s+CLERK")
errors_list = [] # holds our errors
counter = 0

all_data = []

for item in cases:
    counter += 1
    if pat_case_number.search(item) != None:
        try:
            case_num = pat_case_number.findall(item)[0]
            date = pat_date_received.findall(item)[0]
            repayment = pat_repay_total.findall(item)[0]

            all_data.append({
                "case_number": case_num,
                "date_received": date,
                "repayment_amount": repayment
            })
        
        except Exception as e:
            print(f"There is an error for item #{counter}: {e}")
            errors_list.append(item, e)
        print(f"Case #{counter} extracted.")
        
    else:
        print(f"Nothing to extract for case #{counter}.")
    
main_df = pd.DataFrame(all_data)
main_df

Nothing to extract for case #1.
Case #2 extracted.
Case #3 extracted.
Case #4 extracted.


Unnamed: 0,case_number,date_received,repayment_amount
0,811678/2020,08/03/2020,2100.0
1,813094/2020,10/15/2020,214.59
2,801263/2020,01/28/2020,862.0
