<a href="https://colab.research.google.com/github/andrew66882011/qss20_slides_activities/blob/main/activities/04_basicregex_formerging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import re
import numpy as np

## print multiple things from same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Load data and show examples

In [None]:
cep_optin = pd.read_excel("https://frac.org/wp-content/uploads/2021SY-CEP_Database_Export.xlsx")


In [None]:
## clean colnames
new_colnames = [re.sub("[^A-Za-z0-9]+", "", col.lower()) for col in cep_optin.columns]

## add back
cep_optin.columns = new_colnames

cep_optin.head()

cep_optin['schoolname_lower'] = cep_optin.schoolname.str.lower()

## check whether name of school contains
## space followed by elem*
cep_optin['is_elem'] = np.where(cep_optin.schoolname_lower.str.contains("\s+elem", regex = True),
                True, False)


test_schools = cep_optin.loc[(cep_optin.is_elem) &
               (cep_optin.schoolname_lower.str.contains("elem\\.")) |
               (cep_optin.schoolname_lower.str.contains("elem")) |
               (cep_optin.schoolname_lower.str.contains("elementary")) |
               (cep_optin.schoolname_lower.str.contains("esd")),
            'schoolname_lower'].sample(n = 30, random_state = 470)

test_schools_show = test_schools.iloc[13:23]

test_schools_show

# Re.sub illustrations

**Task**: for the `schoolname` field, replace the different varieties of elementary school with `elemschool` in the field

## Incorrect approach 

Returns incorrect results that we'll see below

In [None]:
elem_pattern = r"elementary|elem|elem\\.|elementary school"

new_schools = [re.sub(elem_pattern, "elemschool", school) for school in test_schools_show]

old_and_new = pd.DataFrame({'orig_name': test_schools_show,
                           'cleaned_name': new_schools})

#print(old_and_new.to_latex(index = False))
old_and_new

### Question in class: would it work to change order of OR statement?


Answer: it gets closer (e.g., stewart county and stove prairie are fixed!) still have an issue with those with elem.

In [None]:
elem_pattern_difforder = r"elementary school|elementary|elem\\.|elem"

new_schools_difforder = [re.sub(elem_pattern_difforder, "elemschool", school) for school in test_schools_show]

new_schools_difforder

## A correct approach

Addresses issues above with `elementary school` and `elem.`

In [None]:
elem_pattern_try2 = r"(elem.*)(\s+)?(school)?"
    
new_schools_try2 = [re.sub(elem_pattern_try2, "elemschool", school) 
                   for school in test_schools_show]    


old_and_new_try2 = pd.DataFrame({'orig_name': test_schools_show,
                           'cleaned_name': new_schools_try2})

#print(old_and_new_try2.to_latex(index = False))
old_and_new_try2

## Question from class - how do we tell re.something to ignore the case?

Answer: optional argument inside re: `flags = re.IGNORECASE` to ignore the case

In [None]:
orig_case_schools = cep_optin.schoolname.sample(n = 10, random_state = 54)

orig_case_schools

## do same pattern but with the re.ignorecase
orig_case_schools_sub = [re.sub(elem_pattern_try2, "elemschool", school, flags=re.IGNORECASE) 
                           for school in orig_case_schools]


## see that it matches things like Elementary despite capitalization
## leaves the capitalization the same but just does the replacement despite that
orig_case_schools_sub

## example also shows we may want to modify pattern to capture things like El

# re.findall and re.search illustrations

**Task**: want to create pattern that, for charter schools, allows us to extract the school name prior to the appearance of charter. School names without charter will not have matches

## re.findall 

In [None]:

test_patterns = ["rebeccajohnson8", "rebeccajohnson88", "rebeccajohnson796"]

[re.findall(r"[a-z]+\d+", pat)[0] for pat in test_patterns]

In [None]:
## pull some charter examples and other examples
charter_examples = cep_optin.schoolname_lower[cep_optin.schoolname_lower.astype(str).str.contains("charter")].sample(n = 8,
                    random_state = 422).to_list()
other_examples = cep_optin.schoolname_lower[~cep_optin.schoolname_lower.astype(str).str.contains("charter")].sample(n = 8,
                    random_state = 422).to_list()


combined_examples = charter_examples + other_examples
combined_examples


In [None]:
## charter pattern
charter_pattern = r"(.*)\s+(charter)(\s+)?(\w+)?"

## findall 
test_charter_findall = [re.findall(charter_pattern, 
                    school) for school in combined_examples]

## print result
test_charter_findall



In [None]:
## show example of one
print(test_charter_findall[0][0][0])

## re.search

In [None]:
## get matches
test_charter_search = [re.search(charter_pattern, 
                    school) for school in combined_examples]

test_charter_search


In [None]:
## extract matches

### here, we're just focusing on the 2nd match (thomas edison charter academy)
### and we're getting the first group from that match
thomas_match = test_charter_search[1]
thomas_match

### example where we're just getting the first group
### (name of school before charter)
thomas_firstgroup = thomas_match.group(1)
thomas_firstgroup


In [None]:
### iterate over all groups and print
for i in range(0, len(thomas_match.groups())+1):
    print("Group " + str(i) + " is: ")
    print(thomas_match.group(i))

## see error if we go beyond actual number of 
## groups thomas_match.group(5)

## Question from class - is there a way to pull multiple matched groups at one by feeding .group() something like a list of indices

Response: if you do object.groups() with no index fed, it returns a tuple of groups. You can then slices/subset that tuple using indices

In [None]:
## example- want to return group 1 and group 2 and paste together
thomas_groups_all = thomas_match.groups()
thomas_groups_all

## slice the tuple
thomas_groups_all[0:2]

## do in one step


thomas_groups_12 = thomas_match.groups()[0:2]
thomas_groups_12

In [None]:
## can generalize to the full list with ifelse
def get_precharter_name(one_matchobj):
    
    if one_matchobj:
        school_name = one_matchobj.group(1)
    else:
        school_name = ""
    
    return(school_name)

all_charter_match = [get_precharter_name(one_search) 
                    for one_search in test_charter_search]

all_charter_match

# Group activity

- Return to the full list of school names in the original data
- You want to find the names of high schools. Try out some patterns to standardize the high school names (e.g., `high school` and `high` could both become `highschool`)
- Then, using some example results, try writing a regex pattern and using re.match to get the name of the school that precedes the `highschool` part of the name (e.g., `new trier highschool` -> `new trier`)



### Standardizing high school name

In [None]:

### first pull out some examples to test one
hs_examples = cep_optin.schoolname_lower[cep_optin.schoolname_lower.astype(str).str.contains("high|hs")].sample(n = 15,
                    random_state = 422).to_list()

hs_examples


In [None]:

## for now, ignoring jr/senior distinction
## and matching on high school, high, and hs

## to avoid matching things like highland, 
## after high or hs, add (\s|$) that tells it 
## to either look for a space or look for the 
## end of the string
hs_sub_pattern = r"(\shigh(\s|$)|\shs(\s|$)?)(\s+)?(school)?"
test_pat_examples = [re.sub(hs_sub_pattern, " highschool", example) 
                    for example in hs_examples]

test_pat_examples

In [None]:
### apply over all and assign as a new column
### since we're pulling from original df
### casting it to string since was object
hs_clean_all = [re.sub(hs_sub_pattern, " highschool", str(oneschool)) 
                    for oneschool in cep_optin.schoolname_lower.to_list()]


### assign as col
cep_optin['school_cleanhs'] = hs_clean_all

### With some examples, pulling out name of school before high

In [None]:
## using the test_pat_examples and want to get things like huron, thomson, clovis east
prehs_pattern = r"(.*)\s+(highschool)(\s+)?(\w+)?"

schoolname_preh_matchobj  = [re.search(prehs_pattern, 
                    school) for school in test_pat_examples]

schoolname_preh_matchobj

## get the first group if exists; else return empty string
schoolname_preh = [obj.group(1) if obj else "" for obj in schoolname_preh_matchobj]
schoolname_preh