# Dataframes and Text

We have previously seen that we can search and extract text using regular expressions on
single variables. We can search entire columns for matching text and return new columns
of results. The vectorized search functions use Python's specialized
[raw-strings](https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences) to
avoid having to escape the backslashes. For more details you can read the official Python
[RegEx How-To](https://docs.python.org/3/howto/regex.html). As always
[Regex101](https://regex101.com/) is a fantastic resource for testing debugging regular
expression against test strings.

In [None]:
import pandas as pd
import re

## Credit Card Example

We will load the `Credit Card Info.xlsx` file from our data directory. Our goal is to
extract the credit card information for emailed free text responses. In this scenario we
are given a small sample file that represents the variability in the larger dataset. We
will develop our extraction methods using the sample file.

In [22]:
samples = pd.read_excel(
    "../data/Credit Card Info.xlsx",
    dtype = {
        "id": pd.Int64Dtype(),
        "Event Log": pd.StringDtype()
    }
)
print(samples.dtypes)

id                    Int64
Event Log    string[python]
dtype: object


Each Pandas column, called a series, supports string processing using Regular Expression,
including:

* [`Contains`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) assert is there is a match.
* [`Extract`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html) all the capture groups of the first match.
* [`Extract All`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extractall.html) all the capture groups of all matches.
* [`Find`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.find.html) the index of the first match.
* [`Find All`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.findall.html) all the indexes of all the matches.
* [`Full Match`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.fullmatch.html) assert the entire string matches.
* [`Replace`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html) replace each occurrence with a value.

In [None]:
samples["Event Log"].str.contains(
    r"(\d{4}-){2,3}\d{4}",
    regex = True
)

We can use this to extract the first matching credit card number. Note the non-capturing
`(?: ...)` group used to match the individual sequences of numbers in the repetitions. By
default all brackets form capturing groups, that is the search will return an list of 
strings, where each string matched inside of one of the brackets. We can tell the search
engine that the brackets are used for organizing only by making it non-capturing.

The distinction is between what gets matched, the regular expression, and what gets returned
the grouping brackets `(...)` within the regular expression. By default brackets mean please
return this part of the match.

### Business Analysis Question

The first match is a complete number. Should we return the other three incomplete numbers?
Try out different combinations of the values in the repetition indicator `{start,stop}`. By
default regular expression engines are *greedy* they will prefer the longest match possible
over the first match possible.

In [None]:
ISCREDIT = r"((?:(?:\d{4}[-]){2,3}|(?:\d{4}[ ]){2,3}|(?:\d{4}){2,3})\d{4})"
samples["Event Log"].str.extract(ISCREDIT).replace(
    r"\D",
    "",
    regex = True
)

As an exercise in *or* groups `(...|...|...)` a more general form matches numbers that are
only delimited by dashes, or only delimited by spaces, or only have no delimiters:
```
((?:(?:\d{4}[-]){2,3}|(?:\d{4}[ ]){2,3}|(?:\d{4}){2,3})\d{4})
```

As an exercise in negation classes, such as do *not* match whitespace we will extract the
emails.
```
[^\s@]+@[^\s@]+\.[^\s@\.]+
```

In [None]:
ISEMAIL = r"([^@\s]+@[^@\s]+\.[^@\s\.]+)"
samples["Event Log"].str.extract(ISEMAIL)

In [None]:
ISEXPIRY = r"(\d{2}[/\\]\d{2})"
samples["Event Log"].str.extract(ISEXPIRY).replace(
    r"\D",
    "",
    regex = True
)

### Data Extraction

We can combine all the extraction steps into one cell that creates columns for each of
the extracted data.

In [25]:
# Store the data in fields
ISEMAIL = r"([^@\s]+@[^@\s]+\.[^@\s\.]+)"
ISEXPIRY = r"(?<!\d[/\\ -])(\d\d[/\\ -]?\d\d)(?![/\\ -]?\d)"
ISCREDIT = r"((?:(?:\d{4}[-]){2,3}|(?:\d{4}[ ]){2,3}|(?:\d{4}){2,3})\d{4})"
ISSOCIAL = r"(?<!\d)((?:\d{3}-\d\d-\d{4}|(?:\d{3} \d\d \d{4})|(?:\d{9})))(?![- ]?\d)"
samples["Email"] = samples["Event Log"].str.extract(ISEMAIL)
samples["Credit Card"] = samples["Event Log"].str.extract(ISCREDIT).replace(
    r"\D",
    "",
    regex = True
)
samples["Expiry"] = samples["Event Log"].str.extract(ISEXPIRY).replace(
    r"\D",
    "",
    regex = True
)
samples["Social Security"] = samples["Event Log"].str.extract(ISSOCIAL).replace(
    r"\D",
    "",
    regex = True
)

### Additional References

* [Negative and positive look-a-heads and look-behinds](https://www.regular-expressions.info/lookaround.html).
* Regular Expression Puzzle Games, like [RegEx Golf](https://alf.nu/RegexGolf).

### Look Ahead and Behind

The problem a look-around addresses is that matches require a character to be present for
comparison. The technical name of the process is *"Zero Length Assertion"*. It requires
the elements of the look-around be true without matching the character position. Zero
length assertions do **not** contribute to the positions matched. The distinction becomes
important when using regular expressions for string replacement. While you can use
non-capturing groups to imitate a look-around assertion for text extraction, they will cause
the non-captured parts to be replaced when used in string substitution.

Zero length assertions are useful for improving the sensitivity and specificity of matches
by incorporating information about the wider context in which the match was made.

Lets start by working the example *"q"* example from the tutorial.

In [26]:
phrase = "How many qs are qs not followed by u? The queen of Iraq is quick at quizzes. How to find this q"
noncapneg = re.compile(r"q(?:[^u]|$)")
print(noncapneg.sub("Q", phrase))
noncappos = re.compile(r"q(?:u)")
print(noncappos.sub("Q", phrase))
zeroneg = re.compile(r"q(?!u)")
print(zeroneg.sub("Q", phrase))
zeropos = re.compile(r"q(?=u)")
print(zeropos.sub("Q", phrase))

# Extract by both means
extnoncapneg = re.compile(r"....q(?:[^u]|$)")
print(extnoncapneg.findall(phrase))
extzeroneg = re.compile(r"....q(?!u)")
print(extzeroneg.findall(phrase))
extnoncappos = re.compile(r"....q(?:u)")
print(extnoncappos.findall(phrase))
extzeropos = re.compile(r"....q(?=u)")
print(extzeropos.findall(phrase))

How many Q are Q not followed by u? The queen of IraQis quick at quizzes. How to find this Q
How many qs are qs not followed by u? The Qeen of Iraq is Qick at Qizzes. How to find this q
How many Qs are Qs not followed by u? The queen of IraQ is quick at quizzes. How to find this Q
How many qs are qs not followed by u? The Queen of Iraq is Quick at Quizzes. How to find this q
['any qs', 'are qs', ' Iraq ', 'his q']
['any q', 'are q', ' Iraq', 'his q']
['The qu', ' is qu', ' at qu']
['The q', ' is q', ' at q']


Find all the expiry dates, four digit numbers with a variety of delimiters, that are **not**
part of any other number. The general problem is find exactly $n$ numbers that are **not** a
substring of $m>n$ numbers.

The remaining challenges  are:
* Extract the telephone number when available. Hint: use the social security expression as a
template and play with the number of digits.
* Extract the two "words" following the key `Name:`