# Regular expressions and Python

Regular expressions are a formalism for extracting structured information from unstructured text. 
Using this formalism we can specify a pattern to match the target strings which will be selected based on their structure. 

For example, we might be interested in filtering all the strings which contain at least one number - the regular expressions can help us in specifying and detecting those strings.

In this chapter, we introduce the Python `re` module. We will try to identify a list of figures in a scientific paper and the number of times each one is referenced.

## Regular expression formalism

Regular expressions are used to express text in a generic way so that we can match patterns that crop up in long strings of information. 

We will focus on a few basic concepts:
<table>
<tr>
    <th>Expression</th>
    <th>Meaning</th>
    <th>Examples that match</th>
    <th>Examples that don't match</th>
</tr>
<tr>
<td>[A-Z]</td>
<td>Matches any character A-Z</td>
<td>A, B, C</td>
<td>a, AA, 0</td>
</tr>
<tr>
<td>[A-Z]+</td>
<td>Matches any character A-Z 1-to-many times</td>
<td>A,AA, AAA, AAB, ABCD, JAMES, COFFEE, SPAM</td>
<td>a, aaa, james, coffee, Coffee or emptystring</td>
</tr>
<tr>
<td>[A-Za-z]+</td>
<td>Matches any character A-Z or a-z 1-to-many times</td>
<td>James, Aa, Abc</td>
<td>Test123, C O F F E E</td>
</tr>
<tr>
<td>[A-Za-z0-9]+</td>
<td>Matches any character A-Z, a-z or 0-9 1-to-many times</td>
<td>James, Aa, Abc, Test123</td>
<td>C O F F E E, Coffee? Coffee!</td>
</tr>
</table>

Note that the examples above are indicating full string matches.

Let's try some of these in the language.


In [None]:
import re

def match(pattern, string):
    
    result = False
    
    # If one or more characters at the beginning of string match this regular expression, 
    # return a corresponding match object. 
    # Return None if the string does not match the pattern.
    match = re.match(pattern, string)
        
    if match:
        result = True
    
    print("Testing if {} will match {}. Result: {}".format(pattern, string, result))
    
    return match


match("[A-Z]", "A")
match("[A-Z]", "a")
match("[A-Z]", "0")
match("[A-Z]", "AA")
print(match("[A-Z]", "AA"))

Let's try with [A-Z]+

In [None]:
match("[A-Z]+", "A")
match("[A-Z]+", "a")
match("[A-Z]+", "0")

print("")

result = match("[A-Z]+", "AA")
print("Matched against substring '{}'\n".format(result.group(0)))

result = match("[A-Z]+", "C O F F E E")
print("Matched against substring '{}'\n".format(result.group(0)))

result = match("[A-Z]+", "James")
print("Matched against substring '{}'".format(result.group(0)))

result = match("[A-Z]+", "JAMes")
print("Matched against substring '{}'".format(result.group(0)))

## More examples of regular expression syntax

Here are a few more examples that are useful for the following examples:

<table>
<tr>
<th>Regular Expression</th>
<th>Meaning</th>
</tr>
<tr>
<td>.</td>
<td>Match any non-whitespace character. Note that it also includes punctuations.</td>
</tr>
<tr>
<td>*</td>
<td>Match 0-many of the preceeding pattern. For example .* would match any number of non-whitespace characters including no input at all.</td>
</tr>
<tr>
<td>?</td>
<td>Match the preceeding pattern 0-1 times. This is great for specifying that something is optional.</td>
</tr>
<tr>
<td>\s</td>
<td>Matches whitespace characters - space, tab and newline if MULTILINE patterns are enabled.</td>
</tr>
</table>

To find out more about what Python supports, check out [the documentation page](https://docs.python.org/3/library/re.html#regular-expression-syntax) on regular expressions.

## Real world application.

Let's find out how many figures there are in the ART corpus (https://www.aber.ac.uk/en/cs/research/cb/projects/art/art-corpus/) and how many times they are referenced. 

### Loading and parsing all ART corpus papers
We first read a Pickle file, previosly prepared, with filenames, id and text of sentence from the corpus.

In [None]:
import pickle

# 'all_sentences' is a list of tuples (filenames, id, text) for each sentence.
with open("Datasets/art_dataset.pickle","rb") as f:
    all_sentences = pickle.load(f)  

    
print ("Number of sentences loaded: ",len(all_sentences))

# Print few samples of sentences
for s in all_sentences[:3]:
    print("\nS: ", s)

### Defining a regular expression

Now, we are interested in finding out where the authors reference figures in the papers. Depending on their writing style, some authors might use "Figure 1", some "Fig. 1" an some others "Fig 1" (without dot). We should check and account for each of these.

Also, sometimes figures have subfigures (i.e. Fig 1.A or 1.B), so we need to match for these too.

In [None]:
# If you’re not using a raw string to express the pattern, python will see the backslash as an escape sequence in string literals
pattern = r"Fig(ure)?.?\s+([0-9A-B](\.[A-Za-z0-9])*)"

print(re.match(pattern, "Fig. 1"))
print(re.match(pattern, "Fig 1"))
print(re.match(pattern, "Figure 1"))

# re.match() checks for a match only at the beginning of the string
print("\nPartial match:", re.match(pattern, "Fig 1.A"))

print("")

# re. fullmatch() checks for entire string to be a match
print("Full match:")
print(re.fullmatch(pattern, "Fig 1.A"))

The brackets around 'ure' and a ? in Figure means that the ure in Figure is optional (the author might just say "Fig"). Brackets (parenthesis) allows you to define "groups" to capture variables and also define sub-patterns. 

We also put brackets around the portion that describes the figure number to allow us more flexibility. 

Now we will perform a quick check of our current regular expresion:

In [None]:
tests = ["Fig. 1", "Fig 1", "Figure 1", "Fig. 1.A", "Figure 2.C", "Figure. 3"]

for t in tests:
    m = re.match(pattern,t)
    
    if not m:
        print("Test failed for ", t)
    else:
        # group() function allows you to extract groups denoted by `()` in your expressions
        # Group 0 always returns the string that matched the whole expression from start to end
        print ("Matched span: '{}'".format(m.group(0)))
        print("--------------------")

The `re.match` and `re.search` functions both return a `Match` object or `None` if the regex failed. 

Now let's find out how many times figures are brought up in papers in the ART corpus.

In [None]:
from collections import Counter

# A Counter object is like a dictionary but associating to each key a counter
# Example of Counter usage
# Counter.update(['blue', 'red', 'blue', 'yellow', 'blue', 'red'])
# Counter({'blue': 3, 'red': 2, 'yellow': 1})

# Creating a dictionary made of Counters
# we want figs to be: {filename: Counter{fig_id: fig_appeared_times_in_filename}}
figs = {filename: Counter() for filename, id, text in all_sentences}    

pattern = r"Fig(ure)?.?\s+([0-9A-B](\.[A-Za-z0-9])*)"

# Return a list of figures mentioned in "sentence"
def match_sent(sentence):
    filename, sentence_id, sentence_text = sentence
    sfigs = [] # matched figure ids in "sentence"
    
    # "findall()" returns all matches of the pattern in brackets, as a list of strings
    # The string is scanned left-to-right, and matches are returned in the order found
    for m in re.findall(pattern, sentence_text):
        sfigs.append(m[1])  # matched figure id
        # sfigs.append(m)
        
    return filename, sentence_id, sfigs


# Map is a function that works as an iterator to return a result after applying a function to every item of an iterable
for filename, sentence_id, sentencefigs in map(match_sent, all_sentences):
    # For each filaname, update when a particular figure is mentionated in a sentence
    figs[filename].update(sentencefigs)
    

for file in figs:
    print("\nFile ", file)
    print("References to figures...: ")
    print(figs[file])

So, now we know which papers have which figures and we can find out which paper has highest number of different cited figures and which one references figures the most.


In [None]:
# list of tuples: [(file_name, number of figures mentions in 'file_name')]
sorted_figs_by_refcount = [(x, sum(figs[x].values())) for x in 
                               sorted(figs, key=lambda x: sum(figs[x].values()), reverse=True)] 

# [(file_name, number of different figures ids in 'file_name')]
sorted_figs_by_variety = [(x, len(figs[x])) for x in 
                           sorted(figs, key=lambda x: len(figs[x]), reverse=True)] 

print("Top 5 papers by number of references to figures (frequency)")
for paper,count in sorted_figs_by_refcount[0:5]:
    print("Title: {} Count: {}".format(paper,count))
print("\n\n")
print("Top 5 papers by number of different figures in paper (variance)")
for paper,count in sorted_figs_by_variety[0:5]:
    print("Title: {} Count: {}".format(paper,count))



## Conclusion

We have used regular expressions to parse semi-structured data inside the ART Corpus and determine which of the papers have the most diverse and most frequent references to figures.

For more methods supported by the 're' package, check the python documentation or this tutorial: [W3School](https://www.w3schools.com/python/python_regex.asp)