<div class="alert alert-block alert-info"><br>
    <h2>Week7_Part 2. Regular Expressions (Regexes)</h2> <br>
    <p>A regular expression (Regex) is a special text string that combines letters, digits and special characters for describing a search pattern. In programming, regexes are useful to validate user input, e.g. to check whether a user submitted their email or phone number in the right format. In research, regexes can help us to find specific items in the ocean of unstructured data.</p><br>
    <p>We'll need to use a built-in Python module <b>'re'</b> to match information in text files by Regex patterns.</p><br>
    <p>Check the list of references at the bottom of this file. It will take you to case studies that applied regexes to find data crucial to anwer research questions or advance data analysis.</p>
</div>

<div class="alert alert-block alert-danger">
    <h3>Regex matches in one text file</h3>
</div>

In [1]:
# Open and read in 'Tyger.txt' file which should be in your directory. 


file = open('Tyger.txt', 'r')
poem = file.read()
print(poem)

# We won't clean this poem this time. 
# We won't be looking for word counts, but for patterns, including those that combine function and content words.

In [2]:
# We can use regular expressions to match and extract the recurring textual items, e.g. phrases, or collocations.
# Import the re module which is equipped with regex matching tools.

import re

# Letter 'r' marks the regex pattern that needs to be matched in the text
# The basic pattern may be a specific full word or phrase as below

phrase1 = re.findall(r"burning bright",poem)

print(phrase1)

['burning bright', 'burning bright']


In [3]:
# You might have noticed with the naked eye that 'what' is rather repetitive in The Tyger.
# Let's match the instances of this word and any other word that follows it: 'what' + any word

# \b marks the boundaries of the word(s) we want to match. 
# If we don't set the boundaries for 'what' on both sides, our regex might fetch 'whatever', 'whatsoever', etc.
# \w+ means find any word that goes after 'what'

phrase2 = re.findall(r"\bwhat\b \w+", poem)

# Note that the regex syntax in Python follows slightly different rules than in the sandbox demoed in the video
# Use whitespaces in the combination of patterns

phrase2 # returns a list of strings

['what distant',
 'what wings',
 'what shoulder',
 'what art',
 'what dread',
 'what the',
 'what furnace',
 'what dread']

In [4]:
# Find a match for 'what + word + word'

phrase3 = re.findall(r"\bwhat\b \w+ \w+", poem)

phrase3 

['what distant deeps',
 'what wings dare',
 'what dread feet',
 'what the chain',
 'what furnace was',
 'what dread grasp']

In [5]:
# Search for patterns by prepositions might be rewarding in the analysis of poetry.
# Patterns of function words are also important in forensic linguistics!
# Let's explore what lexical words the preposition 'of' attracts in Blake's poem.

phrase4 = re.findall(r"\w+ \bof\b \w+", poem)

# As you learn, say out loud what the regex above matches.

phrase4                

['forests of the', 'fire of thine', 'sinews of thy', 'forests of the']

In [6]:
# We may write several regex patterns to find words that end in some rhyme.
# Separate several regex patterns with the pipe symbol | to say 'find either this or that'
# If you know what rhyme/ending you want to fetch use \w+ for any word plus a particular ending, e.g. \w+ight

phrase5 = re.findall(r"\w+ight|\w+ire\b", poem)


phrase5

['bright', 'night', 'fire', 'aspire', 'fire', 'bright', 'night']

In [7]:
# Finally, write out the extracted data to an external file with open().
# Pass the name of a new file and the argument 'w' ('write') to the open() function.

file = open('TygerPhrases.txt', 'w')

file.write(str(phrase3)) # phrase3 is a list which needs to be converted to a string to save it in a .txt file

file.close()

# Check the newly written file in your JN file directory.

<div class="alert alert-block alert-danger">
    <h3>Regex matches in multiple text files</h3>
</div>

In [8]:
# Create a list with multiple .txt files

filenames = ["Tyger.txt", "TheAngel.txt", "EcchoingGreen.txt"]

filenames

['Tyger.txt', 'TheAngel.txt', 'EcchoingGreen.txt']

In [9]:
# Loop through each text file in 'filenames' to 1) open each, 2) read each and 3) match the rhyme pattern as in [6]
# Store the matched patterns in the empty list 'patterns'

patterns = []

for f in filenames:
    a = open(f, "r")
    b = a.read()
    c = re.findall(r"\w+ight|\w+ire\b", b) # the re module has been already imported
    patterns.append(c) # append all frequency counts to one empty list 


print(patterns)

# the patterns variable returns a list of 3 dimensions (or, 3 nested lists).
# The last nested list contains no data since no words of such patterns were found.

[['bright', 'night', 'fire', 'aspire', 'fire', 'bright', 'night'], ['night', 'night', 'delight'], []]


<div style="background-color:#ccccff">
    <h3>Use of Regex in Research:</h3>  

<ul> 
<li><a href="https://historicaltexts.jisc.ac.uk/help">The guidelines for searching information, including the use of regexes, on the Historical Texts platform</a></li>
<li><a href="https://essay.utwente.nl/73817/1/chenet_MA_EEMCS.pdf">Paper on using regex in bibliographical research</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2656039/">Paper on using regex in medical data</a></li>
<li><a href="http://ceur-ws.org/Vol-1267/LD4IE2014_Petrovski.pdf">Paper on using regex in e-commerce data</a></li>  
</ul>
<h3>Regex Tutorials:</h3><br>
<ul>
<li><a href="https://www.regular-expressions.info/quickstart.html">Quick Start on Regex</a></li>
<li><a href="https://regexr.com/">Sandbox for Learning and Building with Regex</a></li>
<li><a href="http://www.themacroscope.org/?page_id=643">Big Data in History and Regex</a></li>  
<li><a href="https://programminghistorian.org/en/lessons/understanding-regular-expressions">Understanding Regular Expressions on The Programming Historian</a></li> 
</ul>

</div>