<div class="alert alert-block alert-info">
Author:<br>Felix Gonzalez, P.E. <br> Adjunct Instructor, <br> Division of Professional Studies <br> Computer Science and Electrical Engineering <br> University of Maryland Baltimore County <br> fgonzale@umbc.edu
</div>

In data sciences there are multiple tasks that may require working with text or strings. In data science text data includes both structured and unstructured data. Examples include,
- File names (may be in various formats)
- Text data may be found in various data collections (e.g., variables, lists, tuples, sets, dictionaries, dataframes, tables, etc.) 
- Magic method functions that contain two underscosrs before and after the function (e.g., __ contains__)
- Parsing/extracting text data from webpages, documents, reports, and other files that include text data.
- Natural langauge processing (NLP) models
There are various ways of working with text data as well as reasons to work with text data. In previous notebooks it was shows various methods of workign with string data when using lists. This notebook will discuss other functions to work with text data as well as discusses Regular Expressions (RegEx). 

RegEx are text matching patterns described with a formal syntax. They can include a variety of rules, for finding repetition, text-matching, data-crawling, and file-naming and  more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions.

Documentation References:

- https://docs.python.org/3/library/stdtypes.html#string-methods
- https://docs.python.org/3/library/string.html
- https://docs.python.org/3/library/re.html

# Table of Contents
[Strings and String Operations](#Strings-and-String-Operations)

[Finding String Patterns: Common Functions](#Finding-String-Patterns:-Common-Functions)

[Regular Rexpressions (RegEx or Re)](#Regular-Rexpressions-(RegEx-or-Re))

[Re: Construction Basics](#Re:-Construction-Basics)

[Re: Functions](#Re:-Functions)

[Re: Character Classes](#Re:-Character-Classes)

[Re: Creating Objects](#Re:-Creating-Objects)

[Re: Grouping - Use of "()"](#Re:-Grouping---Use-of-"()")

[Re: Matching Alternatives](#Re:-Matching-Alternatives)

[Re: Matching zero or one - Use of "?" symbol](#Re:-Matching-zero-or-one---Use-of-"?"-symbol)

[Matching zero or more - Use of "*" symbol](#Matching-zero-or-more---Use-of-"*"-symbol)

[Re: Match One or More](#Re:-Match-One-or-More)

[Re: Matching a Fixed Amount of Repeats - use of "{}"](#Re:-Matching-a-Fixed-Amount-of-Repeats---use-of-"{}")

[Re: Matching Start/End - use of Caret and dollar symbols](#Re:-Matching-Start/End---use-of-Caret-and-dollar-symbols)

[Re: Flags: Case Insensitive](#Re:-Flags:-Case-Insensitive)

[Re: Split Function](#Re:-Split-Function)

[Re: Complex String](#Re:-Complex-String)

# Strings and String Operations
[Return to Table of Contents](#Table-of-Contents)

There are numerous string methods and functions that can be used to work with strings. Most used in data science include:
- counting characters or words (e.g., len and count functions)
- formatting (e.g., lower, upper, and format functions), 
- removing or replacing strings or substrings (e.g., strip, rstrip, lstrip and replace functions),
- splitting and joining (e.g., split and join functions)
- finding string patterns (e.g., in, is in, notin, endswith, find, etc.)

Other techniques include RegEx and other functions found in other libraries such as the Pandas contain function. Triple quotes allow to break a sentence in multiple lines.

* Python raw string is created by prefixing a string literal with ‘r’ or ‘R’. 
* Python raw string treats backslash (\\) as a literal character. 
* This is useful when we want to have a string that contains backslash and don’t want it to be treated as an escape character.

Documentation Reference:
- https://docs.python.org/3/library/stdtypes.html#string-methods
- https://docs.python.org/3/library/string.html
- [More on join method](https://www.w3schools.com/python/ref_string_join.asp)
- [Follow Chapter 6 from Automate-the-boring-stuff-with-python](https://automatetheboringstuff.com/)
- [W3-School String Methods](https://www.w3schools.com/python/python_ref_string.asp)
- [AnalyticsVidhya Magic Methods](https://www.analyticsvidhya.com/blog/2021/08/explore-the-magic-methods-in-python/)

#### Output String Formatting

Python allows fancier formatting on the output. (https://docs.python.org/3/tutorial/inputoutput.html).

In [1]:
# Format can be used with numbers.
# The :-10 adds ten spaces.
# The 2.2% moves the period two spaces to the right and two decimal places.
# Brackets will show the variables within the format in the same order.
'{}, {:-10} YES votes  {:2.2%}'.format(123, 789, 0.135234)
# This also works with variable names if replacing the numbers within the format function.

'123,        789 YES votes  13.52%'

In [4]:
my_percentage = 0.135234
print('This is my percentage {:2.1%}'.format(my_percentage))
# The format does not affect the veriable value.
print(my_percentage)

This is my percentage 13.5%
0.135234


In [5]:
# Works also with strings.
'We are {} at "{}!"'.format('students', 'UMBC')

'We are students at "UMBC!"'

In [6]:
# We can define 
my_sentence = 'We are {} at "{}!"'.format('students', 'UMBC')
# Checking the type
type(my_sentence)

str

In [7]:
# Recall that we can also change the data type with functions str()
my_percantage_as_string = str(my_percentage)
type(my_percantage_as_string)

str

#### Single and Double Quotes

Recall that in Python you can use single quotes or double quotes however if a string has single quotes you will need to use the double quotes and vice versa when using the print function. Similar behavior occurs with the use of /n. 

In [8]:
# We have the following String in the variable sentence:
sentence1 = '''University of 
Maryland Baltimore

County\nUniversities at Shady Grove      '''
sentence1 # Note that the /n appears when calling the variable.

'University of \nMaryland Baltimore\n\nCounty\nUniversities at Shady Grove      '

In [9]:
print(sentence1)

University of 
Maryland Baltimore

County
Universities at Shady Grove      


#### Raw String Literals

Raw string literals are string literals that are designed to make it easier to include nested characters like quotation marks and backslashes that normally have meanings as delimiters and escape sequence starts. 

In [10]:
# Let's redefine the sentence as a raw string.
sentence2 = r'''University of 
Maryland Baltimore

County\nUniversities at Shady Grove      '''
print(sentence2) # Note that the /n is printed but the raw string still uses the non-/n spaces

University of 
Maryland Baltimore

County\nUniversities at Shady Grove      


#### String Operations

In [11]:
# Let's redefine the sentence without the multiple spaces.
sentence = 'University of Maryland Baltimore County\nUniversities at Shady Grove      '
sentence

'University of Maryland Baltimore County\nUniversities at Shady Grove      '

In [12]:
len(sentence) # Count number of characters 

73

In [13]:
print(sentence) # Note that the characters forward slash and n (\n) when invoked in the print function enters a space.

University of Maryland Baltimore County
Universities at Shady Grove      


In [14]:
sentence = sentence.replace('\n', ' ') # Replaces the \n for a space
sentence

'University of Maryland Baltimore County Universities at Shady Grove      '

In [15]:
sentence.lower() # Makes every character lower case

'university of maryland baltimore county universities at shady grove      '

In [16]:
sentence.upper() # Makes every character upper case

'UNIVERSITY OF MARYLAND BALTIMORE COUNTY UNIVERSITIES AT SHADY GROVE      '

In [17]:
# Removes leading and trailing spaces in a string.
sentence = sentence.strip()
sentence

'University of Maryland Baltimore County Universities at Shady Grove'

In [18]:
len(sentence) # Count number of characters after striping of spaces at the end.
# Number of characters went from 73 total to 67.

67

In [19]:
sentence.lower().count('UNIversity'.lower()) # Counts the word university. 
# Note applying lower case before counting.
# In the natural language processing we will learn about lemmatization and stemming.
# These are techniques that simplify a word to a simpler form such as its base or root.
# In the case above when applying stemming or lemmatization the words university and universities 
# can be adjusted to be count together as they belong to the same root.

1

In [20]:
sentence[5::2] # Starting in character position 5 select and select every other character. 

'riyo ayadBlioeCut nvriisa hd rv'

In [21]:
sentence_list_of_words = sentence.split(' ') # Splits the sentence using the space separator
sentence_list_of_words

['University',
 'of',
 'Maryland',
 'Baltimore',
 'County',
 'Universities',
 'at',
 'Shady',
 'Grove']

In [22]:
len(sentence_list_of_words) # Number of words in sentence

9

In [23]:
' '.join(sentence_list_of_words) # Joins elements of list and adds a space.

'University of Maryland Baltimore County Universities at Shady Grove'

# Finding String Patterns: Common Functions
[Return to Table of Contents](#Table-of-Contents)

Many of the functions to find string patterns return a boolean (True/False) and check if one string or substring is within another string.

In [24]:
sentence

'University of Maryland Baltimore County Universities at Shady Grove'

In [25]:
sentence.__contains__('Maryland') # Magic method for checking if a substring is in string.
# We will later use Pandas contains() function.

True

In [26]:
'Maryland'in sentence # In function checks if substring is in string.

True

In [27]:
'Maryland' not in sentence # In function checks if substring is NOT in string.

False

In [28]:
sentence.endswith('Grove') # Checks if the string ends with the specified substring.
# This function is super useful to filter for file extensions in a list of filenames.

True

In [31]:
sentence.find('County') # Returns the index location of the first character. 

33

# Regular Rexpressions (RegEx or Re)
[Return to Table of Contents](#Table-of-Contents)

Regular Expressions (RegEx) are text matching patterns described with a formal syntax. Can include a variety of rules, for finding repetition, text-matching, data-crawling, and file-naming and  more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions. You can always use base Python functions instead, but RegExps often provide other advantages.

Note that other strings method can be <b>FASTER</b> than RegEx functions. Benchmarking may help in deciding the best method and functions. Various functions within other libraries also accept use of regex symbols which we will explore as well.

Documentation References:
- [Re module - Documentation](https://docs.python.org/3/library/re.html)
- [RE Cheatsheet](https://www.dataquest.io/cheat-sheet/regular-expressions-cheat-sheet/)

In [32]:
import re # Importing the re library.

# Re: Construction Basics
[Return to Table of Contents](#Table-of-Contents)

When constructing the regular expresion the following rules may apply:
* Letters and numbers match themselves
* Normally case sensitive
* Watch out for punctuation as most of it has special meanings.

#### Matching one of several alternatives
Square brackets mean that any of the listed characters will do <br>
* <code>[ab]</code> means either ”a” or ”b” <br>

You can also give a range:
* <code>[a-d]</code> means ”a” ”b” ”c” or ”d” <br>


Negation: caret means ”not” <br>
* <code>[^a-d]</code> # anything but a, b, c or d

#### Wild cards
* ”.” means ”any character”. For example <br>
To match file names like ”hw3.pdf” and ”hw5.txt”: <br>
<code> hw.\.... </code> <br>


* If you really mean ”.” you must use a backslash
* Once again:
  * Backslash is special in Python strings
  * It’s special again in REs  


* The dolar sign (<code>\$</code>) Matches the end of the string or just before the newline at the end of the string, which is useful for example to work with certain filetypes. 

#### Zero or more copies
The asterisk (<code>*</code>) repeats the previous character 0 or more times
* ”<code>ca*t</code>” matches ”ct”, ”cat”, ”caat”, ”caaat” etc. <br>

The plus sign (<code>+</code>) repeats the previous character 1 or more times
* ”<code>ca+t</code>” matches ”cat”, ”caat” etc. but not ”ct”

#### Repeats
* Braces are a more detailed way to indicate repeats
* <code>A{1,3}</code> means at least one and no more than three A’s
* <code>A{4,4}</code> means exactly four A’s

# Re: Functions
[Return to Table of Contents](#Table-of-Contents)

#### Functions offered by a Match object:
* <code>match()</code>–does it match the beginning of my string? Returns None or a match object
* <code>search()</code>–does it match anywhere in my string? Returns None or a match object
* <code>findall()</code>–does it match anywhere in my string? Returns a list of strings (or an empty list)
  * Note that <code>findall()</code> does NOT return a Match object!
* <code>group()</code>–return the string that matched
  * <code>group()</code>–the whole string
  * <code>group(1)</code>–the substring matching 1st parenthesized sub-pattern
  * <code>group(1,3)</code>–tuple of substrings matching 1st and 3rd parenthesized
* sub-patterns
  * <code>start()</code>–return the starting position of the match
  * <code>end()</code>–return the ending position of the match
  * <code>span()</code>–return (start,end) as a tuple  

# Re: Character Classes
[Return to Table of Contents](#Table-of-Contents)

![image.png](attachment:image.png)

In [33]:
# EXAMPLE: Does this string contain a legal Python filename? 

mystring = 'This contains two files, hw3.py and uppercase.py.'
# The .compile() function, compiles a regular expression pattern. Here defined as "myrule".
myrule = re.compile(r".+\.py") # Recall ”.” means any character, "+" repeats the previous character 1 or more times
mymatch = myrule.search(mystring)
mymatch.group()

'This contains two files, hw3.py and uppercase.py'

Not what we expected! Why? <br>
* Our RE matches ”hw3.py”
* Unfortunately it also matches ”This contains two files, hw3.py”
* And it even matches ”This contains two files, hw3.py and uppercase.py”
* Python will choose the longest match
* We could break my file into words first
* Or we could specify that no spaces are allowed in my match

In [34]:
mystring = "This contains two files, hw3.py and uppercase.py."
# Recall [^a-b] # anything but a, b, c or d
myrule = re.compile(r"[^ ]+\.py") # In this case anything but space.
mymatch = myrule.search(mystring) # Search finds the first item.
print(mymatch)
mymatch.group()

<re.Match object; span=(25, 31), match='hw3.py'>


'hw3.py'

In [35]:
allmymatches = myrule.findall(mystring) # Returns a list of all elements with .py in this case.
allmymatches

['hw3.py', 'uppercase.py']

# Re: Creating Objects
[Return to Table of Contents](#Table-of-Contents)

Let's say we would like to find phone number(s) in a document or a website

In [36]:
# Compile a pattern to match a typical phone number, e.g. 410-455-1000
pattern = re.compile('\d\d\d-\d\d\d-\d\d\d\d') # \d is digit.

In [37]:
ASentence = 'This is my phone number: 410-455-1000: call me!'
## let's find the match
mo = pattern.search(ASentence)

Let's find the index of the first and last characters

In [38]:
mo # Gives all the data, including the span, and the match.

<re.Match object; span=(25, 37), match='410-455-1000'>

In [39]:
mo.start() # String character index at start of the match object.

25

In [40]:
mo.end() # String character index at end of the match object.

37

In [41]:
ASentence[mo.start():mo.end()] # Selecting the string using the character location.

'410-455-1000'

In [42]:
ASentence[25:37] # Same as above but specifying manually.

'410-455-1000'

# Re: Grouping - Use of "()"
[Return to Table of Contents](#Table-of-Contents)

In [43]:
# Compile a pattern for 410-455-1000 so that we can separate the area code later on
pattern = re.compile(r'(\d\d\d)-\d\d\d-\d\d\d\d')

In [44]:
# Search for a match
BSentence = "This is my phone number: 410-455-1000: call me!"
mo = pattern.search(BSentence)
mo.group()

'410-455-1000'

In [45]:
## check group 0
mo.group(0)

'410-455-1000'

In [46]:
## check group 1
mo.group(1)

'410'

In [47]:
## note that \(  is matching with exactly (
pattern = re.compile(r'\(\d\d\d\)-\(\d\d\d-\d\d\d\d\)')
mo = pattern.search('(410)-(455-1000)')
mo.group()

'(410)-(455-1000)'

In [48]:
mo.group(0)

'(410)-(455-1000)'

In [None]:
# Line below will give an error because we did exact matching. Uncomment and run.
#mo.group(1)

# Re: Matching Alternatives
[Return to Table of Contents](#Table-of-Contents)

<code> | </code > 
* <code>A|B</code>, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. 
* An arbitrary number of REs can be separated by the '|' in this way. 
* This can be used inside groups (see below) as well. 
* As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. <b> This means that once A matches, B will not be tested further, even if it would produce a longer overall match. </b>

In [50]:
# Compile a regex for matching MD or Maryland
pattern = re.compile(r'MD|Maryland')
# Search for the match
case1 = pattern.search("UMBC is in Maryland (MD)")
case1.group()

'Maryland'

In [51]:
case2 = pattern.search("UMBC is in MD, which is short for Maryland.")
case2.group()

'MD'

In [52]:
# Create a pattern that matches with 3 digit number or 4 digit number
pattern = re.compile(r'\d\d\d|\d\d\d\d')
mo = pattern.search('1521-312-212414')
mo.group()

'152'

In [53]:
# See the difference?
pattern = re.compile(r'\d\d\d\d|\d\d\d')
mo = pattern.search('1521-312-212414')
mo.group()

'1521'

__Matching Alternatives with Groups__

In [56]:
# Compile a regex for matching jupyter(notebooks|notebook|lab)
pattern = re.compile(r'jupyter (notebooks|notebook|lab)')
# Check for a match
case1 = pattern.search('We use jupyter notebooks jupyter notebook lab')
case1.group()

['notebooks', 'notebook']

In [57]:
case1 = pattern.search('We will have a jupyter lab. Where is my jupyter notebook?')
case1.group()

'jupyter lab'

In [63]:
# Find instances of jupyter notebook, jupyter notebooks or jupyter lab
case3 = pattern.findall('We use jupyter notebooks jupyter lab notebook')
case3

['notebooks', 'lab']

# Re: Matching zero or one - Use of "?" symbol
[Return to Table of Contents](#Table-of-Contents)

Remember <code> ? </code> means either 0 or 1 repetition. If we write <code> X?Y </code>, then it will search for <code> XY</code> and <code> Y </code>. Let's benefit from this.

In [None]:
# Find the phone number which might be entered as XYZ-ASD-ZXCF or XYZASDZXCF or XYZ ASD ZXCF
# What we need is a follows: 
# First look for three characters, then there might be a dash or space but maybe, we need to insert (-| )? in between.
pattern = re.compile(r'\d\d\d(-| )?\d\d\d(-| )?\d\d\d\d')
case1 = pattern.search('First, we write in this format 4104551000')
case1.group()

In [None]:
case2 = pattern.search('Then, we write in this format 410-455-1000')
case2.group()

In [None]:
case3 = pattern.search('Then, we write in this format 410 455 1000')
case3.group()

Now let's compile a pattern that can match phone numbers with and without area codes, e.g. (410)-455-1000 or 455-1000. <br>


In [None]:
pattern = re.compile(r'(\(\d\d\d\)-)?(\d\d\d-\d\d\d\d)')

In [None]:
string1= '(410)-455-1000'
string2 = '455-1000'
string3 = '-455-1000'
string4= '410-455-1000'

mo = pattern.search(string1)
mo.group()

In [None]:
mo = pattern.search(string2)
mo.group()

In [None]:
mo = pattern.search(string3)
mo.group()

In [None]:
mo = pattern.search(string4)
mo.group()

In [None]:
# Note that ? catches at most one char
pattern = re.compile(r'\d-?\d')

mo = pattern.search('1-2')
mo.group()

In [None]:
mo = pattern.search('1--2')
# This will fail because of the double dash. Uncomment and run line below.
#mo.group()

In [None]:
mo = pattern.search('1-24')
mo.group()

# Matching zero or more - Use of "*" symbol
[Return to Table of Contents](#Table-of-Contents)

Remember, The asterisk (<code>*</code>) repeats the previous character 0 or more times
- ”<code>ca*t</code>” matches ”ct”, ”cat”, ”caat”, ”caaat” etc. <br>

The plus sign (<code>+</code>) repeats the previous character 1 or more times
- ”<code>ca+t</code>” matches ”cat”, ”caat” etc. but not ”ct”

In [None]:
# Compile a pattern for mathing 410-455-1000
string = '410-455-1000'

In [None]:
# Match pattern with the Asterisk.
pattern = re.compile(r'\d*-\d*-\d*')

In [None]:
mo = pattern.search(string)
mo.group()

In [None]:
# Pattern.search(string) can be also written as re.search(pattern, string2), e.g.
re.search(pattern, string).group()

<b>What migth go wrong here? </b>

In [None]:
string2 = '41--10000000'
pattern = re.compile(r'(\d*)-(\d*)-(\d*)')
mo = pattern.search(string2)
mo.group()

Read the definition again: <b> 0 or more times </b>.

# Re: Match One or More
[Return to Table of Contents](#Table-of-Contents)

In [None]:
## compile a pattern for mathing 410-455-1000
pattern = re.compile(r'\d+-\d+-\d+')

In [None]:
string = '410-455-1000'

In [None]:
# Match the pattern
mo = pattern.search(string)
mo.group()

In [None]:
string2 = '41--10000000'
mo2 = pattern.search(string2) 

In [None]:
# This will fail because it won't find the pattern, uncomment and run
#mo2.group()

In [None]:
# This returns nothing because there is no pattern.
#mo2

What might go wrong?

In [None]:
# Pattern.search(string) can be also written as re.search(pattern, string2)
# Returns nothing as there is no pattern that meets the conditions.
re.search(pattern, string2)

In [None]:
# This will fail because it didn't find nothing hence there is no group.
#re.search(pattern, string2).group()

In [None]:
# Another case: Try to find a case where this might not be an ideal pattern for phone number matching
string3 = '100000000-1020202020202-102020202020202'
re.search(pattern, string3)

Read the definition again: <b>one or more times. But we don't know how many more!! </b>. <br>

# Re: Matching a Fixed Amount of Repeats - use of "{}"
[Return to Table of Contents](#Table-of-Contents)

In [None]:
# Compile a pattern for mathing 410-455-1000
pattern = re.compile(r'\d{3}-\d{3}-\d{4}') # Same as previously but specifying the number of digits.

In [None]:
string2 = '410-455-1000'

pattern = re.compile(r'(\d*)-(\d*)-(\d*)')
mo = pattern.search(string2)
mo.group()

In [None]:
string = '410-455-1000'
pattern = re.compile(r'\d{3}-\d{3}-\d{4}')
re.search(pattern, string)

In [None]:
string2 = '1201212-121212-121'
re.search(pattern, string2) # Retunrs nothing.

__Use of `{i, j}}`__

In [None]:
# Suppose each address in United States start with a number with 2 to 7 numbers
# i.e. 10117 west street, Baltimore, MD or 13 Elm Street, Baltimore, MD

# Compile a pattern that catches the street number
pattern = re.compile(r'\d{2,7}')

In [None]:
umbc = "1000 Hilltop Circle Baltimore, MD 21250"
re.search(pattern, umbc)

In [None]:
location = '101 Independence Ave SE'

In [None]:
re.search(pattern, location)

# Re: Matching Start/End - use of Caret and dollar symbols
[Return to Table of Contents](#Table-of-Contents)

- Use the caret symbol (^) at the start of a regex to indicate that a match must occur at the beginning of the searched text. 
- Use a dollar sign ($) at the end of the regex to indicate the string must end with this regex pattern.

In [None]:
# Let's take a look at 'hello world!
beginsWithHello = re.compile(r'^Hello')
beginsWithHello.search('Hello world!')#.group()

In [None]:
# Note that there is no match in the following case
beginsWithHello.search('He said Hello.') == None

In [None]:
# Similarly let's use $ symbol
endsWithNumber = re.compile(r'\d$')
endsWithNumber.search('Your lucky number is 13') == None

In [None]:
endsWithNumber.search('Your lucky number is 13')

# Re: Flags: Case Insensitive
[Return to Table of Contents](#Table-of-Contents)

In [58]:
# Note that we can also use flags to case-insensitive matches (re.I is case-insenstive.)
robocop = re.compile(r'robocop', flags = re.I)
robocop.search('RoboCop is part man, part machine, all cop.').group()

'RoboCop'

# Re: Split Function
[Return to Table of Contents](#Table-of-Contents)

In [59]:
split_term = '/'

phrase = '11/29/2021'

# Split the phrase
re.split(split_term, phrase)

['11', '29', '2021']

# Re: Complex String
[Return to Table of Contents](#Table-of-Contents)

In this example we have a complex string with some letters and numbers. The letters and numbers follow a pattern and we want to extract the letter with its corresponding numbers as separate elements. 

In [60]:
string_list = ['H41RW6574_Pulley_Lc11_G743_W550_Mode1',
               'H435RW832_Curved_Lc12_G100_W575_Mode1',
               'H243RW85_PulleyCurved_Lc13_G432_W600_Mode1']

values_list = []

# Iterates through each of the elements in the string_list
for element in string_list:
    values = {} # Defines initial blank dictionary for each element.
    for string in element.split('_'): # Iterates through the splitted sub-elements.
        matches = re.findall(r'(H|RW|Lc|G|W)(\d+)', string) # Finds RW or Lc or G or W and following digits.
        # For each match develop key values of the letter key and the number value
        for match in matches:
            key, value = match
            values[key] = int(value)
    values_list.append(values) # Append the match pairs add to the dictionary and then to the list of dictionaries.

In [61]:
values_list

[{'H': 41, 'RW': 6574, 'Lc': 11, 'G': 743, 'W': 550},
 {'H': 435, 'RW': 832, 'Lc': 12, 'G': 100, 'W': 575},
 {'H': 243, 'RW': 85, 'Lc': 13, 'G': 432, 'W': 600}]

# NOTEBOOK END