<a href="https://colab.research.google.com/github/dss5202-2410/Notebooks/blob/main/String_operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# String operations

This section will cover some of python's built-in string methods and formatting operations, before moving on to a quick guide to the extremely useful subject of regular expressions. Such string manipulation patterns come up often in the context of data science work, and is one big perk of Python in this context.

Strings in python can be define ausing either single or double quotes (they are functionally equivalent).

In [1]:
x = 'a string'
y = "a string"
x == y

True

In addition, it is possible to define multi-line strings using a triple-quote syntax. The new lines are represented as `\n` in display.

In [2]:
multiline_string = """
one
two
three
"""
multiline_string

'\none\ntwo\nthree\n'

## Simple string manipulation

For basic manipulation of strings, Python's built-in string methods can be extremely convenient. If you have a background working in C or another low-level language, you will likely find the simplicity of Python's methods extremely refreshing. We introduced Python's string type and a few of these methods earlier; here we'll dive a bit deeper.

Python makes it quite easy to adjust the case of a string. Here we'll look at the `upper()`, `lower()`, `capitalize()`, and `title()` functions, using the following messy string as an example:

In [3]:
str1 = "tHe qUICk bROWn fOx."

In [4]:
str1.upper()

'THE QUICK BROWN FOX.'

In [5]:
str1.lower()

'the quick brown fox.'

In [6]:
str1.title()

'The Quick Brown Fox.'

In [7]:
str1.capitalize()

'The quick brown fox.'

Another common need is to remove spaces (or other characters) from the beginning or end of the string. The basic method of removing characters is the `strip()` function, which strips whitespace from the beginning and end of the line:

In [8]:
str2 = '         this is the content         '
str2.strip()

'this is the content'

To remove just space to the right or left, use `rstrip()` or `lstrip()`, respectively:

In [9]:
str2.rstrip()

'         this is the content'

To remove characters other than spaces, you can pass the desired character to the `strip()` method:

In [10]:
str3 = "000000000000435"
str3.strip('0')

'435'

## Finding and replacing substrings

For the special case of checking for a substring at the beginning or end of a string, Python provides the `startswith()` and `endswith()` methods:

In [12]:
str4 = 'the quick brown fox jumped over a lazy dog'
str4.startswith('a')

False

In [13]:
str4.endswith('dog')

True

To go one step further and replace a given substring with a new string, you can use the `replace()` method. Here, let's replace 'brown' with 'red':

In [14]:
str4.replace("brown", "RED")

'the quick red fox jumped over a lazy dog'

In [16]:
str4.replace('o', '*')

'the quick br*wn f*x jumped *ver a lazy d*g'

For a more flexible approach to this `replace()` functionality, see the discussion of regular expressions below.

## Splitting and partitioning strings

If you would like to find a substring and then split the string based on its location, the `partition()` and `split()` methods are what you're looking for. Both will return a sequence of substrings.

The `partition()` method returns a tuple with three elements: the substring before the first instance of the split-point, the split-point itself, and the substring after:

In [17]:
str4.partition('fox')

('the quick brown ', 'fox', ' jumped over a lazy dog')

The `split()` method is perhaps more useful; it finds all instances of the split-point and returns the substrings in between. The default is to split on any whitespace, returning a list of the individual words in a string:

In [18]:
str4.split()

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

## Flexible pattern matching with regular expressions

The methods of python's str type give you a powerful set of tools for formatting, splitting, and manipulating string data. But even more powerful tools are available in Python's built-in regular expression module. Regular expressions are a huge topic; there are there are entire books written on the topic (including [Jeffrey E.F. Friedl’s Mastering Regular Expressions, 3rd Edition](https://www.oreilly.com/library/view/mastering-regular-expressions/0596528124/)), so it will be hard to do justice within just a single subsection.

Fundamentally, regular expressions are a means of flexible pattern matching in strings. If you frequently use the command-line, you are probably familiar with this type of flexible matching with the "`*`" character, which acts as a wildcard.

Regular expressions generalize this "wildcard" idea to a wide range of flexible string-matching sytaxes. The Python interface to regular expressions is contained in the built-in `re` module; as a simple example, let's use it to duplicate the functionality of the string `split()` method:

In [22]:
import re

In [23]:
regex1 = re.compile("\s+")
regex1.split(str4)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

In the code above, we first compile a regular expression and use it to split a string.

Just as Python's `split()` method returns a list of all substrings between whitespace, the regular expression `split()` method returns a list of all substrings between matches to the input pattern.

In this case, the regular expression is `\s+`.

+ `\s` is a special character that matches any whitespace (space, tab, newline, etc) and `+` is a character that indicates one or more of the entity preceding it.

+ Therefore, the regular expression matches any substring consisting of one of more spaces.

Similarly, the `regex.sub()` method operates much like `str.replace()`:

In [25]:
regex2 = re.compile("brown")
regex2.sub("RED", str4)

'the quick RED fox jumped over a lazy dog'

## A more complicated example

But, you might ask, why would you want to use the more complicated and verbose syntax of regular expressions rather than the more intuitive and simple string methods? The advantage is that regular expressions offer far more flexibility.

Here we'll consider a more complicated example: the common task of matching email addresses. I'll start by simply writing a (somewhat indecipherable) regular expression, and then walk through what is going on. Here it goes:

In [27]:
regex_email = re.compile('\w+@\w+\.[a-z]{3}')

Using this, if we're given a line from a document, we can quickly extract things that look like email addresses.

In [29]:
str5 = "To email representative, try rep@python.org or the older address rep@google.com."
regex_email.findall(str5)

['rep@python.org', 'rep@google.com']

We can do further operations, like replacing these email addresses with another string, perhaps to hide addresses in the output:

In [30]:
regex_email.sub('--@--.--', str5)

'To email representative, try --@--.-- or the older address --@--.--.'

Finally, note that if you really want to match any email address, the preceding regular expression is far too simple. For example, it only allows addresses made of alphanumeric characters that end in one of several common domain suffixes. So, for example, the period used here means that we only find part of the address:

In [31]:
regex_email.findall('barack.obama@whitehouse.gov')

['obama@whitehouse.gov']

This goes to show how unforgiving regular expressions can be if you're not careful! If you search around online, you can find some suggestions for regular expressions that will match all valid emails, but beware: they are much more involved than the simple expression used here!

## Basics of regular expressions

The syntax of regular expressions is much too large a topic for this short section. Still, a bit of familiarity can go a long way: I will walk through some of the basic constructs here, and then list some more complete resources from which you can learn more. My hope is that the following quick primer will enable you to use these resources effectively.

While simple letters or numbers are direct matches, there are a handful of characters that have special meanings within regular expressions. They are:

```
. ^ $ * + ? { } [ ] \ | ( )
```

We will discuss the meaning of some of these momentarily. In the meantime, you should know that if you'd like to match any of these characters directly, you can escape them with a back-slash.

The `r` preface in `r"\$"` indicates a raw string.

In [33]:
regex3 = re.compile(r"\$")
regex3.findall("the cost is $20")

['$']

In standard Python strings, the backslash is used to indicate special characters. For example, a tab is indicated by "\t":

In [34]:
print('a\tb\tc')

a	b	c


Such substitutions are not made in a raw string:

In [35]:
print(r'a\tb\tc')

a\tb\tc


Just as the "`\`" character within regular expressions can escape special characters, turning them into normal characters, it can also be used to give normal characters special meaning. These special characters match specified groups of characters, and we've seen them before. In the email address regexp from before, we used the character "`\w`", which is a special marker matching any alphanumeric character.

Similarly, in the simple `split()` example, we also saw "`\s`", a special marker indicating any whitespace character.

Putting these together, we can create a regular expression that will match any two letters/digits with whitespace between them:

In [36]:
regex4 = re.compile(r'\w\s\w')
regex4.findall('the fox is 9 years old')

['e f', 'x i', 's 9', 's o']

The following table lists a few of these characters that are commonly useful:

+ "`\d`"	Match any digit

+ "`\D`"	Match any non-digit

+ "`\s`"	Match any whitespace

+ "`\S`"	Match any non-whitespace

+ "`\w`"	Match any alphanumeric char

+ "`\W`"	Match any non-alphanumeric char

This is not a comprehensive list or description; for more details, see [Python's regular expression syntax documentation](https://docs.python.org/3/library/re.html#re-syntax).

## Suquare brackets match custom character groups

If the built-in character groups aren't specific enough for you, you can use square brackets to specify any set of characters you're interested in. For example, the following will match any lower-case vowel:

In [37]:
regex5 = re.compile('[aeiou]')
regex5.split('consequential')

['c', 'ns', 'q', '', 'nt', '', 'l']

Similarly, you can use a dash to specify a range: for example, "`[a-z]`" will match any lower-case letter, and "`[1-3]`" will match any of "1", "2", or "3". For instance, you may need to extract from a document specific numerical codes that consist of a capital letter followed by a digit. You could do this as follows:

In [38]:
regex6 = re.compile('[A-Z][0-9]')
regex6.findall('1043879, G2, H6')

['G2', 'H6']

If you would like to match a string with, say, three alphanumeric characters in a row, it is possible to use "`\w\w\w`". Because this is such a common need, there is a specific syntax to match repetitions -- curly braces with a number:

In [39]:
regex7 = re.compile(r'\w{3}')
regex7.findall('The quick brown fox')

['The', 'qui', 'bro', 'fox']

There are also markers available to match any number of repetitions. For example, the "`+`" character will match one or more repetitions of what precedes it:

In [40]:
regex8 = re.compile(r'\w+')
regex8.findall('The quick brown fox')

['The', 'quick', 'brown', 'fox']

The following is a table of the repetition markers available for use in regular expressions:

Character	Description	Example

+ `?`	Match zero or one repetitions of preceding. For example, "`ab?`" matches "a" or "ab"

+ `*`	Match zero or more repetitions of preceding. For example, "`ab*`" matches "a", "ab", "abb", "abbb"...

+ `+`	Match one or more repetitions of preceding. For example,	"`ab+`" matches "ab", "abb", "abbb"... but not "a"


+ `{n}`	Match n repetitions of preeeding. For example, "`ab{2}`" matches "abb"

+ `{m,n}`	Match between m and n repetitions of preceding. For example, "`ab{2,3}`" matches "abb" or "abbb"

With these basics in mind, let's return to our email address matcher.

```
regex_email = re.compile(r'\w+@\w+\.[a-z]{3}')
```

We can now understand what this means: we want one or more alphanumeric character ("`\w+`") followed by the at sign ("`@`"), followed by one or more alphanumeric character ("`\w+`"), followed by a period ("`\.`" -- note the need for a backslash escape), followed by exactly three lower-case letters.

## Parentheses indicate groups to extract

For compound regular expressions like our email matcher, we often want to extract their components rather than the full match. This can be done using parentheses to group the results:

In [42]:
regex_email2 = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})')
regex_email2.findall(str5)

[('rep', 'python', 'org'), ('rep', 'google', 'com')]

As we see, this grouping actually extracts a list of the sub-components of the email address.

# Further resources on regular expressions

This section is to give you an idea of the types of problems that might be addressed using regular expressions, as well as a basic idea of how to use them in Python. I'll suggest some references for learning below:

1. [Python's re package Documentation](https://docs.python.org/3/library/re.html): I find that I promptly forget how to use regular expressions just about every time I use them. Now that I have the basics down, I have found this page to be an incredibly valuable resource to recall what each specific character or sequence means within a regular expression.

2. [Python's official regular expression HOWTO](https://docs.python.org/3/howto/regex.html): A more narrative approach to regular expressions in Python.

3. [Mastering Regular Expressions (OReilly, 2006)](https://www.oreilly.com/library/view/mastering-regular-expressions/0596528124/) is a 500+ page book on the subject. If you want a really complete treatment of this topic, this is the resource for you.