<center> 
# R406: Using Python for data analysis and modelling

<br> <br> 

## Lecture 5: Strings and regular expressions

<br>

<center> **Andrey Vassilev**

<br> 

<center> **2016/2017**
 

# Outline

1. Strings and string manipulation
2. Regular expressions

# A one-minute review of strings

Strings in Python can be defined using single or double quotes:

In [None]:
s1 = 'String One'
s2 = "string two"

Multiline strings are defined as follows:

In [None]:
s3 = '''Multi
line
string'''
s4 = """Multi
line
string"""
s3 == s4 # The quotes chosen don't matter

In [None]:
print(s3) # Note the way this string prints

# A one-minute review of strings

Strings can be indexed and sliced:

In [None]:
print(s1[2])
s1[0:6]

We can iterate over a string:

In [None]:
for l in s1:
    print(l, end = "  ")

Strings are immutable: `s2[0]=3` will raise an error.

# Case manipulation

The following are self-explanatory. Note that they return copies of the original string (unsurprisingly for an immutable object). Try them out!

In [None]:
s1.upper() # Convert to uppercase

In [None]:
s1.lower() # Convert to lowercase

In [None]:
print(s2)
print(s2.title()) # Convert to titlecase

In [None]:
s2.capitalize() # Capitalize (make first letter capital)

In [None]:
print(s1)
print(s1.swapcase()) # Change the case

# Adding and removing spaces

&nbsp;

## Removing whitespace

Leading and trailing whitespace can be removed by using the `strip()` method.

In [None]:
s = "    some text            "
s.strip()

Removing only leading or only trailing whitespace can be accomplished with `lstrip()` and `rstrip()`.

In [None]:
s.lstrip()

In [None]:
s.rstrip()

The `strip()` method and its variations also accept as argument a specific character to be stripped.

In [None]:
s = "++++++pure text+++++"
s.strip("+")

In [None]:
s.rstrip("+")

## Adding space

These operations serve primarily a formatting purpose. For example, the `center()` function inserts additional whitespace to center a string, producing an appropriately padded (longer) string.

In [None]:
s = "small piece of text"
s.center(30) # The argument is the overall length 
             # of the resulting string

We can also specify the padding character:

In [None]:
s.center(30,"^")

The `ljust()` and `rjust()` methods perform the respective justifications by means of appropriate one-sided padding. They work similarly to `center()`.

In [None]:
s.ljust(40)

In [None]:
# You can also specify the string directly
# instead of defining a variable
"Python is sooo cool".rjust(25,"☯")

There is also a special zero-fill method `zfill()` which pads with zeros from the left (i.e. it right-justifies text):

In [None]:
x = 777
str(x).zfill(10)

# Finding and replacing text

&nbsp;

## Finding text

We can find the first occurrence of a substring by using the methods `find()` and `index()`. The search is performed from the left. Both methods return the position where the substring is found.

In [None]:
s = "A yellow python is prettier than a black python."
s.find("python")

In [None]:
s.index("python")

The difference between the two methods is how they handle the case when the substring is not found: `find()` returns -1, while `index()` throws ~~a tantrum~~ an error.

In [None]:
s.find("anaconda")

In [None]:
s.index("anaconda")

Finding a string by performing the search **from the end** can be done with the methods `rfind()` and `rindex()`.

In [None]:
s.rfind("python")

The `find()` and `index()` methods can optionally take a second and third argument specifying a starting position and an end position in the string which will confine the search to the respective range.

In [None]:
s.find("python", 15)

In [None]:
s.find("python", 15, 30)

There are also methods that check whether a string begins or ends with a specific substring. These are called `startswith()` and `endswith()`, respectively. They return Boolean values.

In [None]:
s.startswith("A blue")

In [None]:
s.endswith("python.")

## Replacing text

**Once again, these operations do not modify the original string but return a new string containing the changes!**

We can replace a substring with another one using the `replace()` method. The syntax is ```s.replace("old", "new")
```

In [None]:
s.replace("python", "anaconda")

# Splitting strings

One can break a string into substrings using different methods.

The `partition()` method finds the first occurrence of a target substring and returns a tuple in the following manner:

In [None]:
s.partition("python")

There is also a `rpartition()`method which performs the same operation from the end:

In [None]:
s.rpartition("python")

The `split()` method splits a string on whitespace symbols or on a specially provided character and returns a list of the resulting substrings. Note that the splitting symbol is not included in the result.

In [None]:
s.split()

In [None]:
s.split("y")

A multiline string can be split into the constituent lines using the method `splitlines()`. It splits on the newline character.

In [None]:
Zen = """In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch."""

Zen.splitlines() # Try passing True as an argument
                 # to include newline characters in the result

# Joining strings

Strings are joined using the `join()` method. The method takes an iterable of strings and concatenates them using a concatenating string:

In [None]:
concatstring = " "
iterstring = ("A","fine","day")
# concatstring = " - ahem - "
concatstring.join(iterstring)

The `join()` method can be used to produce a multiline string from an iterable of strings. This is done using the newline character "\n" as the concatenating string.

**Note:** See [here](https://docs.python.org/3/library/stdtypes.html#str.splitlines) for a table of line boundary characters.

In [None]:
lines = ["Gaudeamus igitur.", 
         "Iuvenes dum sumus.",
         "Post iucundam iuventutem.", 
         "Post molestam senectutem."]
result = "\n".join(lines)
result

# Miscellaneous string methods


The `count()` method counts the number of occurrences of a substring in a given string. 

In [None]:
RB = """Gin a body meet a body
Comin thro' the rye,
Gin a body kiss a body,
Need a body cry?"""

RB.count("body")

It also has an extended syntax giving starting and ending positions for the search, similar to `find()` and `index()`:

In [None]:
RB.count("body",23,67)

The `in` operator checks whether one string is part of another:

In [None]:
"body" in RB

In [None]:
"somebody" in RB

There is a variety of methods that perform different checks on strings and return Boolean values.  

For instance the `isalnum()` method checks whether a string consists of letters and numbers only.

In [None]:
print("abc".isalnum())
print("123".isalnum())
print("a1B3".isalnum())
print("".isalnum()) # empty string is not
print("ab1@".isalnum())
print("45.2".isalnum())

The `isdigit()` method checks whether the string consists of digits only. 

In [None]:
print("123".isdigit())
print("abc".isdigit())
print("123.5".isdigit())
print("000556".isdigit())
print("VI".isdigit())

The list of methods goes on. Check out [this part](https://docs.python.org/3/library/stdtypes.html#string-methods) of the Python documentation for more info.

# Regular expressions

- Regular expressions are a way to describe a pattern we want to discover in a string.
- A rudimentary form of a regular expression can be seen in a command line instruction such as **```dir *.ipynb
``` **, which requires the operating system to list all `ipynb` files in a given location.
- Regular expressions can be thought of as a mechanism to greatly enrich the simple string search and manipulation methods we saw earlier.
- Access to regular expression functionality in Python is gained via the `re` module.

# Basics of regular expressions

In very general terms, working with REs can be outlined as follows:
1. Describe the pattern of interest, using special conventions.
  - A pattern can be something like:  
    *Find the sequence "ID" or "id", followed by the symbols "#" or "№", followed by exactly 9 digits.*
2. Apply an operation to a string using the constructed pattern.
  - An operation can be something like finding the pattern in the string, counting instances of the pattern or replacing it with something.

**A disclaimer and a few pointers:**  
- Regular expressions are a huge topic with many intricacies. 
- Here we'll only try to get the big picture and work out a few basic things to get us started.
- To get deeper into the topic, try the [documentation](https://docs.python.org/3/library/re.html) for the `re` module, the [HOWTO](https://docs.python.org/3.5/howto/regex.html) (in essence an introductory tutorial) on regular expressions or the numerous resources on the WWW.
- In particular, the book *Mastering Regular Expressions* by Jeffrey Friedl is often cited as an authoritative reference on the subject. However, recent editions do not cover Python.

# Implementing regular expressions in Python

0. Import the `re` module.
1. Construct the RE as a string using the special conventions. 
2. Compile the RE string to a pattern object
3. Use the pattern object to apply different operations (methods) to other strings.
  - These "other" strings will be the ones containing the actual information we are processing, e.g. lines read from a file.
  - The operations can be chained (= applied in succession).

# Pattern construction rules

First things first:

In [None]:
%reset
import re

- Some people call these rules a "mini-language". This is an indication that they can be hard to decipher when combined. 
- We'll cover only a few basic ones to get the main ideas.

- A string can be matched literally, e.g. find the word "happy" in another string.
- This is simple and can obviously be accomplished more economically by the methods covered previously.
- We'll use it to illustrate the creation of the pattern object, the match object and the use of some of the methods.

In [None]:
text = "The Ring, which Gollum referred to as 'my precious' or 'precious', extended his life far beyond natural limits."
p = re.compile("precious") # p is a pattern object
m = p.search(text) # m is a match object
x,y = m.start(),m.end() # starting and ending positions 
                        # of first match
print(text[x:]) 
print(text[y:])
l = p.findall(text) # find all matches and return them in a list
# The result is trivial in this case
l

If no match is found the match object will be `None`. Typically we check for this situation explicitly. For example, the method `match()` will check if the pattern matches at the beginning of the string (unlike `search()`, which works through the string). It won't find anything in the previous example.

In [None]:
m = p.match(text)
if m:
    print(m)
else:
    print("No match!")

We can also perform more complex matches using special notation.
- The `[]` construct is used to define a character class, i.e. a group of characters we want to match. For example, `[abc]` matches `a`, `b` or `c`.
- The character `^` in the beginning of a character class means negation. Thus, `[^abc]` matches anything except `a`, `b` or `c`.

In [None]:
text = "A bee can see a zee from afar"
p = re.compile("[bs]ee")
# p = re.compile("[^bs]ee")
itr = p.finditer(text) # Produces an iterable object
for e in itr:
    print("The word '%s' can be found at position %d."%(e.group(), e.start()))

- `.`  In the default mode, this matches any character except a newline.
- `?`  Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. `ab?` will match either 'a' or 'ab'.
- `*`  Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. `ab*` will match 'a', 'ab', or 'a' followed by any number of 'b's.
- `+`  Causes the resulting RE to match 1 or more repetitions of the preceding RE. `ab+` will match 'a' followed by any non-zero number of 'b's; it will not match just 'a'.

- Grouping can be done using `()` to enforce a different scope of an operator and capture the result.
- The `|` symbol has the meaning of "or".

In [None]:
text = """112 blue bottles hanging on the wall,
10 green bottles hanging on the wall,
9 red bottles hanging on the wall,
1 green bottle hanging on the wall."""
p = re.compile("([0-9]+ (?:blue|green) bottle[s]?)")
# Here (?:blue|green) is a non-capturing group
# The simpler (blue|green) will capture these matches separately

# Try also:
# p = re.compile("[0-9]+.*bottle[s]?")
L = p.findall(text)
L

- `\d` Matches any decimal digit; this is equivalent to the class `[0-9]`.
- `\D` Matches any non-digit character; this is equivalent to the class `[^0-9]`.
- `\s` Matches any whitespace character; this is equivalent to the class `[ \t\n\r\f\v]`.
- `\S` Matches any non-whitespace character; this is equivalent to the class `[^ \t\n\r\f\v]`.

- `\w` Matches any alphanumeric character; this is equivalent to the class `[a-zA-Z0-9_]`.
- `\W` Matches any non-alphanumeric character; this is equivalent to the class `[^a-zA-Z0-9_]`.

These sequences can be included inside a character class.

Because of specificities of the parsing of REs related to the use of the backslash `\`, it is recommended to use the so-called *raw strings* to describe the pattern. Here is the difference:

In [None]:
print("this is a regular \n string")
print(r"this is a raw \n string")

- `{m}` Specifies that exactly `m` copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, `a{6}` will match exactly six 'a' characters, but not five.
- `{m,n}` Causes the resulting RE to match from `m` to `n` repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, `a{3,5}` will match from 3 to 5 'a' characters. Omitting `m` specifies a lower bound of zero, and omitting `n` specifies an infinite upper bound. As an example, `a{4,}b` will match 'aaaab' or a thousand 'a' characters followed by 'b', but not 'aaab'.

# Example: extracting Sofia car licence plates

- We have a piece of text containing multiple identifiers (car licence plate numbers, dates, ID numbers).
- We are interested in extracting the licence plate numbers for cars registered in Sofia.
- We know that Sofia plate numbers can take three possible forms: `CooooXX`, `CAooooXX` and `CBooooXX`, where `o` stands for a digit and `X` stands for an uppercase letter.
- (Please don't rush in with counterexamples of other possible forms of Sofia car plates ☺)

In [None]:
plates = """A police officer with badge number PO31254 wrote 
speeding tickets for vehicles with licence plates 
CA6542HP, 234GH856, C1234AA and B4455TK. Another officer, 
holding badge numbered CA98765 and born on 11.11.1970, 
wrote tickets for vehicles CO7391KK, CA3571ET, T1213MA and CB6534EH."""
p = re.compile(r"C[AB]?\d{4}[A-Z]{2}")
L = p.findall(plates)
L