### CS102/CS103

Prof. Götz Pfeiffer<br />
School of Mathematics, Statistics and Applied Mathematics<br />
NUI Galway

# Lecture 13: String Processing

Strings are an important type of data in any computing environment,
as they provide a highly convenient form of communicating information
between a (human) user and a (computer based) application.
We have already seen (and used) `python`'s basic string operations.
Here, we will extend that toolset by more advanced
string formatting operations.

But first, some comments on recent programming tasks.

## Practicals

...

## Simple String Processing.

Recall the basic string operations.

Operator | Meaning
:-:|:-:
`+` | Concatenation
`*` | Repetition
`string[...]` | Indexing
`string[...:...]` | Slicing
`len(string)` | Length
`for c in string:` | Loop over Characters

### Example: User Names

Suppose we want to generate usernames (other than student IDs)
for user accounts on a computer. Given a person's name,
that is their first name and their last name, the person's username
should consist of their first initial, followed by up to 7 letters
of their last name.
Skipping over the details of the software development process in this case,
here is a first, simple version of a `python` function which does this.

In [1]:
def username(first, last):
    "compute a username from a person's name"    
    uname = first[0] + last[:7]
    return uname

In [2]:
username("goetz", "pfeiffer")

'gpfeiffe'

Let's try it on last week's student names.

In [3]:
students_file = open("students.csv")
lines = students_file.readlines()
students_file.close()

In [4]:
from random import randrange
line = lines[randrange(0,len(lines))]
line

'DOYLE,DILLON\n'

In [5]:
def csv2name(line):
    name = line.strip()
    name = name.split(',')
    return name[1], name[0]

In [6]:
first, last = csv2name(line)
username(first, last)

'DDOYLE'

Great!  Except that usernames should be all lowercase ...

### Example:  Month Abbreviations

Suppose we want to compute the 3-letter abbreviation of a month's name
that corresponds to a given month number, like `10` becomes `Oct`.  This would be just another conversion
problem, but unfortunately, there is no nice conversion formula, or is there?

Taking advantage of the fact that all 3-letter abbreviations are exactly, well,
3 letters long, a clever use of slicing gives something like a conversion formula.
Suppose we put all the months into one long string:

In [7]:
months = "JanFebMarAprMayJunJulAugSepOctNovDec"

Then we can pick one of the names as a slice of length 3:

In [8]:
months[5:5+3]

'bMa'

In [9]:
months[6:6+3]

'Mar'

How to compute the correct positions of the slice from the month number?

month|number|from:to
:-:|:-:|:-:
`Jan`|1|0:3|
`Feb`|2|3:6|
`Mar`|3|6:9|
...|...|...

So the slice positions are all multiples of 3. For `Jan` ($=1$)
the slice goes from $0$ to (but not including) $3$, for `Feb` ($= 2$) from $3$ 
to $6$, etc.  The **upper** slice index is 3 times the month's number.
This observation can now be implemented as follows.

In [10]:
def month(number):
    "compute a 3-letter month name"
    months = "JanFebMarAprMayJunJulAugSepOctNovDec" # lookup table
    pos = 3 * number
    name = months[pos-3:pos]
    return name
    

In [11]:
month(3)

'Mar'

In [12]:
month(10)

'Oct'

## Character Encoding

We have seen that strings internally are repesented by sequences of numbers,
one for each character.  Originally, the underlying character set was limited to the
symbols you find on an american computer keyboard, and standardised as the
128 letters of the ASCII code.  These days, the much larger set of symbols
in the [Unicode](https://en.wikipedia.org/wiki/Unicode) standard can be used.

`python` has built-in functions `ord()` and `chr()` that convert characters into
integer codes and back.  Looping over strings  can be used to write
encoder and decoder functions.

### Encoder

In [13]:
def text2unicode(message):
    "convert a textual message into a sequence of Unicode code points"
    codes = []
    for c in message:
        code = ord(c)
        codes.append(code)
    return codes

In [14]:
text2unicode("Hi there!")

[72, 105, 32, 116, 104, 101, 114, 101, 33]

### Decoder

In [15]:
def unicode2text(codes):
    "convert a sequence of Unicode code points into text"
    message = ""
    for code in codes:
        c = chr(code)
        message = message + c
    return message

In [16]:
unicode2text([72, 105, 32, 116, 104, 101, 114, 101, 33])

'Hi there!'

Actually, while string concatenation is handy, it is not the most
efficient way to build a string. (Because strings are immutable, 
new characters cannot simply be appended, a new string object has to 
be created every time a new character arrives.)

In [17]:
def unicode2text2(codes):
    "efficiently convert Unicode into text"
    chars = []
    for code in codes:
        c = chr(code)
        chars.append(c)
    message = "".join(chars)
    return message

In [18]:
unicode2text2([72, 105, 32, 116, 104, 101, 114, 101, 33])

'Hi there!'

## Converting Dates

Sometimes, dates are given in the format `dd/mm/yyyy`, but
a date reads nicer in the format `month day, year`.   String operations
can be used to convert one format into the other.
This time, we want to work with the full month name, and cannot use the
trick above.  Still, the names can be stored in a lookup table.

In [19]:
def nice_date(date):
    "convert dd/mm/yyyy into month day, year"
    months = [ "January", "February", "March", "April",
             "May", "June", "July", "August",
             "September", "October", "November", "December"]
    dd, mm, yyyy = date.split("/")
    day = dd
    month = months[int(mm) - 1]
    year = yyyy
    text = month + " " + day + ", " + year
    return text
    

In [20]:
nice_date("02/03/2003")


'March 02, 2003'

This works, but string concatenation is somewhat clumsy.  Better use **string formatting**.

In [21]:
def nice_date2(date):
    "convert dd/mm/yyyy into month day, year"
    months = [ "January", "February", "March", "April",
             "May", "June", "July", "August",
             "September", "October", "November", "December"]
    dd, mm, yyyy = date.split("/")
    text = "{1} {0}, {2}".format(dd, months[int(mm) - 1], yyyy)
    return text


In [22]:
nice_date2("17/10/2017")

'October 17, 2017'

##  String Formatting

The `format()` method applies to **format strings** and can be used to solve many common
string manipulation problems.

The format string contains **replacement fields** surrounded by curly braces (`{}`).
Anything outside curly braces is copied to the result string.
The replacement fields are replaced by the values of the arguments passed to the
`format()` call.

* The arguments of `format` are numbered `0`, `1`, ..., and can be referred to
by those numbers: `{1}` stands for the second argument:

In [23]:
"replace {1}".format("you", "me")

'replace me'

* If the positions are omitted, the replacement fields are filled with the arguments
in their original order:

In [24]:
"{} {} {}".format(1, 2, 3)

'1 2 3'

Field names can be accompanied by format specifications which indicate how the
the value should be presented: field width, alignment, padding, decimal precision, etc.

### Alignment

Option | Meaning
---|---
`<`| left aligned
`>`| right aligned
`^`| centered

Alignment options only have an effect if a minimum field width is given.
If the width is not specified the field width is determined by its content.

In [25]:
"{0:<5}|{0:^5}|{0:>5}".format("x")

'x    |  x  |    x'

### Precision

For floating point and decimal values, a **precision** parameter
indicates how many digits should be shown, either after the decimal dot,
or in total, depending on a further **type** parameter (`e`, `f` or `g`).

In [26]:
from math import pi
"{0:.4f}|{0:12.4e}|{0:8.6g}".format(1000*pi)

'3141.5927|  3.1416e+03| 3141.59'

### Padding

In [27]:
"{:0>2}/{:0>2}/{}".format(5, 8, 2015)

'05/08/2015'

## Summary: Strings

* Strings are **immutable** sequences of **Unicode** characters.

* Like lists, strings can be manipulated with the built-in **sequence operations** for **concatenation** (`+`),
**repetition** (`*`),
**indexing** (`[]`),
**slicing** (`[:]`),
and **length** (`len()`).

* A `for` loop can be used to iterate through the characters of a string.

* One way of **converting numerical information** into text
is to use a single string, or a list of strings as a **lookup table**.

* Strings are represented on the computer as **numerical codes**
using the  Unicode standard.

* The `python` functions `ord()` and `chr()` **convert**
between Unicode codes and characters.

* Data processing often involves **string processing**.

* The **string formatting** method `format()` is particularly useful
for producing nicely formatted text.