# Representing text

In this chapter, we're going to meet another of Python's native data types &mdash; strings. You'll use strings very often indeed, and will become quite familiar with them. If they seem a little unwieldy at first, don't worry — with practice, you will learn the things you need to know, and learn how to look up the things you don't!

We often need text in our programs. For example:

- We might want to print some useful information, as we did in the previous chapter.
- We will want to read data from files, like headers in a well log data file.
- We sometimes need to pass around text items like URLs, usernames, or access keys.

We already use word-like things for variable names:

In [2]:
impedance = 8e6

So we need a way to give Python a word without it thinking it's a variable name — otherwise it'll raise a `NameError`. Python uses quotation marks for this:

In [3]:
formation = "McMurray Formation"

In this operation, `formation` is the name of a Python string, or `str` containing the data: `"McMurray Formation"`.

The quotes can be single or double quotes, but they must match. If you need an apostrophe, then double quotes are the way to go. In general, many programmers use double quotes for sentences of natural (that is, human) language, and single quotes for short strings, especially ones that are used as keys in a Python dictionary (which you will meet in a minute), or as arguments in a Python function.

In [4]:
formation = "Boar's Head Canyon"

## Iterables: `len` and indexing

Python strings are of a general category of sequence-like types called 'iterables'. We'll meet other iterables later, and we'll become very familiar with how they work over the course of this book.

As the name suggests, iterables are sequences over which we can step or iterate. To put it another way, they are collections of other things. In the case of strings, they are collections of characters. 

A fundamental property of sequences is their 'length'. The function `len()`, which takes any sequence as its only argument. Let's call `len()` on the string we called `formation`:

In [5]:
len(formation)

18

Indeed, `"Boar's Head Canyon"` has 18 characters (the quotes are required to define the string but are not part of it.)

A fundamental feature of Python sequence types is 'indexing' &emdash; the action of selecting a single element from the sequence. In the case of strings, the elements are individual characters. These characters are themselves strings. Uing indexing, we can ask Python for individual characters in the string. For example, let's ask for the 3rd character:

In [6]:
formation[3]

'r'

Wait a minute. `r` is not the 3rd character in `"Boar's Head Canyon"` &emdash; it's the fourth. What's going on?

## Zero-based indexing

Like several other programming languages, including C and Java &mdash; but unlike MATLAB/Octave, Fortran, and Julia &mdash; Python uses zero-based indexing. So the first element in a sequence is retrieved using `0`:

In [7]:
formation[0]

'B'

It might help to think of the index as an 'offset' from the beginning of the sequence (indeed, this is the origin of zero-based indexing in C: it's easier to address memory using offsets').

you might be surprised to hear that we can have negative indices. These are interpreted as stepping backwards from the end of the sequence, starting with `-1` for the last element, then `-2` for the second-last, and so on:

In [8]:
formation[-3]

'y'

At this point, we recommend making some strings and indexing into them. You'll quickly get a feel for how it works. We will use indexing a lot in scientific computing with Python (and with any numeric language).

## Slicing

Often we want to work on a chunk of a sequence, not just a single element. A chunk is called a 'slice' in Python. Perhaps we'd like to extract the first few characters of a string, or a group of formations from a list, or a 10-metre section from a log.

The indexing idea extends into specifying the offsets to the start and end of the slice we want, separated by a colon, like so:

In [9]:
formation[7:11]

'Head'

If the start of the slice is `0`, or the end is `-1`, we can omit them. So `formation[:6]` means "the first six characters of `formation`":

In [10]:
formation[:6]

"Boar's"

Notice that `[6]` denotes the *seventh* element of the sequence, which is the first space in the string, but Python does not return this character as part of the slice. So `[:x]` can be interpreted to mean "up to but not incuding the index `x` character" or, perhaps more naturally, "up to and including the $x$th character". 

Let's try combining some of what we've done so far. Can you predict the value of `result` after these commands?

In [11]:
start, end = 7, -1
result = formation[start:end - 2]

The slicing interface is richer yet — as well as a start and an end to the slice, we can also specify a stride. Here's how we would select every other character in a string:

In [12]:
formation[::2]

"Ba' edCno"

## Composing and concatenating

We have already seen one way to compose strings, using `format` strings:

In [13]:
f"{formation} Formation"

"Boar's Head Canyon Formation"

You can add formatting to the expression inside the curly braces, and you can have as many substitutions as you like:

In [25]:
age = 122.65
f"{formation:>24} Formation: {age:.0f} Ma"

"      Boar's Head Canyon Formation: 123 Ma"

What's happening here? We're inserting 2 variables into this string. In each case, the variable to substitute is named before the colon. After the colon are the formatting instructions. For `formation` we're asking for the text to be right-justified (`>`) in a column 24 characters wide. For `age`, we're asking for a float (`f`) to be represented with no decimals.

Format strings were new in Python 3.6, so if you're reading older code, you might see people using the `format` method:

In [26]:
"{} Formation".format(formation)

"Boar's Head Canyon Formation"

This might seem a little backwards, and a little repetitive — hence the introduction of format strings. You can use the same formatting instructions in each pair of curly braces.

If all we want to do is combine the string we named `formation` with the string `"Formation"`, we can just use concatenation. This is achieved wth the `+` operator, which Python interprets differently for strings than with numbers:

In [27]:
formation + " Formation"

"Boar's Head Canyon Formation"

The `*` operator, which multiplies numbers, has also been 'overloaded' in a similar way:

In [28]:
5 * "-" + " Cretaceous " + 5 * "-"

'----- Cretaceous -----'

## Type casting

Sometimes you need to convert something from one type to another.

For example, this gives an error:

In [38]:
duration = 79
"Length of Cretaceous = " + duration + " Ma"

TypeError: must be str, not int

Python doesn't implicitly understand that you want to turn the `79`, which is an `int`, into a string, so that you can concatenate it with the other string. Python thinks you might want to add another number to an `int`. So instead of guessing your intent, it throws a `TypeError`.

We can explicitly cast the `int` to a `str` with a built-in function called `str`:

In [39]:
"Length of Cretaceous = " + str(duration) + " Ma"

'Length of Cretaceous = 79 Ma'

As we have already seen, we can use format strings for this sort of thing. You start to appreciate why they are so useful, because the conversion of the number to a string is implicit:

In [41]:
f"Length of Cretaceous = {duration} Ma"

'Length of Cretaceous = 79 Ma'

We may also sometimes need to cast from a string to a number, perhaps after reading data from a file. Again, here's Python not knowing what to do when we use the `-` (minus) operator on strings:

In [42]:
'145' - '66'

TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [43]:
float(145) - float(66)

79.0

We could also use `int()`, but `float()` is a bit safer in this case, because we might care about decimal places in some ages.

## `strip` and other methods

Python's strings have quite a few convenient methods. Just enter this into a Notebook or at an IPython prompt:

In [44]:
s = "McMurray Fm: Sand (unconsolidated),,,, \n"

Now type `s.` and press the **tab** key. You should see a list of methods pop up, starting with `s.capitalize`. Take a few minutes to explore some of them. Remember, you can find out what something does by typing, for example, `s.capitalize?` and executing.

In [45]:
s.capitalize?

You should see the following:

    Docstring:
    S.capitalize() -> str

    Return a capitalized version of S, i.e. make the first character
    have upper case and the rest lower case.
    Type:      builtin_function_or_method
    
This says that calling `capitalize()` on `s` returns another `str`, and explains how the string will be transformed. Let's try it.

In [46]:
s.capitalize()

'Mcmurray fm: sand (unconsolidated),,,, \n'

What's that `\n` doing at the end of the string? That's the newline character. If we call the `print` function on a string, those get printed as actual new lines:

In [47]:
print("PERIOD\n\nCretaceous\nJurassic\nTriassic")

PERIOD

Cretaceous
Jurassic
Triassic


Note that although we type it as two characters, it represents a single character:

In [48]:
len("PERIOD\n\n")  # PERIOD followed by 2 newlines.

8

In [49]:
'this\'s'

"this's"

Some other occasions on which you might need the backslash:

- To make a tab character: `\t`.
- To type a backslash on its own, you must 'escape' it: `\\`.
- If you need to type a `'` or `"` in your string, and you can't just use the other mark to define the string itself, you must escape them with `\`. So these both work: `"She said, \"Don't do it!\""` or `'She said "Don\'t do it!"'`.
- To use a special character like `å`, you can give its octal value, `"\345"`, or its hex value, `"\xe5"`.

As well as the newline, there are also some junk commas at the end of our string, probably left over from a CSV file or something. We can remove them by slicing:

In [50]:
s[:-5]

'McMurray Fm: Sand (unconsolidated),'

But this requires us to know how many commas there are. Another string method, `strip()`, lets us remove any number of characters from the start or end of a string. It takes one optional argument: a string containing all the characters you want to strip off, in any order.

In [51]:
s.strip(' ,\n')

'McMurray Fm: Sand (unconsolidated)'

There are also two other strip methods that only operate on the left- or right-hand end of the string, `lstrip()` and `rstrip()`.


## Exercises

- Explore the indexing and slicing of strings. 
- Take some time to explore the various methods on Python's `str` object, especially:
  - `upper()` and `lower()`
  - `isupper()` and `islower()`
  - `startswith()` and `endswith()`
  - `find()` and `replace`


- Predict the output of these commands:

      x = "Jurassic Period"
      x[x.lower().find('p') - 1]

In [53]:
# SOLUTION
x = "Jurassic Period"
x[:x.lower().find('p') - 1]

'Jurassic'

## Formatting strings

ToDo: some commonly used ways of formatting strings.

## Moving on

In this chapter you've met Python's string objects, and found out about their built-in methods. If you tried the `s.split()` method, you might have noticed that it does not return another string like `s.strip()`, or a number like `s.find()`. Instead, it returns a Python `list` — in this case a sequence of strings, resulting from breaking the original string at each space:

In [54]:
s.split()

['McMurray', 'Fm:', 'Sand', '(unconsolidated),,,,']

In the next chapter, we'll find out all about lists.