# Chapter 8: Strings

## TL;DR

A *string* is an **immutable sequence** (= ordered collection) of characters enclosed in (double or single) quotes.

In [1]:
school = "WHU - Otto Beisheim School of Management"

Remember that everything is an object.

In [2]:
id(school)

139856617815184

In [3]:
type(school)

str

In [4]:
school

'WHU - Otto Beisheim School of Management'

## Sequences

**Sequences** are an abstract concept. In particular, any concrete data type (like `str` in this chapter) that simultaneously behaves like a container and an iterable and also implements the notion of a *length* is considered a sequence.

Being such a sequence, strings can be treated like lists in many cases. For example, [len()](https://docs.python.org/3/library/functions.html#len) tells us how many elements (i.e., characters) make up the entire string.

In [5]:
len(school)

40

They can also be traversed over with a `for` loop.

In [6]:
for letter in school:
    print(letter, end=" ")

W H U   -   O t t o   B e i s h e i m   S c h o o l   o f   M a n a g e m e n t 

The `in` operator checks if a given object is a member of a sequence. In the context of strings it checks if a single character or a shorter string (then called a **substring**) is contained in a long string.

In [7]:
"O" in school

True

In [8]:
"WHU" in school

True

In [9]:
"EBS" in school

False

## Indexing

As strings have the additional property of being inherently *ordered*, we can index into them with integers to obtain individual letters just like we with lists in chapter 2.

In [10]:
school[0]

'W'

In [11]:
school[1]

'H'

The index must be of type integer.

In [12]:
school[1.0]

TypeError: string indices must be integers

The last index is one less than the above length of the string.

In [13]:
school[39]

't'

In [14]:
school[40]

IndexError: string index out of range

We can use negative indexes to start counting from the end of the string.

In [15]:
school[-1]

't'

One reason why programmers like to start counting at $0$ is that a positive index and its equivalent negative index always add up to the length of the sequence.

In [16]:
school[6]

'O'

In [17]:
school[-34]

'O'

## Slicing

A *slice* is defined as a segment / subset of a string. The **slicing operator** is just a generalization of the indexing operator. We can put one, two, or three integers within the brackets, seperated by colons (":"). The three integers are then referred to as the *start*, *end*, and *step* values.

In [18]:
school[0:3]

'WHU'

Whereas the *start* is always included in the result, the *end* is not. Counter-intuitive at first, this makes working with slices easier as they always add up to the original string again (= "string concatenation"). As the *end* is is not included, we have to end the second slice with $40$ below.

In [19]:
school[0:3] + school[3:40]

'WHU - Otto Beisheim School of Management'

For convenience, the indexes do no need to lie in the range from 0 to the string's length when slicing. This is different from indexing.

In [20]:
school[0:999]

'WHU - Otto Beisheim School of Management'

Commonly, we leave out the $0$ for the *start* and the *end* if it is equal to the length.

In [21]:
school[:3] + school[3:]

'WHU - Otto Beisheim School of Management'

Slicing makes it easy to obtain shorter versions of the original string.

In [22]:
school[:3] + school[5:26]

'WHU Otto Beisheim School'

A *step* value of $i$ can be used to obtain only every $i$th letter.

In [23]:
school[::2]

'WU-Ot esemSho fMngmn'

A negative step size reverses the order of the sequence.

In [24]:
school[::-1]

'tnemeganaM fo loohcS miehsieB ottO - UHW'

## Immutability

Whereas elements of **a list can be re-assigned** as shortly hinted at in chapter 2 (and covered in much more depth in the next chapter), this is **not possible for strings**. Once created, they **cannot be changed**. Why this is useful, will be talked about extensively in chapter 11 on **dictionaries**.

In [25]:
school[0] = "E"

TypeError: 'str' object does not support item assignment

The only thing we can do is to create a *new* object in memory.

In [26]:
new_school = "EBS" + school[3:]

In [27]:
new_school

'EBS - Otto Beisheim School of Management'

In [28]:
id(new_school)

139856617815088

In [29]:
id(school)

139856617815184

## String Operations

As mentioned before, the `+` and `*` operators are overloaded and used for string concatenation.

In [30]:
greeting = "Hello "

In [31]:
greeting + school[:3]

'Hello WHU'

In [32]:
10 * school[:4]

'WHU WHU WHU WHU WHU WHU WHU WHU WHU WHU '

## String Methods

Objects of type string come with many functions **bound** on them, also called **methods** (see the [documentation](https://docs.python.org/3/library/stdtypes.html#string-methods) for a full list). They basically work like *normal* functions and are accessed via the **dot operator**. Calling a method is also known as **method invocation**.

The [find()](https://docs.python.org/3/library/stdtypes.html#str.find) method returns the index of the first occurrence of a character or a substring. If no match is found, it returns $-1$.

In [33]:
school.find("O")

6

In [34]:
school.find("z")

-1

In [35]:
school.find("Beisheim")

11

[find()](https://docs.python.org/3/library/stdtypes.html#str.find) takes optional *start* and *end* indices that allow us to find occurrences other than the first and in only a substring.

In [36]:
school.find("e")

12

In [37]:
school.find("e", 13)  # 13 not 12 as otherwise the same character is found

16

In [38]:
school.find("e", 13, 15)  # "e" does not occur again on the substring

-1

[count()](https://docs.python.org/3/library/stdtypes.html#str.count) does what we would expect.

In [39]:
school.count("o")

4

As [count()](https://docs.python.org/3/library/stdtypes.html#str.count) is case-sensitive, we need to **chain** it with either the [lower()](https://docs.python.org/3/library/stdtypes.html#str.lower) or [upper()](https://docs.python.org/3/library/stdtypes.html#str.upper) methods we have seen before, to get the count of all "o"s.

In [40]:
school.lower().count("o")

5

Because strings are immutable the methods always return a new object, even if a method does not change the string at all.

In [41]:
example = "test"

In [42]:
id(example)

139856825720704

In [43]:
lower = example.lower()

In [44]:
id(lower)

139856272789432

In [45]:
example is lower

False

In [46]:
example == lower

True

Another popular string method is [split()](https://docs.python.org/3/library/stdtypes.html#str.split) that seperates a string into a list of smaller strings.

In [47]:
for word in school.split():
    print(word)

WHU
-
Otto
Beisheim
School
of
Management


The opposite of splitting can be achieved with the [join()](https://docs.python.org/3/library/stdtypes.html#str.join) method. This is typically invoked on a string that represents some sort of seperator and connects the elements passed to it as a argument list.

In [48]:
words = ["This", "will", "become", "a", "sentence"]

In [49]:
sentence = " ".join(words)

In [50]:
sentence

'This will become a sentence'

With the [replace()](https://docs.python.org/3/library/stdtypes.html#str.replace) method, we can replace parts of a string.

In [51]:
sentence.replace("will become", "is")

'This is a sentence'

## String Comparison

The relational operators also work with strings (another example of overloading). Comparison is done one character at a time until the first pair differs or one string ends. However, strings are sorted in a "weird" way. The reason for this is that computers store characters internally as numbers (after all they only understand $0$s and $1$s as we saw in the chapter on numbers). Depending on the character encoding, these numbers can vary. Commonly, characters and symbols used in the American language are encoded with the numbers 0 through 127, the so-called [ASCII standard](https://en.wikipedia.org/wiki/ASCII). However, Python works with the more general [Unicode/UTF-8 standard](https://en.wikipedia.org/wiki/UTF-8) that understands every language ever used by humans, even emojis.

In [52]:
A = "Apple"  # let's ignore snake_case for this example
a = "apple"
B = "Banana"

In [53]:
A < B

True

In [54]:
a < B

False

One way to fix this, is to only compare lower-case strings.

In [55]:
a < B.lower()

True

To provide a simple intuition for the "weird" sorting above, let's think of the American alphabet as being represented by the numbers as listed below. Then "Banana" is clearly "smaller" than "apple".

In [56]:
for lower_i in range(65, 91):
    upper_i = lower_i + 32  # all the upper case characters are offset by 32
    lower_char = chr(lower_i)  # from their lower case counterpart
    upper_char = chr(upper_i)
    print(f"{lower_char} -> {lower_i} \t {upper_char} -> {upper_i}")

A -> 65 	 a -> 97
B -> 66 	 b -> 98
C -> 67 	 c -> 99
D -> 68 	 d -> 100
E -> 69 	 e -> 101
F -> 70 	 f -> 102
G -> 71 	 g -> 103
H -> 72 	 h -> 104
I -> 73 	 i -> 105
J -> 74 	 j -> 106
K -> 75 	 k -> 107
L -> 76 	 l -> 108
M -> 77 	 m -> 109
N -> 78 	 n -> 110
O -> 79 	 o -> 111
P -> 80 	 p -> 112
Q -> 81 	 q -> 113
R -> 82 	 r -> 114
S -> 83 	 s -> 115
T -> 84 	 t -> 116
U -> 85 	 u -> 117
V -> 86 	 v -> 118
W -> 87 	 w -> 119
X -> 88 	 x -> 120
Y -> 89 	 y -> 121
Z -> 90 	 z -> 122


## String Interpolation

The previous code cell shows an example of a so-called **f-string** (introduced by [PEP 498](https://www.python.org/dev/peps/pep-0498/)).

So far, we have used the [print()](https://docs.python.org/3/library/functions.html#print) function only with normal strings (e.g., "example") or variables. Sometimes, it is more convenient to fill in a "draft" of a string with a value determined only at runtime. This process is called **string interpolation**. There are three ways to achieve that but only two are commonly used these days.

### f-strings

f-strings are the new and most readable way. Just prepend the string with an `f` and put variables / expressions within curly braces.

In [57]:
name = "Alexander"
time_of_day = "morning"
pi = 3.141592653

In [58]:
print(f"Hello {name}! Good {time_of_day}.")

Hello Alexander! Good morning.


Seperated by a colon, many formatting options are available. In the beginning, only the ability to round is important and can be achieved by adding ":.2f" to the variable name to cast the number as a float and round it to two digits.

In [59]:
print(f"Pi is {pi:.2f}")

Pi is 3.14


In [60]:
print(f"Pi is {pi:.3f}")

Pi is 3.142


### format() Method

String objects also provide a [format()](https://docs.python.org/3/library/stdtypes.html#str.format) method that accepts an arbitrary number of positional arguments that are inserted into the string in that same order replacing curly brackets. See the [Python Documentation](https://docs.python.org/3/library/string.html#formatspec) for a full specification. This is the traditional way of string interpolation and many code examples on the internet use it.

In [61]:
print("Hello {}! Good {}.".format(name, time_of_day))

Hello Alexander! Good morning.


Use index numbers if the order is different in the draft string.

In [62]:
print("Good {1}, {0}".format(name, time_of_day))

Good morning, Alexander


The [format()](https://docs.python.org/3/library/stdtypes.html#str.format) method can alternatively be used with keyword arguments as well. The we need to put the keyword names within the curly brackets.

In [63]:
print("Hello {name}! Good {time}.".format(name=name, time=time_of_day))

Hello Alexander! Good morning.


Numbers are treated as in the f-strings case.

In [64]:
print("Pi is {:.2f}".format(pi))

Pi is 3.14


## Special Characters

Some symbols have a special meaning in strings. Most notable are the newline (\n) and tab (\t) "characters". The backslash symbol is also referred to as an **escape character** in this context indicating that the following character has a meaning other than its literal meaning.

In [65]:
print("This is a sentence\nthat is printed\non three lines.")

This is a sentence
that is printed
on three lines.


In [66]:
print("Words\taligned\twith\ttabs.")

Words	aligned	with	tabs.


As emojis are important as well, they can be inserted with the corresponding **unicode code point** number ("\U"). See this [list](https://en.wikipedia.org/wiki/List_of_Unicode_characters) of unicode characters for an overview.

In [67]:
print("\U0001f604")

😄


## Raw Strings

Sometimes we want the backslash and its following character to not be converted into special characters.

For example, let's print a typcial installation path on a Windows systems. Obviously, the new line does not makes sense here.

In [68]:
print("C:\Programs\new_application")

C:\Programs
ew_application


Some strings even produce a `SyntaxError` because the "\U..." cannot be converted into a unicode code point.

In [69]:
print("C:\Users\Administrator\Desktop\Project_Folder")

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (<ipython-input-69-0aad8a365b02>, line 1)

A simple solution would be to just escape the escape character with another backslash.

In [70]:
print("C:\\Programs\\new_application")

C:\Programs\new_application


In [71]:
print("C:\\Users\\Administrator\\Desktop\\Project_Folder")

C:\Users\Administrator\Desktop\Project_Folder


However, Python allows us to to treat any string in its literal or "raw" meaning by prefixing it with an `r`. This option does not change the actual stings, which is often the better solution.

In [72]:
print(r"C:\Programs\new_application")

C:\Programs\new_application


In [73]:
print(r"C:\Users\Administrator\Desktop\Project_Folder")

C:\Users\Administrator\Desktop\Project_Folder


## Multi-line Strings

Sometimes it is better to split a string's content on multiple lines. This is done with triple-double or triple-single quotes. Docstrings are exactly that (by convention always in triple-double quotes).

In [74]:
multi_line = """
I am a multi-line string
consisting of 4 lines.
"""

Linebreaks are kept and implicitly converted into "\n".

In [75]:
multi_line

'\nI am a multi-line string\nconsisting of 4 lines.\n'

So we see two empty lines when we print this string.

In [76]:
print(multi_line)


I am a multi-line string
consisting of 4 lines.



Using [split()](https://docs.python.org/3/library/stdtypes.html#str.split) with the optional *sep* argument confirms that `multi_line` consists of four lines with the first linebreak being the very first character in the string.

In [77]:
for i, line in enumerate(multi_line.split("\n")):
    print(i, line)

0 
1 I am a multi-line string
2 consisting of 4 lines.
3 
