# String Manipulation

Staff : Joshua Schäuble <br>
Support Material : [08_text_manipulation.ipynb](https://github.com/dtaantwerp/dtaantwerp.github.io/blob/master/exercises2021/08_text_manipulation.ipynb) <br/>
Support Sessions :  Thursday October 7 <br>


## 1. Introduction
As we learned before, strings are a *data type* (other data types we looked at: `bool`, `int`, `float`, `tuple`, `list`, `dict`, ...). In this session we will learn how to manipulate strings by looking at two of their main characteristics: (1) Strings are ordered sequences of characters and (2) strings are so-called objects. The latter will be explained in detail only in week 3. However, we can already attempt to understand what it means in practice today.

## 2. Strings as Sequences / Sequence Operations on Strings

Strings are a sequential data type. A string is an ordered sequence of characters. We can use this characteristic to manipulate strings with **sequence operations**, some of which we have already used in previous sessions. For example, we can add strings together or multiply them. The most important sequence operations will be introduced here:

### 2.1 Concatenating strings with `+`
When applied to numbers (integer or float) the plus operator (`+`) processes a mathematical addition. When used on strings, the `+` operator concatenates two or more strings into a new one.

In [2]:
greeting = "Welcome "
name = "Jane"
sentence = "How are you doing?"

all_together = greeting + name + ". " + sentence
print(all_together)

Welcome Jane. How are you doing?


### 2.2 Multiplying strings with `*` 
Although this is not used very often, strings can also be multiplied with the multiplication operator, which we also know from numerical data types:

In [3]:
pete_and_repeat = "Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. \n"
eternity = 20 * pete_and_repeat
print (eternity)

Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room. Pete leaves. Who is left? Repeat. 
Pete and Repeat are sitting in a room.

Can you guess what will be printed by executing the block below?

In [6]:
char1 = "a"
char2 = "b"
char3 = "c"
print(3*(2*(char1+char2)+char3))

ababcababcababc


### 2.3 Slicing Strings via the Character Index

Strings are ordered sequences of characters. This means, every character of the sequence has a position or index. The index starts at 0 and increases from left to right.

| mytext =  | " | H | e | l | l | o |   | W | o | r | l | d  | " |
|-----------|---|---|---|---|---|---|---|---|---|---|---|----|---|
| Index     |   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |   |

The following examples will show how we can cut out slices of a string by using the index.

In [7]:
mytext = "Hello World"
print (mytext[0]) # H
print (mytext[0:5]) # Hello       note: the last character that is printed is the index 4 (not 5!)
print (mytext[6:11]) # World      note: the last character that is printed is the index 10 (there is no 11!)
print (mytext[6:11:2]) # character 6 to 11, but only every second character: Wrd
print (mytext[2:]) # read 2 until end

H
Hello
World
Wrd
llo World


We can also access the index from right to left. The last index position is always [-1].

| mytext =  | " | H   | e   | l  | l  | o  |    | W  | o  | r  | l  | d  | " |
|-----------|---|-----|-----|----|----|----|----|----|----|----|----|----|---|
| Index     |   | -11 | -10 | -9 | -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 |   |

In [8]:
print (mytext[-1])
print (mytext[-5:-1]) # note: the last character printed is the index -2!
print (mytext[-5:])

d
Worl
World


Note:
- Take care! Don't access none-existing negative indexes.  `print (mytext[-15])` will cause an error!
- We can only use the index to access the sequence, but not to alter it directly:

In [9]:
myword = "mouse"
myword[0] = "h"
print (myword)  # error! to replace the m with an h, use a string method (see below, e.g. str.replace())

TypeError: 'str' object does not support item assignment

### 2.4 The functions `len(string)`, `min(string)` and `max(string)`
The function `len(string)` takes a string as an argument and returns the length (= number of characters) of the string as an integer:

In [11]:
catbreed = "Siamese Cat"
len(catbreed) # note: the whitespace is also counted! 

11

The function min(string) returns the smallest character of the string. Correspondingly, the function max(string) returns the largest character in the string. 

The order (small to large) is: special characters --> numbers --> capital letters A-Z -> lowercase letters a-z

In [12]:
word1 = "Baby"
print (min(word1)) # B
print (max(word1)) # y
word1 += "!"
print (min(word1)) # !

mysentence = "These 11 apes ran down the street!"
print( min(mysentence) ) # whitespace!
print( max(mysentence) ) # w


B
y
!
 
w


### 2.5 The comparison operators `in` and `not in`

With the operators `in` and `not in` we can test if a character or string is (not) included in another string. The comparison returns a boolean value (*True* or *False*)

In [13]:
word = "death"
print ("eat" in word)      #True
print ("eat" not in word)  #False
print ("cat" in word)      #False
print ("cat" not in word)  #True

True
False
False
True


## 3. Strings as Objects: Manipulating Strings with *Object-Methods*

Strings are *objects*. To understand objects, we need more concepts than we currently know. Hence, we will only learn in week 3 what precisely objects are. However, in this lecture we can already try to understand string objects simply by using them.  

Each instance of an object (here: each *string*) has predefined object functions (so-called *methods*) attached to it. These methods can be used to get information about the object and to manipulate the object. 

In comparison to normal functions (such as `print(string)` and `len(string)`), methods are always attached to a specific instance of the object. Therefore, the object does not need to be passed into the function as a parameter (as we would do in `print(string)`). Instead, it is called for a specific instance (this will get clear in a moment).

To explain the difference between a normal function and a method, we can use the following methaphor. A normal function is like a speed camera at the side of the road. It detects the speed of any car with speed sensors. It generally works on all cars, but not on animals that pass the camera. A method on the other hand is like the speedometer of a specific car. It also measures the speed, but only of the specific car (= object) it is connected to. Every car object has its own speedometer attached to it.

Following the same logic, string objects have specific methods attached to them. A selection of these methods will be introduced here.


### 3.1 String.format()
The method `string.format()` allows to add placeholders to a string and fill them on demand.

In [14]:
name = input("please enter your name: ")
welcome = "Hello {0}, welcome to my website!".format(name)
print(welcome)

please enter your name: m
Hello m, welcome to my website!


In [15]:
name = input("Please enter your name: ")
age = input("Please enter your age: ")
welcome = "Hello {0}. You are {1} years old, right? You are the only {0} I know, who is {1} years old".format(name,age)

print(welcome)

Please enter your name: d
Please enter your age: d
Hello d. You are d years old, right? You are the only d I know, who is d years old


Note: `string.format()` accepts integers and automatically transforms them into strings (= type casting). This does not work with the plus operator we learned above:

In [20]:
my_name = "Peter"  # type: string
my_age = 65  # type: integer

In [21]:
# this does not work! we need to explicitely transform the integer myAge to string with str(myAge)
welcome = "Hallo " + myName + ". You are " + myAge + " years old." 

NameError: name 'myName' is not defined

In [22]:
# this however works without explicitly casting the integer to a string:
welcome2 = "Hally {0}. You are {1} years old".format(myName, myAge)
print(welcome2)

NameError: name 'myName' is not defined

### 3.2 string.count(substring)
`string.count(substring)` allows us to count how often a given substring occurs in the string:

In [23]:
myword = "butterpot"
countTs = myword.count("t")
countTTs = myword.count("tt")
print ("The word {0} contains {1} times 't' and {2} times 'tt'.".format(myword,countTs, countTTs))

The word butterpot contains 3 times 't' and 1 times 'tt'.


In [24]:
myText = "Paranoids are not paranoid because they’re paranoid, but because they keep putting themselves deliberately into paranoid situations."
print (myText.count("paranoid"))

3


### 3.3 `String.replace(what, with)` and `String.replace(what, with, count)`
`string.replace()` allows us to replace selected parts of a string with another string. We can either replace all occurances or a define a maximum number.

In [25]:
my_text = "Death"
new_text = my_text.replace("eat", "art")
print (new_text)

Darth


In [26]:
my_text = "Hurry Harry! We are late"
print(my_text.replace("r","mph",3))  # the first 3 r's are replaced by mph

Humphmphy Hamphry! We are late


### 3.4 Find the index of a substring:  `string.find()`, `string.rfind()` (and `string.index()`)
Sometimes it is useful to identify where within a string we find a given substring. To identify the index (= position), we can use the methods `string.find(what)` and `string.rfind(what)`. 

- `string.find(what)` returns the lowest index of the string *what* in the *string*. If *what* is not found, it returns -1.

- `string.rfind(what)` returns the highest index of the string *what* in the *string*. If *what* is not found, it returns -1.

In [28]:
my_word = "Butterbeer"
print (my_word.find("e"))  # index of first e in myWord
print (my_word.rfind("e")) # index of last e in myWord
print (my_word.find("q"))  # there is no q in my butterbeer!

4
8
-1


Equally, we can use `string.index(what)` and `string.rindex(what)`. However,these two functions return an error if the searched substring is not found.

> *Tip*: Only use these two, if you are certain that the substring occurs in the string, or if you add error-handling (`try`-`except` - covered later in the course).

### 3.5 Cases: Switch from uppercase to lowercase

Python provides a serious of methods for strings to change cases:

- `string:capitalize()`: changes the first character of the string to uppercase
- `string:title()`: changes the first character of **each word** to uppercase
- `string:upper()`: changes all characters of the string to uppercase
- `string:lower()`: changes all characters of the string to lowercase
- `string:swapcase()`: changes the case of all characters - uppercase to lowercase and vice versa
- `string:casefold()`: more aggressive lowercase-transformation for special characters, e.g. German ß --> ss. This function is useful to test if two strings differ only in their cases.

In [29]:
my_sentence = "this is a sentence."
print(mysentence.capitalize())  # This is a sentence.
print(mysentence.title())       # This Is A Sentence.

my_sentence = my_sentence.upper()
print(mysentence) # THIS IS A SENTENCE.
print(mysentence.lower()) # this is a sentence.

my_sentence = my_sentence.title() # This Is A Sentence.
print (mysentence) # This Is A Sentence.
print(mysentence.swapcase()) #tHIS iS a sENTENCE.

german_word = "Fluß"
german_word2 = "fluss"
print ("{0} and {1} are basically the same.".format(german_word, german_word2)) if german_word.casefold() == german_word2.casefold() else print("the two words are different.")

These 11 apes ran down the street!
These 11 Apes Ran Down The Street!
These 11 apes ran down the street!
these 11 apes ran down the street!
These 11 apes ran down the street!
tHESE 11 APES RAN DOWN THE STREET!
Fluß and fluss are basically the same.


***Additionally***, there are some methods to check what case the string is in. These methods return `True` or `False`, i.e. a boolean:

- `string.islower()`: returns `True` if all characters are lowercase.
- `string.isupper()`: returns `True` if all characters are uppercase.
- `string.isTitle()`: returns `True` if the first character of each word is uppercase and all other characters are lowercase.

In [31]:
print( "this is a sentence".islower()) #True
print( "this is A sentence".islower()) #False
print( "this is a sentEnce".islower()) #False

True
False
False


In [32]:
print( "THIS IS A SENTENCE".isupper()) #True
print( "This Is A Sentence".islower()) #False
print( "THIS IS A FLUß".islower()) #False  --> ß is considered lower-case, but there is no uppercase ß

True
False
False


In [33]:
print( "This Is A Sentence".istitle()) #True
print( "This Is a Sentence".istitle()) #False
print( "THIS IS A SENTENCE".istitle()) #False

True
False
False


### 3.6 Handling Whitespaces
Whitespaces are often a problem. For example, if we want to compare if two strings are the same - except for double whitespaces and linebreaks. To handle such whitespace problems, Python provides a series of string methods.

- `string.strip()`: removes whitespaces at the beginning AND end of a string
- `string.lstrip()`: removes whitespaces at the beginning (l for left) of the string
- `string.rstrip()`: removes whitespaces at the end (r for right) ot the string

In [34]:
print (" This is a string. "           + "And another one.") 
print (" This is a string. ".strip()   + "And another one.")
print (" This is a string. ".lstrip()  + "And another one.")
print (" This is a string. ".rstrip()  + "And another one.")

 This is a string. And another one.
This is a string.And another one.
This is a string. And another one.
 This is a string.And another one.


Equally, we can add whitespaces at the beginning and/or end of a string, until it has a desired length. This can be useful to format our output.

- `string.center(length)`: adds whitespaces at the beginning AND end of a string, until the string has stringLength
- `string.ljust(length)`: adds (howmany) whitespace at the beginning (l for left) of the string
- `string.rjust(length)`: adds (howmany) whitespace at the end (r for right) ot the string

In the following example 2 sentences are printed several times. Note how the first sentence is extended with whitespaces and how this indents the second sentence neatly to always start at the same position. 

In [35]:
print ("This is a string."             + " And another one.") # The first part "This is a string." has 17 characters!
print ("This is a string.".center(25)  + " And another one.") # add 8 whitespaces (25-17=4). 4 at the beginning, 4 at the end
print ("This is a string.".ljust(25)   + " And another one.") # add 8 whitespaces (25-17=4). All 8 at the end.
print ("This is a string.".rjust(25)   + " And another one.") # add 8 whitespaces (25-17=4). All 8 at the beginning.

This is a string. And another one.
    This is a string.     And another one.
This is a string.         And another one.
        This is a string. And another one.


Optionally, we can give a second parameter on these 3 functions, to customize which character is added (instead of a whitespace). Here are the same 4 prints again, but with an underscore "_" as a delimiter:

In [36]:
print ("This is a string."                  + " And another one.") # The first part "This is a string." has 17 characters!
print ("This is a string.".center(25, "_")  + " And another one.") # add 8 _ (25-17=4). 4 at the beginning, 4 at the end
print ("This is a string.".ljust(25, "_")   + " And another one.") # add 8 _ (25-17=4). All 8 at the end.
print ("This is a string.".rjust(25, "_")   + " And another one.") # add 8 _ (25-17=4). All 8 at the beginning.

This is a string. And another one.
____This is a string.____ And another one.
This is a string.________ And another one.
________This is a string. And another one.


### 3.7 Leading Zeros when Printing Numbers as Strings:
To print numbers, we must convert them into strings. Often we want to add leading zeros for a better looking format. This can be achieved with the method `string.zfill(width)`

In [37]:
number1 = input("enter a number: ").strip()
number2 = input("enter another number:").strip()

print(number1.zfill(8))
print(number2.zfill(8))

enter a number: k
enter another number:j9
0000000k
000000j9


A common way of using this, is to first identify the length of the highest number of a set (tuple, list or dictionary), and then to add leading zeros to all numbers according to the highest number's length.

In [38]:
#a list of random numbers (feel free to add more)
list_of_numbers = [12312,765,1233212,23,923681233,15613183,8,345345,43]

#identify the highest number in the list with the function max(integer)
highest_number = max(list_of_numbers)
print("The highest number is: " + str(highest_number)) # highestNumber must be converted into string!

#identify the length (number of characters) of this number
length_of_highest_number = len(str(highest_number)) # length of highestNumber as a string

# run over all numbers in the list:
for number in list_of_numbers:
    # turn them into a string, fill them with leading 0s until they reach the length of the highest number:
    print(str(number).zfill(length_of_highest_number))

The highest number is: 923681233
000012312
000000765
001233212
000000023
923681233
015613183
000000008
000345345
000000043


### 3.8 What's in the String? Inspecting Numeric and Alphabetical Strings

With the following functions we can inspect our strings to a certain degree. For more complex tests we will learn so-called regular expressions at a later stage.

- `string.isdigit()`: returns `True` if the string contains only the digits 0,1,2,3,4,5,6,7,8 and 9 (but also in supscript)

In [40]:
print("5 apples".isdigit()) # False
print("5 32 34".isdigit())  # False
print("532.34".isdigit())   # False
print("532,34".isdigit())   # False
print(" 53234 ".isdigit())  # False --> leading whitespaces!

print("一二三四五".isdigit()) # False --> these are Chinese numbers

print("53234".isdigit())    # True
print("053234".isdigit())   # True

print("2²".isdigit())       # True --> 2nd power of 2 (the subscript 2) is considered a digit!

False
False
False
False
False
False
True
True
True


- `string.isnumeric()`: very similar to `string.isdigit()`, but it also considers numeric characters in other languages, such as the Chinese characters 一，二，三，四， 五...  Note: for English/Dutch writing `string.isdigit()` usually does the trick!

In [41]:
print("5 apples".isnumeric())    #False
print("5 32 34".isnumeric())     #False
print("532.34".isnumeric())      #False
print("532,34".isnumeric())      #False
print(" 53234 ".isnumeric())     #False --> leading whitespaces!

print("53234".isnumeric())       #True

print("一二三四五".isnumeric())   #True
print("一二三四五23".isnumeric()) #True --> Take care, this number is dubious, but the method does not care!

print("2²".isnumeric())         #True --> 2nd power of 2 (the subscript 2) is considered numeric!

False
False
False
False
False
True
True
True
True


To make the confusion complete, there is a third method to check if a string contains numbers:
- `string.isdecimal()`: Returns true ONLY for the digits 0,1,2,3,4,5,6,7,8 and 9 and NO subscritps!

In [100]:
print("22".isdecimal())   #True
print("2²".isdecimal())   #False  (string.isnumeric() and string.isdigit are True here!)
print("三四".isdecimal()) #False  (string.isnumeric() is True here, string.isDigit() is also false here!)

True
False
False


- `string.isalnum()`: Returns true if the string contains only alphabetical characters and/or numbers
- `string.isalpha()`: Returns true if the string contains only alphabetical characters

In [90]:
print("7 eagles flew over the 3 mountains".isalnum()) # False --> whitespace is neither alphabetical nor numeric
print("7eaglesflewoverthe3mountains".isalnum())       # True

False
True


In [42]:
print("7 eagles flew over the 3 mountains".isalpha())   # False --> whitespace is not alphabetical
print("7eaglesflewoverthe3mountains".isalpha())         # False --> 7 and 3 are not alphabetical
print("seveneaglesflewoverthethreemountains".isalpha()) # True

False
False
True


### 3.9 Splitting Strings into Lists

Often we want to split a string into a list (and process them e.g. into dictionaries or dataframes). This is the basis to tokenize texts, e.g. to count all words, to normalize them, find word stems, and word frequency distributions (most/least common words). The simplest way to turn a string into a list of word tokens is by splitting the string at all whitespace characters (not a perfect solution!):

- `string.split()`: Splits a string into a list, with whitespace as a default delimiter
- `string.split(",")`: Use the comma as a delimiter to split the string

In [43]:
my_sentence = "However, the cat was sad."
word_list = my_sentence.split()
print(word_list)
print(word_list[0]) # note: the word contains the comma!


['However,', 'the', 'cat', 'was', 'sad.']
However,


In [44]:
my_sentence = "However, the cat was sad."
word_list = my_sentence.split(",")
print(word_list)
print(word_list[0]) # note: The list has only 2 elements.
print(word_list[1]) #      The second element has a leading whitespace!

['However', ' the cat was sad.']
However
 the cat was sad.


Sometimes we want to split a string into 3 parts: a part *before* a certain word/phrase, the phrase, and the part: *after* the phrase. This can be achieved with 

- `string.partition(separatingString)`: splits a string into a 3-item tuple at the first occurence of the separatingString
- `string.rpartition(separatingString)`: splits a string into a 3-item tuple at the last occurence of the separatingString

In [45]:
sentence = "My cat is the most beautiful cat in the world."
print(sentence.partition("cat"))  # Split at first cat. note: 3-item tuple, not a list!

print(sentence.rpartition("cat"))  # Split at last cat. 

('My ', 'cat', ' is the most beautiful cat in the world.')
('My cat is the most beautiful ', 'cat', ' in the world.')


Often we need to split a multi-line string into a list of lines. This is achieved with the method `string.splitlines()`.
- `string.splitlines()`: Splits string into list of lines with all linebreak characters as a delimiter

In [46]:
multiline = """This is one way
of writing a multiline string
in Python. Simply use
three quotation marks at
the beginning and the
end."""

print (multiline.splitlines())

['This is one way', 'of writing a multiline string', 'in Python. Simply use', 'three quotation marks at', 'the beginning and the', 'end.']


In [47]:
multiline2 =  "This is another\nway of doing the same,\nbut this time I use backslash-n to start a new\nline"
print (multiline2.splitlines())

['This is another', 'way of doing the same,', 'but this time I use backslash-n to start a new', 'line']
