<div style="background-color:lightgrey;
            padding:10px;
            color:black;
            border:black dashed 2px; 
            border-radius:5px;
            margin: 20px 0;">
            
            
# String Manipulation



**Staff:** Joshua Schäuble <br/>
**Support Material:** [Exercises](https://github.com/dtaantwerp/dtaantwerp.github.io/blob/12b652d6955e285f35484c4220b4bb92b7fd9644/exercises/08_text_manipulation.ipynb) <br/>
**Support Sessions:**  Thursday October 7

</div>


## 1. Introduction
As we learned before, strings are a *data type* (other data types we looked at: `bool`, `int`, `float`, `tuple`, `list`, `dict`, ...). In this session we will learn how to manipulate strings by looking at two of their main characteristics: (1) Strings are ordered sequences of characters and (2) strings are so-called objects. The latter will be explained in detail only in week 3. However, we can already attempt to understand what it means in practice today.

## 2. Strings as Sequences / Sequence Operations on Strings

Strings are a sequential data type. A string is an ordered sequence of characters. We can use this characteristic to manipulate strings with **sequence operations**, some of which we have already used in previous sessions. For example, we can add strings together or multiply them. The most important sequence operations will be introduced here:

### 2.1 Concatenating strings with `+`
When applied to numbers (integer or float) the plus operator (`+`) processes a mathematical addition. When used on strings, the `+` operator concatenates two or more strings into a new one.

In [1]:
#CODE

### 2.2 Multiplying strings with `*` 
Although this is not used very often, strings can also be multiplied with the multiplication operator, which we also know from numerical data types:

In [2]:
#CODE

Can you guess what will be printed by executing the block below?

In [None]:
#CODE


### 2.3 Slicing Strings via the Character Index

Strings are ordered sequences of characters. This means, every character of the sequence has a position or index. The index starts at 0 and increases from left to right.

| mytext =  | " | H | e | l | l | o |   | W | o | r | l | d  | " |
|-----------|---|---|---|---|---|---|---|---|---|---|---|----|---|
| Index     |   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |   |

The following examples will show how we can cut out slices of a string by using the index.

In [3]:
mytext = "Hello World"
#CODE





We can also access the index from right to left. The last index position is always [-1].

| mytext =  | " | H   | e   | l  | l  | o  |    | W  | o  | r  | l  | d  | " |
|-----------|---|-----|-----|----|----|----|----|----|----|----|----|----|---|
| Index     |   | -11 | -10 | -9 | -8 | -7 | -6 | -5 | -4 | -3 | -2 | -1 |   |

In [4]:
#CODE




Note:
- Take care! Don't access none-existing negative indexes.  `print (mytext[-15])` will cause an error!
- We can only use the index to access the sequence, but not to alter it directly:

In [5]:
#CODE



### 2.4 The functions `len(string)`, `min(string)` and `max(string)`
The function `len(string)` takes a string as an argument and returns the length (= number of characters) of the string as an integer:

In [6]:
#CODE


The function min(string) returns the smallest character of the string. Correspondingly, the function max(string) returns the largest character in the string. 

The order (small to large) is: special characters --> numbers --> capital letters A-Z -> lowercase letters a-z

In [7]:
#CODE



### 2.5 The comparison operators `in` and `not in`

With the operators `in` and `not in` we can test if a character or string is (not) included in another string. The comparison returns a boolean value (*True* or *False*)

In [8]:
#CODE


## 3. Strings as Objects: Manipulating Strings with *Object-Methods*

Strings are *objects*. To understand objects, we need more concepts than we currently know. Hence, we will only learn in week 3 what precisely objects are. However, in this lecture we can already try to understand string objects simply by using them.  

Each instance of an object (here: each *string*) has predefined object functions (so-called *methods*) attached to it. These methods can be used to get information about the object and to manipulate the object. 

In comparison to normal functions (such as `print(string)` and `len(string)`), methods are always attached to a specific instance of the object. Therefore, the object does not need to be passed into the function as a parameter (as we would do in `print(string)`). Instead, it is called for a specific instance (this will get clear in a moment).

To explain the difference between a normal function and a method, we can use the following methaphor. A normal function is like a speed camera at the side of the road. It detects the speed of any car with speed sensors. It generally works on all cars, but not on animals that pass the camera. A method on the other hand is like the speedometer of a specific car. It also measures the speed, but only of the specific car (= object) it is connected to. Every car object has its own speedometer attached to it.

Following the same logic, string objects have specific methods attached to them. A selection of these methods will be introduced here.


### 3.1 String.format()
The method `string.format()` allows to add placeholders to a string and fill them on demand.

In [9]:
#CODE


In [10]:
#CODE


Note: `string.format()` accepts integers and automatically transforms them into strings (= type casting). This does not work with the plus operator we learned above:

In [11]:
#CODE


In [12]:
#CODE


In [13]:
#CODE


### 3.2 string.count(substring)
`string.count(substring)` allows us to count how often a given substring occurs in the string:

In [14]:
myword = "butterpot"
#CODE


In [15]:
myText = "Paranoids are not paranoid because they’re paranoid, but because they keep putting themselves deliberately into paranoid situations."
#CODE


### 3.3 `String.replace(what, with)` and `String.replace(what, with, count)`
`string.replace()` allows us to replace selected parts of a string with another string. We can either replace all occurances or a define a maximum number.

In [16]:
my_text = "Death"
#CODE


In [17]:
my_text = "Hurry Harry! We are late"
#CODE


### 3.4 Find the index of a substring:  `string.find()`, `string.rfind()` (and `string.index()`)
Sometimes it is useful to identify where within a string we find a given substring. To identify the index (= position), we can use the methods `string.find(what)` and `string.rfind(what)`. 

- `string.find(what)` returns the lowest index of the string *what* in the *string*. If *what* is not found, it returns -1.

- `string.rfind(what)` returns the highest index of the string *what* in the *string*. If *what* is not found, it returns -1.

In [18]:
my_word = "Butterbeer"
#CODE


Equally, we can use `string.index(what)` and `string.rindex(what)`. However,these two functions return an error if the searched substring is not found.

> *Tip*: Only use these two, if you are certain that the substring occurs in the string, or if you add error-handling (`try`-`except`).

### 3.5 Cases: Switch from uppercase to lowercase

Python provides a serious of methods for strings to change cases:

- `string:capitalize()`: changes the first character of the string to uppercase
- `string:title()`: changes the first character of **each word** to uppercase
- `string:upper()`: changes all characters of the string to uppercase
- `string:lower()`: changes all characters of the string to lowercase
- `string:swapcase()`: changes the case of all characters - uppercase to lowercase and vice versa
- `string:casefold()`: more aggressive lowercase-transformation for special characters, e.g. German ß --> ss. This function is useful to test if two strings differ only in their cases.

In [7]:
#CODE


This is a sentence.
This Is A Sentence.
THIS IS A SENTENCE.
this is a sentence.
This Is A Sentence.
tHIS iS a sENTENCE.
Fluß and fluss are basically the same.


***Additionally***, there are some methods to check what case the string is in. These methods return `True` or `False`, i.e. a boolean:

- `string.islower()`: returns `True` if all characters are lowercase.
- `string.isupper()`: returns `True` if all characters are uppercase.
- `string.isTitle()`: returns `True` if the first character of each word is uppercase and all other characters are lowercase.

In [19]:
#CODE


In [20]:
#CODE


In [21]:
#CODE


### 3.6 Handling Whitespaces
Whitespaces are often a problem. For example, if we want to compare if two strings are the same - except for double whitespaces and linebreaks. To handle such whitespace problems, Python provides a series of string methods.

- `string.strip()`: removes whitespaces at the beginning AND end of a string
- `string.lstrip()`: removes whitespaces at the beginning (l for left) of the string
- `string.rstrip()`: removes whitespaces at the end (r for right) ot the string

In [22]:
#CODE


Equally, we can add whitespaces at the beginning and/or end of a string, until it has a desired length. This can be useful to format our output.

- `string.center(length)`: adds whitespaces at the beginning AND end of a string, until the string has stringLength
- `string.ljust(length)`: adds (howmany) whitespace at the beginning (l for left) of the string
- `string.rjust(length)`: adds (howmany) whitespace at the end (r for right) ot the string

In the following example 2 sentences are printed several times. Note how the first sentence is extended with whitespaces and how this indents the second sentence neatly to always start at the same position. 

In [23]:
#CODE


Optionally, we can give a second parameter on these 3 functions, to customize which character is added (instead of a whitespace). Here are the same 4 prints again, but with an underscore "_" as a delimiter:

In [24]:
#CODE


### 3.7 Leading Zeros when Printing Numbers as Strings:
To print numbers, we must convert them into strings. Often we want to add leading zeros for a better looking format. This can be achieved with the method `string.zfill(width)`

In [25]:
#CODE


A common way of using this, is to first identify the length of the highest number of a set (tuple, list or dictionary), and then to add leading zeros to all numbers according to the highest number's length.

In [26]:
#a list of random numbers (feel free to add more)
list_of_numbers = [12312,765,1233212,23,923681233,15613183,8,345345,43]

#CODE
#CODE
#CODE


### 3.8 What's in the String? Inspecting Numeric and Alphabetical Strings

With the following functions we can inspect our strings to a certain degree. For more complex tests we will learn so-called regular expressions at a later stage.

- `string.isdigit()`: returns `True` if the string contains only the digits 0,1,2,3,4,5,6,7,8 and 9 (but also in supscript)

In [10]:
print("5 apples".isdigit()) # False
print("5 32 34".isdigit())  # False
print("532.34".isdigit())   # False
print("532,34".isdigit())   # False
print(" 53234 ".isdigit())  # False --> leading whitespaces!

print("一二三四五".isdigit()) # False --> these are Chinese numbers

print("53234".isdigit())    # True
print("053234".isdigit())   # True

print("2²".isdigit())       # True --> 2nd power of 2 (the subscript 2) is considered a digit!

False
False
False
False
False
False
True
True
True


- `string.isnumeric()`: very similar to `string.isdigit()`, but it also considers numeric characters in other languages, such as the Chinese characters 一，二，三，四， 五...  Note: for English/Dutch writing `string.isdigit()` usually does the trick!

In [11]:
print("5 apples".isnumeric())    #False
print("5 32 34".isnumeric())     #False
print("532.34".isnumeric())      #False
print("532,34".isnumeric())      #False
print(" 53234 ".isnumeric())     #False --> leading whitespaces!

print("53234".isnumeric())       #True

print("一二三四五".isnumeric())   #True
print("一二三四五23".isnumeric()) #True --> Take care, this number is dubious, but the method does not care!

print("2²".isnumeric())         #True --> 2nd power of 2 (the subscript 2) is considered numeric!

False
False
False
False
False
True
True
True
True


To make the confusion complete, there is a third method to check if a string contains numbers:
- `string.isdecimal()`: Returns true ONLY for the digits 0,1,2,3,4,5,6,7,8 and 9 and NO subscritps!

In [100]:
print("22".isdecimal())   #True
print("2²".isdecimal())   #False  (string.isnumeric() and string.isdigit are True here!)
print("三四".isdecimal()) #False  (string.isnumeric() is True here, string.isDigit() is also false here!)

True
False
False


- `string.isalnum()`: Returns true if the string contains only alphabetical characters and/or numbers
- `string.isalpha()`: Returns true if the string contains only alphabetical characters

In [90]:
print("7 eagles flew over the 3 mountains".isalnum()) # False --> whitespace is neither alphabetical nor numeric
print("7eaglesflewoverthe3mountains".isalnum())       # True

False
True


In [42]:
print("7 eagles flew over the 3 mountains".isalpha())   # False --> whitespace is not alphabetical
print("7eaglesflewoverthe3mountains".isalpha())         # False --> 7 and 3 are not alphabetical
print("seveneaglesflewoverthethreemountains".isalpha()) # True

False
False
True


### 3.9 Splitting Strings into Lists

Often we want to split a string into a list (and process them e.g. into dictionaries or dataframes). This is the basis to tokenize texts, e.g. to count all words, to normalize them, find word stems, and word frequency distributions (most/least common words). The simplest way to turn a string into a list of word tokens is by splitting the string at all whitespace characters (not a perfect solution!):

- `string.split()`: Splits a string into a list, with whitespace as a default delimiter
- `string.split(",")`: Use the comma as a delimiter to split the string

In [43]:
my_sentence = "However, the cat was sad."
#CODE



['However,', 'the', 'cat', 'was', 'sad.']
However,


In [44]:
my_sentence = "However, the cat was sad."
#CODE


['However', ' the cat was sad.']
However
 the cat was sad.


Sometimes we want to split a string into 3 parts: a part *before* a certain word/phrase, the phrase, and the part: *after* the phrase. This can be achieved with 

- `string.partition(separatingString)`: splits a string into a 3-item tuple at the first occurence of the separatingString
- `string.rpartition(separatingString)`: splits a string into a 3-item tuple at the last occurence of the separatingString

In [45]:
sentence = "My cat is the most beautiful cat in the world."
# Split at first cat. note: 3-item tuple, not a list!
#CODE


# Split at last cat. 
#CODE


('My ', 'cat', ' is the most beautiful cat in the world.')
('My cat is the most beautiful ', 'cat', ' in the world.')


Often we need to split a multi-line string into a list of lines. This is achieved with the method `string.splitlines()`.
- `string.splitlines()`: Splits string into list of lines with all linebreak characters as a delimiter

In [27]:
multiline = """This is one way
of writing a multiline string
in Python. Simply use
three quotation marks at
the beginning and the
end."""

#CODE


In [28]:
multiline2 =  "This is another\nway of doing the same,\nbut this time I use backslash-n to start a new\nline"

#CODE
