# Manipulating Strings

## Introduction

In this chapter you will learn how to manipulate strings. There are a lot of different operations available on strings, like removing parts of them, converting letters from uppercase into lowercase and much more. You've used many of those string methods in previous labs already, so some of the material here will be a repetition of things you've already seen.

**This notebook covers some parts of [chapter 6](https://automatetheboringstuff.com/2e/chapter6/) of the book.**

### Optional resources

#### Strings in general

- [Python Tutorial: Strings](https://docs.python.org/3/tutorial/introduction.html#strings)
- [Strings and Character Data in Python – Real Python](https://realpython.com/python-strings/)

#### Encoding and str/bytes difference

- [Ned Batchelder: Pragmatic Unicode](https://nedbatchelder.com/text/unipain.html) (ignore the Python 2 parts)
- [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)
- [What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work With Text](https://kunststube.net/encoding/)

## Summary

Strings represent human-readable *text*. You can access a single element of the string with square brackets. Fortunately Python has many helpful string methods to process text.

It is important to understand the distinction between strings (text, for humans) and bytes (data, for computers). If you get data from externally (from a webservice, a subprocess, or even just a `.txt` file from the disk), you need to know how to *decode* it to turn the data into text.

### Single and Double Quotes

In Python you can produce string literals in two ways by surrounding the whole text with single quotation marks (`'text'`) or double quotation marks (`"text"`).

A problem arises if you use single quote marks to surround the text and use them in the text itself again. 

```python
print('That's a lot of fun.')
```

As you can see from the syntax highlighting, the code doesn't work. This is because Python reads only until it reaches the second single quote mark as a string. So you will get just `That` as a string and an error because of the unreadable subsequent part `s a lot of fun.'`. The same happens if you want to use double quote marks themselves in your text.

```python
print("He asked, "What?", and left.")
```

If you want to have single- or double-quotes in a string, you'd normally use the other kind of quote for your string:

```python
print("That's a lot of fun.")
print('He asked, "What?", and left.')
```

or you can *escape* the quote character by prefixing it with a backslash (`\`):

```python
print('That\'s a lot of fun.')
print("He asked, \"What?\", and left.")
```

### Escape Characters

![XKCD 1638: Backslashes](https://imgs.xkcd.com/comics/backslashes.png)

*([XKCD 1638: Backslashes](https://xkcd.com/1638/))*

Above, we've seen how to use `\'` and `\"` to *escape* those characters, so that they can be used inside a string with the same kind of quotes. With other *escape sequences*, certain other *non-printable* ("invisible") special characters can be added to string. This allows us to easily add things like a newline (`\n`) to strings:

| Escape character | Print as             |
| :--------------- | :------------------- |
| \\'              | Single quote         |
| \\"              | Double quote         |
| \t               | Tab                  |
| \n               | Newline (line break) |
| \\\              | Backslash            |

The three characters space (` `), tab (`\t`) and newline (`\n`) are commonly called "whitespace". All those escape characters count as one character: A tab `\t` is one character (even if displayed like 4 or 8 spaces often); a newline `\n` is a single character as well.

### Raw Strings

Let's say you want to print a Windows file path as string. Windows uses backslashes (`\`) as directory separators, so they need to be doubled, which is tedious:

```python
print('C:\\Users\\Bob\\Desktop')
```

An easier and faster way to do this is to use *raw strings*. Similar to f-strings with their `f`-prefix, those kind of strings are prefixed with an `r` character:

```python
print(r'C:\Users\Bob\Desktop')
```

Inside raw strings, every character retains its literal meaning - thus, `r"\n"` is not a line break, but a string with two characters, `\` and `n`.

### Multiline Strings

It would be really time-consuming if you had to use `\n` several times to write a longer text as a string. To simplify this, you can use multiline strings with triple quotes (using either single quotes or double quotes). Inside triple quotes, you can use both `"` and `'` freely, thus we don't need a backslash for `haven't`.

```python
# easier & faster to write
print('''Dear Bob, 

You still haven't answered my question.

Sincerely,
Ross''')

# other possiblity, but takes a while to write it
print('Dear Bob,\n
\n
You still haven\'t answered my question.\n
\n
Sincerely,\n
Ross')
```

### Indexing and Slicing Strings

Just like lists, you can work with strings by using indices and slices.

```python
text = 'Hello, world!'
text[0]   # output = 'H'
text[0:5] # output = 'Hello' 
text[-1]  # output = '!' 
text[:5]  # output = 'Hello' 
text[7:]  # output = 'world!' 
```

### Formatting Strings

Python offers three common ways to format strings based on placeholders:

```python
greeting = 'Hello'
person = 'Peter'

text = '%s, %s!' % (greeting, person)       # very old (Python 2), avoid
text = '{}, {}!'.format(greeting, person)   # old (Python < 3.6)
text = f'{greeting}, {person}!'             # new (Python >= 3.6) and most simple/readable
```

These placeholders can optionally contain instructions on how to fill in the provided values:

```python
value = 1.2345
print(f'{value:.3}')  # only print 3 digits: 1.23
```

A good overview of all possibilities is available at:

- [pyformat.info](https://pyformat.info/) for `.format()`
- [fstring.help](https://fstring.help) for f-strings

### String Methods

The full list of available string methods is [documented in the Python documentation](https://docs.python.org/3/library/stdtypes.html#string-methods).

## Exercises

### Exercise 1: Defining Strings
Rewrite the following text using a single string on one line. Use single quotes.

Make sure your version equals the given one.

In [2]:
# No changes to this!
text = r"""
I'm a text containing some "other text" and some \n!
"""
text_new = '\nI\'m a text containing some "other text" and some \\n!\n'

### Exercise 2: String Functions

#### a) Cleaning up
Complete the `clean_text` function to clean up the given string. Cleaning up means removing all whitespace characters at the beginning and the end of the string. Additionally, count how many characters have been removed.

Hint: The required code should be simple enough to fit on a single line each (but a longer solution is okay too).

In [7]:
def clean_text(text):
    cleaned_text = text.strip()
    count = len(text) - len(cleaned_text)
    return cleaned_text, count


# Your code should work with the example below, but you're free to change it.
cleaned_text, count = clean_text("   \tThis is a messy text\n\t")
print(f"cleaned: {cleaned_text} ({count} characters removed)")

cleaned: This is a messy text (6 characters removed)


#### b) Line endings

Often files edited by different users contain different line endings:

- `\r\n` on Windows, also called CRLF for "carriage return, line feed"
- `\n` on Mac and Linux, also called LF for "line feed"

Normalize the line endings by completing the functions below:

- Replace all `\r\n` endings with `\n` in `normalize_to_lf`
- Replace all `\n` endings with `\r\n` in `normalize_to_crlf`

The given print calls use [the `repr(...)` built-in](https://docs.python.org/3/library/functions.html#repr) - it returns the "debug representation" of an object, which lets you see which kind of line endings the modified strings contain, rather than those actually being printed as newlines.

In [11]:
def normalize_to_lf(text):
    return text.replace("\r\n", "\n")

def normalize_to_crlf(text):
    # todo: replace endings
    return normalize_to_lf(text).replace("\n", "\r\n")

# Your code should work with the example below, but you're free to change it.
text = " \nThis text has different\r\nline\nendings"
print("LF:")
print(repr(normalize_to_lf(text)))
print("CRLF:")
print(repr(normalize_to_crlf(text)))

LF:
' \nThis text has different\nline\nendings'
CRLF:
' \r\nThis text has different\r\nline\r\nendings'


#### c) Splitting
Implement `split_text` which splits a given text on commas. Make sure additional whitespace around the commas is removed, too - but your code should also work without any spaces around the comma.

In [7]:
def split_text(text):
    textlist = text.split(",")
    return [item.strip() for item in textlist]
    

# Your code should work with the example below, but you're free to change it.
print(split_text("apples, oranges  , coconuts       "))

['apples', 'oranges', 'coconuts']


#### d) Joining

Implement `join_words` which joins the given list of words into a string, with the words separated with spaces.

In [1]:
def join_words(words):
    return str((" ").join(words))

# Your code should work with the example below, but you're free to change it.
print(join_words([]))




#### e) Burning house

Implement `change_case`, which takes a string and returns a new one, changed as follows:

- Uppercase words are lowercase afterwards
- Lowercase words are capitalized
- Capitalized words are uppercase

`"My HOUSE is Burning"` becomes `"MY house Is BURNING"`.

Hints:

- Make use of the splitting function to get individual words
- Either use `enumerate` to modify list items in-place, or...
- ...use a list comprehension with a helper function (taking a single word) to build a new list

In [6]:
def change_case(text):
    words = text.split()
    for i in range(len(words)):
        if words[i].isupper():
            words[i] = words[i].lower()
            continue
        if words[i].islower():
            words[i] = words[i].capitalize()
            continue
        if words[i][0].isupper() and not words[i][1].isupper():
            words[i] = words[i].upper()
            continue
    returnstring = ""
    for element in words:
        returnstring += " " + element
    return returnstring.strip()

# Your code should work with the example below, but you're free to change it.
print(change_case("My HOUSE is Burning"))

MY house Is BURNING


### Exercise 3: Indexing and Slicing
Here's the first few sentences of [Oscar Wilde's 'The Selfish Giant'](https://standardebooks.org/ebooks/oscar-wilde/childrens-stories):

> Every afternoon, as they were coming from school, the children used to go and play in the Giant's garden.
>
> It was a large lovely garden, with soft green grass. Here and there over the grass stood beautiful flowers like stars, and there were twelve peach-trees that in the spring-time broke out into delicate blossoms of pink and pearl, and in the autumn bore rich fruit. The birds sat on the trees and sang so sweetly that the children used to stop their games in order to listen to them. 'How happy we are here!' they cried to each other.

In [1]:
text = """
Every afternoon, as they were coming from school, the children used to go and play in the Giant's garden.

It was a large lovely garden, with soft green grass. Here and there over the grass stood beautiful flowers like stars, and there were twelve peach-trees that in the spring-time broke out into delicate blossoms of pink and pearl, and in the autumn bore rich fruit. The birds sat on the trees and sang so sweetly that the children used to stop their games in order to listen to them. 'How happy we are here!' they cried to each other.
"""

For the following exercises, assume the following:
    
- Setences are separated by periods (`.`).
- A paragraph is a single line containing one or more sentences.
- The entire text contains one or more paragraphs.

Your code needs to work with a different text, as long as it's structured in the same way.

#### a) Letters

Return the first and the last letter (`E`, `r`) of the text:

- First, clean up the text: Remove leading/trailing newlines and periods (`.`)
- Then access the required characters using indices

In [26]:
def first_and_last_char(text):
    text = text.strip("\n")
    text = text.strip(".")
    first_char = text[0]
    last_char = text[-1]
    return first_char, last_char

# Your code should work with the example below, but you're free to change it.
first_char, last_char = first_and_last_char(text)
print(f"first char: {first_char}")
print(f"last char: {last_char}")

first char: E
last char: r


#### b) Words

Now print the first and last word (`Every`, `other`) of the text. Clean the text as above, then split it up into words and finally access the first and last word using indices.

In [27]:
def first_and_last_word(text):
    text = text.strip("\n")
    text = text.strip(".")
    words = text.split(" ")
    first_word = words[0]
    last_word = words[-1]
    return first_word, last_word

# Your code should work with the example below, but you're free to change it.
first_word, last_word = first_and_last_word(text)
print(f"first word: {first_word}")
print(f"last word: {last_word}")

first word: Every
last word: other


#### c) Cherry-picking

Finally, take the second paragraph of the text. Then, of each sentence of that paragraph, get the second to last character of the last two words each.

Example: "It was a large lovely garden, with soft gre**e**n gra**s**s."

Your `cherry_pick_chars` function should return a list containing the highlighted characters: ... gre**e**n gra**s**s ... ri**c**h fru**i**t ... **t**o th**e**m ... ea**c**h oth**e**r

In [2]:
def cherry_pick_chars(text):
    paragraph2 = text.split("\n")[3]
    sentences = paragraph2.split(". ")
    toreturn = []
    for sentence in sentences:
        words = sentence.split()
        toreturn.append(''.join(filter(str.isalpha, words[-2]))[-2])
        toreturn.append(''.join(filter(str.isalpha, words[-1]))[-2])
    return toreturn

# Your code should work with the example below, but you're free to change it.
print(cherry_pick_chars(text))

['e', 's', 'c', 'i', 't', 'e', 'c', 'e']


### Exercise 4: String Formatting

#### a) Bob Solo

Create the string `Bob Solo from Seattle is 41 years old.` based on the given inputs. Make sure the age is always rounded down.

In [3]:
import math

def describe_human(age, first_name, last_name, location):
    return f"{first_name.strip().title()} {last_name.strip().title()} from {location.strip().title()} is {math.floor(age)} years old."

# Your code should work with the example below, but you're free to change it.
print(describe_human(age=41.25, first_name="bob ", last_name=" SOLO", location="   Seattle"))

Bob Solo from Seattle is 41 years old.


#### b) Formatted values

With a dictionary such as:

```python
{
    "a": 101,
    "b": 120.3,
    "c": 130.223,
}
```

print its contents like this:

```
A: 101.0
B: 120.3
C: 130.2
```

Requirements:

- Print the key in ALL-CAPS
- Print the value with 1 digit after the decimal point

Hint: You won't need to use `round(...)` or `float(...)` manually.

In [4]:
def print_values(values):
    for key, value in values.items():
        print(f"{key.upper()}: {round(float(value), 1)}")


# Your code should work with the example below, but you're free to change it.
values = {
    "a": 101,
    "b": 120.3,
    "c": 130.223,
}
print_values(values)

A: 101.0
B: 120.3
C: 130.2


### Exercise 5: Counting Things

Create a function `count_things` which returns a dictionary containing the following information about the given text:

* the number of lines (key: `lines`). The final newline (i.e. the empty line `""`) should *not* count as a line), thus, 4 lines are expected with the given text. However, `"Hello World"` should be counted as 1 line, not 0.
* ... commas (key: `commas`)
* ... periods (key: `periods`)
* ... words (key: `words`). Words are considered whitespace-separated, so `foo\nbar` is two words, and so is `foo\n\tbar`.
* ... unique words (key: `unique words`). To make things easier, you don't need to account for punctuation, and upper-/lower-case are two different words. Thus, `"Test test! test"` is counted as three different words.
* ... whitespace characters (key: `whitespace characters`)

Hint: For unique words, think of an existing Python data structure that allows no duplicate entries.

In [38]:
def count_things(text):
    unique = set()
    for word in text.split():
        unique.add(word)

    returndict =  {
        "lines": len(text.splitlines()),
        "commas": text.count(","),
        "periods": text.count("."),
        "words": len(text.split()),
        "unique words": len(unique),
        "whitespace characters": text.count(" ") + text.count("\n") + text.count("\t") + text.count("\r\n")
    }
    return returndict
    


# Your code should work with the example below, but you're free to change it.
text = """ Lorem ipsum dolor sit amet, consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,
sed diam voluptua. At vero eos et \taccusam et justo duo dolores et ea rebum.
Stet clita kasd gubergren, no sea takimata sanctus est Lorem   ipsum dolor sit amet.
"""

#5text = " Hello Hello  ,"
print(count_things(text))

{'lines': 4, 'commas': 4, 'periods': 3, 'words': 50, 'unique words': 41, 'whitespace characters': 54}
