In [30]:
from IPython.display import HTML
from IPython.display import display

tag = HTML('''
<style>
.advanced-cell {
    background-color: #e84c2250;
}
.advanced-cell::after {
    position: absolute;
    display: block;
    top: -2px;
    right: -2px;
    width: 5px;
    height: calc(100% + 3px);
    content: '';
    background: #e84c22;
}
.advanced-label-row {
    border-bottom: 1px solid #e84c22;
    display: flex;
    font-weight: bold;
}
.advanced-label {
    margin-left: auto;
    background-color: #e84c22;
    padding: 5px 8px;
    color: white;
    margin-right: -2px;
}
</style>
<script>

// A function to hide/show highlight advanced topics in the notebook
var highlighted = false;
function highlight_advanced_topics() {
    $(".advanced-cell").removeClass("advanced-cell");
    $(".advanced-label-row").remove();
    if(highlighted) {
        highlighted = false;
        return;
    }
    var advanced = false;
    $(".jp-Cell.jp-MarkdownCell,.jp-Cell.jp-CodeCell").each(function(){
        if(!advanced) {
            if($(this).find(".advanced-start").length > 0) {
                $(this).before("<div class='advanced-label-row'><span class='advanced-label'>Advanced Topic</span></div>");
                $(this).addClass("advanced-cell");
                advanced = true;
            }        
        } else {
            if($(this).find(".advanced-stop").length > 0) {
                if($(this).find(".advanced-start").length > 0) {
                    $(this).before("<div class='advanced-label-row' style='margin-top: 10px;'><span class='advanced-label'>Advanced Topic</span></div>");
                    $(this).addClass("advanced-cell");
                } else {
                    advanced = false;
                }
            } else {
                $(this).addClass("advanced-cell");
            }
        }
    });
    highlighted = true
}

(function() {
  // Load the script
  const script = document.createElement("script");
  script.src = 'https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js';
  script.type = 'text/javascript';
  script.addEventListener('load', () => {
    $(document).ready(highlight_advanced_topics);
  });
  document.head.appendChild(script);
})();
</script>
<div class="m-5 p-5"><span class="alert alert-block alert-danger">Advanced topics in notebook are highlighted!</span></div>''')
display(tag)

# Strings in Python

Strings are sequences of characters. Whichever our application may be, more often than not we end up processing strings at least to some extent in our programs. In Python, strings are objects of type `str` and a rich set of operators, functions, and methods is available for working with them.

Strings are a fundamental data type but they also have some quirks you should know about. Let's have a closer look.

## String literals
A **string literal** is an expression that we can use to represent a specific string value. String literals can be defined by enclosing a sequence of characters using either single or double quotes.

In [56]:
str1 = "StringLiteral"
str2 = 'StringLiteral'
print(type(str1), type(str2))
print(str1 == str2)

<class 'str'> <class 'str'>
True
True


A string in Python can contain as many characters as our memory allows and can also be empty. We can use the built-in function `len()` to get the length of a string.

In [119]:
empty = ""
print(empty)
print(len(empty))


0


We can use a backslash (`\`) to suppress or apply a special interpretation of certain characters within a string. A backslash followed by one of more characters is referred to an escape sequence. For instance we can suppress the special interpretation of the single quote in a string literal defined using single quotes:

In [53]:
string_with_quotes = 'I\'m a string that\'s not afraid to use quotes!'
print(string_with_quotes)

I'm a string that's not afraid to use quotes!


Alternatively we can use escape sequences to apply a special meaning to the characters `t` adn `n`, so that they are interpreted as a tab and newline characters respectively instead of as a lowercase T and a lowercase N.

In [54]:
print("Item\t\tPrice\nApples\t\t2.56€\nBread\t\t1.86€\nCola\t\t1.99€")

Item		Price
Apples		2.56€
Bread		1.86€
Cola		1.99€


When we want to have strings in which sequences of characters starting with a backslash are not intrepreted as escape sequences, we can use **raw string literals**. To define a raw string literal, simply prepend an `r` or `R`:

In [23]:
print(r"Item\t\tPrice\nApples\t\t2.56€\nBread\t\t1.86€\nCola\t\t1.99€")

Item\t\tPrice\nApples\t\t2.56€\nBread\t\t1.86€\nCola\t\t1.99€


## Immutability of strings

Since strings are immutable, when we combine two strings together we are actually creating a new string.

In [64]:
prefix = "un"
suffix = "helpful"
word = prefix + suffix
print(word)
print(word is prefix)
print(word is suffix)
print(id(word), id(prefix), id(suffix))

unhelpful
False
False
140123375706032 140123375994224 140123375993904


<span class="advanced-start"></span>
As for objects of numeric types, since strings are immutable, Python saves computing resources by having all string literals with the same value refer to the same object in memory.


In [65]:
same = "helpful"
print(suffix is same)

True


<span class="advanced-stop"></span>
## Python Strings are Unicode

So, strings are sequences of characters, but what exactly is a **character**? Starting from Python 3, all characters in strings are actually [Unicode](https://en.wikipedia.org/wiki/Unicode) characters. Unicode is a standard that attempts to provide a numeric code for every possible character, in every possible language, on every possible platform. In practice, this means that we can have strings in Python that contains all kinds of interesting characters:

In [71]:
silly_string = "ൠᴥ❽❋✈﷽🤣A☣✌東京"
quiz = "🔨✈️📰🧬🛏️🐜⚙️ = ❓"
print(silly_string)
print(quiz)

ൠᴥ❽❋✈﷽🤣A☣✌東京
🔨✈️📰🧬🛏️🐜⚙️ = ❓


<span class="advanced-start"></span>
The Unicode standard explicitly separates the identity of characters from their specific representation in bytes. The identity of characters is represented by a **code point**, which is just a number assigned to any specific symbol in the Unicode standard. Code points are usually expressed using 4-6 hex digits  with a "U+" prefix. For instance the letter "A" and the "rolling on the floor laughing" emoji have code points U+0041 and U+1F923, respectively. We can use the `ord()` and `chr()` built-in functions to convert from character to code point and viceversa.

In [69]:
print(f"0x{ord('A'):04x}")
print(f"0x{ord('🤣'):04x}")
print(f"{chr(0x0041)}")
print(f"{chr(0x1f923)}")

0x0041
0x1f923
A
🤣


The actual byte representation of a character depend of the **encoding** in use. Encoding is the process that converts code points into a sequence of bytes, *decoding* is the opposite process. The Unicode stardard defines different encoding/decoding algorithms. Encoding and decoding of strings is a huge topic on its own and we're not going to get into it here. We will just point out that, depending on the encoding used, a code point may require a different number of bytes to be represented. 

Internally, Python stores strings as a sequence of code points. When we define a string literal we don't need to bother with its actual representation in bytes, we are just defining a sequence of code points. We need to take into account encoding only when dealing with input/output of text data. In that case, the rule of thumb is: on input, decode bytes into a Python `str` as early as possible, do all processing exclusively on the `str` objects and, on output, encode them to bytes as late as possible.

### A pesky problem with Unicode strings

Consider the following strings:

In [82]:
moria1 = "Khazad-dûm"
moria2 = "Khazad-dûm"
print(moria1)
print(moria2)

Khazad-dûm
Khazad-dûm


If we compare them, they should compare the same, right?

In [83]:
print(moria1 == moria2)
print(len(moria1), len(moria2))
print(list(moria1))
print(list(moria2))

False
10 11
['K', 'h', 'a', 'z', 'a', 'd', '-', 'd', 'û', 'm']
['K', 'h', 'a', 'z', 'a', 'd', '-', 'd', 'u', '̂', 'm']


Turns out there may be multiple ways of representing the same character in Unicode. This may complicates things when making comparisons. For instance, Unicode has *combining characters*: diacritis or other marks that attach to the preceding character, appearing as one when printed. In the example above, the character '^' in `moria2` adds a circumflex accent to the preceding character 'u', rendering as the single 'û' character when the string is printed. Despite printing the same, the two strings have a different number of code points and therefore do not compare the same.

The solution to this problem is to use Unicode normalization. We will not address normalization here, but it is useful for you to know that it may be needed if you end up processing text that allows for these kind of characters to appear.

<span class="advanced-stop"></span>
## Convert objects to strings

The built-in functions `str()`, `repr()` and `ascii()` allow us to retrieve the string representation of an object. The purpose of `str()` is to provide a printable and nicely readable string representation of an object. The purpose of `repr()` is to provide a printable and unambiguous string representation of an object.

In [116]:
print(str(42), repr(42))
print(str("42"), repr("42"))
print(str(42)==str("42"))
print(repr(42)==repr("42"))

42 42
42 '42'
True
False


`ascii()` is the same as `repr()` but escapes all non-ASCII characters in the string.

In [117]:
print(ascii(silly_string))

'\u0d60\u1d25\u277d\u274b\u2708\uf8ff\ufdfd\U0001f923A\u2623\u270c\u6771\u4eac'


<span class="advanced-start"></span>
We will see in Module 3 how these functions are tied to the magic methods `__str__` and `__repr__`.

<span class="advanced-stop"></span>
## String interpolation

**String interpolation** is the process of evaluating a string literal containing one or more placeholders, yielding a result in which the placeholders are replaced with their corresponding values (from [wikipedia](https://en.wikipedia.org/wiki/String_interpolation)).

Python provides different ways to perform string interpolation:
- Formatted String Literals
- str.format()
- %-formatting
- string.Template

In [112]:
passing = 42
tot = 129
print(f"Passing students: {passing}/{tot} ({passing/tot*100}%)")
print("Passing students: {}/{} ({}%)".format(passing, tot, passing/tot*100))
print("Passing students: %d/%d (%f)%%" % (passing, tot, passing/tot*100))

Passing students: 42/129 (32.55813953488372%)
Passing students: 42/129 (32.55813953488372%)
Passing students: 42/129 (32.558140)%


Among these methods the currently preferred way is to use *Formatted String Literals* (also known as **f-strings**). We have already used f-strings extensively in this course. Everytime we prepend `f` to a string and use curly braces to insert the value of a variable or expression at a particular place inside the string, we are defining a formatted string literal.

In [93]:
passing = 42
tot = 129
print(f"Passing students: {passing}/{tot} ({passing/tot*100}%)")

Passing students:  42 / 129  ( 32.55813953488372 %)
Passing students: 42/129 (32.55813953488372%)


<span class="advanced-start"></span>
f-strings allows us to add explicit *type conversions* and *format specifiers*. Expressions we put inside curly braces will be implicitly converted to string using `str()`. We can force f-string to use a different function to convert the expression to string by adding a trailing `!r`, `!a` or `!s` to the expression.

In [96]:
msg = "ROFL 🤣"
sender = "Doug"
print(f"{sender} : {msg}")
print(f"{sender} : {msg!r}")
print(f"{sender} : {msg!a}")

Doug : ROFL 🤣
Doug : 'ROFL 🤣'
Doug : 'ROFL \U0001f923'


Format specifiers allows us to further define how the value of the expressions that we want to substitute in the string will be presented. The syntax of these format identifiers defines its own [mini-language](https://docs.python.org/3/library/string.html#formatspec).

In [101]:
print(f"Passing students: {passing}/{tot} ({passing/tot*100:>.2f}%)")

Passing students: 42/129 (32.56%)


A format specifier that is very useful for debugging purposes is `=`, if we put it after an expression it inserts both the expression string and its value separated by "=" in the string literal.

In [122]:
print(f"{msg=}")
print(f"{passing/tot*100=:.2f}")

msg='ROFL 🤣'
passing/tot*100=32.56


<span class="advanced-stop"></span>
# Type-specific methods

The `str` type defines [many useful methods](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) to manipulate strings. We will not see each of these methods individually here. Instead we will see how to use some of them in the context of an exercise. Suppose to have a list of email addresses in which the *local-part* (all characters that precede the symbol `@`) of the address contains the full name of a person using the character `.` as a separator. We want to write a function that extracts a capitalized list of names from these addresses, possibly also filtering them on the base of a given domain.

In [6]:
emails = "sheev.palpatine@senate.republic.gov,  count.dooku@council.separatists.gov, padme.amidala@senate.republic.gov,bail.organa@senate.republic.gov, nute.gunray@trade.federation.com"

def get_names(emails, domain=None):
    names = []
    for email in emails.split(","):
        temp = email.strip()
        if not domain or temp.endswith(f"@{domain}"):
            components = temp[:temp.find("@")].split(".")
            names.append(" ".join([comp.capitalize() for comp in components]))
    return ", ".join(names)

print(f"All names: {get_names(emails)}")
print(f"Only senators: {get_names(emails, 'senate.republic.gov')}")

All names: Sheev Palpatine, Count Dooku, Padme Amidala, Bail Organa, Nute Gunray
Only senators: Sheev Palpatine, Padme Amidala, Bail Organa
