<header>
    <div style="overflow: auto;">
        <img src="https://digital-skills.tudelft.nl/nb_style/figures/TUDelft.jpg" style="float: left;" />
        <img src="https://digital-skills.tudelft.nl/nb_style/figures/DUT_Flame.png" style="float: right; width: 100px;" />
    </div>
    <div style="text-align: center;">
        <h2><large>Digital Skills</large> -- Python Basic Programming --</h2>
        <h6>&copy; 2019, TU Delft. Creative Commons</h6>     
    </div>
    <br>   
    <br>
</header>

# Strings

## In this Notebook

In this notebook, we discuss the basic use of strings:
* Python's UTF-8 default string encoding
* Python's string en-quoting
* review of the creation of string variables from literals
* string interpretation, special characters, escaping, and raw strings
* embedding quoted string literals within strings
* strings and characters
* review of indexing, slices, and substrings
* string manipulation
* string operations
* lexicographical comparison of strings
* number-to-string and string-to-number
* composing formatted strings and tables

All topics will be discussed at basic programming level.

#### Assumed prior knowledge and skills
- Notebook PythonBasicProgramming_20 on Variables

## String encoding

String-encoding has to do with how you represent and manipulate characters. Today, the default encoding for Python source code is UTF-8, which means that you can introduce string value-specifications in assignments that include UTF-8 characters. Going into this matter is beyond the scope of this notebook; just check the below example. Important for now is that:
* the UTF-8 (UTF-16, ...) encoding is a superset of the ASCII system
* ASCII as well as UTF-8 have printable and alpha-numerical, printable and non-printable characters, digits, spaces, ..., a property you can query using operations on strings

Further background: 
* [UTF-characters](https://en.wikipedia.org/wiki/UTF-8)
* [Python Howto](https://docs.python.org/3/howto/unicode.html)

In [1]:
my_count_list = [ '(líng, 零)', '(yī, 一)', '(èr, 二)', '(sān, 三)', '(sì, 四)', '(wǔ, 五)' ]      # 0, 1, 2, 3, 4, 5
print('my_count_list :', my_count_list)

my_zero  = my_count_list[0][-2]  # no utf encoding needed
my_one   = my_count_list[1][-2]
my_two   = my_count_list[2][-2]
my_three = my_count_list[3][-2]
my_four  = my_count_list[4][-2]
my_five  = my_count_list[5][-2]

# no utf decoding needed ...
print('counting 0..5 :', my_zero, my_one, my_two, my_three, my_four, my_five)

my_count_list : ['(líng, 零)', '(yī, 一)', '(èr, 二)', '(sān, 三)', '(sì, 四)', '(wǔ, 五)']
counting 0..5 : 零 一 二 三 四 五


## String en-quoting

A string variable represents text: en-quoted literals (digits, characters, punctuations, spaces, tabs ...). Python has multiple ways of quoting:

| option | quotes | example |
|:---:|:---:|:---:|
| 1 | single-quotes | `'Hi! I am in a good mood'`|
| 2 | double-quotes | `"Hi! I am in a good mood"`|
| 3 | triple-quotes | `'''Hi! I am in a good mood'''`|
| . | or: | `"""Hi! I am in a good mood"""`|

## String interpretation, special characters, escaping, raw strings
 
Some characters have a *special meaning*, interpret by Python. We can cancel their special meaning by *escaping them*, using a `\` in front of the special character to undo. The interpretation of special characters is dependent on the en-quoting used:
```
    my_law = 'Murphy\'s law'  # we have to escape ' here
    my_law = "Murphy's law"   # no escape needed
```
To escape-the-escape, again use an escape:
```
    my_tip = "use a newline `\\n` in a string to continue printing on the next line"
```

Python can also be prevented from interpolating the content of a string *altogether* by turning the string into a so called *raw string*, by putting a `r` in front of the string. A typical use case is passing on a string in *(La)TeX* format to Matplotlib, preventing it from any interpretation by Python. For example: `title_plot = r'$\Phi(\omega)=\omega^2-\frac{3}{2}\pi$'`, plotting $\Phi(\omega)=\omega^2-\frac{3}{2}\pi$.

For additional background information, see [this article](https://realpython.com/python-strings/)

In [2]:
# single-quoted string literal embedded in double-quoted string ...
my_tip = "use a newline `\\n` in a string to continue printing on the next line"
print(my_tip)

use a newline `\n` in a string to continue printing on the next line


The `\` is used (in the same fashion) in cases where your code line is too long to fit on a single line, and you want to prevent Python from interpreting the remainder as a new code line. You just terminate the first part of your code line using a `\` and continue on the next line (indentation of the continuation line is just ignored). White space in front of the `\` is ignored, but after the `\` there must not be any character anymore; otherwise you escape something else!

In [3]:
# escaping (invisible) newlines while embedding a visible string literal `\\n` in a string ...
my_longer_tip = \
   "Sometimes you have a tip that takes a longer literal string to type. "             \
   "Break it up in parts this way, and Python will concatenate them for you. "         \
   "In this example, we use the escape character `\\` to cancel the newlines `\\n`, "  \
   "basically expanding the code line on which the literal string value is specified " \
   "to one long-stretched line. Don\'t be misled by the green output terminal below "  \
   "folding this long line in its output." \
   
print(my_longer_tip)

Sometimes you have a tip that takes a longer literal string to type. Break it up in parts this way, and Python will concatenate them for you. In this example, we use the escape character `\` to cancel the newlines `\n`, basically expanding the code line on which the literal string value is specified to one long-stretched line. Don't be misled by the green output terminal below folding this long line in its output.


In [4]:
# using a raw string ...
title_plot = r'$\Phi(\omega)=\omega^2-\frac{3}{2}\pi$'
print("raw string in LaTeX:", title_plot)

yet_another_line = r'part one (\n\n\t) and part two'    # try removing the `r`
print(yet_another_line)

raw string in LaTeX: $\Phi(\omega)=\omega^2-\frac{3}{2}\pi$
part one (\n\n\t) and part two


## Strings and characters
Some languages make a difference between single-character data types and string data types. Python does not do that and just uses the `str` string data type: run the below code cell yourself to verify this:

In [5]:
char = 'c'
print('type of char:', type(char))

type of char: <class 'str'>


However, single characters can be converted back and forth to integers, which represent their positions in the ASCII-table. We just give some examples. 

In [6]:
char = 'c'
int_c = ord('c')   # given a char, give the integer representing char in ASCII table
ch_c  = chr(int_c) # given an integer, look up the char in ASCII table

print('In the ASCII-table, char {:s} is represented by integer {:d}: check: {:s}'.\
     format(char, int_c, ch_c))
print('ASCII lowercase letters: {:s}-{:s} have ord()-representation: {:d}-{:d}'. \
      format('a', 'z', ord('a'), ord('z')))

In the ASCII-table, char c is represented by integer 99: check: c
ASCII lowercase letters: a-z have ord()-representation: 97-122


## String indexing, slices, and substrings

A string is much like a list. You can access individual characters and character ranges, by indexing. String are indexed starting from 0. Try it yourself below. Furthermore, for formatted printing a string, use `{:s}` as a placeholder:

In [7]:
my_string = "kdsokdokwokdowkokokdew"
print('all  vowels:', list( my_string ))
print('first vowel: {:s}'.format(my_string[0]))
print('last  vowel: {:s}'.format(my_string[ len(my_string)-1 ]))
print('other vowel: {:s}'.format(my_string[ 1:len(my_string)-1 ]))

# Note: join() is explained further down

my_string = '0123456789'
print('my  numbers: {:s}'.format(my_string))
print('even number: {:s}'.format(','.join(my_string[0:len(my_string):2])))

odds = slice(1,len(my_string),2)
print('odd numbers: {:s}'.format(','.join(my_string[ odds ])))

all  vowels: ['k', 'd', 's', 'o', 'k', 'd', 'o', 'k', 'w', 'o', 'k', 'd', 'o', 'w', 'k', 'o', 'k', 'o', 'k', 'd', 'e', 'w']
first vowel: k
last  vowel: w
other vowel: dsokdokwokdowkokokde
my  numbers: 0123456789
even number: 0,2,4,6,8
odd numbers: 1,3,5,7,9


## String operations
 
Strings come with a lot of helpful functions (of which `.format()` is just one!). We show some below. See [this article](https://realpython.com/python-strings/) for additional background. Experiment with them yourself! Some of the most important ones (apart from `.format()`, `+`, and `*`) are:
* `len(my_string)         # the length of my_string`
* `str( var )             # convert var to string`
* `my_string.upper()      # make my_string uppercase`
* `my_string.lower()      # same, lowercase`
* `my_string.capitalize() # capitalize the first character, lower all others`
* `my_char.join(my_list)  # join the characters in my_list by my_char`

## String manipulation

Strings are immutable types, so you cannot modify its content in place. With the below code

#### DO THIS
1. in `my_name = "james bond"` in the code below, try to change the `j` in a `J` and the `b` in a `B`, by changing characters `my_name[0]` and `my_name[6]`
```
    
    my_name[0] = "J"
    my_name[6] = "B"
```
Python will raise a `TypeError` complaining that strings do not permit this. What you should do is, one of:
1. do this assignment:
    ```
        my_name = "james bond".title()   # title case: capitalize all words
    ```
2. do this:
    ```
        my_name = "james bond"
        my_real_name = my_name.title()
    ```
3. or just re-assign:
    ```
        my_name = "james bond"
        my_name = "James Bond"
    ```
    

In [8]:
my_name = "james bond"
my_real_name = my_name.title()
print("my name is", my_name, ';', my_real_name, '!')

my name is james bond ; James Bond !


## String comparison

When comparing strings (in sorting, for instance), remember that strings are compared *lexicographically* and not *numerically* (like numbers).

In [9]:
my_string = "This string is 32 positions long"
print('my_string: {:s}, length: {:d}'.format(my_string, len(my_string)))

my_other_string = "My other string is somehwat longer"
print('my_other_string: {:s}, length: {:d}'.format(my_other_string, len(my_other_string)))

print()

print('my_other_string <  my_string? : {:s}'.format( str(my_other_string < my_string) ))
print('my_other_string == my_string? : {:s}'.format( str(my_other_string == my_string) ))
print('my_other_string  > my_string? : {:s}'.format( str(my_other_string >  my_string) ))

my_string: This string is 32 positions long, length: 32
my_other_string: My other string is somehwat longer, length: 34

my_other_string <  my_string? : True
my_other_string == my_string? : False
my_other_string  > my_string? : False


## Strings-to-numbers and numbers-to-strings

How to handle numbers and numbers in the form of a string? If you want to handle them *as a number*, convert the string to a number (see below). For instance, to do computations or compare them as numbers. If you want to handle them as strings, compare the number to a string.

In [10]:
my_string = '0.1'
my_eps    =  0.1

if  my_eps == float(my_string): # turn my_string into a real and compare ...
    print('my_string represents the value of my_eps: {:f}'.format(float(my_string)))
else:
    print('values cannot be compared')
    
# the other way around ... Don't forget to change the placeholder! 
# observe how default numberic precision works if you don't specify decimales to print ...

if str(my_eps) == my_string:    # turn my_eps into a string
    print('my_eps represents the value of my_string: {:s}'.format(str(my_eps)))
else:
    print('values cannot be compared')

my_string represents the value of my_eps: 0.100000
my_eps represents the value of my_string: 0.1


## String formatting

Strings can be formatted during creation, *not just when printing*. For formatted printing of strings, see *Notebook_23 on Formatted Printing*. For further background on string formatting, see [here](https://www.geeksforgeeks.org/python-format-function/). You should have observed that `.format()` is actually a string operation (or: string method). Below we show a few more.

In [11]:
# Compose a table of 4 fields and 'tab' from field to field ...

f_widths = (20, 24, 12, 9)       # field  width field 1 .. 4
f_aligns = ('<', '<', '^', '>')  # alignment of the field 1..4

next_fld = '\t'
next_ln  = '\n'

# compose the field format specifications ...
field_1  = '{:' + str(f_aligns[0]) + str(f_widths[0]) + 's}' # composes: '{:<8s}'
field_2  = '{:' + str(f_aligns[1]) + str(f_widths[1]) + 's}'
field_3  = '{:' + str(f_aligns[2]) + str(f_widths[2]) + 's}'
field_4  = '{:' + str(f_aligns[3]) + str(f_widths[3]) + 's}'

# the layout of a single line: fields as defined above ...
line_tmpl = field_1 + next_fld + \
            field_2 + next_fld + \
            field_3 + next_fld + \
            field_4

# use the line template to fill header lines and data line ...
header   = line_tmpl.format("URL / domain", "path", "ranked pages", "coverage")
hbars    = line_tmpl.format("-"*f_widths[0],"-"*f_widths[1],"-"*f_widths[2],"-"*f_widths[3])

# now comes the data in each of the 'cells' ...
data_ln_1= line_tmpl.format("http://www.tudelft.nl", "/en/eemcs/study/", "20%", "40%")
data_ln_2= line_tmpl.format("http://www.tudelft.nl", "/en/3me/education/", "18%", "13%")
data_ln_3= line_tmpl.format("http://www.uu.nl", "/bachelors/en", "22%", "63%")

# pack all data lines in a single string  ...
data = data_ln_1 + next_ln + data_ln_2 + next_ln +  data_ln_3

# ... and print all at once ...
print(header + next_ln + hbars + next_ln + data)

URL / domain        	path                    	ranked pages	 coverage
--------------------	------------------------	------------	---------
http://www.tudelft.nl	/en/eemcs/study/        	    20%     	      40%
http://www.tudelft.nl	/en/3me/education/      	    18%     	      13%
http://www.uu.nl    	/bachelors/en           	    22%     	      63%


Here's another approach

In [12]:
nl      = "\n"
vbar    = "| "
space   = " "

# try to create 'my_hdr' using: ' '.join(["header", "above", "this", "table"]), or even
# using: space.join(["header", "above", "this", "table"]) ...
my_hdr  = "header" + space + "above" + space + "this" + space + "table"
my_hbar = "-" * len(my_hdr)                          # horizontal bar of length header
my_vals = ['1.0', '2.0'] + ['3', '4', '5'] + ["OK"]  # some list of values (strings)
my_data = vbar.join( my_vals )                       # connect all values by a vbar 
my_row  = vbar + my_data + vbar                      # add leftmost and rightmost bar

print(my_hdr.upper() + nl + my_hbar + nl + my_row + nl + my_row + nl + my_row)

HEADER ABOVE THIS TABLE
-----------------------
| 1.0| 2.0| 3| 4| 5| OK| 
| 1.0| 2.0| 3| 4| 5| OK| 
| 1.0| 2.0| 3| 4| 5| OK| 


Below, the use of `\b` is being demonstrated. The typical use of this print character is to overwrite some values in a displayed number, status, input value or something similar.

In [13]:
print("Printing: 1\b2\b3\b4\b5   " \
      "(12345, but `\\b` 'backspaces', overwriting the same position all the time)")

Printing: 12345   (12345, but `\b` 'backspaces', overwriting the same position all the time)


## Done