## Strings

In [2]:
"foobar" == 'foobar'

True

In [3]:
'"Where are you?"'

'"Where are you?"'

In [4]:
"I'm here"

"I'm here"

In [8]:
"""foo
bar
"""

'foo\nbar\n'

In [12]:
from textwrap import dedent

def render_code():
    cpp = dedent("""\
        #include <iostream>

        int main() {
            std::cout << "Hello World!";
            return 0;
        }
    """)
    print(cpp)
    
render_code()

#include <iostream>

int main() {
    std::cout << "Hello World!";
    return 0;
}



In [9]:
"foo" "bar"

'foobar'

String literals can include escape characters. For example:

`\'`    Single quote

`\"`    Double quote

`\t`    ASCII Horizontal Tab (TAB)

`\n`    ASCII Linefeed (LF)

`\xhh`  Character with hex value HH (4,5)

https://python-reference.readthedocs.io/en/latest/docs/str/escapes.html

In [10]:
print("\tell me more")

	ell me more


In [11]:
print(r"\tell me more")

\tell me more


## Encodings

* [ASCII](https://en.wikipedia.org/wiki/ASCII)
* [Unicode](https://en.wikipedia.org/wiki/Unicode)
* [A Programmer’s Introduction to Unicode](http://reedbeta.com/blog/programmers-intro-to-unicode/)
* [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/)
* [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/)

## Unicode

In [1]:
a = "e"

In [2]:
b = chr(0x301)

In [3]:
c = "é"

In [4]:
(a + b, c)

('é', 'é')

In [5]:
a + b == c

False

In [6]:
from unicodedata import normalize

In [7]:
normalize("NFKC", a + b) == c

True

In [9]:
len("🇩🇪")

2

**Unicode is complicated...**

---

In [12]:
s = "Я строка"

In [13]:
list(s)

['Я', ' ', 'с', 'т', 'р', 'о', 'к', 'а']

Python doesn't have a separate type for characters:

In [14]:
s[0], type(s[0])

('Я', str)

How strings are represented in the memory?

`UTF-8`
`UTF-16`
`UTF-32`
`UCS-2`
`UCS-4`

[PEP 393 -- Flexible String Representation](https://www.python.org/dev/peps/pep-0393/)

In [15]:
[ord(c) for c in "hello"]  # 1 byte / code point, UCS-1

[104, 101, 108, 108, 111]

In [16]:
[ord(c) for c in "привет"]  # 2 bytes, UCS-2

[1087, 1088, 1080, 1074, 1077, 1090]

In [17]:
[ord(c) for c in "🇩🇪"]  # 4 bytes, UCS-4

[127465, 127466]

### chr & ord

In [20]:
"\u0068", "\U00000068"

('h', 'h')

In [21]:
chr(0x68)

'h'

In [22]:
chr(1087)

'п'

In [23]:
def identity(ch):
    return chr(ord(ch))

In [24]:
identity('п')

'п'

### One Interesting Example

In [26]:
"a".upper().lower()

'a'

In [13]:
"\N{HEAVY BLACK HEART}"

'❤'

In [25]:
"\N{LATIN SMALL LETTER SHARP S}"

'ß'

In [27]:
ch = "\N{LATIN SMALL LETTER SHARP S}"

In [28]:
ch.upper()

'SS'

In [29]:
ch.upper().lower()

'ss'

## String Methods

[Common string operations](https://docs.python.org/3/library/string.html)

In [30]:
"python is cool".capitalize()

'Python is cool'

In [31]:
"python is cool".title()

'Python Is Cool'

In [32]:
"python is cool".upper()

'PYTHON IS COOL'

In [33]:
"python is cool".lower()

'python is cool'

In [34]:
"python is cool".title().swapcase()

'pYTHON iS cOOL'

### Alignment

💡 _These methods are very helpful when you develop console programs._

Whitespace is a default separator.

In [35]:
"python is cool".ljust(16, "~")

'python is cool~~'

In [36]:
"python is cool".rjust(16, "~")

'~~python is cool'

In [37]:
"python is cool".center(16, "~")

'~python is cool~'

### Strip

In [38]:
"]>>python 2020<<[".lstrip("]>")

'python 2020<<['

In [41]:
"]>>python 2020<<[".rstrip("[<")

']>>python 2020'

In [42]:
"]>>python 2020<<[".strip("[]<>")

'python 2020'

In [43]:
# most frequent use case
"\t python 2020 \r\n  ".strip()

'python 2020'

### Split

In [45]:
"python 2020".split()

['python', '2020']

In [47]:
"python,2020".split(",")

['python', '2020']

In [48]:
"python,,,,2020".split(",")

['python', '', '', '', '2020']

In [51]:
"archive.tar.gz".split(".", 1)

['archive', 'tar.gz']

In [54]:
"archive.file.tag.gz".rsplit(".", 2)

['archive.file', 'tag', 'gz']

---

In [49]:
"foo,bar,baz".partition(",")

('foo', ',', 'bar,baz')

In [50]:
"foo,bar,baz".rpartition(",")

('foo,bar', ',', 'baz')

Sometimes using the `partition` method is more predictable and doesn't require implementing conditional logic because the method always returns a tuple with three arguments.

In [55]:
"archive".rsplit(".", 1)

['archive']

In [57]:
"archive".rpartition(".")

('', '', 'archive')

### Join

ℹ️ [Efficient String Concatenation in Python](https://waymoot.org/home/python_string/)

In [58]:
", ".join(["python", "is", "cool"])

'python, is, cool'

In [59]:
", ".join(filter(None, ["", "python"]))

'python'

In [60]:
", ".join("python")

'p, y, t, h, o, n'

In [61]:
", ".join(range(10))

TypeError: sequence item 0: expected str instance, int found

### Substrings

#### Check For Substring

In [62]:
"py" in "python"

True

In [64]:
"clj" not in "python"

True

In [65]:
"python".startswith("py")

True

In [66]:
"python".endswith("on")

True

In [71]:
"python".startswith(("py", "clo"))

True

In [69]:
"python".endswith(("on", "ava"))

True

#### Search For Index

In [72]:
"python".find("th")

2

In [73]:
"python".find("th", 0, 3)  # ≃ [:3].find("th")

-1

In [74]:
"python".index("th", 0, 3)

ValueError: substring not found

#### Replace

In [76]:
"python".replace("p", "j")

'jython'

In [80]:
"pythonpython".replace("py", "**", 2)

'**thon**thon'

In [81]:
translation_map = {ord("p"): "*", ord("n"): "?"}
"pythonpython".translate(translation_map)

'*ytho?*ytho?'

#### Predicates

In [82]:
"100500".isdigit()

True

In [83]:
"100500".isalnum()

True

In [85]:
"python".isalpha()

True

In [87]:
"python".islower()

True

In [88]:
"PYTHON".isupper()

True

In [89]:
"Python Code".istitle()

True

In [90]:
"\r     \n\t     \r\n".isspace()

True

## String Representation

In [93]:
str("I'am a string")

"I'am a string"

In [94]:
# Always define __repr__ for your objects!
repr("I'am a string")

'"I\'am a string"'

In [96]:
ascii("я строка")

"'\\u044f \\u0441\\u0442\\u0440\\u043e\\u043a\\u0430'"

## String Format

[PyFormat: Using % and .format() for great good!](https://pyformat.info/)

### .format()

In [91]:
"{}, {}, how are you?".format("Hello", "Andrey")

'Hello, Andrey, how are you?'

In [92]:
"Today is April, {}st".format(1)

'Today is April, 1st'

---

In [99]:
"{!s}".format("I'am a string")  # str

"I'am a string"

In [100]:
"{!r}".format("I'am a string")  # repr

'"I\'am a string"'

In [101]:
"{!a}".format("я строка")  # ascii

"'\\u044f \\u0441\\u0442\\u0440\\u043e\\u043a\\u0430'"

#### Format Specification

[Format String Syntax](https://docs.python.org/3.9/library/string.html#format-string-syntax)

In [1]:
"{:~^16}".format("python")

'~~~~~python~~~~~'

In [2]:
"int {0:d} hex: {0:x}".format(42)

'int 42 hex: 2a'

In [3]:
"oct {0:o} bin: {0:b}".format(42)

'oct 52 bin: 101010'

In [4]:
"{:+08.2f}".format(-42.42)

'-0042.42'

In [5]:
"{!r:~^16}".format("foo bar")

"~~~'foo bar'~~~~"

In [6]:
"{0}, {1}, {0}".format("Hello", "Andrey")

'Hello, Andrey, Hello'

In [7]:
"{0}, {who}, {0}".format("Hello", who="Andrey")

'Hello, Andrey, Hello'

In [8]:
# working with containers
point = 0, 10
"x = {0[0]}, y = {0[1]}".format(point)

'x = 0, y = 10'

In [9]:
point = {"x": 0, "y": 10}
"x = {0[x]}, y = {0[y]}".format(point)

'x = 0, y = 10'

### % – The Old One

In [11]:
"%s, %s, how are you?" % ("Hello", "Andrey")

'Hello, Andrey, how are you?'

In [12]:
point = {"x": 0, "y": 10}
"x = %(x)+2d, y = %(y)+2d" % point

'x = +0, y = +10'

#### Gotchas

In [13]:
"%s" % (1, 2)

TypeError: not all arguments converted during string formatting

In [14]:
"%s" % [1, 2]

'[1, 2]'

## String Constants

In [17]:
import string

In [18]:
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [19]:
string.digits

'0123456789'

In [20]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [21]:
string.whitespace

' \t\n\r\x0b\x0c'

## Bytes & Byte Arrays

> Bytes and bytearray objects contain single bytes – the former is immutable while the latter is a mutable sequence. Bytes objects can be constructed the constructor, bytes(), and from literals; use a b prefix with normal string syntax: b'xyzzy'. To construct byte arrays, use the bytearray() function.

In [22]:
# Byte literals
b"\00\42\24\00"

b'\x00"\x14\x00'

In [23]:
rb"\00\42\24\00"

b'\\00\\42\\24\\00'

### Bytes and strings are related

```
>>> help(open)
```
Help on built-in function open in module io:

```python
open(file, mode="r", buffering=-1, encoding=None,
     errors=None, newline=None,
     closefd=True, opener=None)
```
Open file and return a stream.

modes:

- `r` — read (default)
- `t` — decode bytes into text (default)
- `b` — do not decode
- `w` — write (clear file)
- `a` — write to the end of file
- `+` — read + write
- `x` — exclusive creation

In [1]:
f = open('README.md', 'r')
byte_string = f.read(10)
type(byte_string)

str

In [2]:
byte_string

'# Jupyter '

In [3]:
f = open('README.md', 'rb')
byte_string = f.read(100)
type(byte_string)

bytes

In [4]:
byte_string

b'# Jupyter Notebook on Python\n\n```shell\n$ jupyter notebook\n```\n\n## Copyright\n\nCopyright (C) 2019 Andr'

In [5]:
b'\u20ac', len(b'\u20ac')

(b'\\u20ac', 6)

In [6]:
byte_string[0]

35

In [7]:
chr(byte_string[0])

'#'

In [8]:
b'\012\x0a'

b'\n\n'

In [9]:
b'\377\oxfe' + bytes(i for i in range(128, 137))

b'\xff\\oxfe\x80\x81\x82\x83\x84\x85\x86\x87\x88'

In [10]:
ord(b' '), B' '[0], ord(' '), ' '[0]

(32, 32, 32, ' ')

In [11]:
print(bytes(5))
print(bytes(i for i in range(ord('A'), ord('A') + 26)))

b'\x00\x00\x00\x00\x00'
b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'


In [12]:
print(bytes('Your bill is $9.99'))

TypeError: string argument without an encoding

In [13]:
print(bytes('Your bill is $9.99', 'UTF-8'))

b'Your bill is $9.99'


In [14]:
print(bytes('Your bill is $9.99', 'UTF-16'))

b'\xff\xfeY\x00o\x00u\x00r\x00 \x00b\x00i\x00l\x00l\x00 \x00i\x00s\x00 \x00$\x009\x00.\x009\x009\x00'


In [15]:
any((c > 127) for c in open('README.md', 'rb').read())

False

In [16]:
any((c > 127) for c in open('README.md', 'r').read())

TypeError: '>' not supported between instances of 'str' and 'int'

In [18]:
fb = open('README.md', 'rb')

In [19]:
type(fb.read())

bytes

---

In [28]:
"boo" in b"foobar"

TypeError: a bytes-like object is required, not 'str'

In [29]:
b"foobar".replace("o", "")

TypeError: a bytes-like object is required, not 'str'

### encode & decode

Optional argument for dealing with wrong characters:
* `strict` – raise an exception
* `ignore` – skip wrong characters
* `replace` – replace wrong characters with `"\ufffd"`

In [25]:
chunk = "я строка".encode("cp1251")
chunk.decode("utf-8", "ignore")

' '

In [26]:
chunk.decode("utf-8", "replace")

'� ������'

System encoding is used by default.

In [27]:
import sys
sys.getdefaultencoding()

'utf-8'

## Files

- [PEP 383 -- Non-decodable Bytes in System Character Interfaces](https://www.python.org/dev/peps/pep-0383/)
- [Python 3.1 surrogateescape error handler (PEP 383)](https://vstinner.github.io/pep-383.html)
- [Security implications of PEP 383](http://blog.omega-prime.co.uk/2011/03/29/security-implications-of-pep-383/)

In [37]:
handle = open("tmp.txt", "r+")

In [38]:
handle.fileno()

57

In [39]:
handle.tell()

0

In [40]:
handle.seek(8)

8

In [41]:
handle.tell()

8

In [42]:
handle.write("Writing to files in Python")

26

In [43]:
handle.flush()

In [44]:
handle.close()

## stdin, stdout & stderr

In [45]:
input("Name: ")

Name: Andrey


'Andrey'

In [46]:
print("Hello, `sys.stdout`!", file=sys.stdout)

Hello, `sys.stdout`!


In [47]:
print("Hello, `sys.stderr`!", file=sys.stderr)

Hello, `sys.stderr`!


### print

In [48]:
print(*range(4))

0 1 2 3


In [49]:
print(*range(4), sep="_")

0_1_2_3


In [50]:
print(*range(4), end="\n--\n")

0 1 2 3
--


In [51]:
handle = open("tmp.txt", "w")

In [52]:
print(*range(4), file=handle, flush=True)

## StringIO & BytesIO

In [54]:
import io

handle = io.StringIO("foo\n\bar")
handle.readline()

'foo\n'

In [55]:
handle.write("boo")
handle.getvalue()

'foo\nboo'

In [57]:
handle = io.BytesIO(b"foobar")
handle.read(3)

b'foo'

In [58]:
handle.getvalue()

b'foobar'