# Chapter 2. Strings
String is an array of characters. Each character is encoded as an array of
bits.

Given an array of bits, data type indicator tells us these bits are numbers,
string, or others.

## Section 2.1 String Basics

There are many ways to specify string literals.

In [1]:
print('hello, world')  # single quote
print("hello, world")  # double quote
print('''hello
world
''')  # 3 single quotes, as is. In this case, 3 lines.
print("""hello
world
""")  # 3 double quotes.

hello, world
hello, world
hello
world

hello
world



In [2]:
# invisible and escape characters
print('hello,\tworld!\nEnd\\')  # tab, new line, and backslash
print('hello, "O\'Brian"')  # ' inside
print("hello, O'Brian")  # a simpler way

print(r'Raw data \t')  # raw string, no tab

hello,	world!
End\
hello, "O'Brian"
hello, O'Brian
Raw data \t


In [3]:
# operators
print('hello ' + 'world')
print('-' * 80)


hello world
--------------------------------------------------------------------------------


A string is an array of characters. Characters are defined in character tables.
The most basic character table is ascii table (https://en.wikipedia.org/wiki/ASCII)
![ascii](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/ASCII-Table-wide.svg/875px-ASCII-Table-wide.svg.png)

In [4]:
print(ord('a'))
print(chr(97))
# we will discuss other character sets later.

97
a


In [1]:
# string = list of chars => slice and dice
s = 'hello, world'
print(s[7])  # character by position
print(s[2:])  # llo, world, start with position 2 (index starts from 0)
print(s[2:5])  # llo, start included, end excluded
print(s[-1])  # d, last element.
print(s[::2])  # hlo ol, every other letter, step 2. stride of 2
print(s[::-1])  # reverse order

print(len(s))

w
llo, world
llo
d
hlo ol
dlrow ,olleh
12


In [6]:
try:
    s[4] = 'y' # string is immutable, this will fail
except:
    import traceback
    traceback.print_exc()

Traceback (most recent call last):
  File "<ipython-input-6-e55d5583569f>", line 2, in <module>
    s[4] = 'y' # string is immutable, this will fail
TypeError: 'str' object does not support item assignment


In [7]:
# Types and casting
print(type("Hello"))  # class 'str'

# casting
print(int('01234'))  # can convert string to int too, except if not int
print(float('3.14'))

print(type('1234'))  # <class 'str'>
print(type(int('1234')))  # <class 'int'>
print(isinstance(1234, int))  # True

<class 'str'>
1234
3.14
<class 'str'>
<class 'int'>
True


## Section 2.1 String functions

Python string has many convenient methods:
https://docs.python.org/3/library/stdtypes.html#string-methods
They are very handy, so get familiar with them.

In [8]:
s = 'hello, world'
print(s.capitalize())
print(s.upper())
print(s.title())

print(s.index('world'))
print('llo' in s)
print(s.index('o'))  # 4, first index of o
# how to find second index? what about all indices?
print(s.index('o', s.index('o') + 1))  # 8
print(s.rfind('o'))  # same as above
print(s.find('wold'))  # not found, return -1

Hello, world
HELLO, WORLD
Hello, World
7
True
4
8
8
-1


In [9]:
print('[' + ' hello, world '.strip() + ']')
print('[' + ' hello, world '.lstrip() + ']')
print('[' + ' hello, world '.rstrip() + ']')

print('[' + 'Hello' + ']')
print('[' + 'Hello'.ljust(15) + ']')
print('[' + 'Hello'.rjust(15) + ']')

[hello, world]
[hello, world ]
[ hello, world]
[Hello]
[Hello          ]
[          Hello]


In [10]:
s = "I do not like green eggs and ham. Sam I am."
print(s.split(' '))
print('-'.join(s.split()))
# string is treated as list of chars, as we saw before
print('-'.join(s))

# may use re module to find all if no special chars: [s.start() for s in re.finditer(':', s)]
# or use a loop/generator

['I', 'do', 'not', 'like', 'green', 'eggs', 'and', 'ham.', 'Sam', 'I', 'am.']
I-do-not-like-green-eggs-and-ham.-Sam-I-am.
I- -d-o- -n-o-t- -l-i-k-e- -g-r-e-e-n- -e-g-g-s- -a-n-d- -h-a-m-.- -S-a-m- -I- -a-m-.


## Section 2.3 String Formatting

In [11]:
# string format
print('I do not like {} and {}'.format('green eggs', 'ham'))
print('I am {0}. I am {0}. {0}-I-Am'.format('Sam'))  # no duplicates
print('I do not like them {1} or {0}'.format('there', 'here'))  # by position index
print('I do not like them {h} or {t}'.format(t='there', h='here'))  # by keyed index

I do not like green eggs and ham
I am Sam. I am Sam. Sam-I-Am
I do not like them here or there
I do not like them here or there


In [12]:
# f-string is more powerful: https://www.python.org/dev/peps/pep-0498/
there = 'there'
here = 'here'
print(f'I do not like them {here} or {there}')  # use variable names directly
print(f'I do not like them {here!s:20} or {there}')  # 20 or <20, left justified
print(f'I do not like them {here!s:>20} or {there}')  # right justified
print(f'I do not like them {here!s:<20} or {there}')  # left justified
print(f'I do not like them {here!s:^20} or {there}')  # center justified
print(f'I do not like them {here!s:_<20} or {there}')  # fill with _

I do not like them here or there
I do not like them here                 or there
I do not like them                 here or there
I do not like them here                 or there
I do not like them         here         or there
I do not like them here________________ or there


In [13]:
# number format, can take python expression
print(f'2 + 3 = {2 + 3}')
a = 123.456789
print(f'{a:.3f}')  # 123.457
print(f'{a:.10f}')  # 123.4567890000
print(f'{a:12.3f}')  # 12 - 4 = 8 positions before .
print(f'{a:<.3f}')
print(f'{a:.6e}')

2 + 3 = 5
123.457
123.4567890000
     123.457
123.457
1.234568e+02


In [14]:
b = 12345
print(f'{b:07}')  # 0012345
print(f'{b:<10}')
print(f'{b:>10}')
print(f'{b:<+10}')
print(f'{b:>+10}')
b = -12345
print(f'{b:>+10}')

c = 520
print(f'{c:x}')  # hex 208
print(f'{c:#X}')
print(f'{c:o}')  # oct
print(f'{c:b}')  # binary
print(f'{c:e}')  # scientific notation

0012345
12345     
     12345
+12345    
    +12345
    -12345
208
0X208
1010
1000001000
5.200000e+02


In [15]:
x = 12
print(format(x, '08b'))
print(format(x, '010b'))

print('{:,}'.format(1000000000000))  # add comma every 3 digit, in finance.


# https://realpython.com/python-f-strings/

# https://en.wikipedia.org/wiki/Binary_code

# phone number 1234567890 -> 123-456-7890, (123) 456-7890

00001100
0000001100
1,000,000,000,000


## Section 2.4 Unicode and Encoding

To deal with many languages in the world, we expand the ascii character set to
unicode (https://home.unicode.org/).

Over time, we expanded the ascii character set, https://www.ascii-code.com/,
which is 1 byte long. Then we expanded to 2-byte unicode, and then 3-byte
unicode. Now we end up with variable length unicode.

This is related to i18n (https://en.wikipedia.org/wiki/Internationalization_and_localization).

We use charset, short for character set.

Before we have unicode, many charsets were built to deal with languages
like Chinese. Due to mismatch or missing charsets, it was a pain to find a way
around, especially when you played games

![missing_charset](wrong_charset.jpg)

(from: https://m.3dmgame.com/news/201705/3657112_2.html)

Given a series of bits, you have to guess which charset it maps to.

Now with unicode, we don't need to guess anymore (occasionally we still run into
this again).

we can input Greek, Chinese, or emoji characters.

Here is a good reference to look into details.
https://stackoverflow.com/questions/643694/what-is-the-difference-between-utf-8-and-unicode

All Python string literals are unicode. That's the good news. However, when you retrieve
data from files sent by others, databases, web pages, you may still encounter the data
not using unicode. In that case, you have to guess which charset they are using.

In [16]:
print("你好，世界")

# https://pythonforundergradengineers.com/unicode-characters-in-python.html
print('\u03B1 \u03B4 \u03B5')  # greek letters: α δ ε

# utf-8 is a variable length charset
print(len('a'.encode('utf-8')))  # 1 byte, for English
print(len('ε'.encode('utf-8')))  # 2 bytes, for Europe
print(len('好'.encode('utf-8')))  # use 3 bytes to store the char in utf-8, Chinese


你好，世界
α δ ε
1
2
3


In [17]:
# unicode represents characters. it can be encoded into charset.
print('\u03B5')  # this is unicode
print(len('\u03B5'))  # 1, just one unicode character
print('\u03B5'.encode('utf-8'))  # b'\xce\xb5', encode unicode to byte string, 2 bytes
print(len(b'\xce\xb5'))  # 2 bytes
print(b'\xce\xb5'.decode('utf-8'))  # ε, decode byte array to unicode

ε
1
b'\xce\xb5'
2
ε


In [18]:
# another encoding: don't use latin-1 since it can't fit in
print('\u03B5'.encode('cp936'))  # b'\xa6\xc5'

print(b'abc')  # binary abc, as bytes
print(b'abc'.decode('utf-8'))
print(b'\xc2\xb5'.decode('utf-8'))  # µ

b'\xa6\xc5'
b'abc'
abc
µ


https://blog.csdn.net/ywcpig/article/details/52250277
https://blog.csdn.net/tgvincent/article/details/93884725
https://blog.csdn.net/cy_baicai/article/details/82897724

Python code itself is unicode, so we may define Unicode variables names.

In [19]:
# unicode variable name
你好 = 1024
print(你好)

Σ = 1.618
print(Σ)

normalText = 'Pythön rocks'
print(ascii(normalText))

1024
1.618
'Pyth\xf6n rocks'


In [9]:
# bytes([source[, encoding[, errors]]])
# bytearray([source[, encoding[, errors]]])

x = b"monkey"
y = b'''apple,
banana,
pear
'''
z = bytes('Python, bytes', 'utf8')

print(x)
print(y)
print(z)

# convert bytes to string
p = b'El ni\xc3\xb1o come camar\xc3\xb3n'
s = p.decode()
print(s)

# length of bytes object
a = bytes("Python, Bytes", "utf8")
print(len(a))

# + and * operators
print(x * 5)
print(x + y)

b'monkey'
b'apple,\nbanana,\npear\n'
b'Python, bytes'
El niño come camarón
13
b'monkeymonkeymonkeymonkeymonkey'
b'monkeyapple,\nbanana,\npear\n'


In [20]:
print(type(b'abc'))  # <class 'bytes'>
print(type(u'abc'))  # <class 'str'>

print(r'abc\n\t')


<class 'bytes'>
<class 'str'>


In [1]:
import sys, locale
print(sys.getdefaultencoding())  # utf-8
print(locale.getpreferredencoding())  # cp936
print(sys.stdout.encoding)  # utf-8

print(help(locale.getpreferredencoding))  # print the doc for this method

utf-8
UTF-8
UTF-8
Help on function getpreferredencoding in module locale:

getpreferredencoding(do_setlocale=True)
    Return the charset that the user is likely using,
    according to the system configuration.

None
abc\n\t
