## Strings

Strings are used to record both textual information as well as arbitrary collections
of bytes. They are our first example of what in Python we call a sequence—
a positionally ordered collection of other objects. 
Sequences maintain a left-to-right order among the items they contain:
their items are stored and fetched by their relative positions. Strictly speaking, strings
are sequences of one-character strings; other, more general sequence types include
lists and tuples, covered later.

### Sequence Operations

In [1]:
# if we have a four-character string coded inside quotes,
# we can verify its length with the built-in len function and fetch its
# components with indexing expressions

S = 'Spam' # Make a 4-character string, and assign it to a name
len(S)     # Length

'p'

### In Python, indexes are coded as offsets from the front, and so start from 0: the first item is at index 0, the second is at index 1, and so on.

In [7]:
S[0] # The first item in S, indexing by zero-based position

'S'

In [3]:
S[1] # The second item from the left

'p'

### Notice how we assign the string to a variable named S here. A variable is created when you assign it a value, may be assigned any type of object, and is replaced with its value when it shows  up in an expression. It must also have been previously assigned by the time you use its value.

In [4]:
# In Python, we can also index backward, from the end—positive indexes count from
# the left, and negative indexes count back from the right:
S[-1] # The last item from the end in S

'm'

In [5]:
S[-2] # The second-to-last item from the end

'a'

In [6]:
# Formally, a negative index is simply added to the string’s length
S[len(S)-1]

'm'

### In addition to simple positional indexing, sequences also support a more general form of indexing known as slicing, which is a way to extract an entire section (slice) in a single step.

In [8]:
S[1:3] # Slice of S from offsets 1 through 2 (not 3)

'pa'

### Probably the easiest way to think of slices is that they are a way to extract an entire column from a string in a single step. Their general form, X[I:J], means “give me everything in X from offset I up to but not including offset J.”

In [9]:
S[1:] # Everything past the first (1:len(S))

'pam'

In [10]:
S[0:3] # Everything but the last

'Spa'

In [11]:
S[:3] # Same as S[0:3]

'Spa'

In [12]:
S[:-1] # Everything but the last again, but simpler (0:-1)

'Spa'

In [13]:
S[:] # All of S as a top-level copy (0:len(S))

'Spam'

In [14]:
# strings also support concatenation with the plus sign (joining two strings 
# into a new string) and repetition (making a new string by repeating another):

S + 'xyz' # Concatenation

'Spamxyz'

In [15]:
S * 8 # Repetition

'SpamSpamSpamSpamSpamSpamSpamSpam'

### Immutability

Every string operation is defined to produce a new string as its result, because strings are immutable in Python—they cannot be changed in place after they are created. You can’t change a string by assigning to one of its positions, but you can always build a new one and assign it to the same name.

In [16]:
S

'Spam'

In [17]:
S[0] = 'z' # Immutable objects cannot be changed

TypeError: 'str' object does not support item assignment

In [18]:
S = 'z' + S[1:] # But we can run expressions to make new objects

In [19]:
S

'zpam'

### Every object in Python is classified as either immutable (unchangeable) or not. In terms of the core types, numbers, strings, and tuples are immutable; lists, dictionaries, and sets are not—they can be changed in place freely, as can most new objects you’ll code with classes.

Strictly speaking, you can change text-based data in place if you either expand it into a list of individual characters and join it back together with nothing between, or use the bytearray.
bytearray

In [20]:
S = 'shrubbery'

In [21]:
L = list(S) # Expand to a list: [...]

In [22]:
L

['s', 'h', 'r', 'u', 'b', 'b', 'e', 'r', 'y']

In [23]:
L[1] = 'c' # Change it in place

In [24]:
''.join(L) # Join with empty delimiter

'scrubbery'

In [25]:
B = bytearray(b'spam') # A bytes/list hybrid (ahead)

In [26]:
B.extend(b'eggs') # 'b' needed in 3.X

In [27]:
B # B[i] = ord(c) works here too

bytearray(b'spameggs')

In [28]:
B.decode() # Translate to normal string

'spameggs'

### Type-Specific Methods

In addition to generic sequence operations, though, strings also have operations all
their own, available as methods—functions that are attached to and act upon a specific
object, which are triggered with a call expression.

In [29]:
# find method is the basic substring search operation (it returns
# the offset of the passed-in substring, or −1 if it is not present)

S = 'Spam'
S.find('pa') # Find the offset of a substring in S

1

In [31]:
# the string replace method performs global searches and replacements
S.replace('pa', 'XYZ') # Replace occurrences of a string in S with another

'SXYZm'

Other methods split a string into substrings on a delimiter (handy as
a simple form of parsing), perform case conversions, test the content of the string (digits,
letters, and so on), and strip whitespace characters off the ends of the string

In [33]:
line = 'aaa,bbb,ccccc,dd'
line.split(',') # Split on a delimiter into a list of substrings

['aaa', 'bbb', 'ccccc', 'dd']

In [34]:
S = 'spam'
S.upper() # Upper- and lowercase conversions

'SPAM'

In [35]:
S.isalpha() # Content tests: isalpha, isdigit, etc.

True

In [36]:
line = 'aaa,bbb,ccccc,dd\n'

In [37]:
line.rstrip() # Remove whitespace characters on the right side

'aaa,bbb,ccccc,dd'

In [38]:
line.rstrip().split(',') # Combine two operations

['aaa', 'bbb', 'ccccc', 'dd']

In [39]:
# Strings also support an advanced substitution operation known as formatting
'{0}, eggs, and {1}'.format('spam', 'SPAM!') # Formatting method

'spam, eggs, and SPAM!'

In [40]:
'{:,.2f}'.format(296999.2567) # Separators, decimal digits

'296,999.26'

In [43]:
'%.2f | %+05d' % (3.14159, -42) # Digits, padding, signs

'3.14 | -0042'

### As a rule of thumb, Python’s toolset is layered: generic operations that span multiple types show up as built-in functions or expressions (e.g., len(X), X[0]), but type-specific operations are method calls (e.g., aStringupper()).

## Getting Help

For more details, you can always call the built-in dir function. This
function lists variables assigned in the caller’s scope when called with no argument;
more usefully, it returns a list of all the attributes available for any object passed to it.

In [45]:
dir(S)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',


You probably won’t care about the names with double underscores in this list until later
in the book, when we study operator overloading in classes—they represent the implementation
of the string object and are available to support customization. The
__ add __ method of strings, for example, is what really performs concatenation

In [46]:
S + 'NI!'

'spamNI!'

In [47]:
S.__add__('NI!')

'spamNI!'

#### The dir function simply gives the methods’ names. To ask what they do, you can pass them to the help function

In [48]:
help(S.replace)

Help on built-in function replace:

replace(old, new, count=-1, /) method of builtins.str instance
    Return a copy with all occurrences of substring old replaced by new.
    
      count
        Maximum number of occurrences to replace.
        -1 (the default value) means replace all occurrences.
    
    If the optional argument count is given, only the first count occurrences are
    replaced.



### Other Ways to Code Strings

Special characters can be represented as backslash
escape sequences, which Python displays in \xNN hexadecimal escape notation, unless
they represent printable characters

In [52]:
S = 'A\nB\tC' # \n is end-of-line, \t is tab

'A\nB\tC'

In [51]:
len(S) # Each stands for just one character

5

In [53]:
ord('\n') # \n is a byte with the binary value 10 in ASCII

10

In [55]:
S = 'A\0B\0C' # \0, a binary zero byte, does not terminate string
len(S)

5

In [56]:
S # Non-printables are displayed as \xNN hex escapes

'A\x00B\x00C'

Python allows strings to be enclosed in single or double quote characters—they mean
the same thing but allow the other type of quote to be embedded with an escape. It also allows multiline string literals enclosed in
triple quotes (single or double)—when this form is used, all the lines are concatenated
together, and end-of-line characters are added where line breaks appear.

In [58]:
msg = """
aaaaaaaaaaaaa
bbb'''bbbbbbbbbb""bbbbbbb'bbbb
cccccccccccccc
"""

In [59]:
msg

'\naaaaaaaaaaaaa\nbbb\'\'\'bbbbbbbbbb""bbbbbbb\'bbbb\ncccccccccccccc\n'

#### Python also supports a raw string literal that turns off the backslash escape mechanism. Such literals start with the letter r and are useful for strings like directory paths on Windows (e.g., r'C:\text\new').

## Unicode Strings

Python’s strings also come with full Unicode support required for processing text in
internationalized character sets. Characters in the Japanese and Russian alphabets, for
example, are outside the ASCII set. Such non-ASCII text can show up in web pages,
emails, GUIs, JSON, XML, or elsewhere.

In [60]:
'sp\xc4m' # normal str strings are Unicode text

'spÄm'

In [61]:
b'a\x01c' # bytes strings are byte-based data

b'a\x01c'

Formally, in both 2.X and 3.X, non-Unicode strings are sequences of 8-bit bytes that
print with ASCII characters when possible, and Unicode strings are sequences of Uni
code
code points—identifying numbers for characters, which do not necessarily map to
single bytes when encoded to files or stored in memory.

In [62]:
'spam' # Characters may be 1, 2, or 4 bytes in memory

'spam'

In [63]:
'spam'.encode('utf8') # Encoded to 4 bytes in UTF-8 in files

b'spam'

In [64]:
'spam'.encode('utf16') # But encoded to 10 bytes in UTF-16

b'\xff\xfes\x00p\x00a\x00m\x00'

Both 3.X and 2.X also support coding non-ASCII characters with
\x hexadecimal and short \u and long \U Unicode escapes, as well as file-wide encodings declared in program source files.





In [65]:
'sp\xc4\u00c4\U000000c4m'

'spÄÄÄm'

What these values mean and how they are used differs between text strings, which are
the normal string in 3.X and Unicode in 2.X, and byte strings, which are bytes in 3.X
and the normal string in 2.X. Byte strings use only \x hexadecimal escapes to embed the 
encoded form of text, not its decoded code point values—encoded bytes are the same as 
code points, only for some encodings and characters.

In [66]:
'\u00A3', '\u00A3'.encode('latin1'), b'\xA3'.decode('latin1')

('£', b'\xa3', '£')

#### Python 3.X has a tighter model that never allows its normal and byte strings to mix without explicit conversion:

u'x' + b'y' # Fails in 3.3 (where u is optional and ignored) \
u'x' + 'y' # Works in 3.3: 'xy'

'x' + b'y'.decode() # Works in 3.X if decode bytes to str: 'xy' \
'x'.encode() + b'y' # Works in 3.X if encode str to bytes: b'xy'

## Pattern Matching

To do pattern matching in Python, we import a module called re. This module has analogous calls for searching, splitting, and replacement,
but because we can use patterns to specify substrings, we can be much more
general

In [68]:
import re

# This example searches for a substring that begins with the word “Hello,” followed by
# zero or more tabs or spaces, followed by arbitrary characters to be saved as a matched
# group, terminated by the word “world.”

match = re.match('Hello[ \t]*(.*)world', 'Hello Python world')

In [69]:
match.group(1)

'Python '

In [70]:
# The following pattern, for example, picks out three groups separated by slashes,
# and is similar to splitting by an alternatives pattern:
match = re.match('[/:](.*)[/:](.*)[/:](.*)', '/usr/home:lumberjack')
match.groups()

('usr', 'home', 'lumberjack')

In [71]:
re.split('[/:]', '/usr/home/lumberjack')

['', 'usr', 'home', 'lumberjack']