# 1.4 Strings

This section introduces ways to work with text.

**Representing Literal Text**

String literals are written in programs with quotes.

In [2]:
# Single quote
a = 'Yeah but no but yeah but...'

# Double quote
b = "computer says no"

# Triple quotes
c = '''
Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
not around the eyes,
don't look around the eyes,
look into my eyes, you're under.
'''

print(c)


Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
not around the eyes,
don't look around the eyes,
look into my eyes, you're under.



Normally strings may only span a single line. Triple quotes capture all text enclosed across multiple lines including all formatting.

There is no difference between using single (') versus double (") quotes. However, the same type of quote used to start a string must be used to terminate it.

**String escape codes**

Escape codes are used to represent control characters and characters that can't be easily typed directly at the keyboard. Here are some common escape codes:

'\n'      Line feed         # Next line
'\r'      Carriage return   # move to the leftmost column
'\t'      Tab
'\''      Literal single quote
'\"'      Literal double quote
'\\'      Literal backslash

**String Representation**

Each character in a string is stored internally as a so-called Unicode "code-point" which is an integer. You can specify an exact code-point value using the following escape sequences:

In [4]:
a = '\xf1'          # a = 'ñ'
b = '\u2200'        # b = '∀'
c = '\U0001D122'    # c = '𝄢'
d = '\N{FOR ALL}'   # d = '∀'
print(a,b,c,d)

ñ ∀ 𝄢 ∀


The Unicode Character Database is a reference for all available character codes. https://unicode.org/charts/

**String Indexing**

Strings work like an array for accessing individual characters. You use an integer index, starting at 0. Negative indices specify a position relative to the end of the string.

In [6]:
a = 'Hello world'
b = a[0]          # 'H'
c = a[4]          # 'o'
d = a[-1]         # 'd' (end of string)
print (b,c,d)

H o d


You can also slice or select substrings specifying a range of indices with :.

In [7]:
d = a[:5]     # 'Hello', only select till index no 4
e = a[6:]     # 'world'
f = a[3:8]    # 'lo wo'
g = a[-5:]    # 'world'
print(d)
print(e)
print(f)
print(g)

Hello
world
lo wo
world


The character at the ending index is not included. 

Missing indices assume the beginning or ending of the string.

**String operations**

Concatenation, length, membership and replication.

In [8]:
# Concatenation (+)
a = 'Hello' + 'World'   # 'HelloWorld'
b = 'Say ' + a          # 'Say HelloWorld'

# Length (len)
s = 'Hello'
len(s)                  # 5

# Membership test (`in`, `not in`)
t = 'e' in s            # True
f = 'x' in s            # False
g = 'hi' not in s       # True

# Replication (s * n)
rep = s * 5             # 'HelloHelloHelloHelloHello'
print(rep)

HelloHelloHelloHelloHello


**String methods**

Strings have methods that perform various operations with the string data.

Example: stripping any leading / trailing white space.

In [16]:
s = '  Hello '
t = s.strip()     # 'Hello'
z = '  Hello '.strip()
print(s)
print(t)
print(z)

  Hello 
Hello
Hello


Example: Replacing text.

In [29]:
s = 'Hello world'
t = s.replace('Hello' , 'Hallo')   # 'Hallo world'
print(s, t)
s.upper()

Hello world Hallo world


'HELLO WORLD'

**More string methods:**

Strings have a wide variety of other methods for testing and manipulating the text data. This is a small sample of methods:

s.endswith(suffix)     # Check if string ends with suffix
s.find(t)              # First occurrence of t in s
s.index(t)             # First occurrence of t in s
s.isalpha()            # Check if characters are alphabetic
s.isdigit()            # Check if characters are numeric
s.islower()            # Check if characters are lower-case
s.isupper()            # Check if characters are upper-case
s.join(slist)          # Join a list of strings using s as delimiter
s.lower()              # Convert to lower case
s.replace(old,new)     # Replace text
s.rfind(t)             # Search for t from end of string
s.rindex(t)            # Search for t from end of string
s.split([delim])       # Split string into list of substrings
s.startswith(prefix)   # Check if string starts with prefix
s.strip()              # Strip leading/trailing space
s.upper()              # Convert to upper case

**String Mutability**

Strings are "immutable" or read-only. Once created, the value can't be changed.

In [30]:
s = 'Hello World'
s[1] = 'a'

TypeError: 'str' object does not support item assignment

**All operations and methods that manipulate string data, always create new strings.**

**String Conversions**

Use str() to convert any value to a string. The result is a string holding the same text that would have been produced by the print() statement.

In [31]:
x = 42
str(x)

'42'

**Byte Strings**

A string of 8-bit bytes, commonly encountered with low-level I/O, is written as follows:

In [32]:
data = b'Hello World\r\n'

By putting a little b before the first quotation, you specify that it is a byte string as opposed to a text string.

Most of the usual string operations work.

In [33]:
len(data)                         # 13
data[0:5]                         # b'Hello'
data.replace(b'Hello', b'Cruel')  # b'Cruel World\r\n'

b'Cruel World\r\n'

Indexing is a bit different because it returns byte values as integers.

In [34]:
data[0]   # 72 (ASCII code for 'H')

72

Conversion to/from text strings.

In [36]:
text = data.decode('utf-8') # bytes -> text
print(text)
data = text.encode('utf-8') # text -> bytes
print(data)

Hello World

b'Hello World\r\n'


The 'utf-8' argument specifies a character encoding. 

Other common values include 'ascii' and 'latin1'.

**Raw Strings**

Raw strings are string literals with an uninterpreted backslash. They are specified by prefixing the initial quote with a lowercase "r".

In [37]:
rs = r'c:\newdata\test' # Raw (uninterpreted backslash)
rs

'c:\\newdata\\test'

The string is the literal text enclosed inside, exactly as typed. This is useful in situations where the backslash has special significance. Example: filename, regular expressions, etc.

**f-Strings**

A string with formatted expression substitution.

In [38]:
name = 'IBM'
shares = 100
price = 91.1
a = f'{name:>10s} {shares:10d} {price:10.2f}' # 10 = up to 10 characters long, s = strings, d = digits, f = float
a

'       IBM        100      91.10'

In [39]:
b = f'Cost = ${shares*price:0.2f}' # float with 2 decimal points
b

'Cost = $9110.00'

# Exercises

In these exercises, you'll experiment with operations on Python's string type. You should do this at the Python interactive prompt where you can easily see the results. 

Start by defining a string containing a series of stock sticker symbols like this:

In [40]:
symbols = 'AAPL,IBM,MSFT,YHOO,SCO'

**Exercise 1.13: Extracting individual characters and substrings**

Strings are arrays of characters. Try extracting a few characters:

In [42]:
symbols[0]

'A'

In [43]:
symbols[1]

'A'

In [44]:
symbols[2]

'P'

In [45]:
symbols[-1]  # Last character

'O'

In [49]:
symbols[-2] # Negative indices are from end of string

'C'

In Python, strings are read-only.

Verify this by trying to change the first character of symbols to a lower-case 'a'.

In [50]:
symbols[0] = 'a'

TypeError: 'str' object does not support item assignment

**Exercise 1.14: String concatenation**

Although string data is read-only, you can always reassign a variable to a newly created string.

Try the following statement which concatenates a new symbol "GOOG" to the end of symbols:

In [51]:
symbols = symbols + 'GOOG'
symbols

'AAPL,IBM,MSFT,YHOO,SCOGOOG'

Oops! That's not what you wanted. Fix it so that the symbols variable holds the value 'AAPL,IBM,MSFT,YHOO,SCO,GOOG'.

In [53]:
symbols = 'AAPL,IBM,MSFT,YHOO,SCO'
symbols = symbols + ',GOOG'
symbols

'AAPL,IBM,MSFT,YHOO,SCO,GOOG'

Add 'HPQ' to the front the string:

In [54]:
symbols = 'HPQ,' + symbols
symbols

'HPQ,AAPL,IBM,MSFT,YHOO,SCO,GOOG'

In these examples, it might look like the original string is being modified, in an apparent violation of strings being read only. Not so. Operations on strings create an entirely new string each time. When the variable name symbols is reassigned, it points to the newly created string. Afterwards, the old string is destroyed since it's not being used anymore.

**Exercise 1.15: Membership testing (substring testing)**

Experiment with the `in` operator to check for substrings. At the interactive prompt, try these operations:

In [55]:
'IBM' in symbols

True

In [56]:
'AA' in symbols

True

In [57]:
'CAT' in symbols

False

Why did the check for 'AA' return True? Because 'AA' is available in symbols

**Exercise 1.16: String Methods**

At the Python interactive prompt, try experimenting with some of the string methods.

In [58]:
symbols.lower()

'hpq,aapl,ibm,msft,yhoo,sco,goog'

Remember, strings are always read-only. If you want to save the result of an operation, you need to place it in a variable:

In [59]:
lowersyms = symbols.lower()

Try some more operations:

In [60]:
symbols.find('MSFT')

13

In [61]:
symbols[13:17]

'MSFT'

In [63]:
symbols = symbols.replace('SCO','DOA')
symbols

'HPQ,AAPL,IBM,MSFT,YHOO,DOA,GOOG'

In [65]:
name = '   IBM   \n'
name = name.strip()    # Remove surrounding whitespace
name

'IBM'

**Exercise 1.17: f-strings**

Sometimes you want to create a string and embed the values of variables into it.

To do that, use an f-string. For example:

In [67]:
name = 'IBM'
shares = 100
price = 91.1

f'{shares} shares of {name} at ${price:.2f}'

'100 shares of IBM at $91.10'

**Exercise 1.18: Regular Expressions**

One limitation of the basic string operations is that they don't support any kind of advanced pattern matching. For that, you need to turn to Python's `re` module and regular expressions. Regular expression handling is a big topic, but here is a short example:

In [71]:
text = 'Today is 3/27/2018. Tomorrow is 3/28/2018.'
# Find all occurrences of a date
import re
re.findall(r'\d+/\d+/\d+',text)

['3/27/2018', '3/28/2018']

In [74]:
# Replace all occurrences of a date with replacement text
re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)

'Today is 2018-3-27. Tomorrow is 2018-3-28.'

For more information about the re module, see the official documentation at https://docs.python.org/library/re.html.

**Commentary**

As you start to experiment with the interpreter, you often want to know more about the operations supported by different objects. For example, how do you find out what operations are available on a string?

Depending on your Python environment, you might be able to see a list of available methods via tab-completion. For example, try typing this:

In [76]:
s = 'hello world'
dir(s)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',


dir() produces a list of all operations that can appear after the (.). Use the help() command to get more information about a specific operation:

In [77]:
help(s.upper)

Help on built-in function upper:

upper() method of builtins.str instance
    Return a copy of the string converted to uppercase.

