# Class 4 Python Strings

## 4.1 Python Sequences

> Strings, lists and tuples are sequences of various Python objects.

> Sequence types all share the same access model: ordered set with sequentially indexed offsets to get to each element.

> The numbering scheme used starts from zero (0) and ends with one less than the length of the sequence—the reason for this is because we began at 0.

### 4.1.1 Membership (`in`, `not in`)

> The `in` and `not in` operators are Boolean in nature; they return True if the membership is confirmed and False otherwise.

### 4.1.2 Concatenation (`+`)

Concatetation in sequences is different from sum in numbers.

Concatenation is not as efficient as `join`, especially when there are a number of items to be merged.

In [12]:
%load_ext memory_profiler
%timeit %memit 'a' + 'b' + 'c' + 'd' + 'e'

peak memory: 38.86 MiB, increment: 0.49 MiB
peak memory: 38.86 MiB, increment: 0.00 MiB
peak memory: 38.86 MiB, increment: 0.00 MiB
peak memory: 38.87 MiB, increment: 0.00 MiB
1 loop, best of 3: 253 ms per loop


In [13]:
%timeit %memit ''.join(('a', 'b', 'c', 'd', 'e'))

peak memory: 38.87 MiB, increment: 0.00 MiB
peak memory: 38.87 MiB, increment: 0.00 MiB
peak memory: 38.87 MiB, increment: 0.00 MiB
peak memory: 38.87 MiB, increment: 0.00 MiB
1 loop, best of 3: 248 ms per loop


### 4.1.3 Slices

__Individual slicing__

> Sequences are structured data types whose elements are placed sequentially in an ordered manner.

`sequences[index]`

Index can be either positive or negative.

1. Positive index starts from left (0) to right (`len(sequence)-1`);
2. Negative index starts from right (-1) ro left (`-len(sequence)`).
    
> Accessing a group of elements is similar to accessing just a single item. Starting and ending indexes may be given, separated by a colon ( : ).


__Group slicing__

`sequence[starting_index:ending_index]`

> Both starting_index and ending_index are optional, and if not provided, or if None is used as an index, the slice will go from the beginning of the sequence or until the end of the sequence, respectively.

`sequence == sequence[:]`

`sequence[0:ending_index] or sequence[:ending_index]`

`sequence[starting_index:len(sequence)-1] or sequence[:len(sequence)-1]`

reversed sequnce: `sequence[::-1]`

> The starting and ending indices can exceed the length of the string.

> Using None as an index has the same effect as a missing index.

### 4.1.4 Built-in Functions

> If you pass a list to list(), a (shallow) copy of the list’s objects will be made and inserted into the new list.

> A shallow copy is where only __references__ are copied…no new objects are made!

> Note that `len()`, `reversed()`, and `sum()` can only accept sequences while the rest can take iterables.

In [17]:
for order, item in enumerate(['a', 'b', 'c']): print(order, item)

(0, 'a')
(1, 'b')
(2, 'c')


In [18]:
for order, item in enumerate(['a', 'b', 'c'], start=1): print(order, item)

(1, 'a')
(2, 'b')
(3, 'c')


In [19]:
sorted(['a', 'c', 'b'])

['a', 'b', 'c']

In [21]:
sorted(['a', 'c', 'b'], reverse=True)

['c', 'b', 'a']

In [22]:
reversed(['a', 'c', 'b'])

<listreverseiterator at 0x10e7c3510>

In [23]:
list(reversed(['a', 'c', 'b']))

['b', 'c', 'a']

In [24]:
list(reversed(sorted(['a', 'c', 'b'])))

['c', 'b', 'a']

In [25]:
zip(['a', 'c', 'b'], ['1', '2', '0'])

[('a', '1'), ('c', '2'), ('b', '0')]

## 4.2 Strings

> Python treats single quotes the same as double quotes.

> Strings are a literal or scalar type, meaning they are treated by the interpreter as a singular value and are not containers that hold other Python objects. 

> Strings are immutable, meaning that changing an element of a string requires creating a new string.

In [26]:
#!/usr/bin/env python

import string

alphas = string.ascii_letters + '_'
nums = string.digits 
alphnum = alphas + nums

print 'Welcome to the Identifier Checker v1.0'
print 'Testees must be at least 2 chars long.'

myInput = raw_input('Identifier to test? ')

if len(myInput) > 1:
    if myInput[0] not in alphas:
        print '''invalid: first symbol must be 16 alphabetic'''
    else:
        for otherChar in myInput[1:]:
            if otherChar not in alphnum: # ! not efficient
                print '''invalid: remaining 22 symbols must be alphanumeric'''
                break
            else:
                print "okay as an identifier"

Welcome to the Identifier Checker v1.0
Testees must be at least 2 chars long.
Identifier to test? as
okay as an identifier


### 4.2.1 Concatenation

> When concatenating regular and Unicode strings, regular strings are converted to Unicode first before the operation occurs.

`'Hello' + u' ' + 'World' + u'!'`

### 4.2.2 Format string

`format_string % (arguments_to_convert)`

In [51]:
"MM/DD/YY = %02d/%02d/%d" % (2, 15, 67)  # d for int

'MM/DD/YY = 02/15/67'

In [52]:
'Your host is: %s' % 'earth' # s for str

'Your host is: earth'

In [53]:
'%f' % 1234.567890 # f for floating

'1234.567890'

In [54]:
'%.2f' % 1234.567890 # .2f for floating with 2 decimals

'1234.57'

## 4.3 String Built-in Methods

`string.decode(encoding='UTF-8', errors='strict')`

`string.encode(encoding='UTF-8', errors='strict')`

`string.join(seq)`

`string.split(str="", num=string.count(str))`

`string.translate(str, del="")`

## 4.4 Special Features of Strings

`\t` # tab

`\n` # new line

`\"` # double quote

`\'` # single quote

`\\` # backslash


## 4.5 Unicode


### 4.5.1 ASCII VS Unicode VS UTF-8

> ASCII was simple. Every English character was stored in the computer as a seven bit number between 32 and 126.

> Unicode can currently represent over 90,000 characters.

> UTF-8 may use one or two bytes to represent a letter, three (mainly) for CJK/East Asian characters, and four for some rare, special use, or historic characters.

### 4.5.2 Using unicode in real life

__Always convert str to unicde in Python.__

__UTF-8 in, unicode() processes, and UTF-8 out.__

> • Always prefix your string literals with u.

> • Never use str()… always use unicode() instead.

> • Never use the outdated string module—it blows up when you pass it any non-ASCII characters.

> • Only call the encode() method right before you write your text to a file, database, or the network, and only call the decode() method when you are reading it back in.

In [67]:
#!/usr/bin/env python
''' An example of reading and writing Unicode strings: 
Writes a Unicode string to a file in utf-8 and reads it back in.'''

CODEC = 'utf-8' 
FILE = 'unicode.txt'

hello_out = u"Hello world\n"
bytes_out = hello_out.encode(CODEC)
f = open(FILE, "w")
f.write(bytes_out)
f.close()
f = open(FILE, "r")
bytes_in = f.read()
f.close()
hello_in = bytes_in.decode(CODEC)
print hello_in

Hello world


In [68]:
def to_uni(string):
    return string.decode('utf-8', 'ignore') if isinstance(string, str) else string


def to_str(uni_str):
    """
    convert both unicode and int to str
    """
    return uni_str.encode('utf-8', 'ignore') if isinstance(uni_str, (unicode, int)) else uni_str