
![Py4Eng](img/logo.png)

# Sequences: strings, lists, for loops
## Yoav Ram

# Strings

Strings are ordered collections of _characters_. 

_Ordered collections_ means that elements are numbered with _indexes_: 0, 1, 2, 3, 4...  
Note that the first index is 0, __not__ 1!

We can create new string usings single- or double-quotes: `'` or `"`.

In [4]:
x = "Jupyter"
y = 'I love Python'
print(x)
print(y)

Jupyter
I love Python


Strings are objects of type `str`:

In [5]:
type(x)

str

We can concat (לשרשר) strings:

In [6]:
print(x + "2018")

Jupyter2018


We can convert string to numbers and vice versa (if it is appropriate):

In [7]:
x = "4"
y = int(x)
print("y + 1 =", y + 1)

y + 1 = 5


Otherwise, we get an error message...

In [8]:
print("x + 1 =", x + 1)

TypeError: must be str, not int

In [9]:
x = str(y)
print("x =", x)

x = 4


In [10]:
x = "3.14"
y = float(x)
print("y*2 =", y * 2)

y*2 = 6.28


Strings are text but can represent other things, too. For example, DNA sequences.

Again we can concat strings:

In [11]:
upstream = "AAA"
downstream = "GGG"
dna = upstream + "ATG" + downstream
print(dna)

AAAATGGGG


We can find the length of a string using the command `len`:

In [12]:
n = len(dna)
print("The length of the DNA variable is", n)

dna = dna + "AGCTGA"
print("Now it is", len(dna))

The length of the DNA variable is 9
Now it is 15


We can use *syntactic sugar* to make `dna = dna + x` into `dna += x`:

In [13]:
print(dna)
dna += "AGCTGA"
print(dna)

AAAATGGGGAGCTGA
AAAATGGGGAGCTGAAGCTGA


also works with numbers and other operators:

In [14]:
x = 10
x *= 7
print(x)

70


## Access: Indexing

We can acces specific characters (sequence items) in a string using square brackets (`[]`):

In [25]:
text = "A musician wakes from a terrible nightmare."

In [26]:
print(text[0])
print(text[5])

A
i


Python uses **zero-count** indexing: the first element has index 0.

In addition, there is also support for reverse indexing using negative numbers:

In [27]:
print(text[-1])
print(text[-4])

.
a


Here, the last element is accessed using -1 index, and so on.

## Access: Slicing
We can extract subsets of a string by using _slicing_, with the corresponding indexes.  
Remember: indexes start from **0**!

We can access specific indexes of the list (_starting from 0_)

In [28]:
# get the 1st and 6th letters
print(text[0])
print(text[5])

A
i


Indexes work from the tail as well, using negative indices:

In [29]:
# get the last letter
print(text[-1])
# get 5th letter from the end
print(text[-5])

.
m


We can get a range of indexes using _\[start:end\]_

In [31]:
# get the 3rd to 8th letters
print(text[2:8])

musici


Notice that the _start_ position is included, but not the _end_ position. We actually take the character with indexes 2,3,4,5,6,7.
And what do we get?

In [32]:
type(text[2:8])

str

There are shorts for taking the first and last characters:

In [33]:
# get the first 5 letters
print(text[0:5])
# or simply:
print(text[:5])

# get 3rd to last letters:
print(text[3:])

# last 3 letters
print(text[-3:])

A mus
A mus
usician wakes from a terrible nightmare.
re.


## Exercise

The sequence below (named _seq_) consists of 20 characters. 

1. Print the 2nd and 7th characters.
2. Print the 2nd character from the end.
3. Slice the first half of the sequence.  
4. Slice the second half of the sequence.  
5. Slice the middle 10 characters

In [34]:
seq = "CAAGTAATGGCAGCCATTAA"


## Formatting string

There are three ways to do this:
1. The "old" way, using `%`
2. The "new" way, using `format` method
3. The "new-3.6" way, using `f`

Let's see the two "new" ways.

### `format` method

The `format` method works on a string template, with placeholders marked by curly brackets (who said Python doesn't like curly brackets?). The method arguments are parsed to be the values for the placeholders, by order:

In [35]:
message = "Hello {}, would you like {} or {} apples?"
message = message.format("Adam Price", 1, 2)
print(message)

Hello Adam Price, would you like 1 or 2 apples?


We can also specify placeholder's replacement using indices:

In [36]:
message = 'Hello {0}, my name is {1}, if your name is not {0}, please let me know'
message = message.format('Adam', 'Wendy')
print(message)

Hello Adam, my name is Wendy, if your name is not Adam, please let me know


Finally, we can also use named placeholders and specify the values as keyword arguments:

In [37]:
message = 'Hello {guest}, my name is {host}, if your name is not {guest}, please let me know'
message = message.format(guest='Adam', host='Wendy')
print(message)

Hello Adam, my name is Wendy, if your name is not Adam, please let me know


Format automatically handles numbers and other string conversions:

In [38]:
print("Snowhite and the {} dwarfs".format(7))
print("Snowhite and the {} dwarfs".format(7.0))
print("Snowhite and the {} dwarfs".format(7+0j))

Snowhite and the 7 dwarfs
Snowhite and the 7.0 dwarfs
Snowhite and the (7+0j) dwarfs


But we can specify how to convert numbers, if we want; for example, we can specify the number of decimal digits we want:

In [41]:
x = 7.0554332
print("Snowhite and the {:.0f} dwarfs".format(x))
print("Snowhite and the {:.4f} dwarfs".format(x))
print("Snowhite and the {:.6f} dwarfs".format(x))

Snowhite and the 7 dwarfs
Snowhite and the 7.0554 dwarfs
Snowhite and the 7.055433 dwarfs


See all formatting options in the [docs](https://docs.python.org/3.6/library/string.html#format-string-syntax).

Python 3.6 added a new string formatting option using formatted string literals, or [f-strings](https://docs.python.org/3/reference/lexical_analysis.html#f-strings).

In [42]:
name = "John Levin"
age = 31
address = "42 Main st., Sunnyvale, CA"

print(f"His name is {name}, he is {age} and he lives in {address}.")

His name is John Levin, he is 31 and he lives in 42 Main st., Sunnyvale, CA.


Note the `f` before the printed string!

## Exercise - bottles of beer

Write a template and fill it with values using either `format` or f-strings to produce the following text:

```
3 bottles of beer on the wall, 3 bottles of beer.
Take one down, pass it around, 2 bottles of beer on the wall...
2 bottles of beer on the wall, 2 bottles of beer.
Take one down, pass it around, 1 bottles of beer on the wall...
1 bottles of beer on the wall, 1 bottles of beer.
Take one down, pass it around, 0 bottles of beer on the wall...
```

## String methods

We can change a string to lowercase:

In [43]:
text = text.lower()
print(text)

a musician wakes from a terrible nightmare.


and back to uppercase:

In [44]:
text = text.upper()
print(text)

A MUSICIAN WAKES FROM A TERRIBLE NIGHTMARE.


We can replace characters:

In [45]:
dna = 'AAAATGGGGAGCTGAAGCTGA'
rna = dna.replace("T", "U")
print(rna)

AAAAUGGGGAGCUGAAGCUGA


#### Count
We can count characters. 

For example, let's count the number of histidine (`H`) and proline (`P`) in the [amino-acid](http://upload.wikimedia.org/wikipedia/commons/a/a9/Amino_Acids.svg) sequence of the [Human Insulin](http://www.uniprot.org/blast/?about=P01308) enzyme:

In [46]:
insulin = 'MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN'
print("# of histidine:", insulin.count('H'))
print("# of proline:", insulin.count('P'))

# of histidine: 2
# of proline: 6


#### Find and Index
We can find a substring within a string.
For example, we can look for the character `D` in the insulin sequence.

In [47]:
pos = insulin.index('D')
print(pos)

19


In [48]:
type(pos)

int

In [49]:
print(insulin[pos])

D


The result is the index (position) of the first `D` found in the sequence.

We can also look for longer substrings, representing motiffs. For example, let's find the position of the Insulin [B-chain](http://www.uniprot.org/blast/?about=P01308[25-54]) - a specific subsequence - in the entire protein sequence:

In [50]:
b_chain = "FVNQHLCGSHLVEALYLVCGERGFFYTPKT"
position = insulin.index(b_chain)
print("Position:", position)

Position: 24


In [51]:
print(len(b_chain))

30


In [52]:
found = insulin[position : position + len(b_chain)] # slicing (notice the ':')
print(b_chain == found)
print("Original:", b_chain)
print("Found:   ", found)

True
Original: FVNQHLCGSHLVEALYLVCGERGFFYTPKT
Found:    FVNQHLCGSHLVEALYLVCGERGFFYTPKT


#### Split

We can split a string on every occurence of a separator character:

In [53]:
names = "banana,ananas,potato,tomato"
foods = names.split(",")
print(foods)

['banana', 'ananas', 'potato', 'tomato']


What do we get?

In [54]:
type(foods)

list

# Lists

Lists are similar to strings in being sequential, only they can contain **any type of data**, not just characters. They are also mutable (we'll get back to that distinction).

Lists could even include mixed variable types.

We define a list just like any other variable, but use '[ ]' and ',' to separate elements.

In [66]:
# a list of strings
. = ["Human", "Gorilla", "Chimpanzee"]
print(apes)

['Human', 'Gorilla', 'Chimpanzee']


![Gorila](http://upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Western_Lowland_Gorilla_at_Bronx_Zoo_2_cropped.jpg/338px-Western_Lowland_Gorilla_at_Bronx_Zoo_2_cropped.jpg)

In [67]:
# a list of numbers
nums = [7, 13, 2, 400]
print(nums)

[7, 13, 2, 400]


In [68]:
# a mixed list
mixed = [12, 'Mouse', True]
print(mixed)

[12, 'Mouse', True]


You can access list elements just like strings, using indexes (starting from 0):

In [69]:
print(apes[0])
print(apes[-1])

Human
Chimpanzee


Lists are dynamic and mutable - you can append, remove and insert into them. This is done using _list methods_.

We can access and change list elements:

In [70]:
new_apes = apes.copy() # make a copy of the apes list
new_apes[2] = 'Bonobo'
print(new_apes)

['Human', 'Gorilla', 'Bonobo']


This __does NOT__ work with strings though...

In [71]:
print(dna)
dna[5] = 'G'

AAAATGGGGAGCTGAAGCTGA


TypeError: 'str' object does not support item assignment

This is because strings are **immutable** whereas lists are **mutable**. We'll get back to this notion soon.

### More list methods

Add element to the end of the list:

In [72]:
apes.append("Macaco")
print(apes)

['Human', 'Gorilla', 'Chimpanzee', 'Macaco']


Insert element at a given index:

In [73]:
apes.insert(2, "Kofiko")
print(apes)

['Human', 'Gorilla', 'Kofiko', 'Chimpanzee', 'Macaco']


Remove element from list:

In [74]:
apes.remove("Human")
print(apes)

['Gorilla', 'Kofiko', 'Chimpanzee', 'Macaco']


To remove a list item by index:

In [75]:
print(apes.pop(3))
print(apes)

Macaco
['Gorilla', 'Kofiko', 'Chimpanzee']


We can concat lists, just like strings:

In [76]:
print(apes + ["Orangutan", "Baboon"])

['Gorilla', 'Kofiko', 'Chimpanzee', 'Orangutan', 'Baboon']


![Organutan](http://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Orang_Utan%2C_Semenggok_Forest_Reserve%2C_Sarawak%2C_Borneo%2C_Malaysia.JPG/220px-Orang_Utan%2C_Semenggok_Forest_Reserve%2C_Sarawak%2C_Borneo%2C_Malaysia.JPG)

Searching in lists is done using `index` (not `find`):

In [79]:
i = apes.index('Kofiko')
print(i)
print(apes[i])

1
Kofiko


If the value is not found an error is raised. We'll learn how to deal with exceptions in another session. 
For now, we can just find it with a loop, which we will learn in a few minutes.

You can also check if something is in a list (works as well for strings):

In [80]:
if 'Panda' in apes:
    print('Panda is an ape')
else:
    print('Panda is not an ape')

Panda is not an ape


## Lists of numbers

Suppose we have a list of experimental measurements and we want to do basic statistics: count the number of results, calculate the average, and find the maximum and minimum.

In [81]:
measurements = [33,55,45,87,88,95,34,76,87,56,45,98,87,89,45,67,45,67,76,73,33,87,12,100,77,89,92]

count = len(measurements)
avg = sum(measurements) / len(measurements)
maximum = max(measurements)
minimum = min(measurements)

print(count, "measurements with average", avg, "maximum", maximum, "minimum", minimum)

27 measurements with average 68.07407407407408 maximum 100 minimum 12


We'll see a better way to work with sequences of numbers, though, using NumPy.

## Sorting lists
  
We can sort lists using the `sorted` method.  
If the list is made __entirely__ of numbers, then sorting is straightforward:

In [82]:
sorted_measurements = sorted(measurements)
print(sorted_measurements)

[12, 33, 33, 34, 45, 45, 45, 45, 55, 56, 67, 67, 73, 76, 76, 77, 87, 87, 87, 87, 88, 89, 89, 92, 95, 98, 100]


A list of strings will be sorted lexicographically (think about the way '<' and '>' work on strings):

In [83]:
sorted_apes = sorted(apes)
print(sorted_apes)

['Chimpanzee', 'Gorilla', 'Kofiko']


But beware of mixed lists!

In [84]:
mixed = apes + measurements
print(mixed)
print(sorted(mixed))

['Gorilla', 'Kofiko', 'Chimpanzee', 33, 55, 45, 87, 88, 95, 34, 76, 87, 56, 45, 98, 87, 89, 45, 67, 45, 67, 76, 73, 33, 87, 12, 100, 77, 89, 92]


TypeError: '<' not supported between instances of 'int' and 'str'

## List of lists (nested lists)
  
List elements can be of any type, including lists!  
For example:

In [85]:
birds = ['Gallus gallus', 'Corvus corone', 'Passer domesticus']
snakes = ['Ophiophagus hannah', 'Vipera palaestinae', 'Python bivittatus']
animals = [apes, birds, snakes]
print(animals)

[['Gorilla', 'Kofiko', 'Chimpanzee'], ['Gallus gallus', 'Corvus corone', 'Passer domesticus'], ['Ophiophagus hannah', 'Vipera palaestinae', 'Python bivittatus']]


We access lists of lists using double-indexes. For example, to get the 3rd snake:

In [86]:
print(animals[2][2])

Python bivittatus


Note that the elements of the outer list are __lists__ themselves, not strings. For example:

In [87]:
type(animals[1])

list

## Slicing
  
We can slice lists just like we did with strings, to get partial lists.  
For example:

In [88]:
# get the first 10 measurements
print(measurements[:10])
# get the last 3 measurements
print(measurements[-3:])

[33, 55, 45, 87, 88, 95, 34, 76, 87, 56]
[77, 89, 92]


## Exercise

- Use the lists `birds` and `snakes` defined above to create a single list of strings with the animal names. 
- Add the string `Mus musculus` to the list. 
- Remove the `Corvus corone` from the list. 
- Print the 2nd to 5th elements of the resulting list, sorted alphabetically.

# `for` loops

Say we want to print each element of our list:

Python’s `for` loop syntax allows us to iterate over the elements of a `list`, or any `iterable` value. Python's `for` is similar to the `foreach` statement in other languages, rather than `for(i=0; i<n; i++)`:

```py
for loop_variable in iterable:
    statement1
    statement2
    statement3
    ...
```

In [89]:
for ape in apes:
    print(ape, "is an ape")

Gorilla is an ape
Kofiko is an ape
Chimpanzee is an ape


![Python loop](http://2.bp.blogspot.com/-7lXe1_Gou3k/UX92PWche3I/AAAAAAAAAFA/JxD4u8St-9g/s1600/python+loop.jpg)

A more complex loop will go over each ape name and print some stats:

In [90]:
for ape in apes:
    name_length = len(ape)
    first_letter = ape[0]
    print(ape, "is an ape. Its name starts with", first_letter)
    print("Its name has", name_length, "letters")

Gorilla is an ape. Its name starts with G
Its name has 7 letters
Kofiko is an ape. Its name starts with K
Its name has 6 letters
Chimpanzee is an ape. Its name starts with C
Its name has 10 letters


### String loop

Let's go over the Insulin AA sequnce and count the number of prolines manualy. Reminder: `insulin` is a `str`, not `list`.

In [91]:
count = 0
for aa in insulin:
    # the next line is equivalent to
    # if aa == "P": count = count + 1
    count += aa == "P"
print("# of prolines:", count)

# of prolines: 6


Do you remember another way of doing this?

Let's count how many measurements (see above) are above the average:

In [92]:
print(measurements)
print(avg)

[33, 55, 45, 87, 88, 95, 34, 76, 87, 56, 45, 98, 87, 89, 45, 67, 45, 67, 76, 73, 33, 87, 12, 100, 77, 89, 92]
68.07407407407408


In [94]:
over = 0
for x in measurements:
    over += x > avg
print(over, "measurements are over the average.")

15 measurements are over the average.


## Exercise

Complete the code below to count the _ratio_ of electrically-charged amino acids in the Insulin sequence.

In [95]:
charged = ['R','H','K','D','E']
insulin = 'MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN'

# Your code here

print("Ratio of charged amino acids is:", charged_ratio)

NameError: name 'charged_ratio' is not defined

# `range`

Sometimes we want to loop over consecutive numbers.

This is accomplished using the `range` function.

`range` accepts one, two, or three arguments: the bottom and upper limits and the step size.  
The bottom limit can be omitted - the default is zero - and the step can be omitted, too - the default is one.
The upper limit is __not__ included.

In [96]:
for i in range(10): # == range(0, 10, 1)
    print(i)

0
1
2
3
4
5
6
7
8
9


In [97]:
for i in range(10, 20):
    print(i, end=' ')    # print ends with space instead of newline

10 11 12 13 14 15 16 17 18 19 

In [98]:
for i in range(100, 1000, 10):
    print(i, end=' ')

100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850 860 870 880 890 900 910 920 930 940 950 960 970 980 990 

We can turn the range into a list (more on this in the [iteration session](iteration.ipynb):

In [101]:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

### Example  - primality check

Let's check if the number `n` is a prime number - that is, it can only be divided by 1 and itself:

In [102]:
n = 97 # try other numbers
divider = 1

for k in range(2, n): # why start at 2? can we choose a different limit to range? a different step perhaps?
    if n % k == 0:
        divider = k
if divider != 1:
    print(n, "is divided by", divider)
else:
    print(n, "is a prime number")

97 is a prime number


We can also use `range()` to loop on the indices of a list instead of the elements themselves. This is useful in some cases.

In [103]:
for i in range(len(apes)):
    print(apes[i])

Gorilla
Kofiko
Chimpanzee


## `enumerate`

Another elegant way to iterate over lists is with the `enumerate` function. `enumerate` provides two loop variables for every item in the list -- the index and the element:

In [104]:
cities = ['Tel-Aviv', 'Jerusalem', 'Haifa', 'Rehovot']
for i, city in enumerate(cities):
    print("The", i, "city is", city)

The 0 city is Tel-Aviv
The 1 city is Jerusalem
The 2 city is Haifa
The 3 city is Rehovot


## Exercise

Write a nested for loop that creates the identiy matrix of size `n`. A matrix is represented by a list of lists. Finally, print the matrix.

In [6]:
n = 4


[[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]

# Tuples

[Tuples](https://docs.python.org/3.5/tutorial/datastructures.html#tuples-and-sequences) are another data structure for sequential data. They, too, can contain any type and mixed types. The main difference between tuples and lists is that tuples are **immutable**.

Tuples are denoted by round brackets `()`:

In [105]:
t = (15, 76, 'a')
print(t)
type(t)

(15, 76, 'a')


tuple

Tuples are commonly packed and unpacked in Python:

In [106]:
a, b, c = t # unpacking
print('a:', a, 'b:', b, 'c:', c)
t = a, b # packing
print(t)

a: 15 b: 76 c: a
(15, 76)


You can also create empty and singleton tuples:

In [107]:
t0 = ()
type(t0)

tuple

In [108]:
t1 = (5,) # notice the comma
type(t1)

tuple

## Colophon
This notebook was written by [Yoav Ram](http://python.yoavram.com) and is part of the [_Python for Engineers_](https://github.com/yoavram/Py4Eng) course.

The notebook was written using [Python](http://python.org/) 3.6.1.
Dependencies listed in [environment.yml](../environment.yml), full versions in [environment_full.yml](../environment_full.yml).

This work is licensed under a CC BY-NC-SA 4.0 International License.

![Python logo](https://www.python.org/static/community_logos/python-logo.png)