# Worksheet2.ipynb
# Data structures, Data types and manipulations
   - Strings, lists and tuples
   - Collaborative exercises

## Strings

In [None]:
gene1 = 'BRCA1'

In [None]:
gene1

#### double quoted strings

#### Multiline (para) string

In [None]:
para = "BRCA1 is a very famous gene.\nSo is TP53."

In [None]:
para

#### print function produces a more readable output

In [None]:
print(para)

#### Triple quoted strings

In [None]:
para = """
BRCA1 is a very famous gene.
So is TP53.
"""

In [None]:
print(para)

#### Substring search

In [None]:
'BRCA1' in para

In [None]:
'BRCA2' in para

In [None]:
'BRCA2' not in para

#### String concatenation

In [None]:
'BRCA1 is a ' + 'very famous gene'

#### String repetition

In [None]:
print(para * 3)

#### String indexing

In [None]:
gene1

In [None]:
gene1[0]

In [None]:
gene1[1]

In [None]:
gene1[3]

In [None]:
gene1[0], gene1[1], gene1[2], gene1[3], gene1[4]

### Q: Can you guess the output of `gene1[5]`
- 1-2 mins

#### String in reverse

In [None]:
gene1[-1], gene1[-2], gene1[-3], gene1[-4], gene1[-5]

### Q: What is the output of `gene1[-6]`?
- 1-2 mins

#### len function

In [None]:
len(gene1)

#### String slicing

In [None]:
# characters from position 0 (included) to 3 (excluded)
gene1[0:3], gene1[:3]

In [None]:
# characters from position 2 (included) to the end
gene1[2:]

In [None]:
# slice of gene1 from 0 to 3, through step 2
gene1[0:3:2]

#### Slicing creates a new list

In [None]:
gene_slice = gene1[0:3]

### Q: Prove that strings are immutable
- Try to set position 0 of gene1 to any different base
- What do you notice?
- 1-2 mins

### Some popular string methods
- https://docs.python.org/3.12/library/stdtypes.html#string-methods

#### str.find() 

In [None]:
para

In [None]:
para.find('BRCA1')

In [None]:
para.find('is')

In [None]:
# not good behaviour
para.find('BRCA2')

### Q: Experiment the same as above with `str.index()`
- search for `BRCA2`
- https://docs.python.org/3.12/library/stdtypes.html#string-methods
- Compare behaviour with `str.find()` What do you see?
- 1-2 mins

#### str.format()

In [None]:
'{} gene length'.format(gene1)

#### str.upper(), str.lower()

In [None]:
gene1.upper(), gene1.lower()

In [None]:
gene1.isupper(), gene1.islower()

#### str.join()

In [None]:
'_'.join(['BRCA1', 'BRCA2', 'TP53'])

### Sequence types
- list
- tuple
- range

## Lists: Compound data types

#### Different ways to construct lists
- simple assignment
- list constructor

In [None]:
# simple assignment
para_list = ['BRCA1', 'is', 'a', 'very', 'famous', 'gene.', 'So', 'is', 'TP53.']

In [None]:
para_list

In [None]:
# its not practical to do the above
# try `para.split` = list from a string
para.split()

### Q: What is the output of `str.splitlines()`
- Experiment with the same `str` as above, the one used in `str.split()`
- https://docs.python.org/3.12/library/stdtypes.html#string-methods
- explain difference compared to `str.split()`
- What is in position 0 of `str.splitlines()` and why?
- 2-3 mins

In [None]:
# using the list constructor
list(gene1)

### Q: Can you save `my_fav_gene_string` below into a list
- 2-3 mins

In [None]:
my_fav_gene_string = 'BRCA1,BRCA2,TP53'

#### List indexing

In [None]:
para_list

In [None]:
para_list[0]

#### Lists are mutable, contrary to strings

In [None]:
para_list[0] = 'BRCA2'

In [None]:
para_list

#### List methods
- https://docs.python.org/3/tutorial/datastructures.html

##### append

In [None]:
para_list.append('and')

In [None]:
para_list

In [None]:
para_list.append('KRAS')

In [None]:
para_list

##### sort

In [None]:
gene_lengths = [400, 600, 500, 300]

In [None]:
gene_lengths.sort(reverse=True)

In [None]:
gene_lengths

##### sorted

In [None]:
gene_lengths = [400, 600, 500, 300]
sorted_gene_lengths = sorted(gene_lengths)

In [None]:
sorted_gene_lengths

## Tuples
- immutable sequences
- Ideal for representing heterogenous data you don't want altered

In [None]:
gene1 = ('BRCA1', 'Chr17', 'Coding')

In [None]:
gene2 = ('BRCA2', 'Chr13', 'Coding')

In [None]:
gene1[1]

### Q: Test the immutability of tuples
- Example: Can you change the gene name or chromosome number, or type of gene (`coding`, `non-coding`) in the tuples?
- What error is raised?
- 1-2 mins

# Collaborative exercises

## Exercise 1

- Make up an amino acid sequence for your favorite gene with your team. Max 15 characters. Look at the codon table for allowed characters if unsure.
- Store the sequence in a string variable and calculate the length
- Store the sequence in a list and re-calculate the length
- Ask one of your team mates to select a position in the sequence and a mutation. Introduce it into the sequence. Print the result. Can you do this in the string ? Why not? how about in the list ? 
- Now ask your friend to introduce mutations in multiple positions e.g. think of this as a strange translocation, positions 4-8 of the protein. Can you introduce this and print the result? Can you do this in the string ? Why, or why not? how about in the list ?                                 

## Exercise 2

- Store the original amino acid sequence you started with in `Exercise 1` in a list
- Create a copy of this list to store a protein copy
- Introduce some mutations only in the copy, not in the original protein (gene duplication and divergence model). Decide which positions you want to change in the copy with your colleague
- Print out the original protein sequence list, and the sequence of the copy

## Exercise 3
- Can you take the `gene1` tuple created in the class and try to make it mutable, i.e replace `BRCA1` with a different gene on `Chr17`?
- Hint: change the data structure from a tuple to something else

## Exercise 4
- Create a list of lists, with one list and within in two sublists
- each sublist should have three gene names in it
- Decide gene names with your team
- Can you access the first sublist and the first element of the first sublist?