### Week 1 - Python Basics

Covers:

    - Importing modules
    - Reading / writing files
    - Passing arguments
    - Building functions
    - Docstrings & commenting
    - Styling & good practices

### Running Python
- via interpreter => type `python` at cmd line and write code line by line
- Jupyter notebooks => code blocks that can be run individually, good for one off type analysis and plotting
- `.py` scripts => normal python scripts that are executed as `python my_script.py`

#### Importing modules

Always at top of script, always ordered alphabetically

Can order as you like, sensible examples are seperating system, 3rd party and local package imports:




In [63]:
# system
import math
import os
import sys
from time import sleep

# 3rd party installed
import numpy as np
import pandas as pd

# local packages
import week1_example_script as example

In [64]:
# functions imported from other scripts may be called
example.say_hello()

Hello


### Data types

A few data types exist in Python, a few common:
- integers (int): Number
- float (float): Floating point number
- strings (str): Plain text
- lists (list): Allow containing of other data types in a structured way
- dictionaries (dicts): Allow containing of other data types using a key to value structure

A few less used:
- tuples (tuple): Allow of non mutable storing of other data types
- sets (set): Allow storage of unique values



In [4]:
print(type(2021))
print(type(3.14))
print(type("string"))
print(type(["cat", "dog"]))
print(type({"food": "m&ms", "animal": ["cat", "dog"]}))

print(type(("cat", "dog")))
print(type(set(["cat", "dog"])))

<class 'int'>
<class 'float'>
<class 'str'>
<class 'list'>
<class 'dict'>
<class 'tuple'>
<class 'set'>


#### Reading & writing files

Use the python `open()` function to open files, can be used both to open files for reading and writing.


`open(filename, mode)`


In [65]:
# open file and read all lines
f = open('./example_file.tsv', 'r')
f.read()

'hgnc_id\tsymbol\tname\nHGNC:5\tA1BG\talpha-1-B glycoprotein\nHGNC:37133\tA1BG-AS1\tA1BG antisense RNA 1\nHGNC:24086\tA1CF\tAPOBEC1 complementation factor\nHGNC:7\tA2M\talpha-2-macroglobulin\nHGNC:27057\tA2M-AS1\tA2M antisense RNA 1\nHGNC:23336\tA2ML1\talpha-2-macroglobulin like 1\nHGNC:41022\tA2ML1-AS1\tA2ML1 antisense RNA 1\nHGNC:41523\tA2ML1-AS2\tA2ML1 antisense RNA 2\nHGNC:8\tA2MP1\talpha-2-macroglobulin pseudogene 1\nHGNC:30005\tA3GALT2\talpha 1,3-galactosyltransferase 2\nHGNC:18149\tA4GALT\talpha 1,4-galactosyltransferase (P blood group)\nHGNC:17968\tA4GNT\talpha-1,4-N-acetylglucosaminyltransferase\nHGNC:13666\tAAAS\taladin WD repeat nucleoporin\nHGNC:21298\tAACS\tacetoacetyl-CoA synthetase\nHGNC:18226\tAACSP1\tacetoacetyl-CoA synthetase pseudogene 1\nHGNC:17\tAADAC\tarylacetamide deacetylase\n'

In [67]:
# read specific line
f = open('./example_file.tsv', 'r')
f_lines = f.readlines() # read in all lines and store in f_lines variable
f.close()

# read just the 6th line
f_lines[5]

['hgnc_id\tsymbol\tname\n', 'HGNC:5\tA1BG\talpha-1-B glycoprotein\n', 'HGNC:37133\tA1BG-AS1\tA1BG antisense RNA 1\n', 'HGNC:24086\tA1CF\tAPOBEC1 complementation factor\n', 'HGNC:7\tA2M\talpha-2-macroglobulin\n', 'HGNC:27057\tA2M-AS1\tA2M antisense RNA 1\n', 'HGNC:23336\tA2ML1\talpha-2-macroglobulin like 1\n', 'HGNC:41022\tA2ML1-AS1\tA2ML1 antisense RNA 1\n', 'HGNC:41523\tA2ML1-AS2\tA2ML1 antisense RNA 2\n', 'HGNC:8\tA2MP1\talpha-2-macroglobulin pseudogene 1\n', 'HGNC:30005\tA3GALT2\talpha 1,3-galactosyltransferase 2\n', 'HGNC:18149\tA4GALT\talpha 1,4-galactosyltransferase (P blood group)\n', 'HGNC:17968\tA4GNT\talpha-1,4-N-acetylglucosaminyltransferase\n', 'HGNC:13666\tAAAS\taladin WD repeat nucleoporin\n', 'HGNC:21298\tAACS\tacetoacetyl-CoA synthetase\n', 'HGNC:18226\tAACSP1\tacetoacetyl-CoA synthetase pseudogene 1\n', 'HGNC:17\tAADAC\tarylacetamide deacetylase\n']


'HGNC:27057\tA2M-AS1\tA2M antisense RNA 1\n'

In [68]:
# loop over file and read lines, allows to do operations on file as reading
f = open('./example_file.tsv', 'r')
for line in f:
    if 'A2M' in line:
        print(line)
f.close()

HGNC:7	A2M	alpha-2-macroglobulin

HGNC:27057	A2M-AS1	A2M antisense RNA 1

HGNC:23336	A2ML1	alpha-2-macroglobulin like 1

HGNC:41022	A2ML1-AS1	A2ML1 antisense RNA 1

HGNC:41523	A2ML1-AS2	A2ML1 antisense RNA 2

HGNC:8	A2MP1	alpha-2-macroglobulin pseudogene 1



In [69]:
# use of with statement automatically closes file
with open('./example_file.tsv') as f:
    # read file into a Pandas dataframe
    df = pd.read_csv(f, sep='\t')

print(df)


       hgnc_id     symbol                                             name
0       HGNC:5       A1BG                           alpha-1-B glycoprotein
1   HGNC:37133   A1BG-AS1                             A1BG antisense RNA 1
2   HGNC:24086       A1CF                   APOBEC1 complementation factor
3       HGNC:7        A2M                            alpha-2-macroglobulin
4   HGNC:27057    A2M-AS1                              A2M antisense RNA 1
5   HGNC:23336      A2ML1                     alpha-2-macroglobulin like 1
6   HGNC:41022  A2ML1-AS1                            A2ML1 antisense RNA 1
7   HGNC:41523  A2ML1-AS2                            A2ML1 antisense RNA 2
8       HGNC:8      A2MP1               alpha-2-macroglobulin pseudogene 1
9   HGNC:30005    A3GALT2                alpha 1,3-galactosyltransferase 2
10  HGNC:18149     A4GALT  alpha 1,4-galactosyltransferase (P blood group)
11  HGNC:17968      A4GNT        alpha-1,4-N-acetylglucosaminyltransferase
12  HGNC:13666       AAAS

In [70]:
# open a new file, use 'w' mode to write to file
f = open('./output/new_example_file.txt', 'w')
f.write('line1\n')
f.close()

In [71]:
# open file in 'a' mode to append to file, else it will be overwritten
f = open('./output/new_example_file.txt', 'a')
f.write('line2\n')
f.close()

In [72]:
# many packages have their own functions for writing e.g. Pandas

df.to_csv('./output/saved_df.tsv', sep='\t')
df.to_excel('./output/saved_df.xlsx')

#### Passing Arguments

Standard python module - `sys.argv()`
    - allows for passing cmd line arguments into script by position, i.e.:

        - `python my_script.py file1 file2 file3` =>
        
            - sys.argv[0] = file1
            - sys.argv[1] = file2
            - sys.argv[2] = file3

Also can use the `argparse` package, useful to build user friendly cmd line argument passing, handles things such as help / user messages and handles errors for improperly passed arguments etc.

`import argparse`

`parser = argparse.ArgumentParser(description='Script to do a thing')`

`parser.add_argument("--file1", help="arg for file1", required=True)`

`parser.add_argument("--file2", help="arg for file2", required=False)`

`my_args = parse.parse_args`

Can accept many options for each arg, e.g.: type, nargs, default, action etc. 

docs: https://docs.python.org/3/howto/argparse.html






#### Building Functions

Functions in python are defined first with `def()`, block of code when run that can be passed arguments and return results

In [None]:
def read_file(file_to_open):
    with open(file_to_open) as f:
        df = pd.read_csv(f, sep='\t')

    return df

In [73]:
df = read_file('./example_file.tsv')
print(df)

       hgnc_id     symbol                                             name
0       HGNC:5       A1BG                           alpha-1-B glycoprotein
1   HGNC:37133   A1BG-AS1                             A1BG antisense RNA 1
2   HGNC:24086       A1CF                   APOBEC1 complementation factor
3       HGNC:7        A2M                            alpha-2-macroglobulin
4   HGNC:27057    A2M-AS1                              A2M antisense RNA 1
5   HGNC:23336      A2ML1                     alpha-2-macroglobulin like 1
6   HGNC:41022  A2ML1-AS1                            A2ML1 antisense RNA 1
7   HGNC:41523  A2ML1-AS2                            A2ML1 antisense RNA 2
8       HGNC:8      A2MP1               alpha-2-macroglobulin pseudogene 1
9   HGNC:30005    A3GALT2                alpha 1,3-galactosyltransferase 2
10  HGNC:18149     A4GALT  alpha 1,4-galactosyltransferase (P blood group)
11  HGNC:17968      A4GNT        alpha-1,4-N-acetylglucosaminyltransferase
12  HGNC:13666       AAAS

Key things for a good function:

- does one 'thing'
- sensibily named, often useful to begin with an adjective (i.e. read, get, split etc.)
- verbose names are better than abbreviated
- use when doing the same thing repeatedly (i.e. a calculation or filtering data etc.)
- docstrings

In [None]:
def get_sample_id(name):
    if '_' in name:
        sample_id = name.split('_')[0]
        print(f'Sample id is: {sample_id}')
        return sample_id
    else:
        print('Given name does not contain "_"')
        return


In [74]:
name = "sample1_abc_123"
sample_id = get_sample_id(name)


Sample id is: sample1


In [76]:
name = 'somethingElse'
sample_id = get_sample_id(name)

Given name does not contain "_"


#### Docstrings & comments

Use to make code understandable and more readable,  use pairs of triple quotes for docstring and \# for comments

Code should be written to be readable on it's own, but comments can help for more complex functions and / or rationale of why something is being done
i.e. 

In [None]:
def read_file(file_to_open):
    """
    Reads in given file to Pandas dataframe

    Args:
        - file_to_open (str): path to file to read in
    
    Returns:    
        - df (df): DataFrame of data in given file
    """
    with open(file_to_open) as f:
        df = pd.read_csv(f, sep='\t')
 
    return df

#### Styling & Good Practices

PEP8 style guide: http://www.python.org/dev/peps/pep-0008/

Easiest to use a linter extension to IDE, will automatically highlight most issues

Things to follow:

- variables, functions, methods: `lower_case_with_underscores`
- classes: `classesAsCamelCase`
- consistent use of spaces between lines and functions etc.
- **No single character variables** except in very short blocks, i.e.:

&nbsp;&nbsp;&nbsp;&nbsp;`for i in range(0, 10):`

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`print(i)`

- better to be more verbose in naming to make it easier to read

- limit line length:
    - comments/doc strings 72 characters, code 79 (doesn't have to be strict if makes it less readable)
    - adding rulers in IDE makes this easier to follow
    - can split strings in different ways, preffered is to use parentheses:

&nbsp;&nbsp;&nbsp;&nbsp;`long_string = (`
    
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`"this is a really really really really really really really "`

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`"really really really really really long string"`

&nbsp;&nbsp;&nbsp;&nbsp;`)`

- formatting strings using f-strings to add in variables: https://www.python.org/dev/peps/pep-0498/

    ```
    name='John'
    formatted_string = f"His name is {name}"
    ```


- add single line doc strings for simple/obvious functions
- multi line with args/returns/outputs for more complex, can also include short usage if appropriate
- prefer code being readable over lots of comments

- using `__name__` global variable:
    - `if __name__ == __main__:` used to define what to run when the script is called
    - typically all code should be in functions, then the functions called in this section
    - https://docs.python.org/3/library/__main__.html

- other global variables:
    - generally should **not** use globals => variables defined outside of functions (in global scope)
    - these are bad pratice as can cause conflicts and make it difficult to identify errors, unexpected side effects etc
    - exception are constansts which should be `ALWAYS_CAPITALISED`
    - constants are a type of variable that can't be changed, normally defined in a file then imported (i.e. for tokens):

&nbsp;&nbsp;&nbsp;&nbsp;token.py:

&nbsp;&nbsp;&nbsp;&nbsp;```AUTH_TOKEN = 'XXXXXX'```


&nbsp;&nbsp;&nbsp;&nbsp;my_script.py:

&nbsp;&nbsp;&nbsp;&nbsp;```from token import AUTH_TOKEN```




#### Example (relatively) short scripts:

- https://github.com/eastgenomics/eggd_generate_bed/blob/master/resources/home/dnanexus/generate_bed.py
- https://github.com/eastgenomics/dx_job_monitor/blob/main/dx_job_monitor.py
- https://github.com/eastgenomics/hermes/blob/main/hermes.py
- https://github.com/eastgenomics/athena/blob/master/bin/coverage_stats_single.py
- https://github.com/Addy81/BroadWork/blob/master/bin/2.hap.py-analysis.ipynb

#### Useful resources
- common useful functions with worked examples: https://www.w3schools.com/python/python_ref_functions.asp
- StackOverflow, W3schools, learnpython.org, docs.python.org, realpython

Books:
- Learn Python the Hard Way
    - Lots of exercises, teaches Python through doing lots of coding practice
    - https://www.valeacademy.org.uk/documents/download/5e7e0291-cfa4-4f7e-84af-64cf0a01017d.pdf
- Think Python
    - Separate chapters covering core concepts, some examples & mini projects to practice
    - https://greenteapress.com/wp/think-python/
- Automate the Boring Stuff:
    - Lots of projects, good for learning how to put things together
    - https://automatetheboringstuff.com/
    