# Introduction to Python

## What is python?

## Using Python

### The python shell
The python shell, similar to the bash shell, allows us to use python in an interactive manner. You enter in one command at a time and the result is immediately returned.  The python shell can be called directly from the terminal with the command `python`

This is a very convenient tool for testing python commands, and for getting started with the language, but it is not very useful for creating actual programs and or scripts to run. To make an actual program, you will need to put your code in a text file and save it with a `.py` extension.

### Python scripts
To create a python script, we can click on File > New > Text File in the top menu above. This will open the Jupyter Lab text editor in which we are going to create our very first python program. ["Hello World" 
demonstration]

### IPython Notebooks
Another way in which we can use python is how we are doing so here.

## The Basics
If you ever get lost or need more information about how a function or object works in python, you can use the `help()` function. 

In [None]:
help()

In [None]:
help('+')

In [None]:
help(open)

### Variables

### Data Types
In python, like most programming languages 'data type' is an important concept. When you store data in variables, you need to be aware of _how_ that data is being stored. This is referred to as a data type.  Different data types can do different things and can interact with different 'operators' (e.g. '+, -, *, &, etc' in distinct ways.  For example the `+` operator will sum the values of two variables that contain numbers:

In [None]:
a = 2
b = 3
a+b

but it will concatenate two variables that contain strings:

In [None]:
a = '2'
b = '3'
a+b

The following data types are 'built-in' by default in python:
    
* Text Type:	`str`
* Numeric Types:	`int`, `float`
* Sequence Types:	`list`, `tuple`, `range`
* Mapping Type:	`dict`
* Set Types:	`set`


You can always check to see what 'type' a variable is by using the `type()` function

In [None]:
x = 5
type(x)

Python will automatically set the data type when you create a variable under certain conditions.

In [None]:
x = "myString" # str
x = 10         # int
x = 10.0       # float
x = ['pizza','apple','hotdog']  # list
x = ('pizza','apple','hotdog')  # tuple
x = {'pizza','apple','hotdog'}  # set
x = range(10)  # range
x = {"name": "Loyal", "age": 41, "department" : "neuroscience"}  # dict
x = True       # bool


You can also explictly set the data type that you want by using standard functions. This is known as 'casting' a force or change variable to be a specific type.

In [None]:
x = str(5)
x

In [None]:
x = float(4)
x

### Number types

`int` (integer) numbers are whole numbers (positive or negative) without decimals.  There is no limit to the size of an integer in python.

In [None]:
a = 3
b = 29398752398757573292375982737575
c = -42

The integer type should be used for numbers that will always be whole for their operations.  Think of counting the number of bases in a DNA sequence, or counting the number of times something happens.  You will almost never have a 'partial' quantity for these values.  But be careful of how `int` types are handled when you do certain types of operations:

In [None]:
a = 5
b = 10
c = 10/5
print(c)
type(c)

The `float` type is for 'floating point numbers'.  These can be positive or negative and can contain one or more decimals. The `float` type can be specified by adding a decimal value to the end of a number when assigned to a variable:

In [None]:
a = 5.0
b = 1.467283
c = -14.22

The 'float' type can also be scientific notation by adding an `e` to indicate the power of 10:

In [None]:
x = 2e4
print(x)

y = -63.5e100
print(y)

You can convert from one type to another with the `int()` and `float()` functions.

### Strings
Strings or string literals are generally denoted by enclosing them in quotes; either single (`'`) or double (`"`) quotes:

In [None]:
print("Welcome to BCMB!")

You can assign a string to a variable in the same way.

In [None]:
a = "Welcome to BCMB!"
print(a)

Sometimes strings will span multiple lines.  If this is the case, it is convention to enclose them in triple quotes:

In [None]:
b = """Welcome to BCMB!
You've chosen the most exciting graduate program at JHU."""

print(b)

### Strings, substrings, and slicing
Strings in python are stored as 'arrays of bytes' representing each charcater, which bascially means that "Hello" is actually stored in python in something akin to a list: `["H","e","l","l","o"]`.  So we can access, edit, and manipulate different parts of a string (or any other array) by 'slicing' with square brackets. Its important to recognize when doing this that python is a '0-indexed' language which means that any time you count in python, you always start with 0. Lets see how this works in practice:

In [None]:
# Create a string literal and store in a variable
a = "Genomics is fun"

# To retrieve the second character in this string we will slice as so
print(a[1])

# To retrieve a range of positions, we will separate the start and end using ':'
print(a[2:6])

# You can use negative values to start the slice from the end of the string
print(a[-6:-1])

# You can leave either side of the ':' black to represent the beginning or end of a string respectively
print(a[:8])

print(a[9:])

Slicing is a fundamental concept in python for accessing any set of elements in an array (not just strings). You will use this often.

Strings have several functions that are available for some common queries and manipulations:

In [None]:
# Get the length of the string
print(len(a))

# Convert the string to lower case
print(a.lower())

# upper?

# You can strip off excess white space
b = "My favorite gene is Pantr2.   "
print(b.strip())

# You can replace portions of the string
print(a.replace("fun","hard"))

# or split a string into substrings based on a 'separator'
print(a.split(" "))


You can also test for instances of a substring within a string (note the use of a new syntax here that we will explore later)

In [None]:
a = "She sells seashells by the seashore."
x = "sea" in a
print(x)

y = "sho" in a
print(y)

z = "sho" not in a
print(z)
print(type(z))

You can combine strings directly (concatenate) using the `+` operator as we discussed above:

In [None]:
a = "Toad"
b = "the"
c = "wet"
d = "Sprocket"

print(a + b + c + d)

Whoops...we may want to format this a bit better to add a separator. Fortunately, python provides a convenient way to format strings called 'f-strings'. Simply add the variables in a new string and enclose them with curly braces `{}`.

In [None]:
text = f"The best band ever is {a} {b} {c} {d}!"
print(text)

This is a _very_ useful tool for formatting output strings containing useful pieces of information in your code/scripts

In [None]:
gc = 56
name = 'Pantr2'
chromosome = 'chr4'

summary = f"The {name} gene is located on chromosome '{chromosome}' and has a GC content of {gc}%"

print(summary)

There are a number of methods for manipulating/searching/testing strings that are built in to python. Feel free to [check them out](https://docs.python.org/2.5/lib/string-methods.html) and test them on your own.

## Boolean type
The Boolean type refers predominantly to logical tests, and ultimately, `type: bool` can only have two values: `True` or `False` (case sensitive). There are often times when you need to test a value or an expression. In python the value returned from these test is a `bool`:

In [None]:
print(14 > 3)
print(14 == 3)
print(14 < 3)

You can use `bool` values and variables to help control the flow of your code. For example, we could print a message based on whether or not a condition is `True`.

In [None]:
a = 50
b = 100

if a < b:
    print ("a is the smaller value")
else:
    print("b is the smaller value")

## Collection Data types
Collection data types store groups of `items`. Items can be named variables, or objects of other data types, including other collections (nested). There are four main collection data types, each with their own properties/assets:
    
1. A *List* is an _ordered_ collection and is _mutable_. It can also hold duplicate items.
2. A *Tuple* is an _ordered_ collection and is _immutable_. It also allows for duplicate items.
3. A *Set* is an _unordered_ collection and _unindexed_. It does _not_ allow for duplicate items.
4. A *Dictionary* is an _unordered_ collection which is _mutable_ and _indexed_. It does not allow for duplicate index keys.


### Lists
A list is instantiated using square brakets `[]`

In [None]:
genes = ['Gapdh','Mef2c','Pax6','Cxcl1','Msi1']

You can access list items by referring to the index number (remember that python is zero-indexed).

In [None]:
print(genes[1])

print(genes[-2]) # negative indexing to select items from the end of the list. (-1 refers to the last item)

print(genes[1:3]) # you select a range using ":" (returns a new list)

Since `list` items are mutable, you can change any specific item by refering to it's index number

In [None]:
genes[2] = 'Sox10'
print(genes)

`list` collections (like all collection items) are _iterable_, meaning you can loop through elements.

In [None]:
for x in genes:
    print(x.upper())

To add items to a list (at the end) you can use the `append()` function

In [None]:
genes.append('Foxp1')

print(genes)

Conversely, you remove using several methods:

In [None]:
genes.remove('Mef2c') # removes a 'specific' item
print(genes)

genes.pop() # removes a specified index position or the last item in the list if index is not specified.
print(genes)

genes.clear() # empties the entire list
print(genes)

To join two lists, you can use the `+` operator

In [None]:
fruit = ['apple', 'banana','pear']
veg = ['carrot','celery','potato']

food = fruit + veg

print(food)

### Tuples

Tuples operate very similar to lists, but once instantiatied, the items in a tuple cannot be changed. Tuples are created with round brackets `()`. This is a useful data type to hold values associated with a single 'record'.  For example if you wanted to record specific information about a single gene like its name, chromosome, and start position:

In [None]:
a = ('Sox2','chr4',1589182)
b = ('Xist','chrX',23564335)

You access individual elements of a tuple in the same way as a list

In [None]:
print(a[0])

print(b[1])

You can also loop through a tuple since it it iterable.

In [None]:
for val in a:
    print(val)

Once you create a tuple, you cannot change the values, and you cannot add items to it.

## Dictionaries
Dictionaries are 'indexed' collections, meaning the _values_ within the collection each must have a unique _key_. You can create a dictionary using curly braces `{}`.

In [None]:
myGene = {
    'name': 'Sox2',
    'entrezID': 6657,
    'Ensembl': 'ENSG00000181449',
    'chromosome': 'chr3',
    'start': 181711925,
    'end': 181714436,
    'strand': '-'
}
print(myGene)

To access elements of a dictionary, you do so in a manner similar to other collection items (`[]`), but you must specify a 'key' value.

In [None]:
print(myGene['chromosome'])

Dictionaries are mutable so you can change/assign values in the same way

In [None]:
myGene['strand'] = '+'

print(myGene['strand'])

When you loop through a dictionary, the values returned are the key index values

In [None]:
for key in myGene:
    print(key)

You can also iterate over the values, or key:value pairs

In [None]:
for val in myGene.values():
    print(val)

In [None]:
for k, v in myGene.items():
    print(f'key: {k}  value:{v}')

Sometimes it may be useful to check if a key exists in a dictionary.

In [None]:
lookup = "Ensembl"
if lookup in myGene:
    print(f"Found {lookup} in myGene dictionary keys.")

## Control Flow
An important aspect of programming is manipulating the flow of how the program executes commands/functions/operations/etc. This is how programs and scripts take some decisions and execute different things depending on different situations. The structure of most control flow elements in python is fairly similar: evaluate certain conditions/statements and follow this with a colon (`:`).  The subsequent *code block* is below this statement and always indented.  The block ends when the indentation ends. There are three types of control flow statements in python: `if`, `for`, and `while`.  Each operates a bit differently.

### If...Else
The `if` statement first evaluates whether a given expression is `True`. If so, then the associated code block is then executed. We have seen a few examples above but lets make sure we understand how it's organized.

In [None]:
a = 5
if a < 10:
    print(f'{a} is less than 10.')
    
if a >= 10:
    print(f'{a} is greater than or equal to 10.')

Here we've constructed two `if` statements to test the variable `a`. Notice that the first statement (which evaluates to `True`) is executed but the second (`False`) is not. We can also use the `else` and `elif` (read: 'else,if') statement to further condition how python responds to our conditional test.

In [None]:
a = 50
if a < 10:
    print(f'{a} is less than 10.')
elif a == 10:
    print(f'{a} is equal to 10.')
else:
    print(f'{a} is greater than 10.')

There are a few things to note here.  First, the `elif` statement, we are providing _another_ conditional test for the variable `a`. If the first `if` returns `False`, then the next `elif` in the program will then be tested.  If all of the specific `if` and `elif` statements return `False`, then the remaining code block under the `else` statement is evaluated.  In this way, `else` acts as a 'catch all' if none of the other statements are `True`. 

*Only one* of the statements above will be executed; the first statement to evaluate to `True`. Once this happens, python steps out of the `if...else` statement and then continues on wth the rest of the program.

The second point to make from the above is the use of the double `=` in the `elif`.  This is the 'comparison operator'. A single `=` is used as the 'assignment operator' as we have been using to assign values to variables.  If we want to test that two values are in fact equal, then we _must_ use `==`.

### Operators
The construction of boolean logical tests is an important part of how you control the flow of your python program.  Often we want to test whether a variable has a certain value, or even exists at all.  Or perform some mathematical transformation on a value. To do this, we use different 'operators'. Operators are the constructs which can manipulate the value of individual items (or operands). You are familiar with many of these, for example 5 + 3 = 8. In this expression 5 and 3 are operands and '+' is the operator. Python has several types of operators, here we will distinguish between a few types

#### Arithmetic operators
These you should be inherently familar with for the most part. Assume that a = 10 and b = 20:

* `+`    Addition:	Adds values on either side of the operator.	a + b = 30
* `-`    Subtraction:	Subtracts right hand operand from left hand operand.	a – b = -10
* `*`    Multiplication:	Multiplies values on either side of the operator	a * b = 200
* `/`     Division:	Divides left hand operand by right hand operand	b / a = 2
* `%`     Modulus:	    Divides left hand operand by right hand operand and returns remainder	b % a = 0
* `**`    Exponent:	Performs exponential (power) calculation on operators	a**b =10 to the power 20


#### Comparison operators
These are the operators that allow you to compare two items/variables. Each of these operators returns a `bool` value (`True` or `False`)

In [None]:
a = 10
b = 20

print(a == b) # evaluates whether two operands are equal
print(a != b) # evaluates whether two operands are _not_ equal
print(a < b) # less than
print(a > b) # greater than
print(a <= b) # less than or equal to
print(a >= b) # you can probably guess

#### Assignment operators
These _assign_ values to a variable:

In [None]:
a = 10 # assigns the value 10 to the variable a
a += 5 # Adds the value on the right to the value in a and then assigns the new value to a
print(a)

a -= 10 # subtracts the right value from the value in a and then assigns the new value to a
print(a)

a *= 5 # multiplies the right value with the value of a and then assigns to a
print(a)

a /= 5 # divides the value of a by the right value and then assigns to a
print(a)

#### Logical and Membership operators
Logical operators help you compare different expressions

In [None]:
a = True
b = True
c = False

# and: if both values are True then the condition is True
print((a and b))

# or: if _either_ value is True then the condition is True
print((a or b))

# not: reverses the logical state of the condition
print(not(a or b))


Membership operators test whether a value (operand) is a member of a collection (as in a string, list or tuple).

In [None]:
fruits = ['apple','banana','pineapple']
a = "orange"

print(a in fruits)

print(a not in fruits)

print("o" in a)


These are the basic operators that you will need to know to create conditional expressions to guide your control flow.

### Looping/Iteration

### While loops
The while statement executes commands as long as an evaluated conditional expression remains true. This will loop through the code block until such time as the statement is no longer `True`. 

In [None]:
i = 1

while i < 10:
    print(i)
    i += 1


### For loops
The `for..in` statement also performs loops. In this case however, the loop _iterates_ over a sequence or collection, and in doing so it assigns each element of the collection to a specific variable. Any collection that is _iterable_ can be used to construct a for loop. In the example below `range(0,10)` creates an iterable collection of numbers from 0-9.  Each instance of the loop places one of these values (in order) into the newly created variable `i`, and then executes the code block associated with this loop:

In [None]:
for i in range(0,10):
    print(i)

In [None]:
An optional `else` statement can be used to execute a code block after the iterations have completed.

In [30]:
genes = ['Gapdh','Mef2c','Pax6','Cxcl1','Msi1']

for gene in genes:
    print(gene)
else:
    print("No more genes!")

Gapdh
Mef2c
Pax6
Cxcl1
Msi1
No more genes!


_*Homework*_: Look into the `break` and `continue` statements can be used in conjunction with the above statements to further control the flow of a program.

## Functions
Many times when writing programs, you will find yourself doing the same task(s) over and over again. Functions allow you to create reuseable pieces of your programs so you can run a block of specificed code anywhere in your program, as many times as you need, with any modifications or variables that you might need. As an example, lets start by figuring out what the GC% is for a given nucleic acid string.  To do this, we will need to learn the length of the oligonucleotide sequence, as well as the number of 'G' and 'C' bases.  Lets start by figuring out a way to do this for one DNA sequence.

In [None]:
dna = 'ATTAGCGTATTCGAGCTATCGATCTAGCGAGCTAGCTATCAGCGACGTACG'

dnaLength = len(dna)
print(f'Length of dna: {dnaLength}')

nG = ??
print(f'Number of Gs: {nG}')

nC = ??
print(f'Number of Cs: {nC}')

nGC = ??

GC_content = (nGC/dnaLength)*100
print(GC_content)

While that's not terribly tedious, what if we needed to do this for 10,000 different dna sequences? We will definitely need a way to 'functionalize' this process.  So lets create a function to do all of this for us. We start by defining the function with the `def` statement to give the function a name, followed by a pair of parentheses `()` where we can name a few 'parameters' (another name for variables passed to a function) that we will use within our function. This is then followed by the code block to execute when we 'call' the function. At the end of the function, we specify what values we want to return with `return()`.

In [None]:
def getGC(seq):
    dnaLength = len(seq)
    nG = seq.count('G')
    nC = seq.count('C')
    nGC = nG + nC
    GC_content = (nGC/dnaLength)*100
    return(GC_content)


Once we've created our function, we can then call it as many times as we want with different values passed to the `seq` argument as needed. ** Note the terminology difference: the names given in the function definition are called parameters, but the values you supply in the function call are called arguments. **

In [None]:
gc = getGC(dna)
print(gc)

Now try and use your function within a `for` loop to iterate over a `list` of different DNA sequences

In [None]:

myDNASet = [
            'ACTGATGCTAGCTGACTGATCTAGCTGA',
            'TGCATTTTCGAGCTATCGAGCATTCTACGTACT',
            'CACTATCTACGGATCGGAGCGGATTCGTAGCTATGC',
            'GTATCGGATCTAGCGGCGGCATTATCG'
           ] # this is a list of strings, each containing a DNA sequence.

for ...


When you declare variables within a function definition, they become completely isolated from any variables you may have created outside the definition. In this case, all variables created within a function definition are considered 'local'.  You cannot access them from outside the function definition, and they don't override any other variables either. This is called the 'scope' of a variable. All variables have the scope of the code block in which they are declared. There are ways to change this, but for now we just need to realize that there are in fact different scopes.

When defining a function, you may often want to set default values for your parameters, and or make some parameters optional. You can also use 'keyword' arguments can specify them directly when you call a function like so:

In [None]:
def checkDNA(seq, query='AGC'): #checks to see if the DNA sequence has a specific substring
    return(query in seq)

print(checkDNA(dna))

print(checkDNA(dna,query='TGCA'))
    

Now lets try and create a new function that takes a DNA sequence as an argument and calculates the melting temperature (the temperature at which two DNA strands will separate from each other). This property is a function of the DNA sequence itself. For sequences less than 14 nucleotides the formula is: `Tm= (A+T) * 2 + (G+C) * 4`. For sequences longer than 13 nucleotides in length, there is a different equation: `Tm= 64.9 +41*(G+C-16.4)/(A+T+G+C)`. In each equation, `A, T, G, and C` correspond to the number of each nucleotide in the DNA string. Can you define a single function that can test a DNA sequence of any length?

In [None]:
# define and use your function below...






## Importing modules
One of the more useful features of python, as with many other languages, is that you can add features, tools, data, etc to extend the functionality of python using _modules_. Python has a very large and diverse user base that has already developed hundreds (thousands?) of modules to perform broadly useful tasks, or very specialized tasks for different needs. You can also develop your own modules to split projects/workflows in to manageable pieces for easier maintenance and reusability.

A module is nothing more than python code. Within a module, you can define classes (objects), functions, and variables.  Any existing python file can be referenced as a module and the elements within can then be used. For example, if you have a python file named `sequencing.py` you can import the file/module with the name `sequencing`.

To use a module in your code you must first tell python where to find the module. To do this, we use the command `import`

In [None]:
import math

This tells the python interpreter to load all of the functions and elements found in the `math` module into the current session. `math` is a 'standard module' that is included with python.  To use a specific function or variable defined in `math` you can call it using the dot (`.`) operator along with the module name

In [None]:
a = 25
math.sqrt(a)

In [None]:
math.cos(a)

In [None]:
math.factorial(a)

It is important to be aware of modules as they can often save you from having to re-write or re-invent code that others have already solved. It is arguably the basis for the relevance of python as a programming language in the sciences as well!

There are a few syntactically different ways to import a module. You've already seen the direct import method above, but you can also import only the functions you need from a module by using the `from .. import` syntax:

In [None]:
from math import log10, log
log10(a)

Here we've imported just two of the functions in the `math` module. Notice that this time, you don't have to use `math.` before calling these functions. You can access them directly without using the module name.

You can rename the module for brevity or any other reason by using `as` with your import statement.

In [None]:
import datetime as dt

today = dt.date.today()
print("Today is", today)

Here, we imported the datetime module as `dt` . Notice that that the `dt` module defines a class called `date`. Then, we used the `date.today()` function to get the current local date, which is then printed with some accessory text using the `print()` function.

## Working with files

### Reading/opening files
Reading and writing to files in python is achieved through the `open()` function. When you 'open' a file, you are creating a device to communicate with the file on disk. We will need to specify how we would like to interact with the file.  Do we want to read the file ('r') or write to the file ('w'). This is the 'mode' that we will need to specify when we make the call to `open()`

In [None]:
file = open('Foxp1.gbk',mode = 'r')

# iterate through all lines of a file
for line in file:
    print(line)

In [None]:
# Read only the first n lines of a file
nLines = 5
i=0

# Need to re-open the file since we've already iterated through it completely
file = open('Foxp1.gbk',mode = 'r')

while i < nLines:
    print(next(file))
    i += 1

### Fetching files/data from the internet

In [None]:
## lets deconstruct what's happening here.

import urllib
import gzip

url = "https://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/chromosomes/chrIII.fa.gz"
urllib.request.urlretrieve(url, "chrIII.fa.gz")

chrIII = gzip.open('chrIII.fa.gz',mode = 'rb') # note the additional use of 'b' here meaning we are going to 'read' a 'binary' (zipped) file.

print(next(chrIII))

numLines = 10

head = [next(chrIII) for x in range(numLines)] # Using list comprehension to return a list of values from some operation

print(head)
    
for line in head:
    print(line)


### Writing output to a file
When we write to a file, we need to indicate whether we are going to overwrite anything in an existing file, or append (`mode = a`) the content to the end of the existing file. When we are done with a file, we should always close it with `close()`

In [31]:
file = open('myOutput.txt','w')
file.write("Here is my awesome file content!\n")
file.write("It's probably the most important information I'll need for my thesis!\n")
for gene in genes:
    file.write(f'{gene}\n')
file.close()

file = open('myOutput.txt','a') # Try changing the mode from 'a' to 'w'
file.write("I forgot to add this as well\n")
file.close()

file = open('myOutput.txt','r')

for line in file:
    print(line)


Here is my awesome file content!

It's probably the most important information I'll need for my thesis!

Gapdh

Mef2c

Pax6

Cxcl1

Msi1

I forgot to add this as well



## Intro to Python Scripting
Now we're going to tie all of this together into a python script that is a self-contained program. First, lets figure out what our objective(s) are.
