# Introduction to Python Programming

<img align=left src="https://www.python.org/static/community_logos/python-logo-master-v3-TM.png">

This seminar is part of the __NIAID BCBB Python Programming for Biologist Series__. You can find materials for this and other no-cost seminars at the [NIAID Bioinformatics Portal](http://bioinformatics.niaid.nih.gov) or [my GitHub page](http://github.com/burkesquires/python_biologist)

---

## Learning Objectives

- Enable you to recognize and code written in python programming language
- Enable you to adapt python source code for your purpose
- Enable you to run python code in three or more ways
- Enable you to understand the advantages and disadvantages of using integrated development environments (IDE)
- Prepare you for the remaining seminars

---

## Outline
- Why learn python?
- Learnign python by solving a task


# Why Python?

<img src="http://imgs.xkcd.com/comics/python.png">

---

In [None]:
import this

---

The "Hello World" program in four different programming languages

C++
```C++
#include <iostream>
int main()
{
    std::count << "Hello World” << std::endl;
    return 0;
}
```
Java
```Java
public class HelloWorld { 
    public static void main (String[] args) {
        System.out.println("Hello World!"); 
    }
}
```
Python
```python
print(“Hello World!”)
```

R
```R
print("Hello World!", quote = FALSE)
```

## Why Learn Python?

**Easy Syntax**
Python's syntax is easy to learn, so both non-programmers and programmers can start programming right away.

**Readability**
Python's syntax is very clear, so it is easy to understand program code. Python is often referred to as "executable pseudo-code" because its syntax mostly follows the conventions used by programmers to outline their ideas without the formal verbosity of code in most programming languages.

**High-Level Language**
Python looks more like a readable, human language than like a low-level language. This gives you the ability to program at a faster rate than a low-level language will allow you.

**Object oriented programming**
Object-oriented programming allows you to create data structures that can be re-used, which reduces the amount of repetitive work that you'll need to do. Programming languages usually define objects with namespaces, like class or def, and objects can edit themselves by using keyword, like this or self. 

**It's Free**
Python is both free and open-source. The Python Software Foundation distributes pre-made binaries that are freely available for use on all major operating systems called `CPython`. You can get CPython's source-code, too.

**Cross-platform**
Python runs on all major operating systems like Microsoft Windows, Linux, and Mac OS X. A version called [`micropython`](https://micropython.org/) also can run on microcontrollers, small single-board computers like the arduino and Raspberry Pi.

**Easy installation**
Python, and many of its most popular packages, especially for data science, are easily installable with the [Anaconda distribution](https://www.anaconda.com/download/)

**Widely Supported**
Python has an active support community with many web sites, mailing lists, and USENET "netnews" groups that attract a large number of knowledgeable and helpful contributes.

**It's Safe**
Python doesn't have pointers like other C-based languages, making it much more reliable. Along with that, errors never pass silently unless they're explicitly silenced. This allows you to see and read why the program crashed and where to correct your error.

**Batteries Included**
Python is famous for being the "batteries are included" language. There are over 300 standard library modules which contain modules and classes for a wide variety of programming tasks.

**Extensible**
In addition to the standard libraries there are extensive collections of freely available add-on modules, libraries, frameworks, and tool-kits. These generally conform to similar standards and conventions.

Source: [wikiversity/python_concept](https://en.wikiversity.org/wiki/Python_Concepts/Why_learn_Python)

## Popular Python Package - "Scientific Python Stack"

<img src="images/scientific_python_stack_2017.png">

- [Python](https://www.python.org/), a general purpose programming language. It is interpreted and dynamically typed and is very suited for interactive work and quick prototyping, while being powerful enough to write large applications in.
- [NumPy](http://www.numpy.org/), the fundamental package for numerical computation. It defines the numerical array and matrix types and basic operations on them.
- [SciPy](https://www.scipy.org/scipylib/index.html), a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics and much more.
- [Matplotlib](http://matplotlib.org/), a mature and popular plotting package, that provides publication-quality 2D plotting as well as rudimentary 3D plotting

__Data and computation__:

- [pandas](http://pandas.pydata.org/), providing high-performance, easy to use data structures.
- [SymPy](http://www.sympy.org/), for symbolic mathematics and computer algebra.
- [scikit-image](http://scikit-image.org/) is a collection of algorithms for image processing.
- [scikit-learn](http://scikit-learn.org/) is a collection of algorithms and tools for machine learning.
- [h5py](http://www.h5py.org/) and [PyTables](http://www.pytables.org/) can both access data stored in the HDF5 format.

__Productivity and high-performance computing__:

- [IPython](http://ipython.org/), a rich interactive interface, letting you quickly process data and test ideas.
- [Jupyter](http://jupyter.org/) notebook provides IPython functionality and more in your web browser, allowing you to document your computation in an easily reproducible form.
- [Cython](http://cython.org/) extends Python syntax so that you can conveniently build C extensions, either to speed up critical code, or to integrate with C/C++ libraries.
- [Dramatiq](https://dramatiq.io), [Dask](https://dask.readthedocs.io/), [Joblib](https://joblib.readthedocs.io/) or [IPyParallel](https://ipyparallel.readthedocs.io/) for distributed processing with a focus on numeric data.

__Quality assurance__:

- [nose](https://nose.readthedocs.org/en/latest/), a framework for testing Python code, being phased out in preference for [pytest](https://docs.pytest.org/).
- [numpydoc](https://github.com/numpy/numpydoc), a standard and library for documenting Scientific Python libraries.

Bioinformatics
- [biopythpn](https://www.biopython.org) - python bioinformatics / computational biology / molecular biolofy package
- [bioconda](bioconda.org) - `conda` channel and repository of common `unix` bioinformaitcs software

[Adapted from https://www.scipy.org/about.html]

## Python IDEs

We will be using the `Jupyter notebook` and `IPython` kernel in the semianr series, but there are a number other Python development environments. These include


- [PyCharm](pycharm.com), Community and Pro editions
- Spyder, included in hte Anaconda install
- [Thorny - Python IDE for beginners](https://thonny.org/
- many others

---

# Learning Python with an Example

## Analyze a given data file

Souce: http://www.gapminder.org/data/

Lets use our command line / linux skills to view the data file we have been given. We added the data file to a new directory that is in the same parent directory of this filder. 

Remeber to get a list of the files in the curent folder using the Jupyter / IPython magic:

    !ls -l <directory>

In the Windows OS you can use `!dir` to get the list.

Lets use some more magic to take a quick look at the data file:

    !head ../data/gapminder.csv

What can we tell about our data file?

---

### Variables

Create a new variable and use it to store the name of our file

```python
data_file = '../data/gapminder.csv'
```

In [None]:
data_file = '../data/gapminder.csv'

Lets' print the variable to make sure it has been stored:

```python
print(data_file)
```

**Side Notes**

Python variables _MUST_ start with a letter but can have any combination of numbers or letters after that.

[Python Style Guide: Function and Variable Names](https://www.python.org/dev/peps/pep-0008/#function-and-variable-names)
> Function names should be lowercase, with words separated by underscores as necessary to improve readability.

> Variable names follow the same convention as function names.

> mixedCase is allowed only in contexts where that's already the prevailing style (e.g. threading.py), to retain backwards compatibility.

We can also see many more examples of syntax in the [`Learn Python3 in Y Minutes`](https://learnxinyminutes.com/docs/python3/) website

## Everything in Python is an Object!

Everything in python is an object. Let's see what type of object our varaible is:

```python
type(data_file)
```

### Comments

Anything after the pound sign (#) is ignored by the python interpreter; great for explaing what is going on in your code

```python
# Single line comments start with a number symbol.

""" Multiline strings can be written
    using three "s, and are often used
    as documentation.
"""

type(data_file)
```

### File I/O


Open a file, process its contents, and make sure to close it

```python
f = open(data_file)

type(f)
```

```python
f = open(data_file)

print(f.readline())
```

A more "pythonic" way of using _open_ is to preced it using __with__

```python
with open(data_file) as f:

print(f.readline())
```

## "Batteries Included"

"Batteries included" referes to the large amount of functionality that is built within the [python standard library](https://docs.python.org/3/library/index.html).

One example of a module from the python standard library is __csv__, which enables one to read and write csv files. Modules in the PS are already installed on your compuer BUT you do have to _import_ them before their first use ina  python script.

```python
import csv
```

Lets get some information on _csv_

```python
help(csv)
```

```python
# open a file, process its contents, and make sure to close it

with open(data_file) as f:

    reader = csv.reader(f)
    
    for row in reader:

        print(row)
```

We will use a package called pandas when we learn about data analysis but the csv module is useful here.

We could have read the file line-by-line and used the __split__ to split the string into elements.

```python
fields = line.split(“,”)
```

fields would be the equivelent of row.

```python
# open a file, process its contents, and make sure to close it

with open(data_file) as f:

    reader = csv.reader(f)

    for row in reader:

        country = row[0]
        continent = row[1]
        year = row[2]
        lifeExp = row[3]
        pop = row[4]
        gdpPercap = row[5]

        print(country, continent, year, lifeExp, pop, gdpPercap)
```

## Functions

```python
def extract_column_data(row):
    
    country = row[0]
    continent = row[1]
    year = row[2]
    lifeExp = row[3]
    pop = row[4]
    gdpPercap = row[5]

    return (country, continent, year, lifeExp, pop, gdpPercap)
```

```python
with open(data_file) as f:

    reader = csv.reader(f)
    
    for row in reader:
        
        col_data = extract_column_data(row)

        print(col_data)
```

We can also set default arguments by assigning names and values:

```python
def extract_column_data(row, country_col=0, continent_col=1, year_col=3, pop_col=4, gdpPercap_col=5):
    
    country = row[country_col]
    continent = row[continent_col]
    year = row[year_col]
    lifeExp = row[lifeExp]
    pop = row[pop_col]
    gdpPercap = row[gdpPercap_col]

    return (country, continent, year, lifeExp, pop, gdpPercap)
```

```python
with open(data_file) as f:

    reader = csv.reader(f)
    
    for row in reader:
        
        country, continent, year, lifeExp, pop, gdpPercap = extract_column_data(row)

        print(country, continent, year, lifeExp, pop, gdpPercap)
```

### More _pythonic_

Let Python automatically unpack the values!

```python
# unpack all the variables on a single line

with open(data_file) as f:

    reader = csv.reader(f)
    
    for row in reader:

        country, continent, year, lifeExp, pop, gdpPercap = row

        print(country, continent, year, lifeExp, pop, gdpPercap)
```

[Python function guidance](https://jeffknupp.com/blog/2018/10/11/write-better-python-functions/)

How do we store the data that we have?

One way to do it is to save the data to a file as we extract it.

# Reading and Writing Files

## Writing to a File

We have to open a file for writing BEFORE we can write to it. We can use the relative path or we can also give an absolute path like `/Users/username/Documents/dna.txt`

```python
# create an output file
output_file_handle = open("../data/gapminder_output.csv", "w")

# unpack all the variables on a single line

with open(data_file) as f:

    reader = csv.reader(f)
    
    for row in reader:

        country, continent, year, lifeExp, pop, gdpPercap = row

        output_file_handle.write('%s, %s, %s, %s, %s, %s\n' % (country, continent, year, lifeExp, pop, gdpPercap))

output_file_handle.close()
```

You can see what type of object, `output_file_handle`, is by using the `type` function.

```python
type(output_file_handle)
```

## Reading a file

Lets se if we did in fact create a file...

    !ls

    !cat ../data/gapminder_output.csv

```python
output_file_handle = open("../data/gapminder_output.csv", "r")

output_file_contents = output_file_handle.read()

print(output_file_contents)

# remember to close the file

output_file_handle.close()
```

## Looping through File, Line by Line

In a new cell below type, and run:

```python
file = open("../data/gapminder_output.csv")

for line in file:
    
    print(line)
```

## Lists

```python
country_list = []
year_list = []
lifeExp_list = []
```

```python
with open(data_file) as f:

    reader = csv.reader(f)
    
    for row in reader:
        
        country, continent, year, lifeExp, pop, gdpPercap = row

        country_list.append(country)
        
        year_list.append(year)
        
        lifeExp_list.append(lifeExp)
```

```python
len(country_list)
```

```python
len(year_list)
```

```python
len(lifeExp_list)
```

```python
country_list[0]
```

So we have lists of samples, generations and mutations values. We we just plot this? No...

Lets see what are the unique values of sample list:

```python
set(country_list)
```

Remember that in our graph we want to plot out each sample as a separate line.

So we need to save each set of sample data to a new list.

## Conditionals

Remember what the code looked like before:

```python
with open(data_file) as f:

        reader = csv.reader(f)

        for row in reader:

            country, continent, year, lifeExp, pop, gdpPercap = row

            country_list.append(country)

            year_list.append(year)

            lifeExp_list.append(lifeExp)
```

```python
with open(data_file) as f:

        reader = csv.reader(f)

        for row in reader:

            country, continent, year, lifeExp, pop, gdpPercap = row

            if country == 'United States':

                country_list.append(country)

                year_list.append(year)

                lifeExp_list.append(lifeExp)
```

# Dictionaries

```python
from collections import defaultdict    
    
gap_dict = defaultdict(list)

with open(data_file) as f:

    reader = csv.reader(f)
    
    next(reader) # skip the header line
    
    for row in reader:
        
        country, continent, year, lifeExp, pop, gdpPercap = row

        country_year = "%s-%s" % (country, year)
        
        gap_dict[country].append([year, lifeExp])
```

```python
gap_dict
```

```python
# tuples

nucleotides = ('A', 'T', 'C', 'G');

years = (2000, 2001, 2002)

countries  = ('United States', 'South Africa');
```

```python
import numpy as np
import matplotlib.pyplot as plt
    
for country in countries:
    
    data_list = gap_dict[country]

    x, y = zip(*data_list)

    plt.scatter(x, y)

    plt.plot(x, y, '-o')

plt.show()
```

To see examples of possible plots and code check out the gallery at the matplotlib [website](http://matplotlib.org/gallery.html)

## Additional Dictionary example

```python
dna = "AATGATCGATCGTACGCTGAAATGATCGATCGTACGCTGAAATGATCGATCGTACGCTGAAATGATCGATCGTACGCTGAAATGATCGATCGTACGCTGA"

counts = {}

for base1 in ['A', 'T', 'G', 'C']:
    for base2 in ['A', 'T', 'G', 'C']:
        for base3 in ['A', 'T', 'G', 'C']:
            trinucleotide = base1 + base2 + base3
            count = dna.count(trinucleotide)
            counts[trinucleotide] = count
            
print(counts)
```

---

# Regular Expressions

## Patterns in Biology

There are a lot of patterns in biology:

- protein domains
- DNA transcription factor binding motifs
- restriction enzyme cut sites
- runs of mononucleotides

Many problems that we want to solve that require more flexible patterns:

- Given a DNA sequence, what's the length of the poly-A tail?
- Given a gene accession name, extract the part between the third character and the underscore
- Given a protein sequence, determine if it contains this highly-redundant domain motif

## Regular expression module

To search for these patterns, we use the regular expression module “re”. 

```python
re.search(pattern, string)
```

```python
import re
import numpy as np
```

```python
dna = "ATCGCGAATTCAC"

if re.search(r"GAATTC", dna):
    print("restriction site found!")
```

```python
if re.search(r"GC(A|T|G|C)AA", dna):
    print("restriction site found!")
```

```python
dna = "ATCGCA"
if re.search(r"GC[ATGC]+", dna):
    print("restriction site found!")
```

## Get String and Position of Match

Get the string that matched

In a new cell below type, and run:

```python
dna = "ATGAATAACGTACGTACGACTG"

# store the match object in the variable m

m = re.search(r"GA([AT]{3})AC([ATGC]{2})AC", dna)

print("entire match: " + m.group())

print("first bit: " + m.group(1))

print("second bit: " + m.group(2))
```

Get the positions of the match

```python
print("start: " + str(m.start()))

print("end: " + str(m.end()))
```

```python
import re
m = re.search("[ATGC]", 'ATCG')
re?
```

# Files, Programs, & User Input

## Basic File Manipulation

Rename a file

In a new cell below type, and run:

```python
import os

os.rename("old.txt", "new.txt")
```

Rename a folder

```python
os.rename("/home/martin/old_folder", "/home/martin/new_folder")
```

Check to see if a file exists

```python
if os.path.exists("/home/martin/email.txt"):
    print("You have mail!")
```

```python
import os
```

Remove a file

```python
os.remove("/home/martin/unwanted_file.txt")

```
Remove empty folder

```python
os.rmdir("/home/martin/emtpy")
```


To delete a folder and all the files in it, use shutil.rmtree

```python
from shutil import rmtree

    shutil.rmtree("home/martin/full")
```

## Running External Programs

Run an external program

In a new cell below type, and run:
    
```python
import subprocess

    subprocess.call("/bin/date")

```

Run an external program with options

```python
subprocess.call("/bin/date +%B", shell=True)
```


Saving program output

```python
current_month = subprocess.check_output("/bin/date +%B", shell=True)
```

Now using IPython magic:

In [None]:
!date

## User Input

Interactive user input

In a new cell below type, and run:

```python
accession = input("Enter the accession name")

```
        # do something with the accession variable

Capture command line arguments

```python
import sys

    print(sys.argv)

    # python myprogram.py one two three

    # sys.argv[1] return script name
```

---

# To conclude…

## Learning Objectives

- Enable you to recognize and code written in python programming language
- Enable you to adapt python source code for your purpose
- Enable you to run python code in three or more ways
- Enable you to understand the advantages and disadvantages of using integrated development environments (IDE)
- Prepare you for the remaining seminars

---

# Resources

Official Documentation

* [The Python Official Docs](http://docs.python.org/3/)
* [Official Style Guide for Python](https://www.python.org/dev/peps/pep-0008/)

Awesome Python
* [A curated list of awesome Python frameworks, libraries and software](https://github.com/vinta/awesome-python)

Training Resources

* [A Crash Course in Python for Scientists](http://nbviewer.jupyter.org/gist/anonymous/5924718)
* [First Steps With Python](https://realpython.com/learn/python-first-steps/)
- http://interactivepython.org/runestone/default/user/login
- http://www.pythonforbeginners.com
- http://www.pythontutor.com/visualize.html#mode=display
- https://groklearning.com
- https://www.pythonanywhere.com/
* [Hitchhiker's Guide to Python](http://docs.python-guide.org/en/latest/)
* [Python 3 Computer Science Circles](http://cscircles.cemc.uwaterloo.ca/)
* [Dive Into Python 3](http://www.diveintopython3.net/index.html)
* [Python Course](http://www.python-course.eu/index.php)
* [30 Python Language Features and Tricks You May Not Know About](http://sahandsaba.com/thirty-python-language-features-and-tricks-you-may-not-know.html)


Free eBook in HTML / PDF

* [Automate the Boring Stuff with Python](https://automatetheboringstuff.com)
- [Python for Everybody: Exploring Data In Python 3 (free ebook)](https://www.pythonlearn.com/book.php)
- [How to Think Like A Computer Scientist (free ebook)](http://www.greenteapress.com/thinkpython/)
- [Python for Biologists](http://pythonforbiologists.com)
- [PyData 101 talk](https://speakerdeck.com/jakevdp/pydata-101)
- http://interactivepython.org/runestone/default/user/login
- http://openbookproject.net/books/bpp4awd/index.html


Python Regular Expressions (pattern matching)

- http://www.pyregex.com
- http://pythex.org


Python CheatSheets

- https://www.pythonsheets.com

Project Ideas

* [Ideas for Python Projects](http://pythonpracticeprojects.com)

Python Developer Survey

- https://www.jetbrains.com/research/python-developers-survey-2017/

Video

- [PyVideo.org](https://pyvideo.org/)

---

# Q & A

Collaborations welcome

R. Burke Squires - richard dot squires at nih dot gov

or

ScienceApps at niaid.nih.gov