![Python Logo](python_logo_400.png)

# Welcome to the Data Analysis Workshop!

### Copy course materials to your desktop from:
* BRC or VPN: ```T:\TempTransfer\000_NIA_Python_Course```
* Otherwise: [tinyurl.com/y2dnbsl2](https://tinyurl.com/y2dnbsl2)

## Goals of this course

Students of this course should gain familiarity with:
* Python's capabilities, strengths and weaknesses
* Running Python code within a Jupyter Notebook environment
* Generating tables/figures and doing simple stats
* Keywords/names of concepts to help you Google things yourself


## Target audience

* Bench scientists who want to analyze their own data
* Microsoft Excel users
* For beginners: No prior programming knowledge assumed

## What will be covered

<img align="right" alt="Data science workflow graphic" src="data-science-pipline-thumb.png">

<ul>
    <li>Day 1: Basic syntax, data types and operators</li>
    <li>Day 2: Fancy data types (arrays and data frames)</li>
    <li>Day 3: Exploratory data analysis</li>
    <li>Day 4: Case study: Gene expression ribbon plot</li>
</ul>

<div style="align: right; float: right; clear: both;">Image Credit: <a href="https://www.wolfram.com/broadcast/video.php?v=1836">Wolfram</a></div>

## About me

* Christopher Coletta, M.S.
* Computer vision, signal processing, machine learning, longitudinal data analysis
* Computer Scientist in Computational Biology & Genomics Core (CBGC)

## About the Computational Biology & Genomics Core (CBGC)

<img style="align: right; float: right;" alt="Computational Biology Core logo" src="CBGC_logo_300.png">

* Core facility housed in LGG
* Room 10C222
* Seminar or training every month
* Two powerful Windows computers with lots of software and remote access available
* NIA IRP cloud computing like Amazon AWS via [OpenStack](https://niairpcloud.irp.nia.nih.gov/horizon/auth/login/)
* NIAIRPGPU1 - server for deep learning (8x NVIDIA V100 GPUs)

## What WILL be covered

### Wrangle and Clean
* Simple and complex sort
* Filter
* Missing Data
* Group-by operations (Split/apply/combine)
* Merge two spreadsheets (JOIN operations)
* Bin continuous variables into categorical variables

### Exploratory Data Analysis
* Summary statistics
* Pivot table
* Histogram
* Scatter plot
* Box and whiskers
* Pairwise scatterplot matrix

### Model
* Linear regression

### Visualize
* Facet plot
* Heatmap
* Manhattan plot

## Why Python (vs. Excel, Matlab, R, etc)

* Free (no cost) and free ([open-source](https://en.wikipedia.org/wiki/Open-source_software))
* General-purpose: data, web, apps, microcontrollers
* Readible: simple, non-cluttered syntax
* Expressive: do more with less lines of code (relative to C)
* Popular: #3 language in [TIOBE Index](https://www.tiobe.com/tiobe-index/)

### Relative Strengths

Extensive, mature libraries for:

* Image analysis
* Video analysis (self driving cars platform)
* Machine learning, especially artificial neural networks
* Natural language processing (sentiment analysis)

### Relative Weaknesses

* Mobile apps

## Python Learning Resources


* SoloLearn phone app: Python 3 tutorial
* [Python for Scientists and Engineers](http://pythonforengineers.com/python-for-scientists-and-engineers/) - Free Book by Shantnu Tiwari
* [PyData YouTube channel](https://www.youtube.com/user/PyDataTV)
* Questions and answers on StackOverflow.com
* [r/LearnPython](https://www.reddit.com/r/learnpython) on Reddit
* Use Jupyter built-in operator <code>?</code>

## Ecosystem of Python Data Analysis Software

<a href="http://chris35wills.github.io/courses/pydata_stack/"><img src="http://chris35wills.github.io/courses/pydata_stack.png" style="width:540px;height:300px;"></a>

[Anaconda](https://www.continuum.io/downloads) is one of many Python "distributions" that bundles core Python, essential 3rd party packages, and various IDEs.

# Integrated Development Environment

The software app you use to build and test your code.

## Python IDE Choices
* [Spyder](https://pythonhosted.org/spyder/)
* [PyCharm](https://www.jetbrains.com/pycharm/)
* [Jupyter](http://jupyter.org/) - IDE for this workshop
* And more ...

## Jupyter

* IDE for Python, R, Bash, many others
* Web browser interface: communicates with either local or remote back-end ("kernel")
* Creates a [sharable document](https://nbviewer.jupyter.org/) called a notebook.
* Notebook divided up into cells that contain code, output and documentation ("Markdown cell").


## Jupyter Markdown Cell

* [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet): Document-formatting style that is easly convertable to HTML
* Headings preceeded by #
* unordered lists preceeded by a \*
* ordered lists preceeded by a number
* Math equations go in between two Dollar signs, example: $t=\frac{\hat{\beta}-\beta_{{H}_0}}{s.e.(\hat{\beta})}$
* Create links like [this](https://www.nia.nih.gov/)

## Code Cells
* Python code goes in here
* <code>Shift+Enter</code> to run and goto the next cell
* <code>Ctrl+Enter</code> to run code and stay on current cell
* Upon execution, a number shows up on the left indicating order of execution.
* You don't have to run code cells in the order they appear in the notebook.

In [1]:
print( "Hello, world!" )

Hello, world!


## Interacting with cells: Command mode
* Press Esc - box turns blue
* Useful shortcuts:
    * b = Insert cell below
    * a = insert cell above
    * dd = Delete cell
    * Shift + up or down = select/highlight two or more cells
    * M = merge highlightes cells into one

## Edit mode
* Double click to edit - box turns green
* Useful shortcuts
    * Ctrl + Shift + - = split cell at cursor location
    * Enter = gives you a new line inside the same cell
    * Shift + Enter = Runs the code in this cell and go to the next one
    * Ctrl + Enter = Runs the code in this cell and stay on this one

# Basic Python Syntax

## Comments
Lines preceededed by a hash symbol "#" are ignored by the Python interpreter

In [2]:
# Run me! nothing happens!!! 
# askfdjhdsakfadhsfadsk
print("before the hash") # after the hash

before the hash


## Assignment, i.e., give a value a name

* An assignment is the name on the left side of an equal sign.
* It gives a name to a value.
* Names can have upper and lowercase letters, numbers (as long as it's not the first character), as well as underscores (Shift + -).
* Don't use a name that is also a [Python Syntax keyword](https://docs.python.org/3/reference/lexical_analysis.html#keywords)
* Assignment statements in Python do not copy objects, they create bindings between a target and an object.

In [3]:
a_value = 42

See the value attached to the name by typing the name

In [4]:
a_value

42

## <code>print()</code> function
Use the <code>print</code> function to output one or more values at once.

In [5]:
print( a_value )

42


## Code-completion using <code>TAB</code> key

Hit the TAB key to use code completion to help you type faster. Most IDEs have this option. Usually a pop-up menu will appear

In [6]:
a_value

42

## Python Data Types: what are they, and why do we care?

* Different types of data, different data types
* Each type has their own various "superpowers," i.e., functionality.
* Advanced programmers often define their own types with their own functionality
* Here, "simple" means that these are types that are built into core Python, and you can use them right away.
* "Fancy" means simply that you need to use the <code>import</code> command before you use them.



### Scalar Data Types (simple)
* integer <code>int</code>: counting numbers
* float <code>float</code>: decimal numbers
* boolean <code>bool</code>: true/false

### Iterable Data Types (simple)
* string <code>str</code>: words
* list <code>list</code>: collection of things (ordered)
* dictionary <code>dict</code>: map one value to another (unordered)
* set <code>set</code>: unique collection of things (unordered)

### What the difference between "scalar" and "iterable"?
* You can't loop over a scalar. 

### Iterable Data Types (fancy)
* NumPy multi-dimensional <code>array</code>: data, images
* Pandas <code>DataFrame</code>: spreadsheet analog

And many more...

## Scalar Data Types: Integer (<code>int</code>)

* A counting number 1, 2, 3, -89 ..., 0

In [7]:
-23

-23

In [8]:
type( 2345 )

int

In [9]:
type( a_value )

int

## Scalar Data Types: Float (<code>float</code>)

* Decimal numbers
* An accurate approximation to many many decimal places, but technically not an EXACT representation
* If you want to know more about why decimal numbers are called "floats", click [here](https://en.wikipedia.org/wiki/Floating-point_arithmetic).

In [10]:
type( 3.14159 )

float

In [11]:
type( 1/3 )

float

## PEMDAS operators

1. Parentheses - <code>()</code>
2. Exponent - <code>**</code>
3. Multiplication - <code>*</code>
4. Division - <code>/</code>
5. Addition - <code>+</code>
6. Subtraction - <code>-</code>

Example: What is $9-3\div\frac{1}{3}+1=?$

In [12]:
9 - 3 / 1/3 + 1

9.0

In [13]:
9 - 3 / (1/3) + 1

1.0

## Using the <code>type()</code> function

Use this to have Python tell you the data type of any expression or named value.

In [14]:
type( a_value )

int

In [15]:
type( 3.14159 )

float

## Scalar Data Types: Boolean (<code>bool</code>)

Bools can only have a value of <code>True</code> or <code>False</code>.

In [16]:
True

True

In [17]:
False

False

In [18]:
type( True )

bool

## Boolean operators <code>and</code>, <code>or</code>, and <code>not</code>

* <code>and</code> and <code>or</code> are "binary operators", meaning you slap them in between two truth values to make one value.
* Expression is evaluated left-to-right

In [19]:
False and False

False

In [20]:
True and True

True

In [21]:
True or False

True

In [22]:
False or True

True

In [23]:
True or True

True

In [24]:
False or False

False

In [25]:
my_bool_value = True and False
print( my_bool_value )

False


<code>not</code> is a unary operator that negates the value after it.

In [26]:
not True

False

A computer science subtlety: The [short circuit 'or' operator](https://en.wikipedia.org/wiki/Short-circuit_evaluation)

In [27]:
True or False and False

True

In [28]:
True or False and False # True

True

## Some math operators

* ```<``` less than
* ```<=``` less than or equal to
* ```>``` greater than
* ```>=``` greater than or equal to
* ```==``` is equal to
* ```!=``` is not equal to

Note the double equal signs is an operator, not an assignment!!

In [29]:
5 < 6

True

In [30]:
6 <= 6

True

In [31]:
-6 <= 6

True

In [32]:
6 != 6

False

In [33]:
True == True

True

In [34]:
not (True == True)

False

## Using <code>whos</code> command to keep track of named values

In [35]:
whos

Variable        Type    Data/Info
---------------------------------
a_value         int     42
my_bool_value   bool    False


## Iterable Data Types: Strings (<code>str</code>)

* A data type that contains one or more characters
* Strings are surrounded, a.k.a. "delimited" by matching single or double quotes
* You choose whether to use single or double quotes based on what's in the string.
* Escape characters: Backslash followed by a letter to render special characters
    * ```\n```: New line
    * ```\t```: Tab
    * ```\"```: Quote character (not end of string)

In [36]:
"Hello, world!"

'Hello, world!'

In [37]:
'Hello, world!'

'Hello, world!'

I repeat: ***No difference between single and double quotes strings!!!!*** I promise!

In [38]:
"Can't"

"Can't"

In [39]:
'"Really," she said?'

'"Really," she said?'

In [40]:
"I said, \"Hi my name is Chris\""

'I said, "Hi my name is Chris"'

In [41]:
" First line\n Second line"

' First line\n Second line'

In [42]:
print( " First line\n Second line" )

 First line
 Second line


By the way, I'm ***not*** talking about the backtick `, which shares a key with the tilde ~ character. Backtick is ***different*** than a single quote ', which shares a key with the double quote ".

## Iterable Data Types: Lists (<code>list</code>)

* Container for a collection of values
* Can all be the same type or different, doesn't matter.
* Items delimited by commas, all surrounded by brackets [], not parentheses ()
* The order of the values in the list is remembered

![list indexing in python](elementsinalists.png)

In [43]:
a_list = [ 1, 2, 3, 1, "a dog", 'a cat' ]

### Get the ith element from a list using bracket notation

In [44]:
a_list[0]

1

In [45]:
a_list[4]

'a dog'

### Negative index counts from the back of the list

In [46]:
a_list[-1]

'a cat'

### Get position of a value within a list using <code>.index()</code>

In [47]:
a_list.index('a cat')

5

### Use the "unpacking" syntax to get values out of small lists

In [48]:
a_few_things = [ "hello", "goodbye", 42 ]

In [49]:
a_few_things

['hello', 'goodbye', 42]

In [50]:
first, second, third = a_few_things

In [51]:
first

'hello'

In [52]:
second

'goodbye'

In [53]:
third

42

## Iterable Data Types: Dictionaries (<code>dict</code>)

* A <code>dict</code> is one-way associative array, where "keys" are mapped to "values."
* Note: A <code>dict</code> does not keep track of the order in which you inputted the key-value pairs
    * for that you need <code>collections.OrderedDict</code>

![](dict.png)

### Create a <code>dict</code> with stuff in it

The keys are separated by the values by a colon (:), and the key-value pairs are separated by commas.

In [54]:
toy_dict = { 1 : 'a', 2 : 'b', 3: 'c'}

In [55]:
toy_dict

{1: 'a', 2: 'b', 3: 'c'}

### Access an element in a <code>dict</code> using its key and bracket notation <code>[]</code>

In [56]:
info = { 'first name': "Chris",
         "last name" : "Coletta"}

In [57]:
info

{'first name': 'Chris', 'last name': 'Coletta'}

In [58]:
info['first name']

'Chris'

### Keys, not values go into the <code>dict</code>, or you get an error

In [59]:
info['Chris']

KeyError: 'Chris'

### Create an empty <code>dict</code>

Declare empty dict with {}, or dict().

In [60]:
type( {} )

dict

### Add a new key-value pair to an existing <code>dict</code> using bracket notation <code>[]</code>

In [61]:
toy_dict['new_key'] = 'new_value'

In [62]:
toy_dict

{1: 'a', 2: 'b', 3: 'c', 'new_key': 'new_value'}

In [63]:
toy_dict = {}

In [64]:
toy_dict

{}

In [65]:
toy_dict[1] = 'a'
toy_dict[2] = 'b'
toy_dict[3] = 'c'
toy_dict['new_key'] = 'new_value'

In [66]:
toy_dict

{1: 'a', 2: 'b', 3: 'c', 'new_key': 'new_value'}

### Get just the keys or just the values

Every <code>dict</code> has the built-in functions ("methods" in Pythonic speak) <code>.keys()</code> and <code>.values()</code>

In [67]:
toy_dict.keys()

dict_keys([1, 2, 3, 'new_key'])

In [68]:
toy_dict.values()

dict_values(['a', 'b', 'c', 'new_value'])

In [69]:
toy_dict

{1: 'a', 2: 'b', 3: 'c', 'new_key': 'new_value'}

### Values can be any other type, including iterables

In [70]:
{ "former_value": ['a', 'b', 'c' ] }

{'former_value': ['a', 'b', 'c']}

## Iterable Data Types: Sets (<code>set</code>)

* Similar to math concept of sets; has operations like union, intersection, etc.
* Sets are unindexed, unordered, and contains no duplicates.
* My personal favorite of the Python standard types!

### Create a <code>set</code> with stuff in it

Declare a set by putting values inside braces.

In [71]:
set('GATTACA')

{'A', 'C', 'G', 'T'}

In [72]:
a_set = {'set', 'of', 'words', 'of'}

In [73]:
a_set

{'of', 'set', 'words'}

### Create an empty <code>set</code>

Make an empty using <code>set()</code>.

In [74]:
empty_set = set() # not {}, that would be an empty dict

In [75]:
empty_set

set()

In [76]:
empty_set.add( 'hi' )

In [77]:
empty_set

{'hi'}

In [78]:
first = {1,2,3,4,5}
second = {4,5,6,7,8}

### Set Union Operator (<code>|</code>) - "or"

![](set-union.jpg)

In [79]:
first | second

{1, 2, 3, 4, 5, 6, 7, 8}

### Set Intersection Operator (<code>&</code>) - "and"

![](set-intersection.jpg)

In [80]:
first & second 

{4, 5}

### Set Difference Operator (<code>-</code>)

![](set-difference.jpg)

In [81]:
first - second

{1, 2, 3}

### Set Symmetrical Difference Operator (<code>^</code>)

![](p2c6-symmdiff.png)

In [82]:
first ^ second

{1, 2, 3, 6, 7, 8}

## How many elements in an iterable? Use <code>len()</code>

In [83]:
len( first ^ second )

6

In [84]:
a_list

[1, 2, 3, 1, 'a dog', 'a cat']

In [85]:
len(a_list)

6

## Can you change a value's type? Yes!

Use these functions to "[coerce](https://en.wikipedia.org/wiki/Type_conversion)" a value from one type to another:

* ```int()```
* ```float()```
* ```bool()```
* ```list()```
* ```dict()```
* ```set()```
* et al.

In [86]:
type( 45 )

int

In [87]:
type( '45' )

str

In [88]:
a_string = '45'

In [89]:
56 + a_string

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [90]:
a_string

'45'

In [91]:
int( a_string )

45

In [92]:
56 + int(a_string)

101

In [93]:
float( '-45.0345' )

-45.0345

In [94]:
int( float( '-45.0345' ) )

-45

In [95]:
float( 45 )

45.0

In [96]:
str( 56 ) + a_string

'5645'

In [97]:
list( "listify me!" )

['l', 'i', 's', 't', 'i', 'f', 'y', ' ', 'm', 'e', '!']

In [98]:
round( 9.9 )

10

In [99]:
round( -9.9 )

-10

In [100]:
int( round( 9.9 ) )

10

In [101]:
set( "listify me!" )

{' ', '!', 'e', 'f', 'i', 'l', 'm', 's', 't', 'y'}

In [102]:
float( 3 )

3.0

In [103]:
int( 3.14159 )

3

In [104]:
bool( "a_string" )

True

In [105]:
bool( "" )

False

In [106]:
bool(  )

False

## Iterating over items in a <code>list</code> using a <code>for</code> loop

* Statements you want to be repeated inside the loop should be *indented* below the first line.
* Use the <code>TAB</code> key to indent.
* In between the <code>for</code> keyword and the <code>in</code> is the placeholder name whose value changes each time through the loop.

In [107]:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June']

In [108]:
print( "before the loop" )

for m in months:
    print( m )


print( "after the loop" )

before the loop
Jan
Feb
Mar
Apr
May
June
after the loop


### FYI: your temporary "placeholder" variable remains after the for loop

In [109]:
m

'June'

In [110]:
del m

In [111]:
m

NameError: name 'm' is not defined

## Iterating over items in a <code>dict</code> using a <code>for</code> loop using <code>.items()</code> syntax

Use the unpacking syntax within the <code>for .. in</code> syntax to directly assign names to the key and value separately.

In [112]:
num_days_in_month = { 'Jan' : 31, 
                     'Feb' : 28,
                     'Mar' : 31, 
                     'Apr' : 30 }

In [113]:
num_days_in_month

{'Jan': 31, 'Feb': 28, 'Mar': 31, 'Apr': 30}

In [114]:
num_days_in_month.items()

dict_items([('Jan', 31), ('Feb', 28), ('Mar', 31), ('Apr', 30)])

In [115]:
for m, d in num_days_in_month.items():
    print( "There are", d, "days in", m )

There are 31 days in Jan
There are 28 days in Feb
There are 31 days in Mar
There are 30 days in Apr


### Without using <code>.items()</code> iterating over a dict will give you just the keys

In [116]:
for thing in num_days_in_month:
    print( thing )

Jan
Feb
Mar
Apr


### Advanced: Use a "dict comprehension" to switch directionality from value to key

In [117]:
{ value: key for key, value in num_days_in_month.items() }

{31: 'Mar', 28: 'Feb', 30: 'Apr'}

## Day 1 review

1. Python ecosystem of tools
2. Jupyter Notebook is code, output and documentation all in one document
3. Type code into cells, and to run them you press Shift-Enter
4. Tab completion is nice
4. Different data types for different data
5. Operators take one or more input values and turn them into other values *based on the input values type*
6. Converting data from one type to another using the function syntax, e.g., <code>int()</code>
7. Iterating over iterables using a for loop