NIA Intro to Python Class - May 17, 2017

# Day 3: Control Flow, plus advanced data types

Part 1 of today's talk focuses on control flow ([wiki article](http://en.wikipedia.org/wiki/Control_flow)), which is the part of a programming language's syntax that enables execution of the program to follow down one or more branches of instructions conditionaly, or going in loops.
* <code>if</code>/<code>then</code>/<code>else</code>
* <code>while</code> loops
* <code>for</code> loops
* nested <code>for</code> loops

Part 2 of todays talk will involve discussion of two new data types that are third-party extensions to Python but are universally used in the data analysis.
* Matrices using [NumPy arrays](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)
* DataFrames using [Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)

---

## Preview for tomorrow:

[These](https://seaborn.pydata.org/examples/index.html) are just a few of the types of data visualizations you can do in Python.

## Review of material thus far

1. Familiarizing yourself with the Jupyter Notebook IDE
    * code completion using Tab key
    * print out all your variables
    * syntax highlighting
    * more
2. Python scalar data types
    * <code>int</code>
    * <code>float</code>
    * <code>bool</code>
3. Python iterable data types
    * <code>str</code>
    * <code>list</code>
    * <code>tuple</code>
    * <code>dict</code>
    * <code>set</code>
4. Operators and Operations on Iterables
    * add items to and delete items from iterables
    * split, slice and concatenate
    * Nested iterables
    * Basic sorting

## Conditional Statements

### The <code>if</code> statement (a simple conditional)

* Use the keyword <code>if</code>, followed by the test, followed by a colon
* lines that should be evaluated if the test is true should be indented.

In [4]:
if False:
    print( "True fact.")
    print( "yup" )
print( "This line prints regardless." )

yup
This line prints regardless.


### <code>if</code>/<code>else</code> statements (a one alternative conditional)

* The <code>else</code> statements goes at the same indentation level as the matching <code>if</code> statement:

In [6]:
if False:
    print( "True fact.")
    print( "Yup." )
else:
    print( "This ain't gonna print." )
print( "This line prints regardless.")

This ain't gonna print.
This line prints regardless.


### Simple tests and compound tests

Use an operator inside the conditional, and use other operators to combine tests.

In [10]:
some_value = -1

In [8]:
some_value < 0

False

In [11]:
if some_value < 0:
    print( str( some_value), "has a negative sign.")
else:
    print( str( some_value) , "doesn't have a negative sign.")

-1 has a negative sign.


In [15]:
test_value = 0

if test_value < 10 or test_value > 20:
    print( str( test_value ), "is out-of-bounds.")
else:
    print( str( some_value ) , "is inbounds.")

0 is out-of-bounds.


Boolean expressions are evaluated left to right and have an order of operations. Use parentheses to clarify.

### <code>any()</code> and <code>all()</code>

In [16]:
some_conditions = [False, False, False, True, True]

In [17]:
any( some_conditions )

True

In [18]:
all( some_conditions )

False

### By the way....

Coercing True or False values into integers is one way to count them.

In [19]:
int(True)

1

In [20]:
int(False)

0

In [21]:
some_conditions

[False, False, False, True, True]

In [22]:
sum( some_conditions )

2

### <code>if</code>/<code>elif</code>/<code>else</code> (multi-test conditional)

In [25]:
test_value = 15

if test_value < 10:
    print( str( test_value ), "is too low")
elif test_value > 20:
    print( str( test_value ), "is too high")
else:
    print( str( test_value ) , "is just right.")

15 is just right.


### The <code>pass</code> statement: no naked <code>if</code> statements!
If you want Python to do nothing if the condition is true, you can't just leave a blank line, you have to use the keyword <code>pass</code>, properly indented.

In [30]:
planet_earth = { 'population' : 6e9, 'color' : 'blue' }

# There's nothing I can do...
if planet_earth['color'] != 'blue':
    print( "Floating in my tin can." )

### One-liner if statements

You can put the single conditionals on one line if you want.

In [32]:
if 'man' is 5: the_devil = 6

## <a name="while"><code>while</code> loops</a>

A while loop evaluates a boolean expression and does the code in the loop over and over as long as the expression evaluates to true.

In [34]:
age = 15
while True:
    print( "No beer, you", str(age) + "-year-old, wait until next year.")
    age += 1
    if age >= 21:
        break

print( "You're 21, it's party time!")

No beer, you 15-year-old, wait until next year.
No beer, you 16-year-old, wait until next year.
No beer, you 17-year-old, wait until next year.
No beer, you 18-year-old, wait until next year.
No beer, you 19-year-old, wait until next year.
No beer, you 20-year-old, wait until next year.
You're 21, it's party time!


Another way to structure a while loop is to "loop forever" and use a conditional statement with the <code>break</code> keyword.

## <a name="for"><code>for</code> loops</a>

* "Python’s <code>for</code> statement iterates over the items of any sequence (a list or a string), in the order that they appear in the sequence." [reference](http://docs.python.org/2/tutorial/controlflow.html).
* Often times if you know exactly how many times you need to loop, you'll use the <code>range()</code> function, which returns a list of numbers for the for loop to iterate over.
* Each time through the loop, Python with put the next item in the sequence into the variable whose name you declare by putting it between the <code>for</code> and <code>in</code> keywords.

### Iterate N times

In [35]:
# range counts from 0
list( range(10) )

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [36]:
for i in range(10):
    if i == 1:
        suffix = 'st'
    elif i == 2:
        suffix = 'nd'
    elif i == 3:
        suffix = 'rd'
    else:
        suffix = 'th'
    print( str(i) + suffix, "time through the for loop." )

0th time through the for loop.
1st time through the for loop.
2nd time through the for loop.
3rd time through the for loop.
4th time through the for loop.
5th time through the for loop.
6th time through the for loop.
7th time through the for loop.
8th time through the for loop.
9th time through the for loop.


### Iterate over a list of objects

In [37]:
name_list = ['dick', 'jane', 'spot', 'mom', 'dad' ]

In [38]:
name_list

['dick', 'jane', 'spot', 'mom', 'dad']

In [39]:
for name in name_list:
    print( "see", name, "run!" )

see dick run!
see jane run!
see spot run!
see mom run!
see dad run!


### Unpacking nested iterables inside the <code>for</code> loop
* See how you can unpack the tuple right inside the for loop:

In [40]:
name_list

['dick', 'jane', 'spot', 'mom', 'dad']

In [41]:
num_names = len( name_list )

In [42]:
num_names

5

In [43]:
indices = list( range( num_names ) )

In [45]:
indices

[0, 1, 2, 3, 4]

In [46]:
zipped_together = list( zip( indices, name_list ) )

In [47]:
zipped_together

[(0, 'dick'), (1, 'jane'), (2, 'spot'), (3, 'mom'), (4, 'dad')]

In [50]:
name1, name2, name3, name4, name5 = name_list

In [51]:
name2

'jane'

In [49]:
for the_tuple in zipped_together:
    i = the_tuple[0]
    name = the_tuple[1]
    print( "Line", i, "- See", name, "run!" )

Line 0 - See dick run!
Line 1 - See jane run!
Line 2 - See spot run!
Line 3 - See mom run!
Line 4 - See dad run!


### Using <code>enumerate()</code> to count off for you

* The code above is quivalent to using enumerate.
* Use the function <code>enumerate()</code> if you need to slap an index onto an iterable you already have.
* Each time through the loop <code>enumerate()</code> returns a tuple of two values, the first being the index, and the second being the value.

In [52]:
for i, name in enumerate( name_list ):
    print( "Line", i, "- See", name, "run!" )

Line 0 - See dick run!
Line 1 - See jane run!
Line 2 - See spot run!
Line 3 - See mom run!
Line 4 - See dad run!


## Nested <code>for</code> loops

You can put for loops inside other for loops. For example here's a brute force way to create a multiplication table.

In [53]:
all_rows = []

for i in range(1,13):
    a_row = []
    for j in range(1,13):
        a_row.append( i * j )
    all_rows.append( a_row )

In [54]:
all_rows

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
 [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24],
 [3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36],
 [4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48],
 [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60],
 [6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72],
 [7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84],
 [8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96],
 [9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 99, 108],
 [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
 [11, 22, 33, 44, 55, 66, 77, 88, 99, 110, 121, 132],
 [12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144]]

In [59]:
[ _[-1] for _ in all_rows ]

[12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144]

## NumPy arrays

* Rather than nested lists, you can use a matrix.
* Use if you have data all of the same type (ints, floats, bools)
* Row indices and column indices count from 0!
* Numpy matrices have basic statistics built in.

In [60]:
# import the package and give it a nickname
import numpy as np

### Initialize a new matrix from a nested list

In [61]:
mult_table = np.array( all_rows )

In [62]:
mult_table

array([[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12],
       [  2,   4,   6,   8,  10,  12,  14,  16,  18,  20,  22,  24],
       [  3,   6,   9,  12,  15,  18,  21,  24,  27,  30,  33,  36],
       [  4,   8,  12,  16,  20,  24,  28,  32,  36,  40,  44,  48],
       [  5,  10,  15,  20,  25,  30,  35,  40,  45,  50,  55,  60],
       [  6,  12,  18,  24,  30,  36,  42,  48,  54,  60,  66,  72],
       [  7,  14,  21,  28,  35,  42,  49,  56,  63,  70,  77,  84],
       [  8,  16,  24,  32,  40,  48,  56,  64,  72,  80,  88,  96],
       [  9,  18,  27,  36,  45,  54,  63,  72,  81,  90,  99, 108],
       [ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120],
       [ 11,  22,  33,  44,  55,  66,  77,  88,  99, 110, 121, 132],
       [ 12,  24,  36,  48,  60,  72,  84,  96, 108, 120, 132, 144]])

### The <code>.shape</code> attribute

In [72]:
type(mult_table.shape)

tuple

In [73]:
type(mult_table.T)

numpy.ndarray

In [74]:
type(mult_table.sum)

builtin_function_or_method

In [76]:
mult_table.sum?

In [65]:
mult_table.sum()

6084

In [66]:
mult_table.mean()

42.25

In [67]:
mult_table.mean(axis=0)

array([  6.5,  13. ,  19.5,  26. ,  32.5,  39. ,  45.5,  52. ,  58.5,
        65. ,  71.5,  78. ])

### Declare an empty matrix of a fixed size

In [79]:
np.imag?

In [81]:
mult_table = np.empty((12,12))

### Indexing on NumPy Arrays

* Use brackets <code>[]</code>
* Inside brackets, row/column indices are separated by a comma

In [83]:

for i in range(1,13):
    for j in range(1,13):
        mult_table[ i-1, j-1 ] = i * j

mult_table

array([[   1.,    2.,    3.,    4.,    5.,    6.,    7.,    8.,    9.,
          10.,   11.,   12.],
       [   2.,    4.,    6.,    8.,   10.,   12.,   14.,   16.,   18.,
          20.,   22.,   24.],
       [   3.,    6.,    9.,   12.,   15.,   18.,   21.,   24.,   27.,
          30.,   33.,   36.],
       [   4.,    8.,   12.,   16.,   20.,   24.,   28.,   32.,   36.,
          40.,   44.,   48.],
       [   5.,   10.,   15.,   20.,   25.,   30.,   35.,   40.,   45.,
          50.,   55.,   60.],
       [   6.,   12.,   18.,   24.,   30.,   36.,   42.,   48.,   54.,
          60.,   66.,   72.],
       [   7.,   14.,   21.,   28.,   35.,   42.,   49.,   56.,   63.,
          70.,   77.,   84.],
       [   8.,   16.,   24.,   32.,   40.,   48.,   56.,   64.,   72.,
          80.,   88.,   96.],
       [   9.,   18.,   27.,   36.,   45.,   54.,   63.,   72.,   81.,
          90.,   99.,  108.],
       [  10.,   20.,   30.,   40.,   50.,   60.,   70.,   80.,   90.,
         100.,  110.

In [86]:
mult_table[3:5,-1]

array([ 48.,  60.])

### Missing data

* Oftentimes, missing data is represented as <code>np.nan</code>, which stands for Not A Number
* No missing data representation for an integer

In [88]:
decimal_mult_table = mult_table.astype(float)

In [89]:
decimal_mult_table

array([[   1.,    2.,    3.,    4.,    5.,    6.,    7.,    8.,    9.,
          10.,   11.,   12.],
       [   2.,    4.,    6.,    8.,   10.,   12.,   14.,   16.,   18.,
          20.,   22.,   24.],
       [   3.,    6.,    9.,   12.,   15.,   18.,   21.,   24.,   27.,
          30.,   33.,   36.],
       [   4.,    8.,   12.,   16.,   20.,   24.,   28.,   32.,   36.,
          40.,   44.,   48.],
       [   5.,   10.,   15.,   20.,   25.,   30.,   35.,   40.,   45.,
          50.,   55.,   60.],
       [   6.,   12.,   18.,   24.,   30.,   36.,   42.,   48.,   54.,
          60.,   66.,   72.],
       [   7.,   14.,   21.,   28.,   35.,   42.,   49.,   56.,   63.,
          70.,   77.,   84.],
       [   8.,   16.,   24.,   32.,   40.,   48.,   56.,   64.,   72.,
          80.,   88.,   96.],
       [   9.,   18.,   27.,   36.,   45.,   54.,   63.,   72.,   81.,
          90.,   99.,  108.],
       [  10.,   20.,   30.,   40.,   50.,   60.,   70.,   80.,   90.,
         100.,  110.

In [90]:
decimal_mult_table[0]

array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.])

In [91]:
decimal_mult_table[0].sum()

78.0

In [93]:
decimal_mult_table[0, -1]

12.0

In [94]:
decimal_mult_table[0, -1] = np.nan

In [95]:
decimal_mult_table[0]

array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        nan])

In [96]:
decimal_mult_table.sum()

nan

In [99]:
np.nansum( decimal_mult_table[0] )

66.0

In [100]:
np.nanmean( decimal_mult_table[0] )

6.0

In [101]:
np.nanmean?

In [106]:
decimal_mult_table

array([[  1.00000000e+00,   2.00000000e+00,   3.00000000e+00,
          4.00000000e+00,   5.00000000e+00,   6.00000000e+00,
          7.00000000e+00,   8.00000000e+00,   9.00000000e+00,
          1.00000000e+01,   1.10000000e+03,              nan],
       [  2.00000000e+00,   4.00000000e+00,   6.00000000e+00,
          8.00000000e+00,   1.00000000e+01,   1.20000000e+01,
          1.40000000e+01,   1.60000000e+01,   1.80000000e+01,
          2.00000000e+01,   2.20000000e+03,              nan],
       [  3.00000000e+00,   6.00000000e+00,   9.00000000e+00,
          1.20000000e+01,   1.50000000e+01,   1.80000000e+01,
          2.10000000e+01,   2.40000000e+01,   2.70000000e+01,
          3.00000000e+01,   3.30000000e+03,              nan],
       [  4.00000000e+00,   8.00000000e+00,   1.20000000e+01,
          1.60000000e+01,   2.00000000e+01,   2.40000000e+01,
          2.80000000e+01,   3.20000000e+01,   3.60000000e+01,
          4.00000000e+01,   4.40000000e+03,              nan],
    

In [103]:
decimal_mult_table[:, -1] = np.nan

In [110]:
the_copy = decimal_mult_table[:, -2].copy() * 100

In [111]:
the_copy

array([  110000.,   220000.,   330000.,   440000.,   550000.,   660000.,
         770000.,   880000.,   990000.,  1100000.,  1210000.,  1320000.])

In [None]:
decimal_mult_table[:, -2] *= 100

## PANDAS DataFrame

* Emulate R's <code>data.frame</code> structure.
* Basically a NumPy matrix with
    * Row and column names
    * Can have columns of different types
    * Handles missing data better

In [112]:
import pandas as pd

In [113]:
pwd

'C:\\Users\\colettace\\Desktop'

In [114]:
ls

 Volume in drive C has no label.
 Volume Serial Number is AE9B-5918

 Directory of C:\Users\colettace\Desktop

05/17/2017  02:23 PM    <DIR>          .
05/17/2017  02:23 PM    <DIR>          ..
05/17/2017  01:05 PM    <DIR>          .ipynb_checkpoints
05/15/2017  01:07 PM             2,255 Google Chrome.lnk
05/15/2017  12:05 PM           215,550 MicroarrayAnalysisUsingPython.ipynb
05/15/2017  01:41 PM    <DIR>          NewFolder
05/15/2017  02:38 PM            37,661 NIAPythonDay1.ipynb
05/16/2017  02:37 PM            47,238 NIAPythonDay2.ipynb
05/17/2017  02:23 PM            39,988 NIAPythonDay3.ipynb
05/15/2017  12:05 PM        20,074,758 samplefile.xlsx
               6 File(s)     20,417,450 bytes
               4 Dir(s)  254,650,126,336 bytes free


In [115]:
df = pd.read_excel('samplefile.xlsx')

In [116]:
len(df)

59734

In [117]:
len(df.columns)

33

In [118]:
df.shape

(59734, 33)

In [120]:
df.head()

Unnamed: 0,ArrayID,Symbol,AVG_Signal_BR1_TEST_O_1,AVG_Signal_BR1_TEST_O_2,AVG_Signal_BR1_TEST_O_3,AVG_Signal_BR1_TEST_O_4,AVG_Signal_BR1_TEST_Y_1,AVG_Signal_BR1_TEST_Y_2,AVG_Signal_BR1_TEST_Y_3,AVG_Signal_BR1_TEST_Y_4,...,AVG_Signal_BR2_TEST_Y_3,AVG_Signal_BR2_TEST_Y_4,AVG_Signal_BR2_CONTROL_O_1,AVG_Signal_BR2_CONTROL_O_2,AVG_Signal_BR2_CONTROL_O_3,AVG_Signal_BR2_CONTROL_O_4,AVG_Signal_BR2_CONTROL_Y_1,AVG_Signal_BR2_CONTROL_Y_2,AVG_Signal_BR2_CONTROL_Y_3,AVG_Signal_BR2_CONTROL_Y_4
0,1,NA1,49209.04,55571.09,28678.79,25506.29,44113.21,38091.49,29641.85,49992.2,...,44737.37,22252.72,49358.38,43576.56,26025.02,37025.12,37762.74,58349.68,20307.62,25211.6
1,2,NA2,2.463345,3.040068,2.177575,2.220648,3.854862,4.541973,4.131402,4.389443,...,2.206764,2.059939,2.537871,2.222115,2.451152,2.370934,2.21816,2.299958,2.403679,2.306745
2,3,NA3,2.481884,3.074572,2.201559,2.244312,3.891572,4.587771,4.168455,4.438354,...,2.228056,2.08548,2.562563,2.245733,2.47762,2.396659,2.244386,2.326588,2.434108,2.333681
3,4,Tbc1d19,772.9165,631.2797,584.1135,377.6097,436.1088,553.6884,457.3865,640.4461,...,489.7733,355.0931,541.5174,750.3656,383.2357,430.4508,318.9993,315.6893,329.8668,342.941
4,5,Cfc1,24.28796,3.138412,3.287079,2.287869,3.957448,4.674656,4.421931,8.066015,...,2.270093,2.132824,3.875421,2.290862,2.528369,2.446485,2.295713,2.373987,10.21956,2.383641


In [122]:
df.tail(5)

Unnamed: 0,ArrayID,Symbol,AVG_Signal_BR1_TEST_O_1,AVG_Signal_BR1_TEST_O_2,AVG_Signal_BR1_TEST_O_3,AVG_Signal_BR1_TEST_O_4,AVG_Signal_BR1_TEST_Y_1,AVG_Signal_BR1_TEST_Y_2,AVG_Signal_BR1_TEST_Y_3,AVG_Signal_BR1_TEST_Y_4,...,AVG_Signal_BR2_TEST_Y_3,AVG_Signal_BR2_TEST_Y_4,AVG_Signal_BR2_CONTROL_O_1,AVG_Signal_BR2_CONTROL_O_2,AVG_Signal_BR2_CONTROL_O_3,AVG_Signal_BR2_CONTROL_O_4,AVG_Signal_BR2_CONTROL_Y_1,AVG_Signal_BR2_CONTROL_Y_2,AVG_Signal_BR2_CONTROL_Y_3,AVG_Signal_BR2_CONTROL_Y_4
59729,62972,LOC100911030,54.22469,35.75484,31.88958,35.30971,18.8773,25.68657,16.59089,11.67915,...,32.57816,16.67825,21.44451,34.14653,21.40752,25.2632,15.73026,29.26188,20.60742,23.11786
59730,62973,NA11371,2.554798,3.154194,2.375917,2.294845,3.994322,4.788048,4.358372,4.492544,...,2.315197,2.154653,2.606643,2.304856,2.526632,2.3947,2.429271,2.378576,2.417137,2.497314
59731,62974,NA11372,2.540203,3.123994,2.352405,2.276225,3.973528,4.760257,4.334032,4.46073,...,2.297506,2.135165,2.582128,2.286663,2.504557,2.37428,2.404696,2.360261,2.394571,2.478401
59732,62975,NA11373,46838.7,50219.1,26781.29,21699.99,56597.2,34399.53,24484.38,42579.33,...,39437.84,19065.65,45468.39,38192.24,23391.61,34276.04,33938.49,57972.02,17723.86,23862.66
59733,62976,NA11374,49877.94,58878.92,28401.58,23473.5,63054.37,35892.01,25030.26,44512.54,...,45687.15,22055.6,49109.15,42386.76,25735.49,38670.31,37094.9,56548.62,19186.92,25652.13


In [123]:
df[ df['Symbol'] == 'NA11374' ]

Unnamed: 0,ArrayID,Symbol,AVG_Signal_BR1_TEST_O_1,AVG_Signal_BR1_TEST_O_2,AVG_Signal_BR1_TEST_O_3,AVG_Signal_BR1_TEST_O_4,AVG_Signal_BR1_TEST_Y_1,AVG_Signal_BR1_TEST_Y_2,AVG_Signal_BR1_TEST_Y_3,AVG_Signal_BR1_TEST_Y_4,...,AVG_Signal_BR2_TEST_Y_3,AVG_Signal_BR2_TEST_Y_4,AVG_Signal_BR2_CONTROL_O_1,AVG_Signal_BR2_CONTROL_O_2,AVG_Signal_BR2_CONTROL_O_3,AVG_Signal_BR2_CONTROL_O_4,AVG_Signal_BR2_CONTROL_Y_1,AVG_Signal_BR2_CONTROL_Y_2,AVG_Signal_BR2_CONTROL_Y_3,AVG_Signal_BR2_CONTROL_Y_4
59733,62976,NA11374,49877.94,58878.92,28401.58,23473.5,63054.37,35892.01,25030.26,44512.54,...,45687.15,22055.6,49109.15,42386.76,25735.49,38670.31,37094.9,56548.62,19186.92,25652.13
