# Intermediate Python: Programming
# Class 1: Review, getting set up, and working with multiple files

Welcome to Intermediate Python: Programming from fredhutch.io! 
This class assumes you've completed Introduction to Python from fredhutch.io,
or have equivalent knowledge about Python as used for data analysis.
This course continues to build on this framework.
By the end of this course,
you should be able to create fully documented and automated workflows to perform data analysis tasks.

We'll begin this class by reviewing a few basic features of Python relevant to data analysis: 
loading data, assigning to variables, and general syntax.
By the end of this class,
you should be able to:
- use `numpy` to create and subset arrays and perform summary statistics
- create plots with matplotlib
- use for loops to repeat tasks across multiple data files

## Getting set up

In [None]:
# load libraries
import os
import urllib.request
import zipfile
import numpy

about each library

In [None]:
# download data
urllib.request.urlretrieve("http://swcarpentry.github.io/python-novice-inflammation/data/python-novice-inflammation-data.zip", "python-novice-inflammation-data.zip")
# unzip data
zipData = zipfile.ZipFile("python-novice-inflammation-data.zip")
zipData.extractall()

view data in directory

In [None]:
# assign data to variable (so we can recall it later)
data = numpy.loadtxt(fname="data/inflammation-01.csv", delimiter=",")

importing data and assigning to variable so we can recall it later

numpy methods

In [None]:
# what is in the variable?
print(data)

In [None]:
# what type of thing is data?
print(type(data))

relate to shortcut: `type(data)`

In [None]:
# what type of data is contained within the array?
data.dtype

In [None]:
# show shape of data
data.shape

output is rows, columns; rows are the individual patients, and the columns are their daily inflammation measurements

arrays have members, or attributes, which use the dot nomenclature because they have the same part-and-whole relationship

**Challenge:** import small-01.csv and determine if the type or shape of data differ from data object

## Manipulating arrays

In [None]:
# extract or reference first element in array
data[0, 0]

row index, column index; python index starts at 0

In [None]:
# extract middle value
data[30, 20]
print("middle value in the data:", data[30, 20]) 

include in prettier print statement

In [None]:
# slicing data
data[0:4, 0:10] # end bounds not inclusive
data[:3, 36:] # empty values mean beginning or end

In [None]:
# perform math on an entire array
doubledata = data * 2.0
doubledata[0:4, 0:10]
data[0:4, 0:10] # compare with original

In [None]:
# add arrays together
tripledata = doubledata + data
tripledata[0:4, 0:10]

In [None]:
# perform summaries across entire array
print(numpy.mean(data))

**Challenge:** find max, min, standard deviation across the entire array data, and print with meaningful print statements

In [None]:
# multiple assignment: assign multiple variables at a once
maxval, minval, stdval = numpy.max(data), numpy.min(data), numpy.std(data)
print(stdval)

In [None]:
# specify a certain axis to summarize (0 means rows, summarize by day)
numpy.mean(data, axis=0)

In [None]:
# check shape of output
numpy.mean(data, axis=0).shape # 40 values, this is number of days

axis = 1 this summarizes across patients

## Visualizing data

In [None]:
import matplotlib.pyplot
%matplotlib inline

In [None]:
image = matplotlib.pyplot.imshow(data) # im is image, 2D raster

In [None]:
matplotlib.pyplot.show() # not always needed; shortcut allowed because of interface/interpreter
matplotlib.pyplot.imshow(data) # another shortcut!

In [None]:
# plot inflammation over time as average across all patients
ave_inflammation = numpy.mean(data, axis=0)
matplotlib.pyplot.plot(ave_inflammation)

**Challenge:** using one line of code, print the maximum inflammation across all patients

## Repeating actions with loops

what if we wanted to repeat plotting across all data files? how many lines of code would it take given the methods used so far?

In [None]:
# there are multiple ways to show what is contained in a variable
# create a variable for a word
word = "hutchinson"
word
word[0]
word[7]

In [None]:
# what if we change word?
word = "hutch"
word[0]

In [None]:
#word[7] # index error: there is no index 7 in word now!

In [None]:
# for loop: accesses items in a set
for char in word: # need to execute on top line of for loop in some interpreters!
    print(char) # have to include print statement for values to appear!
# repeats action AND is not length dependent


In [None]:
# importing multiple data files
import glob

glob.glob("data/inflammation*.csv")

In [None]:
# create a list of files (* is a wildcard)
filenames = sorted(glob.glob("data/inflammation*.csv")) # sorted to make filenames appear in numerical order
filenames

In [None]:
# loop across all filenames
for f in filenames:
    print(f)


In [None]:
## Challenge: Are all 12 data files the same shape? (hint: write a for loop)

In [None]:
# plot average inflammation for each file in a separate plot
import numpy
import matplotlib

filenames = sorted(glob.glob("data/inflammation*.csv"))
for f in filenames:
    print(f)

    data = numpy.loadtxt(fname=f, delimiter=",")

    fig_ave = numpy.mean(data, axis=0)
    ave_plot = matplotlib.pyplot.plot(fig_ave)
    matplotlib.pyplot.show() # why is this necessary?