# Data Analysis with Python
_T.J. Langford_  
_Wright Lab & YCRC_  
_May 1, 2019_  

# Overview

- Introduction to `numpy` and `matplotlib`
- Data processing and analysis with `numpy`
- Data visualization with `matplotlib`

# Tools and Requirements

- Language: Python 3.6
- Modules: `numpy`, `matplotlib`
- Jupyter notebook

# Comment: Python 2 versus 3
- Major modules will be dropping Python2 support in 2019
    - Including numpy, pandas, and matplotlib
- This tutorial uses Python3
- see https://python3statement.org for details


# Github Repository

- The materials from this tutorial are available on [GitHub](https://github.com/WrightLaboratory/data_analysis)
- Can also launch an active version in a [Binder environment]()

# Data Processing with `numpy`

In [1]:
import numpy as np

# What is Numpy?

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities

[User Guide](https://docs.scipy.org/doc/numpy-1.16.1/)

## N-dimensional array objects

- Fundamental basis of numpy is the `array` object
- 1D array ~ vector
- 2D array ~ matrix  
- nD array (n > 2) ~ tensor

### Creating arrays

Arrays can be created in a variety of ways. The most common are either empty:

In [2]:
a = np.zeros(10)
a

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

or you can create them from an existing `list`:

In [3]:
b = np.array([0,1,2,3,4,5,6,7,8,9])
b

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Array properties

Arrays have a few key properties:

- Data type (float, int, etc)
- Number of dimensions
- Shape 
- Size


In [28]:
print(a.dtype)
print(b.dtype)

float64
int64


In [6]:
a.shape

(10,)

In [32]:
c = np.array([[0,1,2,3],[4,5,6,7]])
c

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [33]:
c.shape

(2, 4)

### Array indexing and slicing

Arrays are indexed (starting at `0`) and we can slice them:

In [9]:
b

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [10]:
b[2:4]

array([2, 3])

In [11]:
b[0:-2]

array([0, 1, 2, 3, 4, 5, 6, 7])

In [12]:
b[::2]

array([0, 2, 4, 6, 8])

We also have "fancy indexing" for n-dimensional arrays:

In [13]:
c[:,2]

array([2, 6])

### Array manipulation

Arrays can be manipulated in-place, without creating a second copy: 

In [14]:
b[3] = 10
b

array([ 0,  1,  2, 10,  4,  5,  6,  7,  8,  9])

You can assign a range of values at once, either to a single value or from another array:

In [15]:
a[0:2] = 9
a[3:7] = b[0:4]

In [16]:
a

array([ 9.,  9.,  0.,  0.,  1.,  2., 10.,  0.,  0.,  0.])

The assignment isn't linked, so changing `b` now doesn't change `a`:

In [17]:
b[3] = 0
a

array([ 9.,  9.,  0.,  0.,  1.,  2., 10.,  0.,  0.,  0.])

### Array operations

We can also act on these arrays with specific operations:

- add, subtract, multiply, and divide by scalars or other arrays

In [18]:
np.multiply(a, b)

array([ 0.,  9.,  0.,  0.,  4., 10., 60.,  0.,  0.,  0.])

- extract statistics about the array (minimum, maximum, RMS, etc)

In [19]:
np.max(b)

9

In [20]:
np.mean(a)

3.1

In [21]:
np.median(a)

0.5

- sum array elements along an axis: 

In [22]:
np.sum(c, axis=1)

array([ 6, 22])

### Import and Export Arrays 

Numpy has two main ways of importing and exporting data:

- human readable text file:

In [23]:
np.savetxt('test.txt', c, fmt='%f', delimiter=',', header='My favorite array')

In [24]:
cat test.txt

# My favorite array
0.000000,1.000000,2.000000,3.000000
4.000000,5.000000,6.000000,7.000000


- higher-efficiency binary data:

In [25]:
np.save('test.npy', c)

## Random Number Generation with `numpy`

Numpy has a full suite of tools for generating random numbers. Very helpful for Monte Carlo simulations or toy data.

Here we will generate 100k random floats from a normal distribution with `mean = 2.0` and `sigma = 1.0`. 

In [36]:
r = np.random.normal(loc=2, scale=1, size=100000)
print(r[0:10])

[3.02765183 1.38807201 3.38206878 3.72932418 1.83487814 2.82455317
 1.64369677 3.40061566 1.84350903 1.91978341]


We can randomly select elements from an array:

In [56]:
np.random.choice(a, size=2)

array([ 0., 10.])

All the "heavy-lifting" is done in `C`, so `numpy`-based work can be _very_ fast.

### Random Number Example: Monte Carlo `pi`

We can perform a Monte Carlo-based simulation to calculate `pi` using two uniform random number generators. 

In [41]:
def mc_pi(num_trials):
    x = np.random.uniform(low=0.0, high=1.0, size=num_trials)
    y = np.random.uniform(low=0.0, high=1.0, size=num_trials)

    r = x**2 + y**2
    
    return len(r[r<1])*4/num_trials
    

In [55]:
for n in [10,100,1000,10000,100000,1000000]:
    print(f"{n}: {mc_pi(n)}")

10: 4.0
100: 3.32
1000: 3.044
10000: 3.1564
100000: 3.13872
1000000: 3.142352


## Data Processing with `Numpy`

Now that we have a basic familiarity with `numpy`, we will attempt to process some low-level data produced by a PMT connected to a scintillator cell.