![DSB Logo](img/Dolan.jpg)
# The NumPy Library
## NumPy Arrays to the Rescue
[The NumPy Quickstart Tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)  
[NumPy Reference Docs](https://docs.scipy.org/doc/numpy-dev/reference/index.html#reference)

# Learning Objectives

## Theory / Be able to explain ...
- The benefits of NumPy for data analysts
- Multidimensional Arrays, Type Coercion, Element-wise Operations, - Universal Functions, Structured Arrays, etc.
- Data types as column specs in NumPy

## Skills / Know how to  ...
- Create and manipulate multidimensional arrays
- Perform various array operations (without for loops)
- Use universal functions to calculate descriptive stats
- Import and export structured arrays

# Basics
## NumPy as a List Replacement

# Lists are Great! … except when they’re not
Lists are very flexible containers for ordered collections:
- Allow mixed data types 
- Built in indexing and slicing schemes
- Easy concatenation and copying

However, lists are also somewhat inefficient:
- Handling mixed data types requires extra processing
- Most operations require `for` loops to iterate over the list items; slicing and concatenating are the exceptions

# What’s wrong with `for` loops?
```python
for i in numbers:
for i in range(len(numbers)):
```
- Prone to programmer error
   - What type is the loop variable `i` in each case above?
   - What if `numbers` is an empty list? What do we want to happen then? 
- Serialized execution (one cycle at a time) doesn’t give the interpreter much room to optimize
   - Modern computers can do many things at once!

# Standard Arrays to the Rescue?!?
Standard library (‘vanilla’) arrays solve *some* of the errors and inefficiencies associated with mixed data types by requiring all data to be of the same type:
```python
from array import array
array('i',1,2)          # list of ints
array('f',1.0,5.0,7.8)  # list of floats
array('u','a','↑','☺')  # list of characters
``` 
However, they still require us to use loops to do anything meaningful with them!


Note: we had to import the `array` module; the `array` data type is included in the Specialized Types portion of the Standard Library.
![Data Type Hierarchy](img/L4_Standard_Data_Types.png)

# Introducing NumPy Arrays (the `ndarray` type)
NumPy arrays are like standard arrays on steroids, with indexing, slicing, copying, etc. **plus** 
- **Type coercion** so you don’t get errors when you mix strings with floats
- **Elementwise** versions of `*`, `/`, `+`, `-`, boolean comparisons, etc. that **eliminate the need for most for loops**
- Methods for **descriptive statistics** and other common calculations
- Support for **linear algebra** operations (dot prod, cross prod, etc.)
- **Streamlined file Input / Output** for 2D tabular data

# Implications for Analysts
- Working with tabular data in **vanilla Python can be tedious and error-prone.**
- NumPy simplifies things by **automating away** most of the tedium of loops, if statements, etc. 
- Tabular data in an `ndarray` **feels more like a spreadsheet**, only without the copy/paste and drag fills. 
- Except, of course, **NumPy can handle any sized data set** (if you have enough time).

# Using NumPy Arrays
## `ndarray` basics

# Importing NumPy
- NumPy is a third-party library that you have to install separately. (That’s why we don’t consider it vanilla.)
- It’s probably a good idea to import numpy near the top of every script/notebook that needs it. 
- For the remaining slides, assume that we have already imported NumPy as follows: 
```python
import numpy as np  # np always refers to NumPy
```

In [1]:
import numpy as np

# Creating a NumPy Array
We can create a new `ndarray` from any ordered collection (list, tuple, etc.).

In [2]:
np_array = np.array([1,5,2,9]) # Note: from a list
type(np_array)

numpy.ndarray

In [3]:
np_array

array([1, 5, 2, 9])

In [4]:
print(np_array[1])

5


In [5]:
len(np_array)

4

# Type Coercion
To prevent mixed types within an `ndarray`, NumPy will coerce (convert) all elements to a ‘lowest common denominator’ type that can represent the all of the data, where `int` → `float` → `str`  

In [6]:
np.array([1,2.0]) # coerces everything to float

array([1., 2.])

In [7]:
np.array([1,2.0,'3']) # coerces everything to (U32-encoded) str

array(['1', '2.0', '3'], dtype='<U32')

# `ndarray` Attributes
The `np.array` type keeps metadata about your arrays:

In [8]:
np_array = np.array([[1,5,2,9],[2,1,9,5]])
np_array.ndim 	# dimensions 1D, 2D, 3D, etc.

2

In [9]:
np_array.shape 	# rows and columns for 2D

(2, 4)

In [10]:
np_array.dtype # data type; ‘int64’ is int

dtype('int64')

Note: there are no parentheses because these are metadata attributes (data), not methods (functions).

# Element-wise Operations
Arithmetic operations work element-wise, iterating over the elements one at a time (without a for loop!)

In [11]:
x =   np.array([[1,5, 2,9], [2, 1,9,5]])
y =   np.array([[1,3, 4,2], [0, 5,8,3]])

In [12]:
x-y # pairwise operator

array([[ 0,  2, -2,  7],
       [ 2, -4,  1,  2]])

In [13]:
2*x # scalar operator

array([[ 2, 10,  4, 18],
       [ 4,  2, 18, 10]])

In [14]:
x.dot(np.transpose(y)) # dot product; not pairwise

array([[42, 68],
       [51, 92]])

# Boolean Comparisons
The built-in boolean comparators `==`, `!=`, `>`, `>=`, `<=`, etc. also apply element-wise:

In [15]:
x = np.array([2,1,9,5])
x>2

array([False, False,  True,  True])

In [16]:
x==2

array([ True, False, False, False])

In [17]:
y = np.array([1,3,7,2])
x>y

array([ True, False,  True,  True])

# In-Place Operations
If we want to do arithmetic on an `ndarray` without creating a copy, we can use `*=`,  `/=` , `+=`, and `-=`.

In [18]:
np_array = np.array([1,5,2,9])
np_array *= 3     # multiply each element by 3
np_array

array([ 3, 15,  6, 27])

# `ndarray` Operations
## Methods and Functions

# Array Methods
`ndarray` has callable methods for descriptive statistics and other common **unary** calculations. 

In [19]:
x = np.array([1,5,2,9])
x.sum()

17

In [20]:
x.min()

1

In [21]:
x.mean()

4.25

In [22]:
x.sort() # modifies x!
x

array([1, 2, 5, 9])

Ref: https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html

# NumPy Universal Functions
In addition to array methods, NumPy also supplies a bunch of useful array functions (that take `ndarray` inputs). There are too many to cover here, but RTFM (below) for more.  
![Universal Functions](img/L5_Universal_Functions.png)  
Excel geeks: don't the function names look familiar? That's somewhat intentional.    
Ref: https://docs.scipy.org/doc/numpy-1.13.0/reference/ufuncs.html#available-ufuncs

# Indexing Tricks
## Selecting just the items you need without a for loop

# Indexing and Slicing as Usual
All the usual indexing & slicing rules apply to 1D `ndarrays`:

In [23]:
x=np.array([1,2,3])
x[1:]

array([2, 3])

2D and higher arrays allow comma separated slices:

In [24]:
x=np.array([[1,2,3],[4,5,6]]) 
x[1][1:] # vanilla Python

array([5, 6])

In [25]:
x[1,1:]  # with commas instead

array([5, 6])

# Indexed Selections
We can use an array of indexes to select elements from another array.

In [26]:
x = np.array([1,5,2,9])  # array of data
i = np.array([1,3])      # array of selected indexes
x[i]                     # x 'reduced' to i

array([5, 9])

# Boolean Selections
Booleans can also be used as selectors.

In [27]:
x = np.array([2,1,9,5])
y=x>2     # y is an array of booleans
y

array([False, False,  True,  True])

In [28]:
x[y]      # use the booleans to select items from x

array([9, 5])

In [29]:
x[x>2]    # all in one statement; it's weird but it works

array([9, 5])

# Note about Iterating in 2D, 3D, etc.
Take care when using for loops with NumPy arrays. They always iterate over the first axis (dimension)! 

In [30]:
x=np.array([[1,2,3],[4,5,6]])
for i in x:
    print(i) # i is an array, not a number

[1 2 3]
[4 5 6]


In [31]:
# What we probably wanted
for row in x:
    for i in row:
        print(i,end=" ") # end argument replaces newline with a space   
    print('\n') # the newline at the end of each row

1 2 3 

4 5 6 



# Structured Arrays
## When columns have names and types

# Tables are more than just data ...
Structured arrays let us specify metadata like column names and data types, **just like database tables**.

In [32]:
# a 2D (rows and columns) array of people data
people = np.array([(1,'Paca','Al'),(2,'Loblaw','Bob')],          # the data
            dtype=[('id',int),('lname', 'U25'),('fname','U25')]) # the columns
people

array([(1, 'Paca', 'Al'), (2, 'Loblaw', 'Bob')],
      dtype=[('id', '<i8'), ('lname', '<U25'), ('fname', '<U25')])

# The `dtype` Specification
```python
dtype=[('id',int),('lname', 'U25'),('fname','U25')]
```
Metadata for each column is encoded in the dtype spec, which is a list of a tuples  
```python
('<col name>',<type spec>)
```

Some common type specs ([RTFM](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html))
- `int`
- `float`
- `'U#'` (unicode string of up to # characters)  

# Record Arrays
Record arrays let us refer to columns as attributes with dot notation. All we have to do is use the `np.rec.array` type instead of `np.array`.

In [33]:
people = np.rec.array([(1,'Paca','Al'),(2,'Loblaw','Bob')],
         dtype=[('id','i'),('lname', 'U25'),('fname','U25')])
people[1].lname # refers to second column by name

'Loblaw'

Note that this is about as close as we can get to a SQL table in NumPy. That's why its called a **record** array (as in, an array of records).

# File I/O
## Fast and (relatively) easy import and export of tabular data

# NumPy Data Sources
NumPy can read and write data to:
- Strings (and streams)
- **CSV files**
- Formatted text files
- Binary files (raw, ‘pickled,’ or arrays)
- Compressed archives (gzip)
- Web URLs (with ftp, sftp, http, https)

Ref: https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.io.html


# The `genfromtxt()` function
The workhorse of NumPy I/O, `genfromtxt()` makes reading from CSV files almost automatic. No more opening, reading, splitting, stripping, closing…  
```python
my_table = np.genfromtxt('my_file.csv',delimiter=',')
print(type(my_table)) # outputs <class 'numpy.ndarray'>
```

Ref: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.genfromtxt.html



# `genfromtxt()` options
`genfromtxt()` has lots of optional arguments:
- `dtype` specifies the data type(s) of the columns
- `delimiter` specifies the column separator
- `autostrip` removes white space characters
- `skipheader` 	skips the indicated number of lines
- `usecols` indicates which columns to import
- `names` provides column names (not needed if the full dtype is given)

Ref: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.genfromtxt.html


## Examples
```python
# people.csv is a csv file with cols ID,LName,FName

# This works if all data is numerical and no headers 
people = np.genfromtxt('people.csv',delimiter=',')

# Skips the first line and uses mixed data types
people = np.genfromtxt('people.csv',skip_header=1,
           dtype=(int,'U25','U25'),names="id,lname,fname",
           delimiter=',')
```


```python
# Reads column names from the first line of the file
people = np.genfromtxt('people.csv',names=True,
           dtype=(int,'U25','U25'),delimiter=',')

# Shorthand function for CSV; returns a rec.array
people =np.recfromcsv('people.csv',names=True,
           dtype=(int,'U25','U25'))
```

# Some Helpful Advice About Overspecifying Your Data Import: Don't
With all the `genfromtxt` optional arguments, it's tempting to use all of them. However, they interact in sometime unintuitive ways. 

NumPy does a suprisingly good job of guessing data types, etc. *if you don't short-circuit the guessing with unnecessary options.* So let it try *and then* add options to fix any problems you encounter. 

# Output with `savetxt()`
The `savetxt()` function does the reverse of `genfromtxt()`.
```python
# people is a structured array
# Save to the CSV file “out.csv”
np.savetxt("out.csv", people, delimiter= ',')

# Save as a gzip file, detected from the filename
np.savetxt("out.csv.gz", people, delimiter= ',')
```

# Classwork (Start here in class)
- If time permits, start in on your homework 

# Homework (Do at home over Fall Break)
The following is due before class next week:
- Any remaining classwork from tonight


Please email chuntley@fairfield.edu if you have any problems or questions.