<a href="https://colab.research.google.com/github/brendenwest/ad450/blob/master/2_numpy_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NumPy & Pandas
___

### Reading

- McKinney, Chapters 4 - 5
- Molin - Introduction to Data Analysis
- Molin - Working with Pandas DataFrames
- https://blog.growingdata.com.au/a-guided-introduction-to-exploratory-data-analysis-eda-using-python/

### Practice
- https://www.datacamp.com/community/tutorials/python-numpy-tutorial
- https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
- https://github.com/guipsamora/pandas_exercises 

### Learning Outcomes

- NumPy multi-dimensional array objects
- Array arithmetic and indexing
- Vectorized array operations
- Conditional logic as array operations
- Group-wise data manipulation
- Pandas data structures - Series and DataFrame
- Data selecting & filtering
- Computing descriptive statistics


# NumPy

---

NumPy is a foundational package for numerical computing in Python.

*   NumPy provides `ndarray`, an efficient multi-dimensional array supporting fast array-oriented arithmetic operations.
*   Can perform math operations on entire arrays without using for loops
*   Can perform common array operations like sorting, unique, & set 
*   Linear Algebra, random number generation and Fourier transform capabilities
*   C api for connecting with C, C++, & FORTRAN libraries
*   Can map data directly onto underlying disk or memory representation


### Efficiency
- NumPy stores data in contiguous memory blocks
- NumPy stores data with single type so operations don’t require type checking
- Performs complex computations on entire arrays without the need for loops
- Operations don’t copy arrays by default


In [0]:
import numpy as np

### ndarray
NumPy provides the `ndarray` a multi-dimensional array structure optimized for fast numeric operatons.

ndarrays are constructed from sequences of homogenous values.

ndarray has built-in functions to create special arrays - e.g. zeroes, ones, empty, arange

In [0]:
arr1 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
arr1  # print ndarray

print("dimensions:",arr1.ndim)  # ndarrays have dimensions
print("shape:\t",arr1.shape)    # ndarrays have shape (numer of rows & columns)
print("datatype",arr1.dtype)    # numpy determines the datatype

dimensions: 2
shape:	 (2, 4)
datatype int64


### Vectorized array operations

NumPy is designed & optimized for batch operations on array data without `for` loops

- Arithmetic operations between equal-size arrays apply the operation element-wise - multiplication, addition, subtraction, division
- Scalar operations propagate the scalar argument to each element in the array
- Comparisons between equal size arrays yield boolean arrays

### Indexing & Slicing

NumPy supports data access with Python-like indexing & slicing

- One-dimensional ndarrays act similar to Python lists
- `ndarray` dimensions are sometimes referred to as axes - e.g. in a 2d array axis 0 is the ‘rows’ and axis 1 is the ‘columns’
- ndarray slices are **views** on the original array and not copied. Any changes to the view are reflected in the source array


In [0]:
arr1d = np.array([1, 2, 3, 4, 5, 6, 7, 8])

arr1d[5:8]  # select data based on index position

# Scalar values can be propagated (aka broadcast) to each element in a slice
arr1d[5:8] = 12
print(arr1d)

# array subsets must be copied explicitly

new_array = arr1d[5:8].copy()
print(new_array)


[ 1  2  3  4  5 12 12 12]
[12 12 12]


In [0]:
arr2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

# elements in mult-dimensional arrays can be accessed with either syntax:

arr2d[0][2]
arr2d[0, 2]

# in mult-dimensional arrays, slicing that omits later indices will return a lower-dimensional ndarray

# Subarrays can be accessed with slices in place of indices:

arr2d[:2, 1:] # select first two rows and all but first column
arr2d[:2, 2]  # select first two rows and just 3rd column


array([3, 7])

### Boolean Indexing

NumPy supports boolean expressions in place of indices, where the expression results in an array of boolean values with the same length as the axis it’s indexing

In [0]:
import numpy as np
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7, 4) # create an array of random values
data[names == 'Bob'] # returns rows from ‘data’ whose index matches 'True' values in boolean array va

array([[-0.0113059 , -0.60401834, -0.69463262,  0.18446081],
       [ 0.94473667,  0.11833578,  0.52714475,  0.02470312]])

- the expression can be assigned to a variable

```
  cond = names == 'Bob'`
  data[cond]
```
- The expression can be negated

```
  cond = names == 'Bob'`
  data[~cond]
```
- the expression can be combined with other indices

```
  cond = names == 'Bob'`
  data[cond, 2:]
```
- Boolean expressions can be combined using & (and) and | (or)

```
    cond = (names == 'Bob') | (names == 'Will')
```

### Universal Functions

NumPy provides universal functions the can perform element-wise operations on array data. They are fast vectorized wrappers for simple functions that take a scalar value and produce one or more scalar results.

- Unary functions - e.g. sqrt, exp - perform element-wise transformations
- Binary functions - e.g. add, maximum - take two arrays and return a single-array as the result
- Ufuncs can use an optional ‘out’ parameter to perform in-place transformations


# Pandas
___

pandas supportsd data structures and data manipulation tools designed for fast & easy data cleaning and analysis.

pandas adopts array-based computing from NumPy, but is designed for tabular or heterogenous data.

### Series

### DataFrame