<a href="https://colab.research.google.com/github/brendenwest/ad450/blob/master/2_numpy_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NumPy & Pandas
___

### Reading

- McKinney, Chapters 4 - 5
- Molin - Introduction to Data Analysis
- Molin - Working with Pandas DataFrames
- https://blog.growingdata.com.au/a-guided-introduction-to-exploratory-data-analysis-eda-using-python/

### Practice
- https://www.datacamp.com/community/tutorials/python-numpy-tutorial
- https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
- https://github.com/guipsamora/pandas_exercises 

### Learning Outcomes

- NumPy multi-dimensional array objects
- Array arithmetic and indexing
- Vectorized array operations
- Conditional logic as array operations
- Group-wise data manipulation
- Pandas data structures - Series and DataFrame
- Data selecting & filtering
- Computing descriptive statistics


# NumPy

---

NumPy is a foundational package for numerical computing in Python.

*   NumPy provides `ndarray`, an efficient multi-dimensional array supporting fast array-oriented arithmetic operations.
*   Can perform math operations on entire arrays without using for loops
*   Can perform common array operations like sorting, unique, & set 
*   Linear Algebra, random number generation and Fourier transform capabilities
*   C api for connecting with C, C++, & FORTRAN libraries
*   Can map data directly onto underlying disk or memory representation


### Efficiency
- NumPy stores data in contiguous memory blocks
- NumPy stores data with single type so operations don’t require type checking
- Performs complex computations on entire arrays without the need for loops
- Operations don’t copy arrays by default


In [0]:
import numpy as np

### ndarray
NumPy provides the `ndarray` a multi-dimensional array structure optimized for fast numeric operatons.

ndarrays are constructed from sequences of homogenous values.

ndarray has built-in functions to create special arrays - e.g. zeroes, ones, empty, arange

In [0]:
arr1 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
arr1  # print ndarray

print("dimensions:",arr1.ndim)  # ndarrays have dimensions
print("shape:\t",arr1.shape)    # ndarrays have shape (numer of rows & columns)
print("datatype",arr1.dtype)    # numpy determines the datatype

dimensions: 2
shape:	 (2, 4)
datatype int64


### Vectorized array operations

NumPy is designed & optimized for batch operations on array data without `for` loops

- Arithmetic operations between equal-size arrays apply the operation element-wise - multiplication, addition, subtraction, division
- Scalar operations propagate the scalar argument to each element in the array
- Comparisons between equal size arrays yield boolean arrays

### Indexing & Slicing

NumPy supports data access with Python-like indexing & slicing

- One-dimensional ndarrays act similar to Python lists
- `ndarray` dimensions are sometimes referred to as axes - e.g. in a 2d array axis 0 is the ‘rows’ and axis 1 is the ‘columns’
- ndarray slices are **views** on the original array and not copied. Any changes to the view are reflected in the source array


In [0]:
arr1d = np.array([1, 2, 3, 4, 5, 6, 7, 8])

arr1d[5:8]  # select data based on index position

# Scalar values can be propagated (aka broadcast) to each element in a slice
arr1d[5:8] = 12
print(arr1d)

# array subsets must be copied explicitly

new_array = arr1d[5:8].copy()
print(new_array)


[ 1  2  3  4  5 12 12 12]
[12 12 12]


In [0]:
arr2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

# elements in mult-dimensional arrays can be accessed with either syntax:

arr2d[0][2]
arr2d[0, 2]

# in mult-dimensional arrays, slicing that omits later indices will return a lower-dimensional ndarray

# Subarrays can be accessed with slices in place of indices:

arr2d[:2, 1:] # select first two rows and all but first column
arr2d[:2, 2]  # select first two rows and just 3rd column


array([3, 7])

### Boolean Indexing

NumPy supports boolean expressions in place of indices, where the expression results in an array of boolean values with the same length as the axis it’s indexing

In [0]:
import numpy as np
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7, 4) # create an array of random values
data[names == 'Bob'] # returns rows from ‘data’ whose index matches 'True' values in boolean array va

array([[-0.0113059 , -0.60401834, -0.69463262,  0.18446081],
       [ 0.94473667,  0.11833578,  0.52714475,  0.02470312]])

- the expression can be assigned to a variable

```
  cond = names == 'Bob'`
  data[cond]
```
- The expression can be negated

```
  cond = names == 'Bob'`
  data[~cond]
```
- the expression can be combined with other indices

```
  cond = names == 'Bob'`
  data[cond, 2:]
```
- Boolean expressions can be combined using & (and) and | (or)

```
    cond = (names == 'Bob') | (names == 'Will')
```

### Universal Functions

NumPy provides universal functions the can perform element-wise operations on array data. They are fast vectorized wrappers for simple functions that take a scalar value and produce one or more scalar results.

- Unary functions - e.g. sqrt, exp - perform element-wise transformations
- Binary functions - e.g. add, maximum - take two arrays and return a single-array as the result
- Ufuncs can use an optional ‘out’ parameter to perform in-place transformations


### Array Oriented Programming

NumPy supports ‘vectorization’ where data processing is executed as array expressions without for loops. Vectorized operations can be 1-2 orders of magnitude faster that pure Python equivalents.


### Conditional Logic

- np.where() is a vectorized ternary expression. For example, to take a value from `arr1` whenever the corresponding value in `cond` is True, and otherwise take the value from `arr2`:

`result = np.where(cond, arr1, arr2)`

- 2nd & 3rd parameters can be arrays or scalars


### Math & Statistical Methods

NumPy can compute statistics for an entire array or the data along a single axis.
- Can compute aggregations by invoking the array instance method or the top-level NumPy function
- Can specify whether to compute across rows or columns
- Can use `sum` to count number of True values in a boolean array 

# Pandas
___

pandas supports data structures and data manipulation tools designed for fast & easy data cleaning and analysis.

pandas adopts array-based computing from NumPy, but is designed for tabular or heterogenous data.

pandas has two primary data structures:

- **Series** - one-dimensional array-like object with a sequence of values having the same datatype
- **DataFrame** - rectangular table of data with ordered collection of columns, each of which can be a different value type

### Series

Has a sequence of values and an associated array of data labels called its index. Sort of like an ordered dict with mapping of index values to data values.

- If not specified otherwise, the index values are sequential integers
- Index values can be strings
- Index-value link is preserved when the Series is filtered or modified
- Series data can be selected by indexing on label 
- Both the Series and the index have a name attribute

A Series can be created directly from a python dict. By default, data will be stored in sorted order of the keys, but you can specify a different order.

pandas can automatically determine datatype of values when a Series is created, but datatype can also be specified.

A Series index can be altered in-place by assigning new values.


### DataFrame

A pandas DataFrame is sort of a dict of Series all sharing the same index.

- DataFrames have both row and column indices
- DataFrames are physically 2D, but can represent higher-dimensional data using hierarchical indexing
- DataFrame rows are sometimes referred to as axis=0
- DataFrame columns are sometimes referred to as axis=1

By default, DataFrame columns are created in sorted order of keys. Order can also be specified.

A column can be retrieved as a Series by bracket or dot notation:
```
frame[‘column’]
frame.column
```
The returned Series will have the same index as the DataFrame and with the name attribute appropriately set.

Column data can be assigned new values, either a scalar or array-like values. Lists or arrays assigned to a column must have the same length as the column. 

A Series can be assigned to a column, with its index aligning to the DataFrame’s index.

Syntax for creating a new column is similar to assignment, but only bracket notation works.

Columns can be removed with the **del** method.

DataFrames can be populated from a variety of array-like structures.

Missing values in the sort data will be populated with `null` or `NaN` depending on column data type.


#### Indexes & Slicing

Index objects 
- are immutable and hold axis labels & metadata 
- behave like sets, but can contain duplicate labels
- Can be accessed like so:
```
	frame.index	# returns rows index
	frame.columns	# returns columns index
```
- Support a number of set methods and properties

DataFrames can be reindexed to rearrange data according to a new index, as long as labels are unique.

Reindex can introduce missing values or fill in missing values explicitly.

Reindex can alter row index, columns, or both.

Entries can be dropped from an axis using index labels. For DataFrames, drop will automatically drop rows unless columns axis is specified.

Methods like drop, which modify the size or shape of a Series or DataFrame, return a new object unless you specify inplace=True

Index labels can be used for data selection or filtering.

Series data can be indexed by label or data value. In either case, this can be a single value or a sequence.

Slicing a Series with labels differs from normal Python - the endpoint is inclusive.

DataFrames indexing returns one or more columns using column names.

DataFrames support row selection with slicing similar to Python, as a convenience, but can also be done with a boolean array.

DataFrame rows can also be selected by axis labels (loc) or integers (iloc). 

Each method takes two parameters - row select and column selection. Parameter values can be scalar, a sequence, or a slice.

DataFrames also support selecting a single value at a row & column position.


#### Arithmetic & Data Alignment

When pandas objects are added, the resulting index is a union of the source indexes. Where index labels don’t overlap, missing values are inserted.

By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows.

  `frame - series`

To broadcast over columns instead, matching on the rows, you have to use arithmetic method and specify the axis to match on:

  `frame.sub(series, axis='index')`

NumPy ufuncs can be applied to pandas objects.

Pandas can also **apply()** an array function to each column or row of a DataFrame. The applied function can return a scalar or a Series.

By default, apply() is invoked per column, but you can specify `axis=’columns’` instead to invoke for each row.

Element-wise functions can be applied with **applymap()**. 


#### Sorting & Ranking

pandas objects can be sorted by index label using **sort_index()** in ascending or descending order.

DataFrames can be sorted by either axis.

**sort_values()** has similar syntax, but sorts the object by values instead. Data in one or more columns can be used as sort keys when sorting a DataFrame.

**rank()** returns the rank of data points in an array. By default, rank() breaks ties by assigning each item the group mean rank. Rank can also be assigned according the order values are observed in the data.

For DataFrames, rank can be computed over the rows or columns.


#### Computing Descriptive Statistics

pandas objects support common math & statistical methods, mostly for **reductions** or summary statistics - methods that extract a single value from a Series or a Series of values from the rows or columns of a DataFrame.

NA values are excluded by default.

**idxmin()** and **idxmax()** return the index value where minimum or maximum values are found.

**describe()** returns multiple summary statistics in one  pass.
