<center> <h1>Unit 2</h1> </center>
<center> <h1>Numpy, Pandas, Matplotlib</h1> </center>
<center>(Execute "Numpy, Pandas, and Matplotlib.ipynb" along)</center>
<br>
<br>
<br>
<center> <h3>IST 718 – Big Data Analytics</h3> </center>
<center> <h3>Daniel E. Acuna</h3> </center>
<center> <h3>http://acuna.io</h3> </center>

# The Scientific Python Ecosystem
- The following commonly-used Python modules are a part of an ecosystem referred to as ***SciPy***:  

<center><img src="./images/unit-02/unit-02-0_pfds5.png" width="90%" align="center"></center>

# Python Modules and Packages
- A ***module*** is single Python file that is intended to be imported into other Python scripts.
  - The Python Standard Library is the standard foundation of the language and does not need to be imported into a Python script.
  - Other reusable code is imported as modules.
  
  
- A ***package*** is a collection of Python modules under a common namespace.
  - Think of packages as folders, and modules as files in a folder.

# Numpy

- Vanilla Python lists do not do some of the standard linear algebra


In [None]:
A = [1, 2, 3, 4]
4*A

# Numpy (2)

- Numpy stores multi-dimensional arrays and allows to make linear computations very efficiently

In [None]:
import numpy as np

# Creating arrays

- You can create arrays in two ways
  - From a Python list or tuple
  - From a special matrix generator
  - Other libraries generate Numpy arrays

In [None]:
print("From vanilla Python", np.array([1,2,3,4]))
print("From generators", np.zeros((3, 3)))

# ND-arrays store only one data type
- For efficiency reasons, an array only stores one datatype

In [None]:
A0 = np.array([0, 2, "cat"])
print(A0)
print(A0.dtype)

# Accessing elements

- You access individual elements with `A[i]` syntax
- Slices with `A[start:stop:step]`
- In multiple dimensions: `A[i, j]` or `A[start:stop:step, start:stop:step]`

In [None]:
# some examples
A1 = np.array([1, 2, 3, 4])
A2 = np.array([[1, 2, 3],
               [4, 5, 6],
               [7, 8, 9]])

In [None]:
print('A1[0] => ', A1[0])
print('A1[0:3:2] =>', A1[0:3:2])

In [None]:
print('A2[0] =>', A2[0])
print('A2[0:3:2] =>', A2[0:3:2])

In [None]:
print('A2[0:3:2, 0:2] =>', A2[0:3:2, 0:2])

# Matrix operations

- Standard linear algebra operations work in Numpy
- Scalar times matrix
- Matrix times matrix
- Operations such as inverse, transpose

In [None]:
# some operations are in another package
import numpy.linalg as la

A = np.array([[2, 3],
             [3, 4],
             [5, 6]])
B = np.array([[1, 2, 3],
             [4, 5, 6]])
C = np.array([[1, 3],
             [3, 2],
             [-1, 6]])

In [None]:
# Matrix transpose
A.T

In [None]:
2*A

In [None]:
4 + A

In [None]:
# multiplication
D = A.dot(B)
A.dot(B)

In [None]:
# to the proper division we need to do A*C^-1
A.dot(np.invert(B))

In [None]:
# checking some identities it should be similar to D
la.inv(la.inv(D))

# Random number generation

- Sometimes it is important to generate random numbers to do simulations

In [None]:
# there are many random number generators
list(filter(lambda x: '_' not in x, dir(np.random)))

In [None]:
# generate 5 by 5 matrix with uniform numbers from 0 to 1
np.random.random(size=(5, 5))

In [None]:
# generate 5 by 5 matrix with normal (Gaussian) distribution
# mean 5 and standard deviation 2.
np.random.normal(size=(5, 5), loc=5., scale=2.)

# Aggregate operations

- There could be several summary operations that we could do across one dimension or several dimensions
- For example, the average per row or column

In [None]:
A3 = np.random.random(size=(3, 5))

In [None]:
# mean across all dimension
A3.mean()

In [None]:
# mean across rows (dimension 0)
A3.mean(axis=0)

In [None]:
# mean across columns (dimension 1)
A3.mean(axis=1)

In [None]:
# there are mamy such operations
list(filter(lambda x: '_' not in x, dir(A3)))

# Chaining operations

- Because each Numpy operations returns an array, we can easily chain operations

In [None]:
A4 = np.random.random(size=(3, 5))

In [None]:
A4.T.mean(axis=0).sum()

# Selecting elements with "masks"

In [None]:
# we can create boolean masks
A4 > 0.5

In [None]:
# and then use those broadcast operations to get the values
A4[A4>0.5]

In [None]:
# you can also select entire rows using the same idea
A4.sum(axis=1)>2.5

In [None]:
# select rows
A4[A4.sum(axis=1)>2.5]

# Operation broadcasting

- Sometimes we might want to apply an operation to each row (or dimension)
- For example, subtract the mean row across a matrix
$$ A_{\text{centered}} = (a_{ij} - \sum_z a_{zj})_{ij}$$
- Or "standardize" columns
$$ A_{\text{standardized}} = (\frac{a_{ij} - \sum_z a_{zj}}{\text{std}(a_{:j})})_{ij}$$

In [None]:
A5 = np.random.random(size=(5, 3))

In [None]:
# centered
A5 - A5.mean(axis=0)

In [None]:
# standardized
(A5 - A5.mean(axis=0))/A5.std(axis=0)

# Apply other functions to arrays
There are many functions that you can apply to arrays. In fact all functions that look convenient are a wrap to more explicit functions. For example, `A + 3` is a wrap around `np.add(A, 3)`

In [None]:
A + 3

In [None]:
np.add(A, 3)

In [None]:
np.sin(A)

In [None]:
# exponential

np.exp(A)

# Activity: Linear regression
- Linear regression is a simple linear prediction method
- $age = (30 \; 20\;  33\;  25\;  50)^T$
- $income = (25000\;  22000\;  21000\;  27000\;  40000)$
- Let's assume a simple model
$$ \hat{income} = b_0 + b_1 age $$
- Use $b = (20000 \; 5000)^T$
- Define the matrices X, y, and b for the predictions $Xb$

In [None]:
# define X
# lets define simple model for making predictions
# define X, y, and b

# Activity: Compute the Mean Squared Error
- Sometimes, we want to compute the mean squared error of a model
$$MSE(b) = \frac{1}{n}\sum_{i=1}^{n} (\hat{income}_i - income_i)^2$$
- Write a function `mse` that takes X, b, and y, and computes MSE

In [None]:
def mse(X, b, y):
    pass

# Activity: Computing the gradient of MSE
$$\Delta MSE = (\frac{d MSE(b_0)}{db_0} \quad \frac{d MSE(b_1)}{db_1})^T$$
- Define a function `grad` that takes X and b and returns the gradient of MSE

In [None]:
def grad(X, b):
    pass

# Activity: Gradient descent

- The following algorithm finds the $b$ that minimizes MSE

```
b = random vector
for i in [1, ..., n]:
    b = b - L grad(X, b)
```

- where L is known as the learning rate.
- Display the `mse` after each iteration. The `mse` should decrease after each iteration

# Pandas

- One of the problems with numpy arrays is that the columns and rows do not have names
- Also Numpy arrays can hold only one dataset
- Sometimes, we want to store something like a "spreadsheet" with names for columns and different datatypes

# What is pandas?
- ***pandas*** is an open-source library with easy-to-use data structures and functions that simplifies data analysis and modeling in Python, including:
  - `DataFrame` and `Series` data structures
  - tools for reading and writing data in CSV format, Excel, and SQL databases
  - "group by" operator on data sets
  - merging and joining data sets  


- pandas data structures are used in many other Python libraries, so it is a good library to be familiar with.

# Using Pandas in Your Programs
- We’ll need to import some packages and modules to use Pandas
```python
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
```

# Essential Pandas: Series and DataFrames
- Pandas has two data structures
  - `Series` $\rightarrow$ A Labeled list of data, 
  - `DataFrame` $\rightarrow$ A dictionary of `Series`
  

- The `DataFrame` is a table of data, and the `Series` represents one column in that table.


- NOTE: Pandas is “smart enough” to create a `DataFrame` from a list of dictionary, too.

# Demo: Exploring Pandas DataFrame and Series

# Demo: Pandas Data Manipulations
<h3><font color="gray">Row and Column Extractions</font></h3>

# Loading From a File to Pandas is Easy
- Use the `read_csv` pandas method to load data.
- It assumes first row is a header, but it can be manually overridden.
- http://pandas.pydata.org/pandas-docs/stable/io.html

# Demo: Exploring a data set in pandas

# matplotlib
- A 2D and 3D plotting library that can be used by Python as well as other frameworks.
- Has a large API with functions that can graph just about any plot imaginable.
<center><img src="./images/unit-02/unit-02-0_pfds6.png" width="60%" align="center"></center>

# matplotlib
<br>
<div class="container2">
  <div class="row2">
    <div class="col-6">
        <ul>
  <li>A 2D and 3D plotting library that can be used by Python as well as other frameworks.</li>
  <br>           
  <li>Has a large API with functions that can graph just about any plot imaginable.</li>
</ul>
      </div>
    <div class="col-6">
        <ul>
  <center><img src="./images/unit-02/unit-02-0_pfds6.png" width="100%" align="center"></center>
</ul>

    </div>
  </div>
</div>