# Intro to Numpy

Today we will be doing an introduction to the Python library [Numpy](https://docs.scipy.org/doc/numpy-1.10.0/user/basics.html). Numpy uses its own data structure, an array, to do numerical computations. The Numpy library is often used in scientific and engineering contexts for doing data manipulation.

For reference, here's a link to the official [Numpy documentation](https://docs.scipy.org/doc/numpy/reference/routines.html).

In [1]:
## An import statement for getting the Numpy library:
import numpy as np
## Also import csv to process the data file (black magic for now):
import csv

### Numpy Arrays

In [3]:
# Example problem with numpy array
lst = [1, 2, 3]
values = np.array(lst)
print(values)
print(lst)

[1 2 3]
[1, 2, 3]


In [4]:
# Example problem with multidimensional numpy array
lst = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
values = np.array(lst)
print(values)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


### Cleveland Heart Disease Dataset

In order to demonstrate the power of the Numpy library, we first needed a dataset to work on. We selected a clean text file of heart disease data, which we got from the [UCI website](https://archive.ics.uci.edu/ml/datasets/Heart+Disease). Take a moment to review the website and get a feel for the dataset.

Now that we're acquainted with the dataset, let's write the data into a matrix so that we can perform calculations on it. Below we will use the Python CSV library to read from the CSV file; we'll learn these techniques later in the semester.

In [6]:
"""
As we saw on the website, the 14 attributes used in the published
experiment are as follows. We will use these fields to retrieve
data from the CSV file, and write them into a list of Numpy arrays.
"""
fields = ["age", "sex", "chest_pain_type", "rest_blood_pressure",
          "cholestoral", "fasting_blood_sugar","rest_ecg", "max_hr",
          "ex_ang", "oldpeak", "slope", "ca", "thal", "num"]

In [8]:
"""
As we saw above, in order to construct an array, we first create a
list and then use the Numpy np.array(lst) constructor. Now we will
apply this technique to construct a multidimensional array, or matrix.

Below, we use the Python CSV library to read from the CSV data file,
which is a technique we will learn later in the semester. Once we
read the data into a list, we construct an array that has one row of
values. We add each row to a list.
"""
arrays = []

with open('processed_cleveland_data.csv', 'r') as csvfile: # open file
    reader = csv.DictReader(csvfile, fields) # create a reader
    for row in reader: # loop through rows
        lst = []
        for field in fields:
            val = row[field]
            # Get rid of absent values
            if val == '?':
                val = 0
            lst.append(float(val))
        arr = np.array(lst)
        arrays.append(arr)
        
print(arrays)

[array([  63. ,    1. ,    1. ,  145. ,  233. ,    1. ,    2. ,  150. ,
          0. ,    2.3,    3. ,    0. ,    6. ,    0. ]), array([  67. ,    1. ,    4. ,  160. ,  286. ,    0. ,    2. ,  108. ,
          1. ,    1.5,    2. ,    3. ,    3. ,    2. ]), array([  67. ,    1. ,    4. ,  120. ,  229. ,    0. ,    2. ,  129. ,
          1. ,    2.6,    2. ,    2. ,    7. ,    1. ]), array([  37. ,    1. ,    3. ,  130. ,  250. ,    0. ,    0. ,  187. ,
          0. ,    3.5,    3. ,    0. ,    3. ,    0. ]), array([  41. ,    0. ,    2. ,  130. ,  204. ,    0. ,    2. ,  172. ,
          0. ,    1.4,    1. ,    0. ,    3. ,    0. ]), array([  56. ,    1. ,    2. ,  120. ,  236. ,    0. ,    0. ,  178. ,
          0. ,    0.8,    1. ,    0. ,    3. ,    0. ]), array([  62. ,    0. ,    4. ,  140. ,  268. ,    0. ,    2. ,  160. ,
          0. ,    3.6,    3. ,    2. ,    3. ,    3. ]), array([  57. ,    0. ,    4. ,  120. ,  354. ,    0. ,    0. ,  163. ,
          1. ,    0.6,    1. ,  

In [9]:
"""
Now we have a list of row arrays, which is stored in the arrays variable.
We want a matrix, so we will feed this list into the array constructor,
and that will yield an array of arrays.

Print out the matrix to see how Python represents the data.
"""

matrix = np.array(arrays)
print(matrix)

[[ 63.   1.   1. ...,   0.   6.   0.]
 [ 67.   1.   4. ...,   3.   3.   2.]
 [ 67.   1.   4. ...,   2.   7.   1.]
 ..., 
 [ 57.   1.   4. ...,   1.   7.   3.]
 [ 57.   0.   2. ...,   1.   3.   1.]
 [ 38.   1.   3. ...,   0.   3.   0.]]


Now we have a matrix representation of the research data. How might this be useful to us?

### Numpy Array Manipulation Methods

Our new matrix is shaped such that each array contains one array of values that belong together. Sometimes, however, we realize after constructing our data that it would be more useful in a different form. What if, for example, instead of having each row represent the 14 attributes for that set, we wanted 14 arrays that were organized by attribute?

In this case, we can achieve that reshaping by taking thr transpose of the matrix. Taking the transpose means swapping the columns and rows of the matrix.

In Numpy, this is an exceptionally easy operation. If we have a matrix `M`, the transpose of that matrix is caluculated by running `M.T`. Here is an example below:

In [12]:
transposed = matrix.T
print(transposed)
print(len(transposed))
print(len(transposed[0]))
print(len(matrix))

[[ 63.  67.  67. ...,  57.  57.  38.]
 [  1.   1.   1. ...,   1.   0.   1.]
 [  1.   4.   4. ...,   4.   2.   3.]
 ..., 
 [  0.   3.   2. ...,   1.   1.   0.]
 [  6.   3.   7. ...,   7.   3.   3.]
 [  0.   2.   1. ...,   3.   1.   0.]]
14
303
303


In addition to the transpose function, Numpy has a function, `swapaxes`, which takes in an array and two ints, `a` and `b`, and swaps the `a` and `b` axes of that array. In the case of our two-dimensional array, swapping the `x` and `y` axes has the same effect as taking the transpose of the matrix.

In [13]:
transposed = np.swapaxes(matrix, 0, 1)
print(transposed)
print(len(transposed))
print(transposed[0][0])

[[ 63.  67.  67. ...,  57.  57.  38.]
 [  1.   1.   1. ...,   1.   0.   1.]
 [  1.   4.   4. ...,   4.   2.   3.]
 ..., 
 [  0.   3.   2. ...,   1.   1.   0.]
 [  6.   3.   7. ...,   7.   3.   3.]
 [  0.   2.   1. ...,   3.   1.   0.]]
14
63.0


Here we swap the axes of a smaller array, so that the effects are easier to see.

In [19]:
x = np.array([[[1], [2], [3]]])
x2 = np.swapaxes(x, 0, 2)
print(x)
print(x2)

[[[1]
  [2]
  [3]]]
[[[1]
  [2]
  [3]]]


In addition to reshaping the array by swapping axes, it is possible to use a one-dimensional iterator over the array. Unlike the above functions called in the above cells, this iterator exists as an attribute of Numpy arrays -- that is, it is not necessary to call a function to create the iterator; it already exists. To access it, we call `arr.flat` for some Numpy array `arr`. See an example below:

In [20]:
# First, a simple example with a 1D array

x = np.array([1, 2, 3])
print(x.flat[0])
print(x.flat[2])

1
3


In [21]:
# Now, an example with a 2D array

x = np.array([[1, 2],[3, 4],[5, 6]])
print(x.flat[0])
print(x.flat[2])
print(x.flat[5])

1
3
6


Now that we have arrays containing all values for each attribute in the dataset, it would be nice to calculate average values for each of those attributes. Again, in Numpy this is exceptionally easy.

Numpy has a built-in function `np.mean(arr)` that takes in a single-dimension array and returns the mean of the array's values. Let's write a function that takes in a matrix and returns an array of means for each row in the matrix.

In [26]:
"""
Input: matrix (array of arrays)
Output: array of average values

Hint: Iterate through the matrix and use np.mean!
"""
def matrix_means(matrix):
    means = []
    for row in matrix:
        means.append(int(np.mean(row)))
    return np.array(means)

matrix_means(transposed)

array([ 54,   0,   3, 131, 246,   0,   0, 149,   0,   1,   1,   0,   4,   0])

### Other Numpy Techniques

Above, we learned how to manipulate arrays and matrices of data that already existed. Often times, in scientific computing, it is useful to [create a multidimensional array](https://en.wikipedia.org/wiki/State-transition_matrix) or [random number](https://en.wikipedia.org/wiki/Randomized_algorithm). Numpy makes these things much easier, with the following techniques.

#### Random Numbers

Numpy has a built-in function that can generate random numbers, or arrays of random numbers. Called by `np.random.rand(d1, d2, ...)`, it takes in optional `dimension` parameters. Without a `dimension` input, it will return a single random float. If you input dimensions, it will return array with the specified shape. Below are some examples:

In [27]:
"""
Here, we will generate single random numbers.
"""
r = np.random.rand()
print(r)

0.49237287649764006


In [29]:
"""
Here, we will generate arrays of random numbers.
"""
r_arr = np.random.rand(2, 3)
print(r_arr)

[[ 0.67727392  0.55399923  0.28479283]
 [ 0.64930206  0.66069841  0.26740901]]


#### Array Generation

As we saw in the previous subsection, it is possible to generate arrays of random numbers in Numpy. It is also possible to generate arrays with a range of values, or zeroes. Here are some examples:

In [30]:
"""
Here, we will generate a multidimensional array of zeros. This might be
useful as a starting value that could be filled in.

As with most array constructors in Numpy, the inputs are dimensions.
"""
z = np.zeros((10, 2))
print(z)

[[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]]


In [31]:
"""
Here, we will generate an array of values within some range.

As with most array constructors in Numpy, the inputs are dimensions.
"""
a = np.arange(6)
print(a)

[0 1 2 3 4 5]


In [34]:
"""
Here, we will take the array from the previous cell and reshape it.

We will take the array of size 6 and reshape it into two arrays of
size 3.

We will also see what happens if we try to use 'a' to construct a new
single-dimension array of size 6.
"""
b = a.reshape(2, 3)
c = a.reshape(1, 6)

print(b)
print(c)

[[0 1 2]
 [3 4 5]]
[[0 1 2 3 4 5]]


Note that in the above example, reshaping a single-dimension array of size 6 into a new single-dimension array of size 6 did not do exactly what we expected. Reshape always returns the new subarrays in a greater array, or matrix.

Another very useful feature of Numpy lets us generate an array of evenly-spaced values within a specified range. For example, if we wanted an array of 30 values between 1 and 5, it would be pretty tedious to have to calculate every separate value. The `linspace` function takes care of this for us.

In [35]:
"""
Generating 30 evenly-spaced values between 1 and 5:
"""
vals = np.linspace(1, 5, 30)
print(vals)

[ 1.          1.13793103  1.27586207  1.4137931   1.55172414  1.68965517
  1.82758621  1.96551724  2.10344828  2.24137931  2.37931034  2.51724138
  2.65517241  2.79310345  2.93103448  3.06896552  3.20689655  3.34482759
  3.48275862  3.62068966  3.75862069  3.89655172  4.03448276  4.17241379
  4.31034483  4.44827586  4.5862069   4.72413793  4.86206897  5.        ]


If we wanted to create an array with values in a specified range but instead of specifying the number of values in the range, we wanted to specify a step size, we can use the `arange` function. Let's make an array with values between 0 and 100, and a step size of 3:

In [36]:
"""
Generating an array with values between 0 and 100, with a step size of 3:
"""
vals = np.arange(0, 100, 3)
print(vals)

[ 0  3  6  9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72
 75 78 81 84 87 90 93 96 99]


#### Statistics

Numpy also makes it easy to compute a variety of statistics of large data sets. Let's return to our heart disease data set and take a look at the attribute at index 3, which is resting blood pressure. To start off, let's compute some basic statistics on this specific attribute.

In [37]:
"""
First, we'll reshape the array containing our dataset so that each row contains one attribute, 
instead of each row containing all attributes for one person. The third row of our transposed matrix should contain
all of the resting blood pressure values in the data.
"""
transposed = matrix.T
bp = transposed[3]
print(bp)

[ 145.  160.  120.  130.  130.  120.  140.  120.  130.  140.  140.  140.
  130.  120.  172.  150.  110.  140.  130.  130.  110.  150.  120.  132.
  130.  120.  120.  150.  150.  110.  140.  117.  140.  135.  130.  140.
  120.  150.  132.  150.  150.  140.  160.  150.  130.  112.  110.  150.
  140.  130.  105.  120.  112.  130.  130.  124.  140.  110.  125.  125.
  130.  142.  128.  135.  120.  145.  140.  150.  170.  150.  155.  125.
  120.  110.  110.  160.  125.  140.  130.  150.  104.  130.  140.  180.
  120.  140.  138.  128.  138.  130.  120.  160.  130.  108.  135.  128.
  110.  150.  134.  122.  115.  118.  128.  110.  120.  108.  140.  128.
  120.  118.  145.  125.  118.  132.  130.  135.  140.  138.  130.  135.
  130.  150.  100.  140.  138.  130.  200.  110.  120.  124.  120.   94.
  130.  140.  122.  135.  145.  120.  120.  125.  140.  170.  128.  125.
  105.  108.  165.  112.  128.  102.  152.  102.  115.  160.  120.  130.
  140.  125.  140.  118.  101.  125.  110.  100.  1

In [38]:
"""
We can quickly find the mean, median, and standard deviation of blood pressure values:
"""
mean = np.mean(bp)
median = np.median(bp)
standard_dev = np.std(bp)
print(mean)
print(median)
print(standard_dev)

131.689768977
130.0
17.5706812395


Let's now take a look at the correlation between resting blood pressure and cholesterol levels.

In [40]:
"""
The cholesterol attribute is at index 4 of our data set, so we'll first create an array with just that attribute:
"""
cholesterol = transposed[4]
print(cholesterol)
print(np.mean(cholesterol))

[ 233.  286.  229.  250.  204.  236.  268.  354.  254.  203.  192.  294.
  256.  263.  199.  168.  229.  239.  275.  266.  211.  283.  284.  224.
  206.  219.  340.  226.  247.  167.  239.  230.  335.  234.  233.  226.
  177.  276.  353.  243.  225.  199.  302.  212.  330.  230.  175.  243.
  417.  197.  198.  177.  290.  219.  253.  266.  233.  172.  273.  213.
  305.  177.  216.  304.  188.  282.  185.  232.  326.  231.  269.  254.
  267.  248.  197.  360.  258.  308.  245.  270.  208.  264.  321.  274.
  325.  235.  257.  216.  234.  256.  302.  164.  231.  141.  252.  255.
  239.  258.  201.  222.  260.  182.  303.  265.  188.  309.  177.  229.
  260.  219.  307.  249.  186.  341.  263.  203.  211.  183.  330.  254.
  256.  407.  222.  217.  282.  234.  288.  239.  220.  209.  258.  227.
  204.  261.  213.  250.  174.  281.  198.  245.  221.  288.  205.  309.
  240.  243.  289.  250.  308.  318.  298.  265.  564.  289.  246.  322.
  299.  300.  293.  277.  197.  304.  214.  248.  2

Now let's find the linear least squares fit for the data, where blood pressure is on the x axis,
and cholesterol level on the y axis. The function `polyfit` takes in two arrays of data and the degree of the polynomial you want to fit, and returns the coefficients of the polynomial (in decreasing order of power with the highest power coefficient first).

In [41]:
"""
Now let's find the linear least squares fit for the data, where blood pressure is on the x axis,
and cholesterol level on the y axis:
"""
fit = np.polyfit(bp, cholesterol, 1)
print("Our fit is: y = " + str(fit[0]) + "x + " + str(fit[1]))

Our fit is: y = 0.382801971459x + 196.281966122


Let's now use our polynomial to predict a person's cholesterol level if their resting blood pressure is at an optimal 120 mm Hg systolic. The function `polyval` takes in the coefficients of a polynomial and an x value, and returns the polynomial evaulated at that x value.

In [42]:
prediction = np.polyval(fit, 120)
print(prediction)

242.218202697
