# Intro to NumPy 

Now, we're going to start off with a foray into the `NumPy` library, which is one of the fundamental packages for scientific computing in Python. 


It turns out that the `pandas DataFrames` we worked with are actually built off the `NumPy array` (which we'll get to), so it's important to have some basic knowledge of what's running under the hood of our `DataFrames`. 


We started with `DataFrames` as opposed to `NumPy arrays` because they are a little bit more intuitive and we're able to interact with them from a higher level (this is largely due to the ability to label our data). 

While `NumPy` offers an amazing amount of functionality (see the [docs](http://www.numpy.org/) for a better idea), one of it's mainstays is the `NumPy array` (an n-dimensional array), which is what we'll focus on today. 


There are loads of things that you can do with `NumPy arrays`, and today we're going to introduce some of their amazing capabilities. 


Learning about everything `NumPy arrays` can do really just takes working with them day in and day out, and so now we'll try to aim for breadth over depth. 

## Learning Objectives

At the end of this notebook you should:

- be able to create NumPy arrays
- have an idea of the many things you can do with `NumPy` and `NumPy arrays`
- and the types of situations where you would want to use them


## The basics of the Array

### What's the big deal with NumPy Arrays?

What's so special about a `NumPy array`? 

From a high level, they are kind of like lists - they just store a bunch of stuff in a container. It turns out, though, that a NumPy array is much faster to interact with and perform certain types of calculations with than a standard Python list. Why is that, though? The two main reasons that they are faster are: 

1. They are stored as one contiguous block of memory, rather than being spread out across multiple locations like a list. 
2. Each item in a NumPy array is of the same data type (i.e. all integers, all floats, etc.), rather than a conglomerate of any number of data types (as a list is). We call this idea homogeneity, as opposed to the possible heterogeneity of Python lists.

Just how much faster are they? Let's take the numbers from 0 to 1 million, and sum those numbers, timing it with both a list and a NumPy array.

In [2]:
import numpy as np # Standard alias when importing NumPy - follow this convention!
numpy_array = np.arange(0, 1000000)
python_list = range(1000000)

In [3]:
python_list

range(0, 1000000)

In [4]:
%timeit np.sum(numpy_array)

111 µs ± 739 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [5]:
%timeit sum(python_list)

8.22 ms ± 53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [6]:
%timeit sum(numpy_array)

36 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


So, NumPy runs nearly an order of magnitude faster! This is because of those two points above. Because NumPy arrays store data in contiguous blocks of memory, they are able to take advantage of **vectorization**, which is the ability of a CPU to perform one operation on multiple pieces of data at once. In addition, since a NumPy array knows what type each object it is storing is (and those types don't change), it doesn't have to waste time checking what type each object is (like a list). The combo of these two things speeds up our calculation quite a bit.

Notice, too, that we had to specify `np.sum()` - NumPy's implementation of sum. When we just used the built-in Python `sum()` on the NumPy array, the calculation was actually slower! This is because NumPy arrays are optimized for vectorized computations, and when we try to do a non-vectorized operation we pay a price. 

It's also worth noting that all we did above was a sum - just a **simple** sum. When we move to doing more complicated operations, we'll save even more time! Let's look at what else NumPy arrays can do...

### Making a NumPy Array

Now that we know how awesome `NumPy arrays` can be, let's dive into them. We're not going to cover everything that you can do with `NumPy arrays` (see the [methods docs](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ndarray.html#numpy.ndarray) for that), but we'll look at the basics. 

Let's start with how we can create a `NumPy array`. To do this, we use the `np.array()` constructor, which expects some kind of array or something that exposes the array interface (i.e. acts like an array - lists and tuples are examples). So, this means that we can create a NumPy array by passing in a list or tuple. 

In [7]:
my_lst_ndarray = np.array([1, 2, 3, 4, 5])
# You can specify the data type upon creation. 
my_tuple_ndarray = np.array((1, 2, 3, 4, 5), np.int32) 

Just like we have the shape attribute on pandas DataFrames, we also have it on NumPy arrays.

In [8]:
print(my_lst_ndarray.shape)
print(my_tuple_ndarray.shape)

(5,)
(5,)


We also have the dtype attribute, which will tell us the data type of the objects in our ndarray (n-dimensional array).

In [9]:
print(my_lst_ndarray.dtype)
print(my_tuple_ndarray.dtype)

int64
int32


In [10]:
my_lst_ndarray2 = np.array(["1", 2, 3, "10", 5])
print(my_lst_ndarray2.dtype)

<U21


"U" stands for Unicode sting and "2" because its a 2-character long string. Every element in the array was converted to a string.

In [11]:
# every element is as type a string now
type(my_lst_ndarray2[1])

numpy.str_

If you try to tell the ndarray to be a certain data type, it will try to cast every object you pass in to that data type (here a 32-bit integer), and fail if it can't cast it to that data type. Below, we are able to cast "10" to a 32-bit integer, so this is fine. 

In [12]:
my_lst_ndarray3 = np.array([1, 2, 3, "10", 5], np.int32) 
print(my_lst_ndarray3.dtype)

int32


In [13]:
# This will not work, because Python can't cast the string 'bozo' as a 32 bit integer. 
my_lst_ndarray3 = np.array([1, 2, 3, "bozo", 5], np.int32)

ValueError: invalid literal for int() with base 10: 'bozo'

Here are some other methods of constructing a NumPy array. It's helpful to know these exist. 

In [14]:
zeros_arr = np.zeros((3,4)) # Create a matrix of zeros with 3 rows and 4 columns. 
zeros_arr

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [15]:
ones_arr = np.ones((10,5))  # Create a matrix of ones with 10 rows and 5 columns.
ones_arr

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [16]:
identity_arr = np.identity(50) # Create an identity matrix with 50 rows and 50 columns. 
identity_arr

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [17]:
random_arr = np.random.rand(2, 2) # Create a 2x2 array of random floats ranging from 0 to 1. 
random_arr

array([[0.6690457 , 0.1697127 ],
       [0.65936437, 0.5918214 ]])

In [18]:
range_arr = np.arange(0, 20, 0.5) # Create a NumPy array with arguments (start, end, step_size). 
range_arr

array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5, 10. , 10.5,
       11. , 11.5, 12. , 12.5, 13. , 13.5, 14. , 14.5, 15. , 15.5, 16. ,
       16.5, 17. , 17.5, 18. , 18.5, 19. , 19.5])

## NumPy Array Math

When working with a `NumPy` array, we have all of the basic mathematics operators available to us: `+`, `-`, `*`, `/`, `**`, and `%`. Frequently, we'll be working with two arrays that are the same size, in which case these operators will just be performed **element-wise**. For example:

In [19]:
first_arr = np.array([1, 2, 3, 4])
second_arr = np.array([5, 6, 7, 8])
first_arr + second_arr # Each element is lined up with it's corresponding element in the other 
                       # array, and the addition is then performed. 

array([ 6,  8, 10, 12])

In [20]:
first_arr = np.array([[1, 2], [3, 4]]) # This is now a two-dimensional array.
second_arr = np.array([[5, 6], [7, 8]]) # This is now a two-dimensional array. 
first_arr * second_arr # Each element is lined up with it's corresponding element in the other 
                       # array, and the multiplication is then performed. 

array([[ 5, 12],
       [21, 32]])

It turns out that our numerical operations can also work when we want to perform an operation between a `NumPy array` and a single value. For example, let's say that we want to subtract `4` from `first_arr` above, or multiply it by `5`, or find the remainder when everything is divided by `3`. We can do that via the following: 

In [21]:
first_arr = np.array([[1, 2], [3, 4]])

In [22]:
first_arr - 4

array([[-3, -2],
       [-1,  0]])

In [23]:
first_arr * 5

array([[ 5, 10],
       [15, 20]])

In [24]:
first_arr % 3

array([[1, 2],
       [0, 1]])

The concept that allows this to happen is referred to as [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html). It is a concept that will be particularly useful when working with and interacting with `NumPy` arrays. Basically, it takes that single number on the right (the `4`, `5`, or `3` above), and **broadcasts** it's shape to match that of `first_arr` (`2 x 2`). After doing so, it then performs the operation element-wise like we saw before. 

It turns out that things can get a little more intricate than this. If we wanted, we could perform mathematical operations like the above at a column level, or row level. For example, we could subtract off `4` from the first column and `5` from the second column, or `4` from the first row and `5` from the second row. We would do that via the following: 

In [25]:
first_arr = np.array([[1, 2], [3, 4]])

In [26]:
# Here, we subtract 4 off the first column and 5 off the second column. 
first_arr - [4, 5]

array([[-3, -3],
       [-1, -1]])

In [27]:
# Here, we subtract 4 from the first row and 5 from the second row. 
first_arr - [[4], [5]]

array([[-3, -2],
       [-2, -1]])

## A Little bit more of NumPy Arrays 

Okay, so now that we know a little bit about a `NumPy array`, what else can we do with it? There are actually quite a number of things we can do. We can index into them, perform calculations, ask for aggregation type metrics, etc. 


### Indexing 




Let's begin by indexing into them. With `NumPy arrays`, we don't have the `.loc[]` or `.iloc[]` methods like we do on a DataFrame - we simply index into them like we would a list. It's effectively a multidimensional list, though. Therefore, we can pass it multiple indexing values. Let's take a look.

In [28]:
# Reshape will reshape the data to the shape that you tell it to (here 5 rows, 4 columns). 
range_arr = np.arange(0, 20, 1).reshape(5, 4)
range_arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [29]:
# Grab every row, but only the element at index 2 in those rows. 
range_arr[:, 2]

array([ 2,  6, 10, 14, 18])

In [30]:
# With no second index, this defaults to taking the rows. 
range_arr[0:2]

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [31]:
# The first set of numbers refers to the rows to grab, the second set the columns.
range_arr[0:2, 1:3] 


array([[1, 2],
       [5, 6]])

### Other methods

Now let's look at some of the other methods that are available. Again, there is a ton we can do, and we're aiming here to at least get your eyes on a lot of the things that are possible. We also want to give you a notebook here that you can look back at to see what is possible (Google is also amazing for this). 

We can perform sums in any direction with a method on the arrays.

In [32]:
# Sum along the rows (i.e. get column totals)
range_arr.sum(axis=0)

array([40, 45, 50, 55])

In [33]:
# Sum along the columns (i.e. get row totals)
range_arr.sum(axis=1)

array([ 6, 22, 38, 54, 70])

In [34]:
# Get sum of all elements in NumPy array. 
range_arr.sum() 

190

We can also grab the mean, standard deviation, max, and min values along the rows (i.e. for the columns). We could also do this along the columns, or for the array as a whole (just like we did with `.sum()`).

In [35]:
range_arr.mean(axis=0)

array([ 8.,  9., 10., 11.])

In [36]:
range_arr.std(axis=0)

array([5.65685425, 5.65685425, 5.65685425, 5.65685425])

In [37]:
range_arr.max(axis=0)

array([16, 17, 18, 19])

In [38]:
range_arr.min(axis=0)

array([0, 1, 2, 3])

If we want to instead grab the **index** at which those min and max values occur (either along the rows or columns), then we can use the `argmin()` and `argmax()` methods available on our NumPy array. 

In [39]:
# We see that the mins of each column occur at row 1 (index 0).
range_arr.argmin(axis=0) 

array([0, 0, 0, 0])

In [40]:
# We see that the maxs of each column occur at row 5 (index 4).
range_arr.argmax(axis=0)

array([4, 4, 4, 4])

In [41]:
 # Here we get the index of the overall minimum (the 0th index).
range_arr.argmin()

0

In [42]:
# Here we get the index of the overall maximum (the last index). 
range_arr.argmax()

19

We can get the cumulative sum or product with the following.

In [43]:
# Here it gets the cumsum along the rows (i.e. from top to bottom)
range_arr.cumsum(axis=0)

array([[ 0,  1,  2,  3],
       [ 4,  6,  8, 10],
       [12, 15, 18, 21],
       [24, 28, 32, 36],
       [40, 45, 50, 55]])

In [44]:
# Gets the cumprod along the rows
range_arr.cumprod(axis=0)

array([[    0,     1,     2,     3],
       [    0,     5,    12,    21],
       [    0,    45,   120,   231],
       [    0,   585,  1680,  3465],
       [    0,  9945, 30240, 65835]])

In [45]:
# We can flatten our arrays as follows. 
range_arr.flatten()
range_arr.ravel()  # They look the same in this case. 

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

## A brief look at a cool NumPy method

The majority, if not all, of the methods that we looked at for NumPy arrays are available on pandas columns. They might have some slightly different naming conventions (`idxmax` on a column versus `argmax` on a NumPy array, for example), but since pandas DataFrames are built on NumPy arrays, the methods available on NumPy arrays largely coincide with the methods available on pandas DataFrames. 

Many of these methods are available as functions on the `NumPy` module itself, as well. Just like we can call the `argmax()` method on a NumPy array, we can call `np.argmax()` and pass in a list or tuple. Before we move back to DataFrames, let's look at one last method that is available in `NumPy`, `np.where()`. `np.where()` can help us to find what elements in a NumPy array meet some condition. 

In [46]:
my_ndarray = np.array([2, 4, 6, 8, 24, 3, 8, 9, 12])

In [47]:
# Returns the indices where the data meet the condition.
print(np.where(my_ndarray <= 2))  
print(np.where(my_ndarray == 8)) 
print(np.where(my_ndarray > 6)) 

(array([0]),)
(array([3, 6]),)
(array([3, 4, 6, 7, 8]),)


## Pivot Tables

From [wiki](https://en.wikipedia.org/wiki/Pivot_table): "Among other functions, a pivot table can automatically sort, count total, or give the average of the data stored in one table or spreadsheet, displaying the results in a second table showing the summarized data. Pivot tables are also useful for quickly creating unweighted cross tabulations."

As you might have guessed, we have functionality to create pivot tables available for our use in pandas. The way that we do this is by calling the `pivot_table()` function that is available on the pandas module (which we've stored as `pd`). As the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html) tell us, the `pivot_table()` expects a number of different arguments: 

1. `data`: A DataFrame object
2. `values`: a column or a list of columns to aggregate
3. `index`: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.
4. `columns`: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.
5. `aggfunc`: function to use for aggregation, defaulting to np.mean

Notice that by default this uses the mean for the `aggfunc` parameter. 

In [48]:
# Let's recall what the data looks like. 
import pandas as pd
red_wines_df = pd.read_csv('data/winequality-red.csv', sep=';')
red_wines_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Let's take a moment to quickly learn about another pandas function called `cut()` that allows us to turn a column with continuous data into categoricals by specifying bins to place them in.

In [49]:
pd.cut(red_wines_df['fixed acidity'], bins=np.arange(4, 17)).head()

0      (7, 8]
1      (7, 8]
2      (7, 8]
3    (11, 12]
4      (7, 8]
Name: fixed acidity, dtype: category
Categories (12, interval[int64, right]): [(4, 5] < (5, 6] < (6, 7] < (7, 8] ... (12, 13] < (13, 14] < (14, 15] < (15, 16]]

In [50]:
fixed_acidity_bins = np.arange(4, 17)
fixed_acidity_series = pd.cut(red_wines_df['fixed acidity'], bins=fixed_acidity_bins, 
                              labels=fixed_acidity_bins[:-1])
fixed_acidity_series.name = 'fa_bin'
red_wines_df = pd.concat([red_wines_df, fixed_acidity_series], axis=1)

In [51]:
red_wines_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,fa_bin
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,7
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,7
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,7
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,11
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,7


Now we can get the mean residual sugar for each quality category/fixed acidity bin like we did earlier, but with a pivot_table (mean is the default aggregation function).

In [52]:
pd.pivot_table(red_wines_df, values='residual sugar', index='quality', columns='fa_bin')

fa_bin,4,5,6,7,8,9,10,11,12,13,14,15
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3,,,1.5,3.5375,3.4,,1.8,2.2,,,,
4,1.75,5.3,2.714286,2.453846,2.583333,2.433333,2.566667,1.5,4.5,,,
5,1.6,1.85,2.492623,2.441331,2.496786,2.675,3.238889,2.77,2.393333,3.133333,,5.025
6,2.35,2.886538,2.556767,2.167027,2.281731,2.801563,2.910345,2.524359,2.9125,2.85,1.8,
7,2.1,1.9,2.595,2.655,2.796429,2.8625,2.718,2.638889,4.15,2.8,2.2,3.7
8,2.0,1.6,,2.316667,1.8,2.166667,3.866667,5.2,2.2,,,


In [53]:
# We can also specify a function to aggregate with (by default it is mean)
pd.pivot_table(red_wines_df, values='residual sugar', index='quality', 
               columns='fa_bin', aggfunc=np.max)

fa_bin,4,5,6,7,8,9,10,11,12,13,14,15
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
3,,,1.8,5.7,3.4,,2.1,2.2,,,,
4,2.1,12.9,5.6,4.4,6.3,3.4,3.4,1.6,4.5,,,
5,1.6,2.5,7.9,8.1,7.9,13.8,15.5,5.15,4.6,4.8,,7.5
6,4.3,13.9,10.7,5.5,5.1,11.0,15.4,6.2,4.3,3.8,1.8,
7,2.1,2.2,6.0,8.3,6.2,8.9,6.55,4.4,5.8,2.8,2.2,3.7
8,2.0,1.8,,3.6,1.8,2.8,6.4,5.2,2.2,,,
