Some of the content in this notebook is based on information in **Python for Data Analysis** by Wes McKinney, Chapter 4 (pg 85-107).

## A List is not an Array

You have probably used lists in Python to store collections of data. Lists store references to objects and because of this can contain objects of mixed types (floats and strings and anything else!). They are not true arrays.  

NumPy provides objects that implement true arrays. NumPy arrays store the data contiguously in memory as opposed to storing references to remote locations. They have many built-in operations that run quickly and eliminate the need for writing complex code with loops/functions. All elements in a NumPy array must be of the **same type**.

In [1]:
import numpy as np

## How to create arrays

One of the most common ways to create an array is by casting a list.

In [2]:
my_list = [5, 2, 3.5, 7]
my_arr = np.array(my_list)
print(my_arr)

[5.  2.  3.5 7. ]


Notice that all of the numbers in **my_arr** are floats. Remember that arrays contain only a single type. So, when the array is created, Python will try to infer the appropriate type. You can see the data type using **dtype**:

In [None]:
print(my_arr.dtype)

What about multi-dimensional lists and arrays?

In [3]:
big_list = [[4, 5, 2, 7], [10, 100, 1000, 10000], [5,10,50,89]]
big_arr = np.array(big_list)
print(big_arr)
print(big_arr.dtype)

[[    4     5     2     7]
 [   10   100  1000 10000]
 [    5    10    50    89]]
int64


In [4]:
print(big_arr.ndim) #how many dimensions
print(big_arr.shape) #number of rows, number of columns (a column is the number of items per row)

2
(3, 4)


There are other ways to make arrays. We can make arrays that contain all zeros or all ones:

In [None]:
z = np.zeros((5,2))
print(z)

In [None]:
o = np.ones(7)
print(o)

We can also make them empty:

In [None]:
emp = np.empty((2,4,2))
print(emp) #notice it is 3D and is filled with meaningless values

NumPy's arange is useful - this is analogous to using **range** but unlike range it produces the entire collection of numbers at one time:

In [None]:
r = np.arange(1,15)
print(r)

## How arrays make things easier

Suppose that we had a list and we wanted to multiply every item in the list by 4. We might try:

In [5]:
a_list = [10,100,400,20]
result = a_list * 4
print(result)

[10, 100, 400, 20, 10, 100, 400, 20, 10, 100, 400, 20, 10, 100, 400, 20]


That didn't do what we wanted! For lists, multiplication is defined as repetition. This gets worse if we try to divide by 4:

In [6]:
a_list = [10,100,400,20]
result = a_list / 4
print(result)

TypeError: unsupported operand type(s) for /: 'list' and 'int'

To make this work with a list, we would need to use a loop. We could do this in one line with list comprehension:

In [7]:
a_list = [10,100,400,20]
result = [item/4 for item in a_list]
print(result)

[2.5, 25.0, 100.0, 5.0]


We don't need a loop when we use an array:

In [8]:
an_arr = np.array([10,100,400,20])
result = an_arr/4
print(result)

[  2.5  25.  100.    5. ]


Will this work in more dimensions? Let's create some random data and try to multiply by 4:

In [9]:
my_arr = np.random.random((5,4))
print(my_arr*4)

[[1.42084877 1.71872706 2.54991833 3.93221009]
 [0.53039895 3.26428705 1.47312933 2.45825375]
 [3.93212133 0.31628031 1.88317036 0.88718806]
 [1.29006293 0.46872559 0.67535269 3.00407151]
 [0.95012393 0.66424563 3.96073401 2.04997933]]


We can do other neat things too. Let's add up all of the values in my_arr:

In [10]:
my_arr.sum()

9.357457251622328

Or maybe we want the row totals:

In [11]:
my_arr.sum(axis = 1) #use axis = 0 for column totals
#you can also do this with np.sum(my_arr, axis = 1)

array([2.40542606, 1.93151727, 1.75469001, 1.35955318, 1.90627072])

Or the mean:

In [12]:
my_arr.mean()
#or np.mean(my_arr)

0.46787286258111643

In [13]:
my_arr.mean(axis = 1)
#or np.mean(my_arr, axis = 1)

array([0.60135652, 0.48287932, 0.4386725 , 0.33988829, 0.47656768])

## Indexing arrays

Indexing works similarly for arrays as it does for lists. Consider a 1-D array:

In [None]:
rand = np.random.random(5)
print(rand)
print(rand[0])
print(rand[2])

What about higher dimensions?

In [None]:
rand = np.random.random((3,4))
print(rand)

In [None]:
print(rand[0]) #this is the entire first row

In [None]:
print(rand[0,:]) #this does the same thing, the : means all columns

In [None]:
print(rand[:,1]) #this is the second column, the : means all rows

In [None]:
#print(rand[1][2])#you could write this but more commonly...
print(rand[1,2])

### Slicing

Similar to lists, we can take slices with arrays:

In [None]:
data = np.random.random((4,5))
print(data)

In [None]:
print(rand[1:3]) #rows 1 and 2, 3 is not included!

In [None]:
print(rand[1:3,2:4]) #columns 2 and 3 from rows 1 and 2

In [None]:
print(rand[:,1:3]) #columns 1 and 2, all rows

### Indexing with boolean expressions

Very commonly, we will want to identify elements in an array that meet a certain condition and select those. That is, we want to take a subset of the array including only items that meet a certain condition. Suppose we have data for three types of iris flowers: setosa, virginia, and versicolor. We have observations for 4 variables on 10 flowers in one array (the data) and we have the type of each flower (the name/label) in a separate array), like this:

In [None]:
names = np.array(['setosa','setosa','virginica','versicolor','virginica',
                  'setosa','versicolor','virginica','setosa','setosa'])
data = np.random.random((10,4))

In [None]:
print(data)

There is one name that goes with each row of data. So the first row is a setosa flower, the second row is a setosa flower, the third row is a virginica flower and so forth. Suppose we only want data for virginica flowers. 

In [None]:
idx = names == 'virginica'
print(idx)

We have asked Python to go through and look at the names array and mark whether each item was virginica or not - this results in an array of booleans - notice there are 10 because we had 10 names. We can then use this array of booleans to subset our data array:

In [None]:
virginica_data = data[idx]
print(virginica_data)

How did it know which rows to pick? Anything that was marked true in the array called **idx** is selected. In **idx** we see that the indices 2, 4, and 7 are True, so when we ask for **data[idx]** it will select the data (rows) at indices 2,4, and 7.

## Reshaping and Raveling

Imagine that we have data for a seating chart in a room. There are 3 rows of 4 seats. We've recorded all of our data in a 1 dimensional array:

In [None]:
seats = np.array(['apple','carrot','turtle','potato','umbrella','dog','cat',
                 'tortoise','orange','book','computer','pencil']) #imagine these are names

But....we'd really like the data to be arranged like the room, in 3 rows of 4. We can **reshape** the data:

In [None]:
seats = seats.reshape(3,4)
seats

Notice that the first number we passed to reshape was the number of rows, and the second was the number of columns.

The inverse of this operation, to go from higher dimensions to lower dimensions, is known as flattening or raveling. There are two methods, **flatten** and **ravel** which do this, but they work slightly differently. For our purposes, 99.99% of the time we can use either without seeing a difference:

In [None]:
data = np.random.random((4,5))
print(data)

In [None]:
a = data.flatten()
print(a)

In [None]:
data = np.random.random((3,3))
print(data)

In [None]:
b = data.ravel()
print(b)