### Fancy Indexing

In NumPy, **fancy indexing** is a way to use an array of indices to specify multiple elements of an array at once, instead of just a single index.

In [1]:
import numpy as np

In [2]:
arr = np.arange(10, 100, 10)
arr

array([10, 20, 30, 40, 50, 60, 70, 80, 90])

If we wanted to form an array consisting of the 1st, 3rd and 4th elements of `arr`, we could do it this way:

In [3]:
sub = np.array([arr[0], arr[2], arr[3]])
sub

array([10, 30, 40])

But using fancy indexing, we can make an array of those indices to do the same thing:

In [4]:
sub = arr[np.array([0, 2, 3])]
sub

array([10, 30, 40])

In this example, we say `[0, 2, 3]` is the **index array** - an array of indices basically.

Now, be careful with fancy indexing - we can use an `ndarray`, and in some cases even a Python `list`, but we cannot, however, use a `tuple` - NumPy will interpret this as specifying single indices for multiple dimensions.

In [5]:
try:
    arr[(0, 2, 3)]
except Exception as ex:
    print(type(ex), ex)

<class 'IndexError'> too many indices for array


And we get that exception because NumPy interprets our tuple as specifying indices for 3 axes (dimensions), but our array has only one - hence `too many indices`.

Unlike slicing, the array we get back from fancy indexing is **not** "linked" to the original:

In [6]:
arr

array([10, 20, 30, 40, 50, 60, 70, 80, 90])

In [7]:
sub

array([10, 30, 40])

In [8]:
sub[1] = 300
sub

array([ 10, 300,  40])

In [9]:
arr

array([10, 20, 30, 40, 50, 60, 70, 80, 90])

One of the interesting things about fancy indexing, is that the resulting array has the shape of the index array.

In [10]:
arr = np.arange(1, 10)
arr

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [11]:
arr[np.array([0, 1, 1, 5])]

array([1, 2, 2, 6])

Here the selection array was `[0, 1, 1, 5]`, so a shape of `1 x 4`, and so our result was an array of that shape. (Note also that we can specify the same index more than once).

But we could also do the following, using a 2-D `ndarray` for the indices:

In [12]:
arr[np.array(
    [
        [0, 1], 
        [1, 5]
    ]
)]

array([[1, 2],
       [2, 6]])

You'll notice that the resulting array, like the previous example, uses the elements of `arr` at indices `0`, `1`, `1`, and `5`, but the resulting **shape** conforms to the shape of the indices we specified.

We can also use fancy indexing on multi-dimensional arrays.

In [13]:
m = np.arange(25).reshape(5, 5)
m

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

Let's pick all the columns from rows `0`, `1` and `3` - not something we can do with standard slicing:

In [14]:
m[[0, 1, 3]]

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [15, 16, 17, 18, 19]])

But furthermore, we could specify a single index, a slice, or a fancy index for the second axis:

In [15]:
m[[0, 1, 3], 2]

array([ 2,  7, 17])

As you can see, we ended up picking the 3rd element of each row, for rows `0`, `1`, and `3`.

We could also specify a slice:

In [16]:
m

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [17]:
m[[0, 1, 3], 0::2]

array([[ 0,  2,  4],
       [ 5,  7,  9],
       [15, 17, 19]])

And of course, we can also specify an index array for both dimensions, but this is a bit more difficult to understand, and not commonly used - but certainly possible:

In [18]:
m

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [19]:
m[np.array([0, 1, 3]), np.array([1, 2, 4])]

array([ 1,  7, 19])

Because both index arrays had the same shape, we ended up with a result of `(0, 1)`, `(1, 2)` and `(3, 4)` for our indices.

You can think of the result being the elements at the tuples `(i, j)` formed by zipping the two arrays:

```
zip([0, 1, 3], [1, 2, 4])
```
which corresponds to these indices:
```
[(0, 1), (1, 2), (3, 4)]
```

You can even use 2-D index arrays, but this get quite a bit more complicated to understand, and is not commonly used.

For example:

In [20]:
m[np.array([[0, 4], [2, 3]]), np.array([[1, 3], [2, 4]])]

array([[ 1, 23],
       [12, 19]])

Again you can think of this as "zipping" the first and second dimension indices this way:

```
0 4
2 3
```

"zipped" with

```
1 3
2 4
```

which results in these combined indices:

```
(0, 1) (4, 3)
(2, 2) (3, 4)
```

And again, the result shape is the same as the shape of the two index arrays.

##### Example

Let's look at a more practical application of using fancy indexing.

We're going to load some daily quotes from a file: `AAPL.csv`

In [21]:
import csv

with open('AAPL.csv') as f:
    reader = csv.reader(f, skipinitialspace=True)
    headers = next(reader)
    data = list(reader)

In [22]:
headers

['Symbol', 'Date', 'Close', 'Volume', 'Open', 'High', 'Low']

In [23]:
data

[['AAPL', '10/29/2020', '115.32', '146129200', '112.37', '116.93', '112.2'],
 ['AAPL', '10/28/2020', '111.2', '143937800', '115.05', '115.43', '111.1'],
 ['AAPL', '10/27/2020', '116.6', '92276770', '115.49', '117.28', '114.5399'],
 ['AAPL', '10/26/2020', '115.05', '111850700', '114.01', '116.55', '112.88'],
 ['AAPL', '10/23/2020', '115.04', '82572650', '116.39', '116.55', '114.28'],
 ['AAPL', '10/22/2020', '115.75', '101988000', '117.45', '118.04', '114.59'],
 ['AAPL', '10/21/2020', '116.87', '89945980', '116.67', '118.705', '116.45'],
 ['AAPL', '10/20/2020', '117.51', '124423700', '116.2', '118.98', '115.63'],
 ['AAPL', '10/19/2020', '115.98', '120639300', '119.96', '120.419', '115.66'],
 ['AAPL', '10/16/2020', '119.02', '115393800', '121.28', '121.548', '118.81'],
 ['AAPL', '10/15/2020', '120.71', '112559200', '118.72', '121.2', '118.15'],
 ['AAPL', '10/14/2020', '121.19', '151062300', '121', '123.03', '119.62'],
 ['AAPL', '10/13/2020', '121.1', '262330500', '125.27', '125.39', '119.

What we want to do here is extract the dates into one array, and the numerical values for `Open` and `Close` in another array - as long as we keep both arrays in the same order, we can always associate the date with the data by using the same index on both arrays.

We'll start by putting all this data into a NumPy array:

In [24]:
data = np.array(data)
data

array([['AAPL', '10/29/2020', '115.32', '146129200', '112.37', '116.93',
        '112.2'],
       ['AAPL', '10/28/2020', '111.2', '143937800', '115.05', '115.43',
        '111.1'],
       ['AAPL', '10/27/2020', '116.6', '92276770', '115.49', '117.28',
        '114.5399'],
       ['AAPL', '10/26/2020', '115.05', '111850700', '114.01', '116.55',
        '112.88'],
       ['AAPL', '10/23/2020', '115.04', '82572650', '116.39', '116.55',
        '114.28'],
       ['AAPL', '10/22/2020', '115.75', '101988000', '117.45', '118.04',
        '114.59'],
       ['AAPL', '10/21/2020', '116.87', '89945980', '116.67', '118.705',
        '116.45'],
       ['AAPL', '10/20/2020', '117.51', '124423700', '116.2', '118.98',
        '115.63'],
       ['AAPL', '10/19/2020', '115.98', '120639300', '119.96', '120.419',
        '115.66'],
       ['AAPL', '10/16/2020', '119.02', '115393800', '121.28', '121.548',
        '118.81'],
       ['AAPL', '10/15/2020', '120.71', '112559200', '118.72', '121.2',
        '11

You'll notice that our array data type is a string type - so we'll have to deal with that in a bit.

First, we want to extract just the dates, so we can use a slice for that:

In [25]:
dates = data[:, 1]
dates

array(['10/29/2020', '10/28/2020', '10/27/2020', '10/26/2020',
       '10/23/2020', '10/22/2020', '10/21/2020', '10/20/2020',
       '10/19/2020', '10/16/2020', '10/15/2020', '10/14/2020',
       '10/13/2020', '10/12/2020', '10/09/2020', '10/08/2020',
       '10/07/2020', '10/06/2020', '10/05/2020', '10/02/2020',
       '10/01/2020', '09/30/2020', '09/29/2020'], dtype='<U10')

Yes, those are strings too, and we can convert those values to `datetime` objects (NumPy can handle other Python types, but may not be able to handle vectorization automatically) - but all we really need these dates are for looking up the date for a particular index, so that's probably OK - we could even just stick to a Python list or tuple for that.

In [26]:
from dateutil import parser

In [27]:
dates = [parser.parse(d) for d in dates]

In [28]:
dates

[datetime.datetime(2020, 10, 29, 0, 0),
 datetime.datetime(2020, 10, 28, 0, 0),
 datetime.datetime(2020, 10, 27, 0, 0),
 datetime.datetime(2020, 10, 26, 0, 0),
 datetime.datetime(2020, 10, 23, 0, 0),
 datetime.datetime(2020, 10, 22, 0, 0),
 datetime.datetime(2020, 10, 21, 0, 0),
 datetime.datetime(2020, 10, 20, 0, 0),
 datetime.datetime(2020, 10, 19, 0, 0),
 datetime.datetime(2020, 10, 16, 0, 0),
 datetime.datetime(2020, 10, 15, 0, 0),
 datetime.datetime(2020, 10, 14, 0, 0),
 datetime.datetime(2020, 10, 13, 0, 0),
 datetime.datetime(2020, 10, 12, 0, 0),
 datetime.datetime(2020, 10, 9, 0, 0),
 datetime.datetime(2020, 10, 8, 0, 0),
 datetime.datetime(2020, 10, 7, 0, 0),
 datetime.datetime(2020, 10, 6, 0, 0),
 datetime.datetime(2020, 10, 5, 0, 0),
 datetime.datetime(2020, 10, 2, 0, 0),
 datetime.datetime(2020, 10, 1, 0, 0),
 datetime.datetime(2020, 9, 30, 0, 0),
 datetime.datetime(2020, 9, 29, 0, 0)]

Next, we'll want to extract our numerical data - and this time we'll definitely want to use a NumPy array for this.

The columns we want to extract are the `Open` and `Close` columns:

In [29]:
headers

['Symbol', 'Date', 'Close', 'Volume', 'Open', 'High', 'Low']

So this means we are interested in indices `4` and `2` - in that order.

We can use fancy indexing to extract these two columns:

In [30]:
oc = data[:, np.array([4, 2])]
oc

array([['112.37', '115.32'],
       ['115.05', '111.2'],
       ['115.49', '116.6'],
       ['114.01', '115.05'],
       ['116.39', '115.04'],
       ['117.45', '115.75'],
       ['116.67', '116.87'],
       ['116.2', '117.51'],
       ['119.96', '115.98'],
       ['121.28', '119.02'],
       ['118.72', '120.71'],
       ['121', '121.19'],
       ['125.27', '121.1'],
       ['120.06', '124.4'],
       ['115.28', '116.97'],
       ['116.25', '114.97'],
       ['114.62', '115.08'],
       ['115.7', '113.16'],
       ['113.91', '116.5'],
       ['112.89', '113.02'],
       ['117.64', '116.79'],
       ['113.79', '115.81'],
       ['114.55', '114.09']], dtype='<U10')

So this is almost where we want to be - the only remaining step is to make all these strings into floats.

And for that we can use the `astype` method:

In [31]:
oc = data[:, np.array([4, 2])].astype(float)
oc

array([[112.37, 115.32],
       [115.05, 111.2 ],
       [115.49, 116.6 ],
       [114.01, 115.05],
       [116.39, 115.04],
       [117.45, 115.75],
       [116.67, 116.87],
       [116.2 , 117.51],
       [119.96, 115.98],
       [121.28, 119.02],
       [118.72, 120.71],
       [121.  , 121.19],
       [125.27, 121.1 ],
       [120.06, 124.4 ],
       [115.28, 116.97],
       [116.25, 114.97],
       [114.62, 115.08],
       [115.7 , 113.16],
       [113.91, 116.5 ],
       [112.89, 113.02],
       [117.64, 116.79],
       [113.79, 115.81],
       [114.55, 114.09]])

Now, if we want to calculate the difference between high and low, we can use vectorized operations:

In [32]:
diffs = oc[:, 1] - oc[:, 0]
diffs

array([ 2.95, -3.85,  1.11,  1.04, -1.35, -1.7 ,  0.2 ,  1.31, -3.98,
       -2.26,  1.99,  0.19, -4.17,  4.34,  1.69, -1.28,  0.46, -2.54,
        2.59,  0.13, -0.85,  2.02, -0.46])

Or maybe we want to calculate the % difference from the open:

In [33]:
diff_percs = ((oc[:, 1] - oc[:, 0]) / oc[:, 0]) * 100
diff_percs

array([ 2.62525585, -3.34637114,  0.96112218,  0.91220068, -1.15989346,
       -1.44742444,  0.17142367,  1.12736661, -3.31777259, -1.86345646,
        1.67621294,  0.15702479, -3.32880977,  3.61485924,  1.46599584,
       -1.10107527,  0.40132612, -2.19533276,  2.27372487,  0.11515635,
       -0.72254335,  1.77519993, -0.40157137])