# Arrays

The first step in doing data science is to collect a *data set*. That is, if we want to answer a question -- such as, "How much money does the average data scientist make per year?" -- we don't go out and ask only one person, we survey a *lot* of people and analyze the results. As such, we need ways of working with large *collections* of data. In this section, we'll see two data types -- lists and arrays -- which enable us to work with sequential data. 

## Lists

In Python, the simplest way to make a collection of data is by creating a *list*. You can do this by surrounding a group of items with square brackets, `[ ]`, and separating each item with a comma `,`:

In [None]:
salaries = [110_000, 95_000, 100_000]
salaries

Lists are their own data type:

In [None]:
type(salaries)

Any type of data is allowed inside a list (including other lists), and you can include variables:

In [None]:
x = 42
random_stuff = ['oranges', x, True, [1, 2, 3]]
random_stuff

Python's lists are versatile and easy to use, but they have a big problem: they are slow. As data scientists, we will be working with sequences of millions, if not billions, of entries -- so speed is of the essence. Therefore, will use another type of collection to store our sequential data: the *array*.

## Arrays

Arrays are like lists, but optimized for the types of heavy calculations done in data science. They are blazing fast, and memory-efficient.

Arrays aren't included with Python, however.
Remember that Python wasn't originally designed specifically for data scientists. Instead, it is a *general purpose* language, used by web developers, software engineers, and artists, too. So in order to give Python what it needs -- a way of efficiently working with large sequences of numbers -- a group of scientists independently developed an extension to Python called [NumPy](https://numpy.org/) (short for "numeric python").

```{figure} ../images/numpy-logo.svg
---
height: 150px
name: my-figure
---
The *NumPy* logo.
```

```{tip}
Avoid the embarassment -- it's pronounced "num-pie"  
<small>(not "num-pee")</small>
```

To get access to arrays, we'll need to import *NumPy*, just as we did with the `math` module in the previous chapter:

```{margin}

Note that while the `math` module is included with Python, *NumPy* is not (it has to be installed separately). But we import them the same way.
```

In [None]:
import numpy as np

The notation `as np` means that we are giving `numpy` a new, shorter name that will be faster to type. Whenever we want to use function in the package, we'll write `np.` instead of `numpy.`.

```{margin}

You could change `numpy`'s name to whatever you want, for instance: `import numpy as my_favorite_library`. However importing it as `np` is a *de facto* standard in data science. Don't import it as anything else unless you have a good reason.
```

Let's create an array. We do so by calling the `np.array()` function with a list of data:

In [None]:
hours_slept_array = np.array([8, 7, 7, 8, 5, 8, 9])
hours_slept_array

```{note}

Note the square brackets! If you try to create an array without them, you'll see an error.
```

Arrays are their own data type:

In [None]:
type(hours_slept_array)

The array we've created contains numbers, but arrays can also contain other types of data, like strings or bools. *But* in order to maximize their efficiency, a single array should only contain a single data type.

In [None]:
np.array(['this', 'is', 'also', 'fine'])

Remember what happened when we evaluated expressions that contained both ints and float? The result was always a float. The same thing will happen if we try to make an array containing ints and float:

In [None]:
np.array([1, 2, 3.0])

If possible, NumPy will always try to convert everything you give it to the same type. That means if you give it strings and numbers, it'll turn everything into strings!

Why is this? Because you can always convert a number into a string (just place quotes around it!), but there are only a handful some strings that can be reliably converted into a number. For the sake of consistency, NumPy turns it all into strings.


```{margin}
No need to be worried about the weird looking "dtype" -- that just tells us that the data type it contains are stored as [unicode](https://en.wikipedia.org/wiki/Unicode) strings (`U`) with a maximum possible length of 21 characters (`21`).
```

In [None]:
np.array([1, 2, '3'])

Sometimes it is useful to know how many elements are in an array. We can determine this with the `len` function:

In [None]:
arr = np.array([1, 2, 3])
len(arr)

## Array Methods

NumPy comes full of additional functions and methods that perform a vast amount of useful calculations on arrays. Better yet: these functions and methods are *fast*.

The NumPy functions can be called just like we called the math functions. Once we've imported the `numpy` library (abbreviated as `np`) we can just type `np.` followed by a function name to access the function. For instance, you can use the `np.mean` function to calculate the average value of a sequence:

In [None]:
example_array = np.array([1, 1, 2, 3, 3])
np.mean(example_array)

There are loads of more complex functions, such as the `np.diff` function which calculates the difference between each consecutive pair of elements:

In [None]:
np.diff(example_array)

Just like strings, arrays also own special methods that can perform calculations. A few useful ones are shown below:

In [None]:
example_array.min()

In [None]:
example_array.max()

In [None]:
example_array.sum()

In [None]:
example_array.mean()

Don't worry, you don't need to memorize all of the different functions/methods (there are a lot!) -- we'll include references when necessary.

### Example

Every year, the programming community forum [StackOverflow](http://stackoverflow.com) [surveys](https://insights.stackoverflow.com/survey/2019#overview) its users, asking them such important questions as: what is your salary? and, how many computer monitors are on your desk at home? The results are publicly available. Since many of those who respond are data scientists, we can use the data to get an idea of a typical data scientist's salary.

In [None]:
salaries = np.loadtxt('../../data/salaries.csv')

The variable `salaries` is a NumPy array containing the salaries of every US-based data scientist in the survey. How many were there? We can answer that with `len`:

In [None]:
len(salaries)

What was the mean salary?

In [None]:
salaries.mean()

Nice. What about the *median* salary?

In [None]:
salaries.median()

Oops. It turns out that there is no method called "median" in numpy. There is, however, a `median` function:

In [None]:
np.median(salaries)

Notice that the median is about \$10,000 less than the mean. As a data scientist would point out, the mean is more "sensitive" to "outliers", meaning that a few people who make a very large amount of money can skew the mean. Let's see what the largest salary is:

In [None]:
salaries.max()

2 million dollars! Remember, though: these salaries are *self-reported*.

### Accessing array items

An array is an *ordered sequence* of items that has a beginning and an end. We can retrieve an element by specifying its {dterm}`index`. The index of the first item in an array is zero, the index of the second item is one, and so on. For example, let's say we have an array with three elements:

In [None]:
names = np.array(['Xanthippe', 'Yvonne', 'Zelda'])
names

To get the first element out of the array, we write:

In [None]:
names[0]

To get the second, we write:

In [None]:
names[1]

And to get the third (i.e., last) element, we write:

In [None]:
names[2]

Here's a useful trick: if you use a negative number to retrieve an element, Python starts counting from the *back* of the array. So, for instance, to retrieve the last element we can also write:

In [None]:
names[-1]

The array above has only three things in it, and their indices are 0, 1, and 2. What happens if we try to access the list at an index that doesn't exist, such as 99?

In [None]:
names[99]

## Element-wise operations

The power of arrays really starts to shine when math is involved. Arrays have the power to quickly perform operations over each element they contain. To begin, let's create a simple array of numbers:

In [None]:
array1 = np.array([1, 2, 3])

To subtract 3 from all of these numbers, we can simply write:

In [None]:
array1 - 3

To multiply each of the numbers by 2, we would write:

In [None]:
array1 * 2

And so on. In practice this means we could do something like convert an entire array of temperatures measured in Fahrenheit to Celsius by writing a single expression:

In [None]:
temperatures_f = np.array([0.5, 32.0, 71.6, 212.0])

Remember that the formula for converting a measurement in Fahrenheit to Celsius is $C = (F - 32) * (5/9)$. Therefore:

In [None]:
temperatures_c = (temperatures_f - 32) * (5 / 9)
temperatures_c

In the above example, first `(temperatures_f - 32)` is evaluated and produces an array with 32 subtracted from every temperature. Then `(5 / 9)` is evaluated. Then then every element in the new array is multiplied by 5/9, producing the final output array.

We can also do element-wise operations between pairs of data from two arrays.

For this to work, both arrays must have the same size. The arrays are then lined up next to eachother, and the operation is performed between every corresponding pair of elements. This is best demonstrated with some examples:

In [None]:
array1 = np.array([1, 2, 3])
array2 = np.array([2, 4, 6])

In [None]:
array1 * array2

In [None]:
array1 - array2

In [None]:
array1**array2

Both paired element-wise operations and standalone element-wise operations can be used in the same expression, since we're always producing another array as a result of each expression.

In [None]:
(array1 * 2) - array2

Watch out for the new errors you might encounter! Let's see what happens if our other array isn't the same size.

In [None]:
array_short = np.array([2,4])
array1 * array_short

The error message is a little cryptic -- what is this about "broadcasting"? Nevertheless, we can kind of understand that there is some issue with the "shape" of the two arrays not being compatible.
In fact, this error is telling us that the first array has three elements but the other only has two, so the two arrays couldn't be pushed into the same shape.

## Ranges

Often times it's useful to create an array of consecutive numbers, such as:

In [None]:
np.array([0, 1, 2, 3, 4, 5, 6, 7])

Rather than write this array by hand, we can use the `np.arange` function to do it for us:

In [None]:
np.arange(8)

Notice that just like indices, ranges will start at zero by default and exclude the last number. So calling `np.arange(12)`, for instance, will create an array with eleven elements whose first entry is 0 and whose last element is 11.

While we saw an example of the range function being called with one argument, it can be called with one, two, or three arguments:

- `np.arange(endpoint)` Consecutive integers from 1 to endpoint (exclusive)
- `np.arange(start, endpoint)` Consecutive numbers from start to endpoint (exclusive), increasing by 1 each step.
- `np.arange(start, endpoint, stepsize)` Consecutive numbers from start to endpoint (exclusive), changing by stepsize each step.

Some example might make this clearer:

In [None]:
np.arange(10)

In [None]:
np.arange(5.5, 10)

In [None]:
np.arange(0, 1, 0.2)

In [None]:
np.arange(-1, -4)

In [None]:
np.arange(-1, -4, -1)

The result of `np.arange` is an array like any other, so we can write things like:

In [None]:
np.arange(5) + 3

```{hiddenanswer}
---
question: |
    How would you use `np.arange` to create the array containing the first 6 powers of 2: 1, 2, 4, 8, 16, and 32?
answer: |
    `2**np.arange(6)`
```

---
## Summary

- To get multiple pieces of data in one place, we create a **collection**. If the collection is ordered then it is a **sequence**.
- Each item in a sequence has an **index** -- its position, starting at zero.
- **Lists** are the most basic sequence, and are created by surrounding a group of items with square brackets and separating each item by commas: `[item, item, ...]`
- **Arrays** are a sequence type from the NumPy library, and are created by passing a list into the `np.array` function: `np.array([item, item, ...])` `np.array(my_list)`
- NumPy offers lots of additional functions that can be called on sequences. These can be accessed using `np.function_name(arguments, ...)`.
- An item can be selected from an array by using brackets with the index of the item: `my_array[index]`
- Arrays support **element-wise operations**, such as adding or multiplying all elements by a single number.
- Arrays of the same length support paired element-wise operations between the two arrays, such as adding or multiplying each element in one array with each element in the same position of another array.
- An array of numbers with constant spacing can be easily constructed using `np.arange`
- A range will always *exclude* the endpoint -- so `np.arange(3)` will count `0 1 2`.