# Collections of Data - Sequences

Now we have a decent understanding of individual data values in Python. But, as a data scientist you'll need to work with an entire set of data values -- not just one!

We can start heading towards the notion of a 'data set' by learning how to create collections of values.

## Lists

In Python, the simplest way to make a collection is by creating a list. You can do this by surrounding a group of items with square brackets, `[ ]`, and separating each item with a comma `,`. Any data type is allowed inside a list (including other lists)!

In [None]:
my_amazing_list = ['one', True, 3, [5.0 - 1.0]]
my_amazing_list

In [None]:
type(my_amazing_list)

The end result is an ordered sequence of items that we can read from beginning to end. Each item has a position, called an **index** which starts at 0. So the in the list above which as four elements, the first element has index 0 and the last element has index 3.

Now that your amazing list is created, it's possible to add, remove, reorder, and search for items in the list, as well as calculate some interesting statistics!

### Functions on lists

Collections are handy because they allow us to perform calculations that require multiple pieces of data to be computed.

For instance, if you've been keeping track of the hours of sleep you get each night for the past week (Monday to Sunday), you might be interested in the shortest and longest amount of time you slept.  Neither of these calculations can be done easily with simple expressions, but they can be done easily calling a function on a list!

Let's start by creating our collection of data and assigning it to a meaningful name.

In [None]:
hours_slept = [8, 7, 7, 8, 5, 8, 9]

Some functions are specifically designed to perform calculations on sequences. For example, `max` and `min` are functions that will find the largest or smallest item in a list, respectively.

If we're curious what the longest number of hours you slept last week was, we can pass our list into the `max` function.

In [None]:
max([8, 7, 7, 8, 5, 8, 9])

And if we want to find the shortest number of hours you slept, then we can use the `min` function on our list. It was tedious to type out all the hours when we calculated the max, so let's take the smarter route and just use the variable name of our collection instead.

In [None]:
min(hours_slept)

We can also calculate the total number of hours we slept last week by using `sum`, and we can find out the number of items in a list by using `len`. Feel free to play with these functions in the interactive notebook!

````{hiddenanswer}
---
question: Using `sum` and `len` how would you write an expression that calculates the average number of hours you slept last week?
answer: |
    ```
    sum(hours_slept) / len(hours_slept)
    ```
    7.857142857142857
````

Finally, just like strings had their own set of special functions (called methods), lists have their own set of methods.

These methods allow us to do things like add and remove items from the list. We'll go over these methods later when we need them!

### Selecting items from a list

Recall that every item in a sequence has an **index** -- this is how the computer stores its position in the sequence. Indices (plural of index) start at 0.

To grab a specific item from a list, we can use square brackets containing the index of the item we want.

In [None]:
days_of_the_week = [
    'Monday',
    'Tuesday',
    'Wednesday',
    'Thursday',
    'Friday',
    'Saturday',
    'Sunday'
]

In [None]:
days_of_the_week[0]

In [None]:
days_of_the_week[len(days_of_the_week) - 1]

Notice that if we try to select an item from an index that's outside of the list, we'll get a helpful `IndexError`.

In [None]:
days_of_the_week[7]

## Arrays

Python’s lists are useful and easy to work with, but can be slow. As data scientists, we will eventually be working with sequences of millions, if not billions, of entries -- so speed is of the essence. Additionally, a lot of the calculations we've seen above aren't going to work if our list contains mixed types! After all, have you ever tried calculating the sum of `2` and `'orange'`?

The numerical analysis library `numpy` fixes this by introducing a new type of sequence: *arrays*. Let's import the `numpy` library (calling it `np` just to make it quicker to type) and get started!

In [None]:
import numpy as np

```{tip}
Avoid the embrassment -- it's pronounced "num-pie"  
<small>(not "num-pee")</small>
```

Arrays are stricter than lists, with two main restrictions that are instantly noticeable:

1. All of the items need to be the same data type
2. One created, we cannot directly add or remove items

However, NumPy arrays offer incredible power and speed which have made them one of the most commonly used collections in data science.

A lot of the same concepts from lists carry over when trying to understand arrays. In fact, to create an array we pass in a list to the `np.array()` function.

In [None]:
hours_slept_array = np.array([8, 7, 7, 8, 5, 8, 9])
hours_slept_array

Arrays can also contain other types of data, like strings or bools or other arrays, *but* remember that a single array can only contain a single type of data.

In [None]:
np.array(['this', 'is', 'also', 'fine'])

Remember what happened when we evaluated expressions that contained both ints and float? The result was always a float. The same thing will happen if we try to make an array containing ints and float -- remember, only one data type can be contained by an array.

In [None]:
np.array([1, 2, 3.0])

If possible, NumPy will always try to convert everything you give it to the same type. That means if you give it strings and numbers, it'll turn everything into strings!

Why is this? Because you can always convert a number into a string (just place quotes around it!), but there are only a handful some strings that can be reliably converted into a number. For the sake of consistency, NumPy turns it all into strings.

```{margin}
No need to be worried about the weird looking "dtype" -- that just tells us that the data type it contains are stored as unicode strings (`U`) with a maximum possible length of 21 characters (`21`).
```

In [None]:
np.array([1, 2, '3'])

### Functions on Arrays

The functions we looked at that can be used on lists can also be used on arrays, but NumPy also comes full of additional functions and methods that perform a vast amount of useful calculations on arrays.

The NumPy functions can be called just like we called the math functions. Once we've imported the `numpy` library (abbreviated as `np`) we can just type `np.` followed by a function name to access the function.

The same functions we used to find the shortest, longest, and total amount of time slept can also be used on arrays. At a basic level, NumPy also offers their own version of these functions.

In [None]:
min(hours_slept_array) == np.min(hours_slept_array)

In [None]:
max(hours_slept_array) == np.max(hours_slept_array)

In [None]:
sum(hours_slept_array) == np.sum(hours_slept_array)

The `np` variants of functions are usually faster and able to handle more types of data. But NumPy also offers tons of functions that *aren't* built in to Python. For instance, you can use the `np.mean` function to calculate the average value of a sequence -- a metric which is extremely useful and common in data analysis.

In [None]:
example_array = np.array([1, 1, 2, 3, 3])
np.mean(example_array)

There are loads of more complex calculations, such as the `np.cumsum` function which calculates the 'cumulative sum' of a sequence -- a running total of the sum of the sequence at each point along the sequence.

In [None]:
np.cumsum(example_array)

Just like strings and lists, arrays also own special methods that can perform calculations! All of the NumPy functions we've looked at so far can also be called using the same dot notation that we used with string and list methods.

In [None]:
example_array.min()

In [None]:
example_array.max()

In [None]:
example_array.sum()

In [None]:
example_array.mean()

In [None]:
example_array.cumsum()

Don't worry, you don't need to memorize all of the different functions/methods (there are a lot!) -- we'll include references when necessary.

### Selecting items from arrays

You can select specific items from arrays the exact same way you did with lists, using square brackets and the index of the item you want to select.

In [None]:
days_of_week_array = np.array(days_of_the_week)

In [None]:
days_of_week_array[0]

In [None]:
days_of_week_array[len(days_of_week_array) - 1]

In [None]:
days_of_week_array[7]

### Element-wise operations

The power of arrays really starts to shine when math is involved. Arrays have the power to quickly perform operations over each element they contain.

In [None]:
array1 = np.array([1, 2, 3])

In [None]:
array1 - 3

In [None]:
array1 * 2

In [None]:
# Remember, we can use the modulus operator `%` 2 to check if something's even
array1 % 2

In [None]:
array1**2

In practice this means we could do something like convert an entire array of temperatures measured in Fahrenheit to Celsius by writing a single expression.

```{margin}
The forumla for converting a measurement in Fahrenheit to Celsius is $C = (F - 32) * (5/9)$
```

In [None]:
temperatures_f = np.array([0.5, 32.0, 71.6, 212.0])

In [None]:
temperatures_c = (temperatures_f - 32) * (5 / 9)
temperatures_c

In the above example, first `(temperatures_f - 32)` is evaluated and produces an array with 32 subtracted from every temperature. Then `(5 / 9)` is evaluated. Then then every element in the new array is multiplied by 5/9, producing the final output array.

We can also do element-wise operations between pairs of data from two arrays!

For this to work, both arrays must have the same size. Then, each element of the first array will be operated on with the element in the same position of the second array.

In [None]:
array1 = np.array([1, 2, 3])
array2 = np.array([2, 4, 6])

In [None]:
array1 - array2

In [None]:
array1 * array2

In [None]:
array1**array2

Both paired element-wise operations and standalone element-wise operations can be used in the same expression, since we're always producing another array as a result of each expression.

In [None]:
(array1 * 2) - array2

Watch out for the new errors you might encounter! Let's see what happens if our other array isn't the same size.

In [None]:
array_short = np.array([2,4])
array1 * array_short

The error is telling us that the first array has three elements but the other only has two, so the two arrays couldn't be pushed into the same shape.

## Ranges

Often times it's useful to create an array of consecutive numbers. Creating an array-**range** allows us to do just that.

In fact, we've already seen an example of this when we looked at **indices**. Remember that indices are how the computer keeps track of the items in a sequence, starting at 0 and counting up by 1 for each element in the sequence.

The indices for a sequence with four elements would be `0 1 2 3`, and we can call the `np.arange` function to easily create this sequence!

In [None]:
np.arange(4)

Notice that just like indices, ranges will start at zero by default and exclude the last number.

While we saw an example of the range function being called with one argument, it can be called with one, two, or three arguments:

- `np.arange(endpoint)` Consecutive integers from 1 to endpoint (exclusive)
- `np.arange(start, endpoint)` Consecutive numbers from start to endpoint (exclusive), increasing by 1 each step.
- `np.arange(start, endpoint, stepsize)` Consecutive numbers from start to endpoint (exclusive), changing by stepsize each step.

In [None]:
np.arange(10)

In [None]:
np.arange(5.5, 10)

In [None]:
np.arange(0, 1, 0.2)

In [None]:
np.arange(-1, -4)

In [None]:
np.arange(-1, -4, -1)

Again pay attention that a range will always include the `start` value, but will never include the `end` value.

Now that we can create an array of sequential numbers with even spacing, we can use our range in element-wise arithmetic!

In [None]:
my_range = np.arange(4)
my_range

In [None]:
my_range * 2

In [None]:
my_range ** 2

In [None]:
np.array([10, 5, 10, 5]) * my_range

---
## Summary

To get multiple pieces of data in one place, we create a **collection**. If the collection is ordered then it is a **sequence**.

Each item in a sequence has an **index** -- its position, starting at zero.

| Index | 0 | 1 | 2 | ... | n-1 |
| --- | --- | --- | --- | --- | --- |
| Sequence (length n) | [item, | item, | item, | ... | item]

**Lists** are the most basic sequence, and are created by surrounding a group of items with square brackets and separating each item by commas: `[item, item, ...]`

Any data type is allowed inside of lists, and the items can have different types.

Some functions can be called on lists to perform calculations, such as `min` `max` `sum` and `len`.

Lists own a handful of list **methods** -- functions that belong solely to the data type of lists.

Methods are called using **dot notation**, by placing a dot after a list or variable name of a list, then calling the function: `my_list.function_name(arguments, ...)`

Some list methods allow you to add and remove items from lists.

An item can be selected from a sequence by using brackets with the index of the item: `my_list[index]`

**Arrays** are a sequence from the NumPy library, and are created by passing a list into the `np.array` function: `np.array([item, item, ...])` `np.array(my_list)`

All items in an array must have the same data type.

Functions that can be called on lists can also be called on arrays.

NumPy offers lots of additional functions that can be called on sequences. These can be accessed using `np.function_name(arguments, ...)`.

Arrays own a handful of methods, called using dot notation.

Some array methods allow you to perform calculations on the array.

An item can be selected from an array the same way as any sequence, by using brackets with the index of the item: `my_array[index]`

Arrays support **element-wise operations**, such as adding or multiplying all elements by a single number.

Arrays of the same length support paired element-wise operations between the two arrays, such as adding or multiplying each element in one array with each element in the same position of another array.

An array of numbers with constant spacing can be easily constructed using a NumPy array-**range**: `np.arange(endpoint)` `np.arange(start, endpoint)` `np.arange(start, endpoint, stepsize)`

A range will always *exclude* the endpoint -- so `np.arange(3)` will count `0 1 2`.

Since `np.arange` produces an array, it can be used in element-wise operations.