### Pandas Lesson 1: Series
This tutorial introduces the fundamental building block of pandas: Series. By the end of this section, you will learn how to create different types of Series, subset them, modify them, and summarize them.

1. What is a Series?
In the simpliest terms, a Series is an ordered collection of values, generally all of the same type. For example, you can have a Series that contains the ages of everyone in your class (a numeric Series), or a Series of all the names of people in your family (a string Series).

This may sound familiar: isn’t that how we described numpy vectors (i.e. one-dimensional numpy arrays)? Yes! In fact, Series are basically one-dimensional numpy arrays with lots of extra features added on top of them. As we’ll see, most everything you could do with a numpy array you can do with a Series; Series can just do more.

Series are central to pandas because pandas was designed for statistics, and Series are a perfect way to collect lots of different observations of a variable.

There are lots of ways to create Series, but the easiest is to just pass a list or an array to the pd.Series constructor.

To illustrate, let me tell you about a week at the zoo I wish I owned. Here’s what attendance looked like at my zoo last week:

| Day of Week | Attendees   |
|   :----     |    ----:    |   
| Monday      | 132 people  | 
| Tuesday     | 94 people   | 
| Wednesday   | 112 people  |
| Thursday    | 84 people   |
| Friday      | 254 people  |
| Saturday    | 322 people  |
| Sundayy     | 472 people  |

Let’s make a Series for this attendance pattern:

In [1]:
import pandas as pd # We have to import pandas to use Series!

attendance = pd.Series([132, 94, 112, 84, 254, 322, 472])
attendance

0    132
1     94
2    112
3     84
4    254
5    322
6    472
dtype: int64

### Indices
One of the fundamental differences between numpy arrays and Series is that all Series are associated with an index. An index is a set of labels for each observation in a Series. If you don’t specify an index when you create a Series, pandas will just create a default index that just labels each row with it’s initial row number, but you can specify an index if you want.

In this case, for example, we know that these entries are associated with different days of the week, so let’s specify an index for our attendance Series:

In [2]:

attendance = pd.Series([132, 94, 112, 84, 254, 322, 472],
                       index=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
                              'Friday', 'Saturday', 'Sunday'])
attendance

Monday       132
Tuesday       94
Wednesday    112
Thursday      84
Friday       254
Saturday     322
Sunday       472
dtype: int64

Now as we see the rows are labeled with days of the week on the left side, rather than with initial row numbers.

Note that you can always access a Series’ index with the .index property:

In [3]:
attendance.index

Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'],
      dtype='object')

An important property of index labels is that they stay with each row, even if you sort your data. So if I sort my Series by attendance, not only will rows re-order, but so will the index labels:

In [None]:
attendance = attendance.sort_values()
attendance

Note: This seems intuitive with days-of-the-week as our index labels, but it can be confusing when your index starts out as row numbers. For example, if you had not changed our index to be days of the week, then the default index would look like the index labels were just row numbers. But if we then sort the Series, the numbers will shuffle, and they will no longer correspond to row numbers:

In [None]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472])
attendance

In [None]:
attendance = attendance.sort_values()
attendance

2. Subsetting Series
Extracting a subset of elements from a Series is an extremely important task, not least because it generalizes nicely to working with bigger datasets (which are at the heart of data science). This process — whether applied to a Series or a dataset — is often referred to as “taking a subset”, “subsetting”, or “filtering”. If there is one skill you need to master as quickly as possible, it’s this.

In pandas, there are three ways to filter a Series: using a separate logical Series, using row-number indexing, and using index labels. I tend to use the first method most, but all three are useful. The first and second of these you will recognize from numpy arrays, while the last once (since it uses index labels which only exist in pandas) is unique to pandas.

Subsetting using row-number indexing
A different way to subset a Series is to specify the row-numbers you want to keep using the iloc function. (iloc stands for “integer location”, since row numbers are always integers). This will give you the behavior you’re more familiar with from R or numpy. Just remember that, as in all of Python, the first row is numbered 0!

In [None]:
fruits = pd.Series(["apple", "banana"])
fruits.iloc[0]

You can also subset with lists of rows, or ranges, just like in numpy:

In [None]:
fruits.iloc[[0, 1]]

In [None]:
fruits.iloc[0:2]

Subsetting using index values
Lastly, we can subset our rows using the index values associated with each row using the loc function.

In [None]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472],
                       index=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
                              'Friday', 'Saturday', 'Sunday'])

In [None]:
attendance.loc["Monday"]

You can also ask for ranges of index labels. Note that unlike in integer ranges (like the 0:2 we used above to get rows 0 and 1), index label ranges include the last item in the range. So for example if I ask for .loc["Monday":"Friday"], I will get Friday included, even if .iloc[0:2] doesn’t include 2.

In [None]:
attendance.loc["Monday":"Friday"]

Subsetting with logicals
Let’s jump right into an example, using our Zoo attendance Series:

In [None]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472],
                       index=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
                              'Friday', 'Saturday', 'Sunday'])

Suppose we want to only get days with at least 100 people attending. We can subset our Series by using a simple test to build a Series of booleans (True and False values), then asking pandas for the rows of our Series for which the entry in our test Series is True:

In [None]:
was_busy = attendance > 100
was_busy

In [None]:
busy_days = attendance.loc[was_busy]
busy_days

There is one really important distinction between how subsetting works in pandas and most other languages though, which has to do with indices. Suppose we want to subset a Series with fruits to only get the entry “apple”. Would could do the following:

In [None]:
fruits = pd.Series(["apple", "banana"])
apple_selector = pd.Series([True, False])
fruits.loc[apple_selector]

This looks familiar from numpy, but:

A very important difference between pandas and other languages and libraries (like R and numpy) is that when a logical Series is passed into loc, evaluation is done not on the basis of the order of entries, but on the basis of index values. In the case above, because we did not specify indices for either fruits or apple_selector, they both got the usual default index values of their initial row numbers. But let’s see what happens if we change their indices so they don’t match their order:

In [None]:
fruits # We can leave fruits as they are

In [None]:
apple_selector = pd.Series([True, False], index=[1, 0])
apple_selector

Note that we’ve flipped the index order for apple_selector: the first row has index value 1, and the second row has value 2. Now watch what happens when we put apple_selector in square brackets:

In [None]:
fruits.loc[apple_selector]

We get banana! That’s because in apple_selector, the index value associated with the True entry as 1, and the row of fruit that had index value 1 was banana, even though they are in different rows. This is called index alignment, and is absolutely crucial to keep in mind while using pandas.

But note this only happens if your boolean array is a Series (and thus has an index). If you pass a numpy boolean array or a list of booleans (neither of which have a concept of an index), then despite using loc, alignment will be based on row numbers not index values (because there are no index values to align).

In [None]:
fruits.loc[[True, False]]