In [None]:
%reload_ext postcell
%postcell register

# Pandas Series - an overview

Pandas library provides a fantastic interface to tabular data, made up of rows and columns. This is usually used to represent "business" datasets such as customer or product information. Generally, tables are organized so rows represent observations and columns represent features. Such "tabular" datasets are different from other datasets used in machine learning, such as images, videos, graphs, etc.

We will study three main objects within the Pandas library: Dataframes and two objects which make up dataframes, Series and Indexes.

![](images/dataframes.jpg)

*Hint* Review earlier lectures which provide quickstart introductions to Numpy and Pandas. These notes take a more systematic approach to describing Pandas.

In [None]:
import numpy as np
import pandas as pd

In [None]:
pd.__version__

# Series

Pandas series are similar to Python's built-in lists and numpy arrays. Here is an example:

In [None]:
#Python list
["Bart", "Homer", "Lisa", "Maggie", "Marge"]

In [None]:
pd.Series(["Bart", "Homer", "Lisa", "Maggie", "Marge"])

In [None]:
#Numpy array
np.array(["Bart", "Homer", "Lisa", "Maggie", "Marge"])

#### Series, like Numpy arrays, have types

In [None]:
type([1, 2, 3])

In [None]:
type([1.2, 2, 3])

In [None]:
type([1.2, 2, "Home"])

In [None]:
pd.Series([1, 2, 3])

In [None]:
pd.Series([1.2, 2, 3])

In [None]:
pd.Series([1.2, 2, "Homer"])

In [None]:
pd.Series(["Marge", "Lisa", "Homer"])

Notice that, unlike Python lists, Series objects have types associated with them. One of the reasons Pandas and Numpy are so much faster than native Python is because all elements are expected to be the same type and can be optimized for performance.

**String type** With Pandas 1.0, string series have their own data type, while `object` will be the generic, catch all type.

**Exercise** Create a Pandas series, containing numbers from zero to 10

In [None]:
%%postcell exercise_030_110_a

#type your answer here

#### More complex keys

Notice that Pandas series are nothing more than a wrapper around Numpy arrays (at least in Pandas 1.0):

In [None]:
ss = pd.Series(["Bart", "Homer", "Lisa", "Maggie", "Marge"])
ss

In [None]:
ss.values

In [None]:
ss.index

In [None]:
type(ss.values)

In [None]:
type(ss.index)

In [None]:
type(np.array(["Bart", "Homer", "Lisa", "Maggie", "Marge"]))

Recall that Numpy was created to be a library for numeric matrix manipulation. It make sense to ask, what is the value at index zero. It makes no sense, in matrix math, to ask, "what is the value at index 'Homer'?"

Pandas, however, is designed to work with datasets which may have categories and texts. If a matrix contains ages of people, it is perfectly reasonable to ask, what is the 'age' value at index 'Homer.'

In [None]:
ss

In [None]:
ss[1]

In [None]:
pd.Series([12, 38, 10, 2, 36])


In [None]:
ss2 = pd.Series([12, 38, 10, 2, 36], index=["Bart", "Homer", "Lisa", "Maggie", "Marge"])
ss2

Notice that Series provide the ability to add a custom `index`:

In [None]:
ss2["Homer"]

**Exercise** Create a series of ages and names of characters as the index:
Ned 41, Daenerys 16, Tyrion 32, Jon 16

In [None]:
%%postcell exercise_030_110_b

#type your answer here

#### Series combine the properties of lists and dictionaries

In [None]:
pd.Series([41, 16, 32, 16])

In [None]:
pd.Series({'Ned': 41, 'Daenerys': 16, 'Tyrion': 32, 'Jon': 16})

Recall that Python's core lists are accessed by zero based integer values. Elements in a list are ordered, which means we can use the _slice_ notation to access multiple items:

In [None]:
["Bart", "Homer", "Lisa", "Maggie", "Marge"][2:4]

Dictionaries let us provide our own keys. However, they cannot be sliced:

In [None]:
{"Homer":38, "Marge":36, "Bart":12, "Lisa":10, "Maggie":6}["Bart":"Maggie"] #<= dictionaries don't understand slicing

Pandas Series combine properties of lists and dictionaries, along with the performance of Numpy:

In [None]:
ss2

In [None]:
ss2["Bart"] #as dictionary

In [None]:
ss2[0] # as list

In [None]:
ss2[2:4] # as list

In [None]:
ss2["Lisa":"Marge"] # as ??

**Caveat** Notice that when we used slicing with the implicit (built-in) index values: `2:4`, the 4th value wasn't included in the result set (as expected). However, when sliced with explicit index values (the index we provided): "Bart":"Maggie", the last value _is_ included. Series provides a way around this confusion via `.loc` and `.iloc` 

![](images/series.jpg)

In [None]:
ss2

In [None]:
ss2.values

In [None]:
ss2.index

### Caveat: `s.columnA` vs `s['columnA']`

So far, we have accessed elements of Series using the syntax `s['columnA']`. However, the following syntax is also allowed `s.columnA`. For example

In [None]:
ss2

In [None]:
ss2['Marge']

In [None]:
ss2.Marge

In [None]:
ss2.

In [None]:
ss2.shape

In [None]:
ss2.Mr. Burns

The 'dot' syntax is very convenient, since it looks like calling a normal method on an object. It is slightly shorter to type and prvides better IDE help. In a cell, type `ss2.` then press the [TAB] key. You will notice a list of dropdowns. Type "M" and you will see Marge and Maggie's name pop up. You are using the autocomplete to find elements of a series!

Also notice that you can call actual operations on a series:

In [None]:
ss2.sum()

If you type `series.XYZ` .. is that referring to an element, indexed by the key "XYZ" or is it referring to the function "XYZ()"? This is confusion is the reason why, unless we are sure that there is no conflict, the safer option is to use the syntax `s['columnA']`

In [None]:
ss2['Homer J']
ss.Homer J #<= syntax error!

### Indexing Series

Similar to Numpy, values in series can be retrieved in several ways:
1. Implicit index (similar to lists)
2. Explicit index or label (similar to dictionaries)
3. Slicing
4. Boolean Masking
5. Fancy indexing

#### Implicit index

In [None]:
ss2

Given the series above, where names of the Simpson family are the index and their ages are the values, we can get Marge's age, which is at row 5 via directly requesting the second value (remember that Python is zero based):

In [None]:
ss2[4]

**Exercise** Given the series `ss2`, get second element

In [None]:
%%postcell exercise_030_110_c

#type your answer here

#### Explicit index or labels

Much like dictionaries, the key associated with Marge's name will return her age:

In [None]:
ss2

In [None]:
ss2["Marge"]

**Caveat** Notice that the same syntax is being used to access data implicitely and explicitely: `series[index]`. If `index` is an integer, then Pandas assumes it is an implicit inex, if it is not an integer, then Pandas assumes it is an explicit index. What if our explicit index was also an intger? See `.loc` and `.iloc`, later in the lecture

**Exercise** Given the series `ss2`, get the element corresponding to "Maggie"

In [None]:
%%postcell exercise_030_110_d

#type your answer here

#### Slicing

Similar to Numpy arrays and Python lists, Series can be sliced. Note that keys are sliced in terms of their location in the series, **not alphabetically**

In [None]:
ss2

In [None]:
ss2[2:4]

In [None]:
ss2["Lisa":"Marge"]

Notice that slicing with explicit values includes the last item while slicing with implicit indexes the last value is not included.

Remember that negative indexes can be used, just like normal Python lists. Below we select items which start at "second to last" and end at the last item:

In [None]:
ss2

In [None]:
ss2[-2:]

**Exercise** Given the series `ss2`, get the second, third and fourth elements (use slicing)

In [None]:
%%postcell exercise_030_110_e

#type your answer here

#### Boolean (Masking)

Providing a `True` or `False` value for element results in elements corresponding to `False` being filtered out.

In [None]:
ss2

In [None]:
mask = [True, False, True, True, True]
ss2[mask]

In [None]:
ss2[[True, False, True, True, True]]

Btw, this does not work with Python lists

In [None]:
['B', 'H', 'L', 'M', 'M'][mask]

Filter on the index

In [None]:
ss2[ss2 != "Homer"]

In [None]:
ss2.index != "Homer"

In [None]:
ss2[ss2.index != "Homer"]

Filter on the values

In [None]:
ss2.values <30

In [None]:
ss2[ss2.values < 30]

In [None]:
ss2 < 30

Notice that you don't have to call `.values` for cleaner code

In [None]:
ss2[ss2 < 30]

**Exercise** Find everyone older than 2 years

In [None]:
%%postcell exercise_030_110_f

#type your answer here

**Exercise** Find everyone, except "Maggie"

In [None]:
%%postcell exercise_030_110_g

#type your answer here

#### Fancy indexing

If you know the specific list of items, you can ask the Series to return them directly:

In [None]:
ss2

In [None]:
ss2[[1,3,4]]

Notice that the order of elements in the index will be what the Series returns:

In [None]:
ss2[[4,3,1]]

In [None]:
ss2[[4,3,2, 2, 2, 2]]

You can also use the explicit index, instead of the implicit index

In [None]:
ss2[["Marge", "Marge", "Lisa"]]

**Exercise** Given the series `ss2`, get the second, third and fifth elements (using fancy indexing)

In [None]:
%%postcell exercise_030_110_h

#type your answer here

### `.loc` and `.iloc` or confusion when explicit indexes are integers

Notice that the series we have been working with has strings as the explicit index and integers as the values:

In [None]:
pd.Series([12, 38, 10, 2, 36], index=["Bart", "Homer", "Lisa", "Maggie", "Marge"])

What if we flip that around and a couple more characters

In [None]:
ss3 = pd.Series(["Bart", "Homer", "Lisa", "Maggie", "Marge"], index=[12, 38, 10, 2, 36])
ss3

**Caveat** If I ask for values for 1 `ss3[2]`, am I asking for elements at location 2 or for characters with age 2?

In [None]:
ss3[2]

In order to avoid this confusion, Pandas provides methods `.loc`, which expects explicit index values (aka labels) and `.iloc`, which expects implicit index values:

In [None]:
ss3.loc[2]

In [None]:
ss3.iloc[2]

#### Special indexer: `.at[]`

Implicit index (similar to lists)
Explicit index (similar to dictionaries)
Slicing
Masking
Fancy indexing

Notice that we have several ways of accessing elements: index, labels, slicing, masking and fancy indexing. The same syntax `s[XXX]`, `s.loc[XXX]` or `s.iloc[XXX]` can be used with any of the methods above. There are times when you want exactly a single, scalar value to be returned. In such cases, you use the syntax `s.at[XXX]`:

In [None]:
ss3.at[2]

In [None]:
ss2.at['Maggie']

In [None]:
ss3.at[1:2] # <= slicing is prohibited, since it would return multiple values

### Indexing summary

In [None]:
ss2

```python
#implicit (like lists)
ss2[integer] # get value at location number

#explicit labels (like dictionaries)
ss2[key]     # get value at index key 

#slicing
ss2[integer_start:integer_end] # get values between locations integer_start and integer_end (integer_end not included)
ss2[key_start:key_end]         # get values between keys key_start and key_end (key_end INCLUDED, by locations of keys, not alaphabetical)

#fancy indexing
ss2[[integer1, intger3]]  # get values at locations intger1 and integer3
ss2[[key1, key3]] # get values at keys key1 and key3

ss2.loc[...]  # always operate in terms of keys

ss2.iloc[...] # always operate in terms of location

ss2.at[...]   # always return a single value
```

### Creating Series

We have already seen series being create with a list and an index: `pd.Series(data, index=None)`.

Another common way of creating a series is via a dictionary:

In [None]:
pd.Series({"Homer":38, "Marge":36})

Create it from a list

In [None]:
pd.Series(['Homer', 'Marge', "Maggie"])

There are less common metods of creating series, such as:

In [None]:
pd.Series(44, index=["Homer", "Marge", "Maggie"])

You can always combine what you learned for lists with series:

### Converting to other formats

Series can be converted to a dictionary very easily:

In [None]:
ss2

In [None]:
ss2.to_dict()

Other, similar method:
1. ss2.to_excel
2. ss2.to_frame (conver to dataframe)
3. ss2.to_json
4. ss2.to_sql

Any many others

You often need to convert a series to a DataFrame:

In [None]:
ss2.to_frame()

In [None]:
pd.DataFrame(ss2)

In [None]:
pd.DataFrame(ss2).shape

One of the most useful patterns is to convert a series and its index to a dataframe with two columns:

In [None]:
ss2

In [None]:
ss2.reset_index(name="age")

In [None]:
ss2.reset_index(name="age").rename(columns={"index":"name"})

### Operating on series

Much like Numpy arrays, several operations can be done on a series:

In [None]:
ss2

In [None]:
ss2.sum(), ss2.count(), ss2.min(), ss2.max(), ss2.mean()

In [None]:
ss2.isin([10, 13, 12, 46, 38])

In [None]:
ss2.isin(["Homer", "Mr. Burns", "Barney", "Maggie", "Dr. Hibert"])

In [None]:
ss2.index.isin(["Homer", "Mr. Burns", "Barney", "Maggie", "Dr. Hibert"])

In [None]:
ss2.index

**Exercise** Explain the previous two cells

In [None]:
ss2

In [None]:
ss2.sample(2)

In [None]:
ss2.sample(10, replace=True)

In [None]:
ss2.nlargest(2)

In [None]:
ss2.astype('float')

In [None]:
ss2.sort_values()

In [None]:
ss2.sort_index()

In [None]:
ss2

**Exercise** Add two years to everyone's age (recall the Numpy lecture) in series `ss2`

In [None]:
%%postcell exercise_030_110_i

#type your answer here

### Combining series and intelligently handling missing data

The ability to intelligenty combine series is one of the most powerful features of Pandas. Note the series we have already been working on, which contains Simpson family ages:

In [None]:
ss2

In [None]:
weights = pd.Series([240, 85], index=["Homer", "Bart"])
weights

Notice that we don't have everyone's weight. What happens when we combine these two series:

In [None]:
pd.DataFrame({'ages':ss2, 'weights':weights})

Series are designed for single dimensional data. DataFrames, which we will study soon, contain multiple series. Notice that the two series have been intelligently combined!

Let's add two series. Notice that we can add them as if they were scalars or numpy arrays. Further notice that Pandas intelligently inserts `NaN` in appropriate places

In [None]:
ss2 + weights

In [None]:
ss2.add(weights, fill_value=0)

**Intelligent handling of missing values is one of the reasons to use Pandas**

Note that aggregate functions _know_ to ignore missing values when adding series:

In [None]:
(ss2 + weights).sum()

In [None]:
97+278