In this lecture we are going to explore the pandas Series structure. By the end of this lecture you should be familiar with how to store and manipulate single dimensional indexed data in the Series object.

The series is one of the core data structures in pandas. You think of it a cross between a list and a dictionary. The items are all stored in an order and there's labels with which you can retrieve them. An easy way to visualize this is two columns of data. The first is the special index, a lot like keys in a dictionary. While the second is your actual data. Its important to note that the data column has a label of its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data. And we'll talk about that later on in the course.

In [1]:
# lets import pandas to get started
import pandas as pd

In [None]:
# as you migh expect, you can create a series by passing a list of values.
# when you do this, Pandas automatically assigns an index starting with zero and
# sets the name of the series to None. Lets work on an example of this.

# One of the easiest ways to create a series is to use an array-like object, like
# a list.

# Here we'll make a list of the three students, Alice, Jack, and Molly, all as strings
students = ['Alice', 'Jack', 'Molly']

# Now we just call the Series function in pandas and pass in the students
pd.Series(students)

0    Alice
1     Jack
2    Molly
dtype: object

In [None]:
# we dont have to use strings. If we passed in a list of whole numbers, for instance,
# we could see that panda sets the type to int64. Underneath panda stores series values in a
# typed array using the Numpy Library. This offers significant speedup when processing data
# versus traditional python lists.

# lets create a little list of numbers
numbers = [1, 2, 3]
# and turn that into a  series
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

In [None]:
# we could see that the result is a dtype of int64 objects

In [4]:
# there's more other typing details that exist for perfomance that are important to know.
# the most important is how numpy and thus pandas handle missing data.

# in python, we have the none type to indicate a lack of data. But what do we do if we want
# to have a typed list like we do in the series object?

# underneath, pandas does some type conversion. if we create a list of strings and we have
# one element, a None type, pandas inserts it as a None and uses the type object for the
# underlying array.

# lets recreate aour list of students, but leave the last one as a None
students = ['Alice', 'Jack', None]
# and lets convert this to a series
pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

In [10]:
# however, if we create a list of numbers, integers or floats, and put in the None type,
# pandas automatically converts this to a special floating point value designated as NaN
# which stands for "Not a Number"

# so lets create a list with a None value in it
numbers = [1, 2, None]
# and turn that into a series
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

In [11]:
# you'll notice a couple of things. first, NaN is a different value. second, pandas
# set the dtype of this series to floating point numbers instead of object or ints. That's
# maybe a bit of a surprise - why not just leave this as an integer? underneath, pandas
# represent NaN as a floating point number, and because integeres can be typecast to
# floats, pandas went and converted our integers to floats. So when you are wondering why
# the list of integers you put into a Series is not int, its probably beacause there is
# some missing data.

In [2]:
# None and NaN might be being used by the data scientist in the same way, to denote
# missing data, but the underneath these are not represented by pandas in the same way.

# NaN is NOT equivalent to None and when we try the equality test, the result is False.

# lets bring in numpy which allows us to generate an NaN value
import numpy as np
# and lets compare it to None
np.nan == None

False

In [3]:
# it turns out that you actually cant do an equality test of NaN to itself. when you do,
# the answer is always False.
np.nan == np.nan

False

In [4]:
# instead, you need to use special functions to test for the presence of not a number,
# such as the Numpy library isnan()
np.isnan(np.nan)

True

In [5]:
# so keep in mind when you see NaN, its meaning is similar to None, but its a
# numeric value and treated differently for efficiency reasons.

In [6]:
# lets talk more about how pandas Series can be created. while my list might be a common
# way to create some play data, often you have label data that you want to manipulate.
# a series can be created directly from dictionary data. if you do this, the index is
# automatically assigned to the keys of the dictionary that you provided and not just
# incrementing integers

# here's an example using some data of students and their classes.
students_class = {
    'Alice': 'Physics',
    'Jack': 'Chemistry',
    'Molly': 'English'
}
s = pd.Series(students_class)
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [7]:
# we see that, since it was string data, pandas set the data type of the series to "object".
# wee se that the index, the first column, is also a list of strings.

In [8]:
# once the series has been created, we can get the index object using the index attribute.
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

In [9]:
# as you play more with pandas you will notice that a lot of things are implemented as
# as numpy arrays, and have the dtype value set. this is true of indices,  and here pandas
# infered that we were using objects for the index.

In [10]:
# now this is kind of interesting. the dtype of object is not just for strings, but for
# arbitrary objects. lets create a more complex type of data, say, a list of tuples.
students = [("Alice", "Brown"), ("Jack", "White"), ("Molly", "Green")]
pd.Series(students)

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

In [11]:
# we see that each of the tuples is stored in the series object, and the type is object.

In [12]:
# you can also separate your index creation from the data by passing in the index as a
# list explicitly to the series.

s = pd.Series(['Physics', 'Chemistry', 'English'], index=['Alice', 'Jack', 'Molly'])
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [13]:
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

In [14]:
# so what happens if your list of values in the index object are not aligned with the keys
# in your dictionary for creating the series? well, pandas ovverides the automatic creation
# to favor only and all of the indices values that you provided. so it will ignore from your
# dictionary all keys which are not in your index, and pandas will add None or NaN type values
# for any index value you provide, which is not in your dictionary key list.

# Here's and example. I'll pass in a dictionary of three items, in this case students and
# their courses
students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}
# when i create the series object though i'll onlly ask for an index with three studnets,
# and i will exclude Jack
pd.Series(students_scores, index=['Alice', 'Molly', 'Sam'])

Alice    Physics
Molly    English
Sam          NaN
dtype: object

In [15]:
# the result is that the series object doesnt have Jack in it, even though he was in our
# original dataset, but it explicitly does have Sam in it as a missing value.