In this lecture, we will talk about one of the primary data types of the Pandas libaray, the Series. You will learn about the structure of the Series, how to query and merge Series objects togetherm and the importance of thinking baout parallelization when engaging in data science programming.

In [2]:
# a pandas Series can be queried either by the index posisiton or the index label. if you dont give an
# index to the series when querying, the posistion and the label are effectively the same values.
# to query by numeric location, starting at zero, use the iloc attribute. to query by the index  label,
# you can use the loc attribut.

# lets start with an example. we'll use students enrolled in classes from a dictionary
import pandas as pd
students_classes = {
    'Alice': 'Physics',
    'Jack': 'Chemistry',
    'Molly': 'English',
    'Sam': 'History'
}
s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [2]:
# so, for this series, if you wanted to see the fourth entry we would use the iloc
# attribute with the parameter 3 (because start from 0)
s.iloc[3]

'History'

In [3]:
# if you wanted to see what class Molly has, we would use the loc attribute with a parameter
# of Molly
s.loc['Molly']

'English'

In [4]:
# keep in mind that iloc and loc are not methods, they are attributes. so you dont use
# parentheses to query them, but square brackets insted, which is called the indexing operator.
# in Python this calls get or set for an item depending on the context of its use.


In [5]:
# Pandas tries to make our code a bit more readable and provides a sort of smart syntax using
# the indexing operator directly on the series itself. for instance, if you pass in an integer parameter,
# the operator will behave as if you want it to query via iloc attribute.
s[3]

'History'

In [7]:
# if you pass in an object, it will query as if you wanted to use the label based loc attribute.
s['Molly']

'English'

In [8]:
# so what happens if your index is a list of integeres? this is a bit complicated and Pandas cant
# determine automatically whether you're itending to query by index position or index label. SO
# you need to be careful when using the indexing operator on theSeries itself. The safer option
# is to be more explicit and use the iloc or loc attributes directly.

# heres an example using class and their classcode information, where classes are indexed by
# classcodes, in the form of integers
class_code = {
    99: 'Physics',
    100: 'Chemistry',
    101: 'English',
    102: 'History'
}
s = pd.Series(class_code)
s

99       Physics
100    Chemistry
101      English
102      History
dtype: object

In [9]:
# if we try and call s[0] we get a key error bacause there's no item in the classes list with
# an index of zero, instead we have to call iloc explicitly if we want the first item.
s[0]

KeyError: ignored

In [10]:
s.iloc[0]

'Physics'

In [11]:
s.loc[99]

'Physics'

In [4]:
# now we know how to get data out of the series, lets talk about working with the data.
# a common task is to want to consider all of the valaues inside of a series and do some 
# sort of operation. this could be trying to find a certain number, or summarizing data or
# transforming the data in some way.

In [1]:
# Pandas and the underlying numpy libraries support a method of computattion called vectoriation
# vectorization works with most of the functions in the numpy library, including the sum function.

In [3]:
import numpy as np

grades = pd.Series([90, 80, 70, 60])

# we can just call np.sum and pass in an iterable item. in this case, our pandas series.
total = np.sum(grades)
total/len(grades)

75.0

In [5]:
# is vectorization actually faster than traditional way? The jupyter notebook has a magic
# function which can help.

# first, lets create a big series of random numbers. this is used a lot when demonstrating
# techniques with Pandas
numbers = pd.Series(np.random.randint(0, 1000, 100000))

# now lets look at the top five items in that series to make sure they actually seem random.
# we can do this with the head() function
numbers.head()

0    574
1    489
2    488
3    578
4     88
dtype: int64

In [6]:
# we can actually verify that length of the series is correct using the len function
len(numbers)

100000

In [7]:
# ok, we're confident now that we have a big series. the ipyhton interpreter has something called
# magic function begin with a percentage sign. if we type this sign and then hit the Tab key, you
# can see a list of the avaialabe magic function. you could write your own magic function
# too, but that's a little bit outside of the scope of this course.

In [8]:
# here, we're actually going to use whats called a cellular magic function. these start with
# two percentage signs and wrap the code in the current jupyter cell. the function we're going
# to use is called timeit. this function will run our code a few times to determine, on average,
# how long it takes.

# lets run timeit with traditioanl way to find avg. you can give timeit the number of loops
# that you would like to run. by default,  it is 1000 loops. ill aks timeit here to use 100 runs
# note that in order to use cellular magic function, it has to be the first line in the cell.
%%timeit -n 100
total = 0
for number in numbers:
  total+=number
total/len(numbers)

100 loops, best of 5: 12.7 ms per loop


In [9]:
# not bad. timeit ran the code and it doesnt seem to take very long at all. now lets try
# with vectorization

In [10]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

100 loops, best of 5: 157 µs per loop


In [11]:
# wow this is a pretty shocking difference in the speed and demonstrates why one should be
# aware of parallel computing features and start thinking in functional programming terms.
# put more simply, vectorization is the ability for a computer to execute multiple instructions
# at once, and with high performance chips, especially graphics cards, you can get dramatic
# speedups. modern graphics cards can run thousands of instruction in parallel.