### Querying Series

In [39]:
# A pandas Series can be queried either by the index position or the index label.
# To query by numeric location, starting at zero, use the iloc attribute. 
# To query by the index label, you can use the loc attribute.

import pandas as pd
students_classes = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'}
s = pd.Series(students_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [40]:
# To see the fourth entry in the series we would use the iloc attribute
# with appropriate parameter
s.iloc[3]

'History'

In [41]:
# Query using index label we use loc attribute
s.loc['Alice']

'Physics'

In [42]:
# Keep in mind that iloc and loc are not methods, they are attributes. So you don't use 
# parentheses to query them, but square brackets instead, which is called the indexing operator. 

In [43]:
# We can use the indexing operator directly on the series itself.
# Behaves like iloc attribute and makes code more readable.
s[3]

'History'

In [44]:
# If we pass in an object/ index label it will behave as loc attribute
s['Alice']

'Physics'

In [45]:
# So what happens if your index is a list of integers? This is a bit complicated and Pandas can't 
# determine automatically whether you're intending to query by index position or index label. So 
# you need to be careful when using the indexing operator on the Series itself. The safer option 
# is to be more explicit and use the iloc or loc attributes directly.

class_code = {90: 'Physics',
             100: 'Chemistry',
             101: 'English',
             102: 'History'}
s = pd.Series(class_code)

In [46]:
# If we try and call s[0] we get a key error because there's no item in the classes list with 
# an index of zero, instead we have to call iloc explicitly if we want the first item.

# This will result in an error
s[0]

KeyError: 0

In [47]:
# Using iloc attribute it will be more specific
s.iloc[0]

'Physics'

In [48]:
# Iterate over a series in pandas
# To find average score
grades = pd.Series([60, 70, 80, 90])

total = 0
for grade in grades:
    total += grade
print(total/len(grades))

75.0


In [49]:
# This works, but it's slow. Modern computers can do many tasks simultaneously, especially, 
# but not only, tasks involving mathematics.

# Pandas and the underlying numpy libraries support a method of computation called vectorization. 
# Vectorization works with most of the functions in the numpy library, including the sum function.

In [50]:
# Here's how we would really write the code using the numpy sum method. First we need to import 
# the numpy module

import numpy as np

# Then we just call np.sum and pass in an iterable item. In this case, our panda series.

total = np.sum(grades)
print(total/len(grades))

75.0


In [51]:
# To create a list of random integers
numbers = pd.Series(np.random.randint(1, 1000, 10000))
numbers.head()

0    904
1    968
2    584
3    461
4    136
dtype: int32

In [52]:
# To verify the length of th list
len(numbers)

10000

In [53]:
# The ipython interpreter has something called magic functions begin with a percentage sign. 
# If we type this sign and then hit the Tab key,
# you can see a list of the available magic functions. You could write your own magic functions too.

In [54]:
# Here, we're actually going to use what's called a cellular magic function. These start with two 
# percentage signs and wrap the code in the current Jupyter cell. The function we're going to use 
# is called timeit. This function will run our code a few times to determine, on average, how long 
# it takes.

# Let's run timeit with our original iterative code. You can give timeit the number of loops that 
# you would like to run. By default, it is 1,000 loops. I'll ask timeit here to use 100 runs because 
# we're recording this. Note that in order to use a cellular magic function, it has to be the first 
# line in the cell

In [55]:
%%timeit -n 100
total = 0
for number in numbers:
    total += number
total/len(numbers)

3.69 ms ± 650 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [56]:
# Now trying it with vectorization.

In [57]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

129 µs ± 55.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [58]:
# A Related feature in pandas and nummy is called broadcasting. With broadcasting, you can 
# apply an operation to every value in the series, changing the series. For instance, if we
# wanted to increase every random variable by 2, we could do so quickly using the += operator 
# directly on the Series object. 

numbers.head()

0    904
1    968
2    584
3    461
4    136
dtype: int32

In [59]:
# Increase everything in the series by 2
numbers += 2
numbers.head()

0    906
1    970
2    586
3    463
4    138
dtype: int32

In [60]:
# The procedural way of doing this would be to iterate through all of the items in the 
# series and increase the values directly. Pandas does support iterating through a series 
# much like a dictionary, allowing you to unpack values easily.

# We can use the iteritems function which returns a label and value
for label, value in numbers.iteritems():
    # now for the item which is returned, lets call Series.at[]
    # which allows access to a single value using label in a Series
    numbers.at[label] = value+2
# Check result of this computation
numbers.head()

# Earlies set_value was used instead of Series.at[] but was deprecated in later Python ver.

0    908
1    972
2    588
3    465
4    140
dtype: int32

In [65]:
%%timeit -n 10
# Comparing speeds between looping and broadcasting in pandas
s = pd.Series(np.random.randint(1, 1000, 10000))
for label, value in numbers.iteritems():
    numbers.at[label] = value+2

193 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [66]:
%%timeit -n 10
s = pd.Series(np.random.randint(1, 1000, 10000))
# Broadcast with +=
s += 2

315 µs ± 73.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [67]:
# This is a note on using the indexing operators to access series data. The .loc attribute lets 
# you not only modify data in place, but also add new data as well. If the value you pass in as 
# the index doesn't exist, then a new entry is added. And keep in mind, indices can have mixed types. 
# While it's important to be aware of the typing going on underneath, Pandas will automatically 
# change the underlying NumPy types as appropriate.

In [70]:
s = pd.Series([1, 2, 3])

# Adding a value using .loc
s.loc['English'] = 4
s

0          1
1          2
2          3
English    4
dtype: int64

In [71]:
# Following is an example where index values are not unique, and this makes
# pandas Series a little different conceptually then, for instance, a relational database.

students_classes = pd.Series({'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'})
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [72]:
# A Series just for some new student Kelly.
kelly_classes = pd.Series(['Psychology', 'EVS', 'Chemistry'], index=['Kelly', 'Kelly', 'Kelly'])
kelly_classes

Kelly    Psychology
Kelly           EVS
Kelly     Chemistry
dtype: object

In [75]:
# We can append all of the data in this new Series to the first using the .append() function.
all_students_classes = students_classes.append(kelly_classes)
all_students_classes

Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Psychology
Kelly           EVS
Kelly     Chemistry
dtype: object

In [76]:
# There are a couple of important considerations when using append. First, Pandas will take 
# the series and try to infer the best data types to use. In this example, everything is a string, 
# so there's no problems here. Second, the append method doesn't actually change the underlying Series
# objects, it instead returns a new series which is made up of the two appended together. This is
# a common pattern in pandas - by default returning a new object instead of modifying in place - and
# one you should come to expect. By printing the original series we can see that that series hasn't
# changed.
students_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [77]:
# Finally, we see that when we query the appended series for Kelly, we don't get a single value, 
# but a series itself. 
all_students_classes.loc['Kelly']

Kelly    Psychology
Kelly           EVS
Kelly     Chemistry
dtype: object