# Querying `Series`


* A pandas Series can be queried either by the **`index position`** or the **`index label`**. 
* If you don't give an index to the series when querying, the position and the label are `effectively the same values`. 

* To query by <i>numeric location</i>, starting at zero, use the **`iloc`** attribute. To query by the <i>index label</i>, you can use the **`loc`** attribute. 


In [6]:
import pandas as pd
students_classes = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'}
s = pd.Series(students_classes)
print(s)
print('\nFourth entry using iloc[3]: ',s.iloc[3])  # to see the fourth entry
# If you wanted to see what class Molly has, we would use the loc attribute with a parameter 
# of Molly.
print("Printing using the loc['Molly']:", s.loc['Molly'] )

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

Fourth entry using iloc[3]:  History
Printing using the loc['Molly']: English


In [17]:
# what happens if your index is a list of integers?

class_code = {99: 'Physics',
              100: 'Chemistry',
              101: 'English',
              102: 'History'}
s = pd.Series(class_code)

In [22]:
# s[0] # error because pandas assumes key is 99 for the first obj

# to solve this, we need the iloc to explicity locate the index no.
# s.iloc[1]
# s.loc[99]

'Physics'

In [23]:
grades = pd.Series([90, 80, 70, 60])

total = 0
for grade in grades:
    total+=grade
print(total/len(grades))

# works, but might be slow for huge data


75.0


In [25]:
# using the numpy sum method we can improve the runtime

import numpy as np

total = np.sum(grades)
total/len(grades) # find the avg

75.0

In [31]:
# both of these methods create the same value, but which one is actually faster?

numbers = pd.Series(np.random.randint(1,1000,10000)) # rand 10000 len series
print(len(numbers))
numbers.head() # jsut to see the first 5

10000


0    527
1    946
2    139
3    392
4    255
dtype: int32

In [32]:
# use cellular magic function. These start with "%%" and wrap the code in the current Jupyter cell. 
# The function we're going to use is called timeit. This function will run our code a few times to determine, 
# on average, how long it takes.


# We are going to compare the two methods used to find the sum above

In [35]:
%%timeit -n 100
total = 0

for number in numbers:
    total+=number
total/len(numbers)

1.44 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [42]:
%%timeit -n 1000

total = np.sum(numbers)
total/len(numbers)

80.2 µs ± 16 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [43]:
# difference is clear
#  one should be aware of parallel computing features and start thinking in functional programming terms.

# vectorization is the ability for a computer to execute multiple instructions
# at once, and with high performance chips, especially graphics cards, you can get dramatic
# speedups. Modern graphics cards can run thousands of instructions in parallel.

# ============= more examples ============
#  we should as much as possible reduce the number of times we iterate, as it reduces
# program speed

In [46]:
%%timeit -n 10
# we'll create a blank new series of items to deal with
s = pd.Series(np.random.randint(0,1000,1000))
# And we'll just rewrite our loop from above.
for label, value in s.iteritems():
    s.loc[label]= value+2

25.8 ms ± 968 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [54]:
%%timeit -n 10
# We need to recreate a series
s = pd.Series(np.random.randint(0,1000,1000))
# And we just broadcast with +=
s+=2

370 µs ± 108 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
# Not only is it significantly faster, but it's more concise and even easier 
# to read too.

In [56]:
# Here's an example using a Series of a few numbers. 
s = pd.Series([1, 2, 3])

# We could add some new value, maybe a university course
s.loc['History'] = 102  # since this is not found, it creates a new entry

s

0            1
1            2
2            3
History    102
dtype: int64

In [62]:
#  Dealing with non-unique index

students_classes = pd.Series({'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'})
print('All stud classes: \n',students_classes)


# create a Series just for some new student Kelly, which lists all of the courses
# she has taken. We'll set the index to Kelly, and the data to be the names of courses.

kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index=['Kelly', 'Kelly', 'Kelly'])
print('\nKelly classes: \n', kelly_classes)


# Finally, we can append all of the data in this new Series to the first using the 
# .append() function.
all_students_classes = students_classes.append(kelly_classes)

# This creates a series which has our original people in it as well as all of Kelly's courses
all_students_classes

All stud classes: 
 Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

Kelly classes: 
 Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object


  all_students_classes = students_classes.append(kelly_classes)


Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [64]:
# There are a couple of important considerations when using append. 
# First, Pandas will take the series and try to infer the best data types to use. In this 
# example, everything is a string, so there's no problems here. 

# Second, the append method doesn't actually change the underlying Series
# objects, it instead returns a new series which is made up of the two appended together. This is
# a common pattern in pandas - by default returning a new object instead of modifying in place - and
# one you should come to expect. By printing the original series we can see that that series hasn't
# changed.


students_classes  # the original document is saved

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [65]:
# Finally, we see that when we query the appended series for Kelly, we don't get a single value, 
# but a series itself. 
all_students_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object