<a href="https://colab.research.google.com/github/brunofbpaula/DataScience-UM-Coursera/blob/main/Pandas/QueryingSeries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Querying a Series

Here's a study of the Series object, how to query and merge two or more together, and the importance of thinking about parallelization when engaging in data science programming.

In [1]:
import pandas as pd

A pandas Series can be queried either by the index position or the index label. If a Series isn't given an index in a query, the position and the label are effectively the same values.

To query by the index label, the 'loc' attribute is used. Now, to query by numeric location and starting at zero, go for the 'iloc' attribute.

Pandas provides a sort of smart syntax using the indexing operator directly on the Series itself. So, no need of 'loc' or 'iloc'.

They are recommended for a safer and more explicit option. For example, if the index is a list of integers, it would make it impossible for Pandas to determine automatically whether the intention is to query by index position or index label.

In [2]:
# Let's create a dictionary with Jujutsu Kaisen characters as keys and their cursed techniques as values
jjk_ct = {
    'Satoru Gojo': 'Hollow Purple',
    'Yuji Itadori': 'Kokusen',
    'Megumi Fushiguro': 'Divine Dogs'
}

# Turn the dictionary into a Series object
jjk_ct = pd.Series(jjk_ct)

#
print(jjk_ct.iloc[2])  # Same as jjk_ct[2]

jjk_ct.loc['Satoru Gojo']  # Same as jjk_ct['Satoru Gojo']

Divine Dogs


'Hollow Purple'

In [3]:
# Now let's prove their importance with another example
last_titles = {
    2012: 'FIFA Club World Cup',
    2013: 'Recopa Sudamericana',
    2017: 'Domestic League Brasileirão',
}

last_titles = pd.Series(last_titles)

# This will raise an error
last_titles[0]

KeyError: ignored

In [4]:
last_titles.iloc[0]

'FIFA Club World Cup'

Fixed it.

## Iterating through a Series



In [8]:
# Interating through a list of grades to figure out the average grade
grades = pd.Series([100, 90, 95, 97, 85, 84])

# Using a for loop
total = 0
for grade in grades:
  total += grade

print(total/len(grades))

91.83333333333333


In [11]:
# Using Numpy

import numpy as np

total = np.sum(grades)
print(total/len(grades))

91.83333333333333


Both give us the same result. But one of them is actually faster than the another. And the Jupyter Notebook has a function to check it out.

In [16]:
# Big Series of random numbers

numbers = pd.Series(np.random.randint(0, 1000, 10000))

# Returns the first five items (it's a Pandas function)
numbers.head()

0    375
1     71
2     14
3    279
4    247
dtype: int64

The ipython interpreter has something called magic functions, which begins with a percentage sign. Below is a cellular magic function. It wraps the code in the current cell.

In [20]:
# This one is called timeit. It will run the code a few times to determine how long it takes on average.
# The '-n 100' is the times we want it to run.
%%timeit -n 100

# For loop
total = 0
for number in numbers:
  total += number

total/len(numbers)

5.11 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [22]:
%%timeit -n 100

# Using numpy
total = np.sum(numbers)
total/len(numbers)

123 µs ± 45.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


We can see that the second form is the fastest. This method is called vectorization - the ability for a computer to execute multiple instructions at once.

With high performance chips, especially graphic cards, dramatic speedups are achived. Modern graphic cards can run thousands of instruction in parallel.

## Broadcasting

It's possible to apply an operation to every value in the Series, changing the Series. They are usually faster than procedural ways and more readable as well.

In [30]:
numbers.head()

0    377
1     73
2     16
3    281
4    249
dtype: int64

In [32]:
numbers += 2
numbers.head()

0    379
1     75
2     18
3    283
4    251
dtype: int64

In [38]:
%%timeit -n 100

# Broadcasting
n = pd.Series(np.random.randint(0, 1000, 10000))
n +=2

246 µs ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [39]:
%%timeit -n 10

# For loop
n = pd.Series(np.random.randint(0, 1000, 1000))

for label, value in n.items():
  n.loc[label] = value + 2

42.4 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## New Index

It's possible to add a new entry to the Series object using the loc attribute by passing an index that doens't exist in it.

In [42]:
new_series = pd.Series(np.random.randint(0, 3, 3))
new_series.loc['Satoru Gojo'] = 'Infinity'

new_series

0                     1
1                     1
2                     2
Satoru Gojo    Infinity
dtype: object

## Not Unique Indexes

This makes Pandas Series a little different conceptually then, for instance, a relational database.

In [51]:
jjk = pd.Series({
    'Yuji Itadori': 'Kokusen',
    'Megumi Fushiguro': 'Divine Dogs',
    'Kento Nanami': 'Ratio Technique'
})

satoru_ct = pd.Series(['Red', 'Blue', 'Purple'], index=['Satoru Gojo', 'Satoru Gojo', 'Satoru Gojo'])

# This fuction appends all data of the last object to the end of the first
jjk_ct = pd.concat([jjk, satoru_ct])

jjk_ct

Yuji Itadori                Kokusen
Megumi Fushiguro        Divine Dogs
Kento Nanami        Ratio Technique
Satoru Gojo                     Red
Satoru Gojo                    Blue
Satoru Gojo                  Purple
dtype: object

In [53]:
jjk_ct.loc['Satoru Gojo']

Satoru Gojo       Red
Satoru Gojo      Blue
Satoru Gojo    Purple
dtype: object