<br>
in this lecture, we'll learn about:

* how to query and merge Series objects together.
* the importance of thinking about parallelization.

# Querying Series 

in this section, we'll how to get data out of the Series object.

A pandas Series can be queried either by the index position or the index label. 

**To query by numeric location**, starting at zero, use the **iloc** attribute. 

**To query by the index label**, you can use the **loc** attribute. 

In [1]:
import pandas as pd 

In [2]:
student_classes = {'Amir' : 'Internship',
                  'Reza' : 'Internet Engineering',
                  'Saleh' : 'History',
                  'Sara' : 'Software Project'}

serie = pd.Series(student_classes)
serie

Amir               internship
Reza     internet engineering
Saleh                 history
Sara         software project
dtype: object

Keep in mind that iloc and loc are not methods, they are attributes. So you don't use parentheses to query them, but square brackets instead, which is called the indexing operator.

In [3]:
serie.iloc[3]

'software project'

In [4]:
serie.loc["Amir"]

'internship'

Pandas tries to make our code a bit more readable and provides a sort of smart syntax using the indexing operator directly on the series itself. 

if we pass in an integer parameter, the operator will behave as if you want it to query via the **iloc** attribute.

In [5]:
serie[2]

'history'

If you pass in an object, it will query as if you wanted to use the label based **loc** attribute.

In [6]:
serie["Sara"]

'software project'

If you don't give an index to the series when querying, the outcome of index position and index label will be effectively the same values. 

In [7]:
s = pd.Series([1, 2, None])
s

0    1.0
1    2.0
2    NaN
dtype: float64

In [8]:
s[1]

2.0

In [9]:
s.iloc[2]

nan

In [10]:
s.loc[2]

nan

In [11]:
s.iloc[2] == s.loc[2]

False

what happens if your index is a list of integers? 

This is a bit complicated and Pandas can't determine automatically whether you're intending to query by index position or index label. So you need to be careful when using the indexing operator on the Series itself. The safer option is to be more explicit and use the iloc or loc attributes directly.

In [12]:
class_code = {100 : 'History',
             102 : 'Physics',
             98 : 'Mathmatics',
             142 : 'software engineering'}

s = pd.Series(class_code)

the below error hapend because pandas didn't call s.iloc[0] underneath as one might expect.

In [13]:
s[0]

KeyError: 0

In [14]:
s.iloc[0]

'History'

In [15]:
s.loc[142]

'software engineering'

# parallel computing:

<br>
A typical programmatic approach to consider all of the values inside of a Series and do some sort of operation would be to iterate over all the items in the series.

In [16]:
student_grades = pd.Series([90, 70, 80, 100, 95, 60])

total = 0
for grade in student_grades:
    total += grade
    
total / len(student_grades)

82.5

<br>
This works, but it's slow. Modern computers can do many tasks simultaneously, especially, but not only, tasks involving mathematics.

Pandas and the underlying numpy libraries support a method of computation called vectorization.

Vectorization works **with most of the functions in the numpy library**, including the sum function.

In [17]:
import numpy as np

In [18]:
total = np.sum(student_grades)
total / len(student_grades)

82.5

is the second one actually faster? 

The Jupyter Notebook has a magic function which can help us to know.

when we're using **np.random.randint()**, we must determine how many random integer numbers we want and what we want those numbers between.

In [19]:
numbers = pd.Series(np.random.randint(0, 1000, 10000))

to look at the top five items in the Series, we should use **.head()** function.

In [20]:
numbers.head()

0    161
1     51
2    241
3     48
4    964
dtype: int32

In [21]:
len(numbers)

10000

The ipython interpreter has something called magic functions begin with a percentage sign. If we type this sign and then hit the Tab key, we can see a list of the available magic functions.

we're actually going to use what's called a **cellular magic function**. These start with **two percentage signs** and **wrap the code in the current Jupyter cell**. The function we're going to use is called **timeit**. This function **will run our code a few times to determine, on average, how long it takes**.

we can give **timeit** the **number of loops** that you would like to **run**. **By default, it is 1,000 loops**.

Note that in order to use a cellular magic function, it has to be the first line in the cell.

In [22]:
%%timeit 

total = 0

for num in numbers:
    total += 0
    
total / len(numbers)

1.56 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [23]:
%%timeit

total = np.sum(numbers)
total / len(numbers)

98.3 µs ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


<br>
This is a pretty shocking difference in the speed and demonstrates why one should be aware of parallel computing features and start thinking in functional programming terms.

vectorization is the ability for a computer to execute multiple instructions at once, and with high performance chips, especially graphics cards, we can get dramatic
speedups. 

Modern graphics cards can run thousands of instructions in parallel.

### broadcasting

A Related feature in pandas and nummy is called broadcasting. 

With broadcasting, we can apply an operation to every value in the series, changing the series. For instance, if we wanted to increase every random variable by 2, we could do so quickly using the += operator 
directly on the Series object. 

In [24]:
numbers.head()

0    161
1     51
2    241
3     48
4    964
dtype: int32

In [25]:
numbers += 2
numbers.head()

0    163
1     53
2    243
3     50
4    966
dtype: int32

The procedural way of doing this would be to iterate through all of the items in the series and increase the values directly.

Pandas does support **iterating** through a series much like a dictionary, allowing you to unpack values easily.

the **iteritems()**  returns **a label and value**.

In [35]:
%%timeit -n 100

numbers = pd.Series(np.random.randint(0, 1000, 1000))

for label, value in numbers.iteritems():
    numbers.loc[label] = value + 2 

64 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [36]:
%%timeit -n 100

numbers = pd.Series(np.random.randint(0, 1000, 1000))

numbers += 2

273 µs ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


not only it is significantly faster, but it's more concise and even easier to read too. 

**The typical mathematical operations** we would expect are **vectorized in the numpy**.

**loc** attribute lets you not only **modify data** in place, but also **add new data&& as well. If the value you pass in as the index doesn't exist, then a new entry is added.

keep in mind, we can have **mixed types** for **data values or index labels**.

In [40]:
s = pd.Series([1,2,3])
s.loc["Mathmatics"] = (9, 18)
s

0                   1
1                   2
2                   3
Mathmatics    (9, 18)
dtype: object

# Merging Sereis 

**in the Series**, we **can not have unique indexes**.

this makes pandas Series a little different conceptually than a relational database.

In [56]:
student_classes = {"Amir" : "Internship",
                  "Reza" : "Internet Engineering",
                  "Saleh" : "History",
                  "Sara" : "Software Project"}

s1 = pd.Series(student_classes)
s1

Amir               Internship
Reza     Internet Engineering
Saleh                 History
Sara         Software Project
dtype: object

In [60]:
Arman_classes = ["DB", "AI", "Internet Engineering"]

s2 = pd.Series(Arman_classes, ["Arman", "Arman", "Arman"])
s2

Arman                      DB
Arman                      AI
Arman    Internet Engineering
dtype: object

There are a couple of important considerations when using append.

* First, Pandas will take the series and try to infer the best data types to use.
* Second, the append method doesn't actually change the underlying Series objects, it instead returns a new series which is made up of the two appended together.

In [61]:
s3 = s1.append(s2)
s3

Amir               Internship
Reza     Internet Engineering
Saleh                 History
Sara         Software Project
Arman                      DB
Arman                      AI
Arman    Internet Engineering
dtype: object

In [62]:
s1

Amir               Internship
Reza     Internet Engineering
Saleh                 History
Sara         Software Project
dtype: object

In [63]:
s2

Arman                      DB
Arman                      AI
Arman    Internet Engineering
dtype: object

<br>
we see that when we query the appended series for Arman, we don't get a single value, but a series itself. 

In [65]:
s3.loc["Arman"]


Arman                      DB
Arman                      AI
Arman    Internet Engineering
dtype: object