# Pandas
___

<img src="https://i1.wp.com/numfocus.org/wp-content/uploads/2016/07/pandas-logo-300.png?fit=300%2C300&ssl=1" width="200">

Now that we have a better understanding of Numpy, we notice 2 things:
1. Numpy is extremely useful for working with numerical values of the same type.
1. Numpy is missing some flexibility when it comes to working with data which contains heterogeneous data (say, strings along side floats), as well as preforming some more data science related operations such as groupings and pivots.

So we would like a way to benefit from numpy ability to work efficiently with numerical values, but, also enjoy some flexibility which will allow us to work with heterogeneous data. You might guess the answer by now : __Pandas__ (Named is derived from *Panel Data* which is multi-dimensional data involving measurements over time) 

## Hello Pandas
***
Pandas is a python module which builds on top of numpy capabilities, harvesting its numerical efficiency while enabling us to work with heterogeneous data. It does so by wrapping the ndarrays with it's own objects : pandas Dataframe and pandas Series which we will discuss after we import.

In [0]:
import numpy as np  # It is always advised to import numpy when working with pandas
import pandas as pd # The official alias of pandas is pd.

In [0]:
pd.__version__

'0.25.3'

## Pandas Objects
Pandas supply 3 objects for us to work with : 
1. __Index__
1. __Series__
1. __DataFrame__  
The reason for this order is because each series object contains a Index, and each DataFrame contains both Series objects and an Index object. But, we will talk about them in this order : Series, DataFrame, Index.

### Series
***
<img src="https://www.straightpokersupplies.com/media/catalog/product/cache/1/image/1200x1200/9df78eab33525d08d6e5fb8d27136e95/w/o/world-series-of-poker-playing-cards-modiano-2015-16.jpg" width="200">

The first object we are going to discuss is the pandas Series. A series is a "wrapped" __1d__ numpy array. Let's start with creating our very first pandas Series :

In [0]:
s = pd.Series([10, 20, 30, 40], name='random stuff') 

In [0]:
s

0    10
1    20
2    30
3    40
Name: random stuff, dtype: int64

In [0]:
s.name

'random stuff'

In [0]:
s[0], s[1], s[2], s[3]

(10, 20, 30, 40)

In [0]:
s[1:3] # Slicing works as well

1    20
2    30
Name: random stuff, dtype: int64

Ok, so when we print our series we see that the values it contains plus the index of each value.  
Let's jump back to numpy for a second

In [0]:
a = np.array([10, 20, 30, 40])

In [0]:
a

array([10, 20, 30, 40])

In [0]:
a[0], a[1], a[2], a[3]

(10, 20, 30, 40)

Series can only be __1 dimensional__!

In [0]:
pd.Series(np.zeros((2, 2))) # this throws an exception

Exception: ignored

Well, at first glance it looks like the only difference between a series and a numpy array is the fact that the index is printed for us. And the truth is, that this is the major difference - the index.  
A series object contains 2 objects as attributes: 
1. __values__ - a numpy array of values.
1. __index__ - a pandas Index object. (guess what holds the values under the hood? a numpy array)  

Let's explore:

In [0]:
s_values, s_index = s.values, s.index

In [0]:
s_values, s_index

(array([10, 20, 30, 40]), RangeIndex(start=0, stop=4, step=1))

In [0]:
type(s_values), type(s_index)

(numpy.ndarray, pandas.core.indexes.range.RangeIndex)

Ok, so we see that the values attribute is a numpy array and that the index is a RangeIndex object (This is a sub class of pandas Index object which we will talk about soon).  
This means that unlike numpy which sets an implicit index, in a series we can explicitly set an index. Let's see this in action:

In [0]:
s = pd.Series([10, 20, 30, 40], index=list('abcd'))

In [0]:
s.name

'random stuff'

In [0]:
s

a    10
b    20
c    30
d    40
dtype: int64

Now we understand why printing the index makes sense. What's more, we can use the index to access the values in our series

In [0]:
s['a'], s['b'], s['c']

Slicing work here as well!

In [0]:
s['a':'c']

But, we did not lose our previous capabilities and we can still access our elements using a numeric index. This comes with a caveat we will soon see.

In [0]:
s[0]

10

__Danger__. Be careful with mixing explicit indicies and implicit indicies as in some cases this could cause confusion. the worst scenario is when you set a different numeric index to your series because then you can not use the implicit indexing anymore.

In [0]:
s = pd.Series([10, 20, 30, 30], index=range(1, 5))
s

In [0]:
s[0] # This throws a long error which I don't want to pollute the notebook

In [0]:
s[1] # Some might expect this to give the 2nd value in the array while it will give the first

One way to look at pandas series is as a sort of python dictionary. Where the index is the keys and the values are, well, the values.  
This similarity is so apparent that you can use python dictionaries to build pandas series.

In [0]:
d = {
    'dollar' : 3.14,
    'euro'   : 3.5,
    'pound'  : 4.29
}
s = pd.Series(d)
s

dollar    3.14
euro      3.50
pound     4.29
dtype: float64

In [0]:
s['dollar']

3.14

In [0]:
s['euro':]

In [0]:
d.keys(), s.index

(dict_keys(['dollar', 'euro', 'pound']),
 Index(['dollar', 'euro', 'pound'], dtype='object'))

__Summing up__ : A pandas series is an enhanced 1d numpy array which enables us explicit indexing setting.

***
## Exercise
***

__Create a series of size 10 with random values and an implicit index__

In [0]:
# Your code starts here
# Your code ends here

0    0.135622
1    0.098460
2    0.909627
3    0.298143
4    0.441150
5    0.948057
6    0.024713
7    0.381613
8    0.642526
9    0.731618
dtype: float64


__Create the following series: (index on left, values on the right)__
```py
2     1
4     3
6     5
8     7
10    9
```

In [0]:
a = None
# Your code starts here
# Your code ends here


2     1.0
4     3.0
6     5.0
8     7.0
10    9.0
dtype: float64

__Use slicing to access the values of `a` with index 4 and 8.__

In [0]:
# Your code starts here

# Your code ends here


(3.0, 7.0)

__Create the following series(index on the left, values on the right)__
```py
2squared     4
3squared     9
4squared    16
5squared    25
```

In [0]:
# Your code starts here

# Your code ends here

TypeError: ignored

## DataFrame
***

<img src="https://www.tutorialspoint.com/python_pandas/images/structure_table.jpg" width="300">

A pandas dataframe is basically $n$ pandas series stacked vertically(They are the columns) one next to each other. Think of a dataframe as basically an excel sheet where you can name you columns. Another option is to think of it as an enhanced 2d numpy array.  
Let's explore :

In [0]:
data = np.random.randint(low=10, high=50, size=(15, 3))
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
data

array([[14, 22, 36],
       [32, 27, 43],
       [11, 47, 21],
       [11, 40, 23],
       [16, 24, 20],
       [22, 34, 38],
       [45, 39, 26],
       [21, 14, 29],
       [40, 27, 31],
       [28, 42, 32],
       [24, 29, 23],
       [30, 43, 25],
       [36, 47, 41],
       [21, 24, 49],
       [46, 44, 48]])

In [0]:
df

Unnamed: 0,A,B,C
0,14,22,36
1,32,27,43
2,11,47,21
3,11,40,23
4,16,24,20
5,22,34,38
6,45,39,26
7,21,14,29
8,40,27,31
9,28,42,32


We can access each column by it's name :

In [0]:
df['A']

0     14
1     32
2     11
3     11
4     16
5     22
6     45
7     21
8     40
9     28
10    24
11    30
12    36
13    21
14    46
Name: A, dtype: int64

And each columns is a pandas series as we mentioned above:

In [0]:
type(df['A']), type(df['B']), type(df['C'])

(pandas.core.series.Series,
 pandas.core.series.Series,
 pandas.core.series.Series)

And as mentioned above each of these series contains a numpy array.

In [0]:
type(df['A'].values), type(df['B'].values), type(df['C'].values)

(numpy.ndarray, numpy.ndarray, numpy.ndarray)

We mentioned earlier we can view the pandas series as a dictionary, mapping from index to value. We can also think of a dataframe as a dictionary mapping from index (column name) to a series.  
This helps in remember the following : When we use the square brackets notation to access elements in our dataframe we get back a Series.

The dataframe object also contains yet another Index object, which maps to the dataframe rows. We will see how we use these to access the rows later on.  

In [0]:
type(df.index), type(df.columns), type(df.values[0, 0])

(pandas.core.indexes.range.RangeIndex,
 pandas.core.indexes.base.Index,
 numpy.int64)

***
## Exercise
***

__Create the following dataframe__
```py
	 A	 B	 C	 D	 E
0	0	 1	 2	 3	 4
1	5	 6	 7	 8	 9
2	10	11	12	13	14
3	15	16	17	18	19
4	20	21	22	23	24
```

In [0]:
# Your code starts here

# Your code ends here

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]


Unnamed: 0,A,B,C,D,E
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19
4,20,21,22,23,24


__Create the following dataframe__
```py
	0	1	2	3	4
0	A	B	C	D	E
```

In [0]:
# Your code starts here

# Your code ends here

[['A', 'B', 'C', 'D', 'E']]


Unnamed: 0,0,1,2,3,4
0,A,B,C,D,E


__Create the following dataframe__:
```py
	Israel	USA	Japan
first	0	1	2
second	3	4	5
```

In [0]:
# Your code starts here

# Your code ends here

[[0 1 2]
 [3 4 5]]


Unnamed: 0,Israel,USA,Japan
first,0,1,2
second,3,4,5


## Index
*** 

<img src="https://ecuinc.biz/wp-content/uploads/2017/11/index.jpg" width="200">

We just saw each of our object uses the pandas index. We can think of the pandas index as an immutable numpy array. Meaning, we can perform on the index object a lot of the operations we used on the numpy array, __BUT__,  we can't  change any of the values (which makes sense for an Index). (In other words, An Index is an immutable object).

In [0]:
idx = pd.Index([1, 2, 3, 4, 5, 6, 7, 8])

In [0]:
idx.shape, idx.ndim, idx.size, idx.dtype

((8,), 1, 8, dtype('int64'))

And we can use a lot of the same techniques to access elements in the array

In [0]:
idx[0], idx[1:5], idx[::2]

(1,
 Int64Index([2, 3, 4, 5], dtype='int64'),
 Int64Index([1, 3, 5, 7], dtype='int64'))

But, since this is an immutable object we can not change the values

In [0]:
idx[0] = 12

TypeError: ignored

And guess what the Index object contains under the hood?

In [0]:
type(idx.values)

A great functionality of the Index object is it supports set operations. We can perform various set operation between indices to great new indices.

Some set operation reminder:  

* __Union__        : $A \cup B = \{a | a\in A ~or~ a\in B \}$ All the elements which are either in A or in B.  
* __Intersection__ : $A \cap B = \{a | a\in A ~and~ a\in B \}$ All the elements which are both in A and in B.  
* __Symmetric difference__ : $A \triangle B = \{a | a \in A ~or~ a\in B~ but ~not ~both \}$

And the operators in python:  
* __Union__ - |  
* __Intersection__ - &  
* __Symmetric Difference__ - ^  

Let see that in action:

In [0]:
ind_1 = pd.Index(np.arange(10))
ind_2 = pd.Index(np.arange(5, 12))
ind_1, ind_2

(Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64'),
 Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64'))

In [0]:
ind_1 | ind_2

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')

In [0]:
ind_1 & ind_2

Int64Index([5, 6, 7, 8, 9], dtype='int64')

In [0]:
ind_1 ^ ind_2

Int64Index([0, 1, 2, 3, 4, 10, 11], dtype='int64')

In most of the scenario you will encounter there will not be a need to construct an index object outside a Series or a Dataframe.

## Exercise
***

__Get all the values which are in ind_1 minus those in ind_2__

In [0]:
ind_1 = pd.Index(np.arange(10))
ind_2 = pd.Index(np.arange(5, 12))
print(ind_1)
print(ind_2)

# Your code starts here

# Your code ends here

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')


Int64Index([0, 1, 2, 3, 4], dtype='int64')

Get all the indices which are either in `ind_1` and `ind_2` but not in `ind_3`

In [0]:
ind_1 = pd.Index(np.arange(0, 10, 2))
ind_2 = pd.Index(np.arange(1, 11, 2))
ind_3 = pd.Index(np.arange(0, 11, 3))

# Your code starts here

# Your code ends here

Int64Index([1, 2, 4, 5, 7, 8], dtype='int64')

## Accessing elements
***

### Series
We talked about the fact that a series is similar to 2 different objects : a 1d numpy array and a dictionary.  
So we know we can access elements in the same way we access elements in each of those objects.  
But there are some nuances we should pay attention to:

In [0]:
s = pd.Series(np.arange(1, 5), list('bcde'))

In [0]:
s

b    1
c    2
d    3
e    4
dtype: int64

Accessing like a numpy array :

In [0]:
s[1], s[2:4]

(2, d    3
 e    4
 dtype: int64)

Accessing like a dictionary:

In [0]:
s['e'], s['b':'d'] # Notice this does tkae the last element!

(4, b    1
 c    2
 d    3
 dtype: int64)

By the way, slicing does not work for python dictionaries.

In [0]:
d = {l:i for i, l in zip(range(5), list('abcde'))}
d

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

In [0]:
d['a':'d']

TypeError: ignored

And we can start getting 'fancy' with our Series with much of what we saw for numpy.  
We can pass use boolean indexing:

In [0]:
np.random.seed(2611)
s = pd.Series(np.random.randint(low=0, high=20, size=10))
s

0    18
1    18
2     1
3     6
4    16
5    15
6    19
7     0
8    17
9     9
dtype: int64

In [0]:
x = s[s>6]
x
x.loc[2] #element no longer there

0    18
1    18
4    16
5    15
6    19
8    17
9     9
dtype: int64

In [0]:
s[(s<6) | (s > 17)]

0    18
1    18
2     1
6    19
7     0
dtype: int64

And we can get fancy as well

In [0]:
np.random.seed(2611)
index_lettters = [chr(ord('a') + i) for i in range(10)]
s = pd.Series(np.random.randint(low=0, high=20, size=10), index=index_lettters)
s

a    18
b    18
c     1
d     6
e    16
f    15
g    19
h     0
i    17
j     9
dtype: int64

In [0]:
s[['a', 'd', 'e']]

a    18
d     6
e    16
dtype: int64

In [0]:
s[[0, 3, 4]]

a    18
d     6
e    16
dtype: int64

Don't forget you can use both the defined index an numeric indices when your defined index is not numeric.

#### loc and iloc 
As you might notice, the fact that we can use both the implicit and explicit index can cause confusion.  
If you want to make sure that both you and the person reading your code knows which index you are referring to you can use the __loc__ and __iloc__ methods.
* __iloc__ - refers to the numeric index
* __loc__ - refers to the explicit index.  
If your explicit index is the same as the implicit index this 2 methods will return similar results.

In [0]:
np.random.seed(2611)
index_lettters = [chr(ord('a') + i) for i in range(10)]
s = pd.Series(np.random.randint(low=0, high=20, size=10), index=index_lettters)
s

a    18
b    18
c     1
d     6
e    16
f    15
g    19
h     0
i    17
j     9
dtype: int64

In [0]:
s.iloc[[1, 2, 3]]

In [0]:
s.loc['a':'c']

In [0]:
# This will cause and exception.
s.loc[:4]

In [0]:
# This will cause an exception.
s.iloc['a':'d']

### DataFrame
As we saw earlier its easy to think of a DataFrame as a dictionary mapping between index and columns, and as we saw we can access column in a dictionary like bracket notion:

In [0]:
np.random.seed(1010)
index_lettters = [chr(ord('a') + i) for i in range(10)]
data = np.random.randint(low=0, high=100, size=(10, 3))
df = pd.DataFrame(data, columns=['A', 'B', 'C'], index=index_lettters)
df

Unnamed: 0,A,B,C
a,36,72,74
b,18,78,67
c,22,42,76
d,98,81,53
e,64,22,90
f,29,24,97
g,95,80,49
h,58,71,70
i,90,45,78
j,35,88,99


In [0]:
df['A']

a    36
b    18
c    22
d    98
e    64
f    29
g    95
h    58
i    90
j    35
Name: A, dtype: int64

So the question arises - how do we access rows in our dataframes? And the answer is through the loc and iloc methods. In a dataframe those methods refers to the rows only. 

In [0]:
# Accessing the 2nd and 3rd row
df.iloc[[2, 3]]

Unnamed: 0,A,B,C
c,22,42,76
d,98,81,53


In [0]:
df.iloc[1:5]

Unnamed: 0,A,B,C
b,18,78,67
c,22,42,76
d,98,81,53
e,64,22,90


Let's take a look at a loc example as well:


In [0]:
df.loc['a']

A    36
B    72
C    74
Name: a, dtype: int64

In [0]:
df.loc[['a', 'b', 'c']],

(    A   B   C
 a  36  72  74
 b  18  78  67
 c  22  42  76,)

Ok, So if you think you are starting to get the hang of it. Now let's get you confused.  
If you use slicing or boolean mask using the bracket notion, this will work on the rows.

<img src="https://media0.giphy.com/media/3oz8xZvvOZRmKay4xy/giphy.gif?cid=790b7611c7f9d5575bd5514b74f3bc3d9757e4daf8ae570f&rid=giphy.gif" width="300">

So slicing works on the rows:

In [0]:
df[1:3]

Unnamed: 0,A,B,C
b,18,78,67
c,22,42,76


Boolean masking works on the rows:

In [0]:
mask = df['A'] > 50 # Get rows where A value is bigger then 50
df[mask]

Unnamed: 0,A,B,C
d,98,81,53
e,64,22,90
g,95,80,49
h,58,71,70
i,90,45,78


Oh, fancy indexing does work on the columns :

In [0]:
df[['A', 'B']]

Unnamed: 0,A,B
a,36,72
b,18,78
c,22,42
d,98,81
e,64,22
f,29,24
g,95,80
h,58,71
i,90,45
j,35,88


To Summarize : 
If you want to access the columns:
- Standard dictionary like accessing works.
- Fancy indexing given the same type as the columns index works as well.

If you want to access the rows:
- use iloc for implicit index.
- use loc for explicit index.
- use slicing using bracket notion.

- use boolean masking using bracket notion.

***
## Exercise
***

In [0]:
np.random.seed(1000)
index_lettters = [chr(ord('a') + i) for i in range(20)]
df = pd.DataFrame(np.random.randint(10, 100, size=(20, 4)), 
                  columns=['House', 'Garden', 'Shed', 'Basement'],
                  index=index_lettters)
df

__Display the *House* and *Garden* Columns__

In [0]:
# Your code starts here

# Your code ends here

__Display all the even rows__

In [0]:
# Your code starts here

# Your code ends here

__Display all the rows where the *Garden* value is bigger than 50.__

In [0]:
# Your code starts here

# Your code ends here

__Display all the rows where either the *Shed* value is bigger than 50 or the Basement value is lower 50 but not both.__

In [0]:
# Your code starts here

# Your code ends here

__Display the 4, 8 and 10th row (don't use letters).__

In [0]:
# Your code starts here

# Your code ends here

## Universal Functions
***
We saw that numpy universal function is the magic sauce which gives us an amazing speed up when performing element wise operations. Since pandas uses numpy under the hood we want to leverage this functionality with pandas as well.
When dealing with series, numpy ufuncs works as you expect, with the "side effect" that the index is preserved as is. As one would expect.

In [0]:
s = pd.Series([1, 2, 3, 4])
s

In [0]:
s**2

We can see that each of the element was squared and our index stayed the same.  
If you have a homogenous dataframe you can preform the same on datafrmaes as well:

In [0]:
df = pd.DataFrame(np.arange(20).reshape(4, 5))
df * 10

Again, the index stays the same while each element in the matrix is multiplied by 10.  

If one of the columns is string, this will fail.

In [0]:
df_with_str = pd.concat((df, pd.Series(['a', 'b', 'c', 'd'])), axis=1)
df_with_str

In [0]:
df_with_str ** 2 # This fails since we have a string column in our dataframe.

## Operating on two pandas object
If you try to preform an binary operation between 2 pandas objects, the operation will be based on the index and pandas will complete the result with nan in case an index appears in only one of the objects.

In [0]:
ser_a = pd.Series(np.arange(5), index=list('abcde'))
ser_b = pd.Series(np.arange(5, 10), index=list('gfedc'))

ser_a, ser_b

In [0]:
ser_a / ser_b

The result is a Series where the index is equal to the union of both index. Where both series had a value we get a float but where there was only one value we get a nan value, which is pandas way of indicating a missing value.  
If you want to get a more reliable result you can use an explicit operation call and pass in the fill_value parameter which fill any missing value on either side of the series with the passed parameter

In [0]:
ser_a.div(ser_b, fill_value=1)  # If one of the values is missing preform the action with 1.

If you try to perform an action between 2 dataframes the same logic takes place only this time the elements will be aligned on both the column index and the row index.

In [0]:
A = pd.DataFrame(np.arange(18).reshape(6,3), columns=['A', 'B', 'C'])
A

In [0]:
B = pd.DataFrame(np.arange(4).reshape(2, 2) * 10, columns=['D', 'E'])
B

In [0]:
A + B

If we try to preform an action between a series and dataframe the alignment will be on the 

In [0]:
ser_a = pd.Series([1, 2, 3], index=list('ABC'))
ser_a

In [0]:
A - ser_a

***
## Exercise
***

In [0]:
np.random.seed(1002)
df_1 = pd.DataFrame(np.random.randint(0, 50, size=(20, 3)), columns=['Sunday', 'Monday', 'Tuesday'])
df_2 = pd.DataFrame(np.random.randint(0, 50, size=(10, 3)), columns=['Sunday', 'Monday', 'Tuesday'])

In [0]:
df_1

In [0]:
df_2

__Get all the values from `df_1` Sunday column squared__

In [0]:
# Your code starts here

# Your code ends here

__get the value of `df_2` Monday + `df_1` Tuesday. Complete missing values on either side with 9.__

In [0]:
# Your code starts here

# Your code ends here

__Get all the rows from `df_2` where the *monday* value is bigger then the *monday* value of `df_1`__

In [0]:
# Your code starts here

# Your code ends here

## Concatenate and Append
Again, going back to numpy, we can concatenate series and dataframes using the pandas concat method :

In [0]:
ser_a = pd.Series(np.arange(5))
ser_b = pd.Series(np.arange(5, 10), index=np.arange(5, 10))
ser_a, ser_b

In [0]:
pd.concat([ser_a, ser_b])

What's important to know is that when concatenating objects, pandas concate keeps the indicies of the original objects. This could cause some unwanted behaviour and you should pay attention to these issues.

In [0]:
ser_a = pd.Series(np.arange(5))
ser_b = pd.Series(np.arange(5, 10))
pd.concat([ser_a, ser_b])

If you do want the new object to have a new "organized" index you can use the ignore_index flag.

In [0]:
pd.concat([ser_a, ser_b], ignore_index=True)

We can concat dataframe as well:

In [0]:
df_1 = pd.DataFrame(np.zeros((3, 2)), columns=['A', 'B'])
df_2 = pd.DataFrame(np.ones((4, 2)), columns=['A', 'B'])
pd.concat([df_1, df_2], axis=0)

We can also use append as we saw early on with lists.

In [0]:
df_1.append(df_2)

# References

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) A thorough tour into Numpy. 