# Pandas
* Let us now descend into the beauty which is pandas!
* Creted by Wes McKinney while a consultant for hedge funds
* Has strong time series features
* Built on top of numpy - reusing what you know!
* Is insanely popular (read: jobs)

## Series
* There are two main structures that are almost the same, the Series, and the DataFrame
* The Series is one dimensional data, the DataFrame is two dimensional, let's talk Series first, it's a lot like numpy

In [1]:
import pandas as pd
pd.Series(['Alice', 'Jack', 'Molly'])

0    Alice
1     Jack
2    Molly
dtype: object

* check out that dtype. `object`
* also, what's the deal with the numbers at the front of the series list?
  * these are indexes, we had them with numpy, but with numpy they were implicit, here they seem to be explicit

In [2]:
pd.Series([1, 2, 3])

0    1
1    2
2    3
dtype: int64

In [4]:
pd.Series(['Alice', 'Jack', None])

0    Alice
1     Jack
2     None
dtype: object

In [6]:
import numpy as np
np.array(['Alice', 'Jack', None])

array(['Alice', 'Jack', None], dtype=object)

In [7]:
numbers = [1, 2, None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

* Notice the insertion of `np.nan` as the missing value
* What else has changed?

# Thinking about NaNs!

In [9]:
np.nan == None

False

In [10]:
np.nan == np.nan

False

In [13]:
np.nan is None

False

In [11]:
np.isnan(np.nan)

True

In [12]:
None == None

True

In [15]:
students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}
s = pd.Series(students_scores)
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

* Wait, this is new! That index isn't just a pointer into an array! We have labels!

In [None]:
s.index

* So an index can be an object, eh? Hrm...

In [24]:
# things can get weird fast
s = pd.Series([['Physics','Chemistry'], 'Chemistry', np.arange(0,10,2)],
              index=[("Alice","Brown"), 'Jack', 24])
s

(Alice, Brown)    [Physics, Chemistry]
Jack                         Chemistry
24                     [0, 2, 4, 6, 8]
dtype: object

# Quick summary of what we know thus far
* the `Series` is based on numpy ndarray and shares many characteristics
* the `Series` is a one dimensional array which has an index
* the index can be seemingly anything! Same with the data!
* `np.nan != np.nan` but `np.isnan(np.nan) is True` 🤯

# Now, the DataFrame
* this is the object you'll be using, so let's get aquainted
* it's essentially a two dimensional `Series`, which means:
  1. You can think of it as if it were a table, so it has rows and columns
  2. The rows have an index, the columns have a name. You can refer to a cell by cross referencing
  3. The rows have an order -- just keep this in mind.

* You can create a dataframe from several series objects (e.g. think of them each as a column), or from lists, dictionaries, etc. etc.

In [26]:
students = [{'Name': 'Alice', 'Class': 'Physics', 'Score': 85},
            {'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82},
            {'Name': 'Mark', 'Class': 'Biology', 'Score': 90}]
df = pd.DataFrame(students, index=['U-M', 'MSU', 'U-M'])

In [27]:
df

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


* Oh man, that looks purdy!! HTML rendering in Jupyter FTW!

* We can extract data from the rows using the location (`.loc`) attribute
* Watch carefully....

In [28]:
df.loc["MSU"]

Name          Jack
Class    Chemistry
Score           82
Name: MSU, dtype: object

* Two important considerations:
1. The return value seems to be a `Series` -- neat!
2. `.loc` is **not** a function.  

In [29]:
# wtf is this loc all about?  
type(df.loc)

pandas.core.indexing._LocIndexer

* [https://github.com/pandas-dev/pandas/blob/9ef67b1a88e3a4c59cdb436d49479eae0a5b32fe/pandas/core/indexing.py#L1388]()

* Key takaway: there is no magic here. Just go look it up if you want to know how and why something works the way it does.
* Other important takeaway: ~~loc()~~ is not a thing. it's loc\[\]. Think about this as a numpy array and it will make more sense -- you're just indexing into the array.

In [31]:
# reminder what our dataframe looks like
df

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


In [33]:
# we can use loc to index in (you saw this)
df.loc["MSU"]
# we can also add the second dimension, column names, to the index
df.loc["U-M","Score"]

U-M    85
U-M    90
Name: Score, dtype: int64

In [34]:
# what if we want two columns?
df.loc["U-M",["Class","Score"]]

Unnamed: 0,Class,Score
U-M,Physics,85
U-M,Biology,90


* So, `.loc` allows us to index in both dimensions of the dataframe, and allows us to slice by both index and column.
* `.loc` has a sibling though, `.iloc`. This stands for integer location. So you can slice by the row or column number

In [36]:
df

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


In [37]:
df.iloc[0,1]

'Physics'

In [38]:
# Oh, and slicing? Check ✅
df.iloc[0:2,0:2]

Unnamed: 0,Name,Class
U-M,Alice,Physics
MSU,Jack,Chemistry
