# Pandas
* Let us now descend into the beauty which is pandas!
* Creted by Wes McKinney while a consultant for hedge funds
* Has strong time series features
* Built on top of numpy - reusing what you know!
* Is insanely popular (read: jobs)

## Series
* There are two main structures that are almost the same, the Series, and the DataFrame
* The Series is one dimensional data, the DataFrame is two dimensional, let's talk Series first, it's a lot like numpy

In [1]:
import pandas as pd
pd.Series(['Alice', 'Jack', 'Molly'])

0    Alice
1     Jack
2    Molly
dtype: object

* check out that dtype. `object`
* also, what's the deal with the numbers at the front of the series list?
  * these are indexes, we had them with numpy, but with numpy they were implicit, here they seem to be explicit

In [2]:
pd.Series([1, 2, 3])

0    1
1    2
2    3
dtype: int64

In [3]:
pd.Series(['Alice', 'Jack', None])

0    Alice
1     Jack
2     None
dtype: object

In [4]:
import numpy as np
np.array(['Alice', 'Jack', None])

array(['Alice', 'Jack', None], dtype=object)

In [5]:
numbers = [1, 2, None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

* Notice the insertion of `np.nan` as the missing value
* What else has changed?

# Thinking about NaNs!

In [6]:
np.nan == None

False

In [7]:
np.nan == np.nan

False

In [8]:
np.nan is None

False

In [9]:
np.isnan(np.nan)

True

In [10]:
None == None

True

In [11]:
students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}
s = pd.Series(students_scores)
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

* Wait, this is new! That index isn't just a pointer into an array! We have labels!

In [12]:
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

* So an index can be an object, eh? Hrm...

In [13]:
# things can get weird fast
s = pd.Series([['Physics','Chemistry'], 'Chemistry', np.arange(0,10,2)],
              index=[("Alice","Brown"), 'Jack', 24])
s

(Alice, Brown)    [Physics, Chemistry]
Jack                         Chemistry
24                     [0, 2, 4, 6, 8]
dtype: object

# Quick summary of what we know thus far
* the `Series` is based on numpy ndarray and shares many characteristics
* the `Series` is a one dimensional array which has an index
* the index can be seemingly anything! Same with the data!
* `np.nan != np.nan` but `np.isnan(np.nan) is True` 🤯

# Now, the DataFrame
* this is the object you'll be using, so let's get aquainted
* it's essentially a two dimensional `Series`, which means:
  1. You can think of it as if it were a table, so it has rows and columns
  2. The rows have an index, the columns have a name. You can refer to a cell by cross referencing
  3. The rows have an order -- just keep this in mind.

* You can create a dataframe from several series objects (e.g. think of them each as a column), or from lists, dictionaries, etc. etc.

In [14]:
students = [{'Name': 'Alice', 'Class': 'Physics', 'Score': 85},
            {'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82},
            {'Name': 'Mark', 'Class': 'Biology', 'Score': 90}]
df = pd.DataFrame(students, index=['U-M', 'MSU', 'U-M'])

In [15]:
df

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


* Oh man, that looks purdy!! HTML rendering in Jupyter FTW!

* We can extract data from the rows using the location (`.loc`) attribute
* Watch carefully....

In [18]:
df.loc["MSU"]

Name          Jack
Class    Chemistry
Score           82
Name: MSU, dtype: object

* Two important considerations:
1. The return value seems to be a `Series` -- neat!
2. `.loc` is **not** a function.  

In [21]:
# wtf is this loc all about?  
df.loc._getitem_scalar("bob")

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

* https://github.com/pandas-dev/pandas/blob/9ef67b1a88e3a4c59cdb436d49479eae0a5b32fe/pandas/core/indexing.py#L1388

* Key takaway: there is no magic here. Just go look it up if you want to know how and why something works the way it does.
* Other important takeaway: ~~loc()~~ is not a thing. it's loc\[\]. Think about this as a numpy array and it will make more sense -- you're just indexing into the array.

In [23]:
# reminder what our dataframe looks like
df

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


In [27]:
type(df.loc["MSU"])

pandas.core.series.Series

In [28]:
df.loc["U-M"]

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
U-M,Mark,Biology,90


In [31]:
df.loc["U-M","Score"]

U-M    85
U-M    90
Name: Score, dtype: int64

In [30]:
# we can use loc to index in (you saw this)
#df.loc["MSU"]
# we can also add the second dimension, column names, to the index
df.loc["U-M",["Score"]]

Unnamed: 0,Score
U-M,85
U-M,90


In [32]:
# what if we want two columns?
df.loc["U-M",["Class","Score"]]

Unnamed: 0,Class,Score
U-M,Physics,85
U-M,Biology,90


* So, `.loc` allows us to index in both dimensions of the dataframe, and allows us to slice by both index and column.
* `.loc` has a sibling though, `.iloc`. This stands for integer location. So you can slice by the row or column number

In [None]:
df

In [None]:
df.iloc[0,1]

In [33]:
# Oh, and slicing? Check ✅
df.iloc[0:2,0:2]

Unnamed: 0,Name,Class
U-M,Alice,Physics
MSU,Jack,Chemistry


* Cool, we have a DataFrame. A two dimensional data storage object with row indexes and column names.
* We can get data out a row or column at a time, or narrow down to specific row/column combinations.
* And we can pull data out using nice labels (strings!) or integer locations.
* But the fun doesn't stop there! The dev's have also set the indexing operator for the DataFrame directly as  column projection!

In [34]:
df["Name"]==df.loc[:,"Name"]

U-M    True
MSU    True
U-M    True
Name: Name, dtype: bool

In [36]:
df

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


In [35]:
df["Name"]

U-M    Alice
MSU     Jack
U-M     Mark
Name: Name, dtype: object

In [38]:
df.loc[:,"Name"]

U-M    Alice
MSU     Jack
U-M     Mark
Name: Name, dtype: object

In [41]:
w=df["Name"]==df.loc[:,"Name"]

In [46]:
w.iloc[1]

True

Super handy. In fact, you'll use this all the time. Oh, and remember how the return value of a column is a series? Check this out...

In [None]:
df["Name"]["MSU"]

I'm going to show you something I would encourage you to never use

Please, honest, I mean, it looks nice, but it's really going to bite you later....

In [47]:
df.Name.MSU

'Jack'

In [49]:
df["advanced"]=[True,False,True]
df

Unnamed: 0,Name,Class,Score,advanced
U-M,Alice,Physics,85,True
MSU,Jack,Chemistry,82,False
U-M,Mark,Biology,90,True


In [50]:
df.advanced

U-M     True
MSU    False
U-M     True
Name: advanced, dtype: bool

In [52]:
df["advanced two"]=[1,2,3]

In [53]:
df

Unnamed: 0,Name,Class,Score,advanced,advanced two
U-M,Alice,Physics,85,True,1
MSU,Jack,Chemistry,82,False,2
U-M,Mark,Biology,90,True,3


In [54]:
df.advanced==df['advanced two']

U-M     True
MSU    False
U-M    False
dtype: bool

Pandas devs add the column name as an attrbute to the DataFrame and this is used to index directly into the dataframe.🤯

Please, just forget you saw this. Squirrel it away in a dark corner of your mind and pretend it isn't there.

* Operations on DataFrames rarely change the DataFrame, instead they tend to return a view or a copy
* For instance, you can `drop()` data in the DataFrame but it's still there

In [60]:
df.drop('MSU')
df

Unnamed: 0,Name,Class,Score,advanced,advanced two
U-M,Alice,Physics,85,True,1
MSU,Jack,Chemistry,82,False,2
U-M,Mark,Biology,90,True,3


In [59]:
df

Unnamed: 0,Name,Class,Score,advanced,advanced two
U-M,Alice,Physics,85,True,1
MSU,Jack,Chemistry,82,False,2
U-M,Mark,Biology,90,True,3


* it's easy to drop columns too. The norm is instead of dropping the column, just project the columns you want
* df=df['Col1','col2']
* and you can get a list of the columns with `df.columns`
* But, you can also delete a column with del(df['col']).
  * what is happening here?

* Most functions include a parameter `inplace=True` which can be set to actually change the DataFrame, but more common is to just make views into new variables. Really, the only benefit to dropping is when you are *sure* you want to nuke the data.

(Data Scientists often have a hoarding behavior....)

* Last big DataFrame manipulation insight is this: to add a column, just assign it like it's already there!

In [61]:
df["Coolness"]=["High","Low","High"]

In [62]:
df

Unnamed: 0,Name,Class,Score,advanced,advanced two,Coolness
U-M,Alice,Physics,85,True,1,High
MSU,Jack,Chemistry,82,False,2,Low
U-M,Mark,Biology,90,True,3,High


In [66]:
df["yet another col"]=None

In [69]:
df

Unnamed: 0,Name,Class,Score,advanced,advanced two,Coolness,yet another col
U-M,Alice,Physics,85,True,1,High,
MSU,Jack,Chemistry,82,False,2,Low,
U-M,Mark,Biology,90,True,3,High,


In [78]:
df.drop("Class",axis='columns')

Unnamed: 0,Name,Score,advanced,advanced two,Coolness,yet another col
U-M,Alice,85,True,1,High,
MSU,Jack,82,False,2,Low,
U-M,Mark,90,True,3,High,


In [73]:
df

Unnamed: 0,Name,Class,Score,advanced,advanced two,Coolness,yet another col
U-M,Alice,Physics,85,True,1,High,
MSU,Jack,Chemistry,82,False,2,Low,
U-M,Mark,Biology,90,True,3,High,
