# Pandas
* Let us now descend into the beauty which is pandas!
* Creted by Wes McKinney while a consultant for hedge funds
* Has strong time series features
* Built on top of numpy - reusing what you know!
* Is insanely popular (read: jobs)

## Series
* There are two main structures that are almost the same, the Series, and the DataFrame
* The Series is one dimensional data, the DataFrame is two dimensional, let's talk Series first, it's a lot like numpy

In [None]:
import pandas as pd
pd.Series(['Alice', 'Jack', 'Molly'])

* check out that dtype. `object`
* also, what's the deal with the numbers at the front of the series list?
  * these are indexes, we had them with numpy, but with numpy they were implicit, here they seem to be explicit

In [None]:
pd.Series([1, 2, 3])

In [None]:
pd.Series(['Alice', 'Jack', None])

In [None]:
import numpy as np
np.array(['Alice', 'Jack', None])

In [None]:
numbers = [1, 2, None]
pd.Series(numbers)

* Notice the insertion of `np.nan` as the missing value
* What else has changed?

# Thinking about NaNs!

In [None]:
np.nan == None

In [None]:
np.nan == np.nan

In [None]:
np.nan is None

In [1]:
np.isnan(np.nan)

NameError: name 'np' is not defined

In [None]:
None == None

In [None]:
students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}
s = pd.Series(students_scores)
s

* Wait, this is new! That index isn't just a pointer into an array! We have labels!

In [None]:
s.index

* So an index can be an object, eh? Hrm...

In [None]:
# things can get weird fast
s = pd.Series([['Physics','Chemistry'], 'Chemistry', np.arange(0,10,2)],
              index=[("Alice","Brown"), 'Jack', 24])
s

# Quick summary of what we know thus far
* the `Series` is based on numpy ndarray and shares many characteristics
* the `Series` is a one dimensional array which has an index
* the index can be seemingly anything! Same with the data!
* `np.nan != np.nan` but `np.isnan(np.nan) is True` 🤯

# Now, the DataFrame
* this is the object you'll be using, so let's get aquainted
* it's essentially a two dimensional `Series`, which means:
  1. You can think of it as if it were a table, so it has rows and columns
  2. The rows have an index, the columns have a name. You can refer to a cell by cross referencing
  3. The rows have an order -- just keep this in mind.

* You can create a dataframe from several series objects (e.g. think of them each as a column), or from lists, dictionaries, etc. etc.

In [None]:
students = [{'Name': 'Alice', 'Class': 'Physics', 'Score': 85},
            {'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82},
            {'Name': 'Mark', 'Class': 'Biology', 'Score': 90}]
df = pd.DataFrame(students, index=['U-M', 'MSU', 'U-M'])

In [None]:
df

* Oh man, that looks purdy!! HTML rendering in Jupyter FTW!

* We can extract data from the rows using the location (`.loc`) attribute
* Watch carefully....

In [None]:
df.loc["MSU"]

* Two important considerations:
1. The return value seems to be a `Series` -- neat!
2. `.loc` is **not** a function.  

In [None]:
# wtf is this loc all about?  
type(df.loc)

* [https://github.com/pandas-dev/pandas/blob/9ef67b1a88e3a4c59cdb436d49479eae0a5b32fe/pandas/core/indexing.py#L1388]()

* Key takaway: there is no magic here. Just go look it up if you want to know how and why something works the way it does.
* Other important takeaway: ~~loc()~~ is not a thing. it's loc\[\]. Think about this as a numpy array and it will make more sense -- you're just indexing into the array.

In [None]:
# reminder what our dataframe looks like
df

In [None]:
# we can use loc to index in (you saw this)
df.loc["MSU"]
# we can also add the second dimension, column names, to the index
df.loc["U-M","Score"]

In [None]:
# what if we want two columns?
df.loc["U-M",["Class","Score"]]

* So, `.loc` allows us to index in both dimensions of the dataframe, and allows us to slice by both index and column.
* `.loc` has a sibling though, `.iloc`. This stands for integer location. So you can slice by the row or column number

In [None]:
df

In [None]:
df.iloc[0,1]

In [None]:
# Oh, and slicing? Check ✅
df.iloc[0:2,0:2]

* Cool, we have a DataFrame. A two dimensional data storage object with row indexes and column names.
* We can get data out a row or column at a time, or narrow down to specific row/column combinations.
* And we can pull data out using nice labels (strings!) or integer locations.
* But the fun doesn't stop there! The dev's have also set the indexing operator for the DataFrame directly as  column projection!

In [None]:
df["Name"]==df.loc[:,"Name"]

Super handy. In fact, you'll use this all the time. Oh, and remember how the return value of a column is a series? Check this out...

In [None]:
df["Name"]["MSU"]

I'm going to show you something I would encourage you to never use

Please, honest, I mean, it looks nice, but it's really going to bite you later....

In [None]:
df.Name.MSU

Pandas devs add the column name as an attrbute to the DataFrame and this is used to index directly into the dataframe.🤯

Please, just forget you saw this. Squirrel it away in a dark corner of your mind and pretend it isn't there.

* Operations on DataFrames rarely change the DataFrame, instead they tend to return a view or a copy
* For instance, you can `drop()` data in the DataFrame but it's still there

In [None]:
df.drop('MSU')

In [None]:
df

* it's easy to drop columns too. The norm is instead of dropping the column, just project the columns you want
* df=df['Col1','col2']
* and you can get a list of the columns with `df.columns`
* But, you can also delete a column with del(df['col']).
  * what is happening here?

* Most functions include a parameter `inplace=True` which can be set to actually change the DataFrame, but more common is to just make views into new variables. Really, the only benefit to dropping is when you are *sure* you want to nuke the data.

(Data Scientists often have a hoarding behavior....)

* Last big DataFrame manipulation insight is this: to add a column, just assign it like it's already there!

In [None]:
df["Coolness"]=["High","Low","High"]

In [None]:
df