# Pandas
* Let us now descend into the beauty which is pandas!
* Creted by Wes McKinney while a consultant for hedge funds
* Has strong time series features
* Built on top of numpy - reusing what you know!
* Is insanely popular (read: jobs)

## Series
* There are two main structures that are almost the same, the Series, and the DataFrame
* The Series is one dimensional data, the DataFrame is two dimensional, let's talk Series first, it's a lot like numpy

In [2]:
#series is one dimensional, dataframe is two dimensional
import pandas as pd #import pandas
pd.Series(['Alice', 'Jack', 'Molly']) #creates object. The 0, 1, 2 are indices that are explicit while in numpy they are implicit

0    Alice
1     Jack
2    Molly
dtype: object

* check out that dtype. `object`
* also, what's the deal with the numbers at the front of the series list?
  * these are indexes, we had them with numpy, but with numpy they were implicit, here they seem to be explicit

In [3]:
pd.Series([1, 2, 3]) #indeces are in first column and the numbers are in the second colomn

0    1
1    2
2    3
dtype: int64

In [4]:
pd.Series(['Alice', 'Jack', None]) #you are able to mix types

0    Alice
1     Jack
2     None
dtype: object

In [5]:
import numpy as np #importing numpy for example
np.array(['Alice', 'Jack', None]) #making an array. ##looks sort of like the pd.series

array(['Alice', 'Jack', None], dtype=object)

In [6]:
numbers = [1, 2, None] 
pd.Series(numbers) #with pandas replaces None with NaN

0    1.0
1    2.0
2    NaN
dtype: float64

* Notice the insertion of `np.nan` as the missing value
* What else has changed?

## Thinking about NaNs!

In [7]:
import numpy as np
np.nan == None    #returns False
np.nan is None    #returns False
None == None      #returns True
np.nan == np.nan  #returns False
np.isnan(np.nan)  #returns True

True

## More on Series

In [8]:
students_scores = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English'}  #created a dictionary
s = pd.Series(students_scores) #plugged in students scores into pandas series. Index serves as a label using Alice, Jack, Molly
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

* Wait, this is new! That index isn't just a pointer into an array! We have labels!

In [9]:
s.index #get an index object showing the labels of the names from the dictionary

Index(['Alice', 'Jack', 'Molly'], dtype='object')

* So an index can be an object? Hmm...

In [10]:
# things can get weird fast ##extreme example
s = pd.Series([['Physics','Chemistry'], 'Chemistry', np.arange(0,10,2)], #passing in a list with three indices in the list. In first section putting list with physics and chemistry, next section has the string chemistry, third section numpy arange from 0 to 10 stepping by 2 
              index=[("Alice","Brown"), 'Jack', 24]) #parameter index/indices is passed with a tuple with strings, a string called jack, and an integer 24
s

(Alice, Brown)    [Physics, Chemistry]
Jack                         Chemistry
24                     [0, 2, 4, 6, 8]
dtype: object

# Quick summary of what we know thus far
* the `Series` is based on numpy ndarray and shares many characteristics
* the `Series` is a one dimensional array which has an index
* the index can be seemingly anything! Same with the data!
* `np.nan != np.nan` but `np.isnan(np.nan) is True` 🤯

# Now, the DataFrame
* this is the object you'll be using, so let's get aquainted
* it's essentially a two dimensional `Series`, which means:
  1. You can think of it as if it were a table, so it has rows and columns
  2. The rows have an index, the columns have a name. You can refer to a cell by cross referencing
  3. The rows have an order -- just keep this in mind.

* You can create a dataframe from several series objects (e.g. think of them each as a column), or from lists, dictionaries, etc. etc.

In [4]:
import pandas as pd
import numpy as np

students = [{'Name': 'Alice', 'Class': 'Physics', 'Score': 85},
            {'Name': 'Jack', 'Class': 'Chemistry', 'Score': 82},
            {'Name': 'Mark', 'Class': 'Biology', 'Score': 90}] 
df = pd.DataFrame(students, index=['U-M', 'MSU', 'U-M']) #pd.DataFrame is the constructor for Dataframes, index= displays indeces  (can be optional). students list of dicitonary is passed into pd.dataframe
df  #indeces are the row names, keys from the dictionarys is the column name

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82
U-M,Mark,Biology,90


* That looks nice! HTML rendering in Jupyter ftw!

In [5]:
df["Name"] #you dont get a dataframe you get a series because of the use of single brackets
type(df['Name']) #pandas.core.series.Series

pandas.core.series.Series

Super handy. In fact, you'll use this all the time. Oh, and remember how the return value of a column is a series? Check this out...

In [7]:
df[['Name', 'Class']] #you get a dataframe, a list inside indexing brackets and gets subsection of that table
type(df[['Name', 'Class']]) #pandas.core.frame.DataFrame

Unnamed: 0,Name,Class
U-M,Alice,Physics
MSU,Jack,Chemistry
U-M,Mark,Biology


In [8]:
df['Name']['MSU'] #df['Name'] is a series that you can index into using ['MSU']
type(df[['Name']['MSU']]) #

'Jack'

In [10]:
df[0] #error
df[0:1] #gets the first row, refers to rows
df[:2] #gets rows one and two

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
MSU,Jack,Chemistry,82


## Assignments
* Last big DataFrame manipulation insight is this: to add a column, just assign it like it's already there!

In [23]:
df['Coolness'] = ['High', 'Low', 'Medium'] #creates another column for coolness. Column is titled coolness and high low medium are placed inside

In [24]:
df

Unnamed: 0,Name,Class,Score,Coolness
U-M,Alice,Physics,85,High
MSU,Jack,Chemistry,82,Low
U-M,Mark,Biology,90,Medium


In [25]:
df['yet another column'] = None #creates another column for None. If you put in a single value it will broadcast thinking that you want the entire column filled by that value

In [26]:
df

Unnamed: 0,Name,Class,Score,Coolness,yet another column
U-M,Alice,Physics,85,High,
MSU,Jack,Chemistry,82,Low,
U-M,Mark,Biology,90,Medium,


In [27]:
df.drop('Class', axis='columns') #drops class. You need to specify which axis so it knows where to search, in this case columns

Unnamed: 0,Name,Score,Coolness,yet another column
U-M,Alice,85,High,
MSU,Jack,82,Low,
U-M,Mark,90,Medium,


In [28]:
df  #class came back due to persisting operations. the drop example returned a copy instead of the original

Unnamed: 0,Name,Class,Score,Coolness,yet another column
U-M,Alice,Physics,85,High,
MSU,Jack,Chemistry,82,Low,
U-M,Mark,Biology,90,Medium,


## Persisting Operations

* Operations on DataFrames rarely change the DataFrame, instead they tend to return a view or a copy
* For instance, you can `drop()` data in the DataFrame but it's still there

* it's easy to drop columns too. The norm is instead of dropping the column, just project the columns you want
* df=df['Col1','col2']
* and you can get a list of the columns with `df.columns`
* But, you can also delete a column with del(df['col']).
  * what is happening here?

In [29]:
df = df.drop('MSU') #if you want to drop the row or column permenantly, you will need to make the dataframe equal to itself
df

Unnamed: 0,Name,Class,Score,Coolness,yet another column
U-M,Alice,Physics,85,High,
U-M,Mark,Biology,90,Medium,


In [30]:
df = df[['Name', 'Class', 'Score', 'Coolness']] #you can also create a new reference by making a new df and picking out the columns you want
df #now df refers to new dataframe

Unnamed: 0,Name,Class,Score,Coolness
U-M,Alice,Physics,85,High
U-M,Mark,Biology,90,Medium


* Most functions include a parameter `inplace=True` which can be set to actually change the DataFrame, but more common is to just make views into new variables. Really, the only benefit to dropping is when you are *sure* you want to nuke the data.

(Data Scientists often have a hoarding behavior....)

In [1]:
df.drop('Coolness', axis=1, inplace=True) #inplace changes dataframe in place without setting equal to self. Warning will pop up

NameError: name 'df' is not defined

In [32]:
df #coolness is now gone

Unnamed: 0,Name,Class,Score
U-M,Alice,Physics,85
U-M,Mark,Biology,90
