# Pandas
Pandas is a library built on NumPy that provides different data structures for manipulating and analyzing numerical data.
## Index
* [Series](#Series)
    * [Creating a Series](#Creating-a-Series)
    * [Operations on Series](#Using-Operations-on-Series)
    * [Series and Dictionaries](#Series-and-Dictionaries)
    * [Name Attribute](#Name-Attribute)
* [DataFrames](#DataFrames)
    * [Creating DataFrames](#Creating-DataFrames)
    * [Columns](#Columns)
    * [Nested Dictionaries in DataFrames](#Nested-Dictionaries-in-DataFrames)
* [Index Objects](#Index-Objects)

### Resources
* [Pandas Tutorial](https://www.geeksforgeeks.org/pandas-tutorial/)
* [Learn Python Pandas](https://www.tutorialspoint.com/python_pandas/index.htm)
* [Python Pandas Tutorial](https://pythonexamples.org/pandas-examples/)
* [Pandas: Data Analysis in Python](https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS)
* [Getting Started](https://pandas.pydata.org/getting_started.html)

In [1]:
import pandas as pd
from pandas import Series, DataFrame

## Series
Pandas series are 1D array-like objects that contain a sequence of values with an associated index.
### Creating a Series
Series can be creating using pandas `Series` attribute

In [2]:
obj = pd.Series([4, 7, -5, 3])
print('Series:')
print(obj)  # String representation (shows index)
print('Values:', obj.values)  # Array representation

Series:
0    4
1    7
2   -5
3    3
dtype: int64
Values: [ 4  7 -5  3]


The index of each data point can also be changed using `index=`

In [3]:
obj2 = pd.Series([4, 7, -5, 3], index=['d','b','a','c'])
print(obj2)
print(obj2.index)

d    4
b    7
a   -5
c    3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')


You can use the labels in the index to select values within a series. Below, `['a','d']` are interpreted as a list of indices.

In [4]:
print(obj2['c'])
print()
print(obj2[['a', 'd']])

3

a   -5
d    4
dtype: int64


### Using Operations on Series
Operations such as filtering with boolean arrays, scalar multiplication, math functinos, and NumPy functions preserve the index-value link.

In [5]:
print(obj2[obj2 > 0])  # Outputs all values > 0
print()

print(obj2 * 2)  # Multiplies all values by 2
print()

import numpy as np
print(np.exp(obj2))  # Numpy function on series

d    4
b    7
c    3
dtype: int64

d     8
b    14
a   -10
c     6
dtype: int64

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64


### Series and Dictionaries
Series can also been seen as a fixed-length, ordered dict. (Series maps index values to data values)

In [6]:
print('b' in obj2)
print('z' in obj2)

True
False


You can create a series from a dictionary by passing it into `pd.Series()`

In [7]:
sdata = {'Ohio': 35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

You can override the order of the dict's keys by passing the `index=` parameter

In [8]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Because California has no associated value, it appears as NaN (Not a Number). The `inull` and `notnull` functions can be used to detect missing data.

In [9]:
print(pd.isnull(obj4))  # Outputs True when there is missing data
print()
print(pd.notnull(obj4))  # Outputs False when there is missing data

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool


Data Alignment Features: Series also auto aligns by index label when using arithmetic operations

In [10]:
print(obj3 + obj4)

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


### Name Attribute
Series and its index has a `name` attribute that integreates other key areas of pandas functionality.

In [11]:
obj4.name = 'Population'
obj4.index.name = 'State'
print(obj4)

State
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: Population, dtype: float64


The index can also be altered in-place by assignment:

In [12]:
print(obj)
print()
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
print(obj)

0    4
1    7
2   -5
3    3
dtype: int64

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64


## DataFrames
Pandas DataFrame is a 2D mutable tabular data structure (rows,cols) that contain data.

### Creating DataFrames
DataFrames can be created from lists, dictionaries, series, and ndarrays. Below is constructing a DataFrame through a dict of equal-length lists.

In [13]:
data = {'State':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'Year':[2000,2001,2002,2001,2002,2003],
        'Pop':[1.5,1.7,3.6,2.4,2.9,3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,State,Year,Pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


The `head` method will select only the first 5 rows of a large DataFrame.

In [14]:
frame.head()

Unnamed: 0,State,Year,Pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


### Columns

Specifying a sequence of columns will arrange the DataFrame in that order.

In [15]:
pd.DataFrame(data, columns=['Year', 'State', 'Pop'])

Unnamed: 0,Year,State,Pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


Passing a column that isn't contained in the dict will result in NaN values.

In [16]:
frame2 = pd.DataFrame(data, columns=['Year', 'State', 'Pop', 'Debt'],
                      index =['One', 'Two', 'Three', 'Four', 'Five', 'Six'])
frame2

Unnamed: 0,Year,State,Pop,Debt
One,2000,Ohio,1.5,
Two,2001,Ohio,1.7,
Three,2002,Ohio,3.6,
Four,2001,Nevada,2.4,
Five,2002,Nevada,2.9,
Six,2003,Nevada,3.2,


Columns in DataFrames can also be retrieved like a Series.

In [17]:
print(frame2['State'])  # Retrieving through dict notation
print()
print(frame2.Year)  # Retrieving through attribute

One        Ohio
Two        Ohio
Three      Ohio
Four     Nevada
Five     Nevada
Six      Nevada
Name: State, dtype: object

One      2000
Two      2001
Three    2002
Four     2001
Five     2002
Six      2003
Name: Year, dtype: int64


Assigning a column that DNE will create a new one.

In [18]:
# By Assignment
frame2['Debt'] = 16.5  # Modifying debt column
frame2

Unnamed: 0,Year,State,Pop,Debt
One,2000,Ohio,1.5,16.5
Two,2001,Ohio,1.7,16.5
Three,2002,Ohio,3.6,16.5
Four,2001,Nevada,2.4,16.5
Five,2002,Nevada,2.9,16.5
Six,2003,Nevada,3.2,16.5


In [19]:
# By NumPy Functions
frame2['Debt'] = np.arange(6.0)  # Ascending order up to 6.0
frame2

Unnamed: 0,Year,State,Pop,Debt
One,2000,Ohio,1.5,0.0
Two,2001,Ohio,1.7,1.0
Three,2002,Ohio,3.6,2.0
Four,2001,Nevada,2.4,3.0
Five,2002,Nevada,2.9,4.0
Six,2003,Nevada,3.2,5.0


When assigning lists/arrays to a column, the values length must match the length of the DataFrame.

In [20]:
val = pd.Series([-1.2,-1.5,-1.7], index=['Two','Four','Five'])
frame2['Debt'] = val
frame2

Unnamed: 0,Year,State,Pop,Debt
One,2000,Ohio,1.5,
Two,2001,Ohio,1.7,-1.2
Three,2002,Ohio,3.6,
Four,2001,Nevada,2.4,-1.5
Five,2002,Nevada,2.9,-1.7
Six,2003,Nevada,3.2,


Columns can also be deleted using the `del` keyword

In [21]:
# First, lets create a new column called Eastern.

frame2['Eastern']= frame2.State == 'Ohio'  # Eastern value set to T if state is Ohio
frame2

Unnamed: 0,Year,State,Pop,Debt,Eastern
One,2000,Ohio,1.5,,True
Two,2001,Ohio,1.7,-1.2,True
Three,2002,Ohio,3.6,,True
Four,2001,Nevada,2.4,-1.5,False
Five,2002,Nevada,2.9,-1.7,False
Six,2003,Nevada,3.2,,False


In [22]:
# Now, the del method can be used to remove it.
print('Before:', frame2.columns)

del frame2['Eastern']

print('After:', frame2.columns)

Before: Index(['Year', 'State', 'Pop', 'Debt', 'Eastern'], dtype='object')
After: Index(['Year', 'State', 'Pop', 'Debt'], dtype='object')


### Nested Dictionaries in DataFrames
If a nested dict is passed to a DataFrame, the columns will be the outer dict keys, and the row indices are the inner keys.

In [23]:
pop = {'Nevada':{2001: 2.4, 2002:2.9},
       'Ohio':{2000:1.5, 2001: 1.7, 2002:3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


Rows and Columns can also be swapped (transposed) using `T`

In [24]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


Index can also be specified, by default the inner dicts are combined and sorted.

In [25]:
pd.DataFrame(pop, index=[2001,2002,2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


## Index Objects

Index objects hold axis labels and other metadata. Any sequence of labels when creating Series/DataFrames is internally converted to an Index.

In [26]:
obj = pd.Series(range(3), index=['a','b','c'])
index = obj.index
print(index)
print(index[1::])

Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')


Index objects are immutable (cannot be modified by the user). This makes it safer to share Index objects among data structures.

In [27]:
# index[1] = 'd' -TypeError. Immutable!

labels = pd.Index(np.arange(3))
print(labels)

Int64Index([0, 1, 2], dtype='int64')


In [28]:
obj2 = pd.Series([1.5,-2.5,0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [29]:
obj2.index is labels

True