<a href="https://colab.research.google.com/github/djgreen/AI-BootCamp/blob/main/AIBootCampPandasIntro1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to pandas Data Structures
**Authors**: 
- Dr. Jany Chan, The Ohio State University
- Dr. Chaitanya Kulkarni, The Ohio State University
- Prof. Raghu Machiraju, The Ohio State University

---

## Context
The material here was developed by the authors for a professional masters course in data analytics. The enrolled students are often from all academic backgrounds. MDs, PharmDs, MBAs, etc. The goal of that program is to teach to data story telling in context.

---

## Objectives
- Learn about pandas and what it provides
- Learn about Series, Dataframe objects
- Learn about indexing and selection
- Learn about using functions in pandas

---
## pandas

Before we start on our pandas' journey, remember that the functionality of pandas is rooted in a foundation of NumPy, meaning:
1. Easy conversions from built-in Python data structures to NumPy `ndarrays` to pandas `DataFrames` and `Series`
2. Accessing elements by indexing and slicing is the same when using indices

**What are pandas objects in Python?**

- Pandas data structures are enhanced versions of NumPy structured arrays 
- Rows and columns can be identified with labels as well as simple integer indices
    - One basic tenet to keep in mind: **data alignment is intrinsic.**
        - The link between labels and data will not be broken unless done so    explicitly by you.
        - This means that **the ordering of the data does not matter** when manipulating the data structures.

We will cover two fundamental pandas data structures:
- ``Series``
- ``DataFrame``

In [None]:
# Let's import the necessary libraries
import numpy as np
import pandas as pd


## Pandas Series Object

A pandas ``Series`` is a Python class that defines a one-dimensional array of indexed data. You can think of this data structure as a single column in an Excel spreadsheet.

### Constructing Series objects

There are multiple ways to construct a pandas ``Series`` and they all follow the same pattern:
```python
>>> pd.Series(data, index=index)
```
where ``index`` is an optional argument, and ``data`` can be one of many entities.


For example, ``data`` can be a list or NumPy array and then ``index`` defaults to an integer sequence:

In [None]:
# We'll create a Series using the NumPy function `arange`
# The default (implicit) index is the familiar Python index starting at 0
data = pd.Series(np.arange(0, 1, 0.1))
data

0    0.0
1    0.1
2    0.2
3    0.3
4    0.4
5    0.5
6    0.6
7    0.7
8    0.8
9    0.9
dtype: float64

A pandas ``Series`` inclues both:
- a sequence of values
- a sequence of indices

which are  accessed with ``values`` and ``index`` attributes
-  `values` is simply a familiar data structure: the NumPy array
-  `index` is an object of type ``pd.Index``

In [None]:
print(data.values)
print(data.index)

[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
RangeIndex(start=0, stop=10, step=1)


### ``Series`` as generalized NumPy array

- ``Series`` are basically interchangeable with a 1-D NumPy array

The essential difference is the presence of the `index`
- NumPy `array` has an *implicitly defined* integer index used to access the values
- pandas ``Series`` has an *explicitly defined* index associated with the values
- In NumPy, arrays can be accessed "sequentially" using the familiar "start:stop:step". 
- In pandas, objects have non-uniform, unstructured access, meaning:
    - `index` does not need to be an integer, but can consist of values of any desired type
    - Can use strings as an index (i.e. labels)

**Note**: We'll explore this more in the next section [Pandas Data Indexing and Selection](https://colab.research.google.com/drive/1Uafik9EwMJdvi_Vf8MYbFtu581z-i1Lf?usp=sharing#scrollTo=Pandas_Data_Indexing_and_Selection)

In [None]:
# Aside from the familiar [start:stop:step], pandas has another method for indexing: labels
# Here, we've used a Python list to create a Series and defined the index
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['alice', 'bob', 'charlie', 'david'])

# Now we can access the element we want using its label
data['alice']

0.25

### Series as specialized dictionary
Recall that a Python dictionary is a data structure that maps  **arbitrary** `keys` to a set of **arbitrary** `values`
-  pandas``Series`` are a structure which maps **typed** `keys` to a set of **typed** `values`.
-  The data type of a pandas ``Series`` makes it much more efficient than Python dictionaries for certain operations.

The ``Series``-as-dictionary analogy can be made even more clear by constructing a ``Series`` object directly from a Python dictionary:

In [None]:
# Here, we've created a dictionary mapping states to population
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

# This dictionary can be directly transformed into a pandas Series
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

By default, a ``Series`` will be created where the index is defined from the sorted keys (if implicit). Thus, typical dictionary-style item access can be performed:

In [None]:
population['California']

38332521

Unlike a dictionary, though, the ``Series`` also supports array-style operations such as slicing:

In [None]:
population['California':'New York']

California    38332521
Texas         26448193
New York      19651127
dtype: int64

## Pandas DataFrame Object

The next fundamental structure in pandas is the ``DataFrame``. Think of this as a single sheet in Excel.

In [None]:
# First, let's start with another Python dictionary
area_dict = {'Texas': 695662, 'New York': 141297, 'Illinois': 149995, 
             'Florida': 170312, 'California': 423967}

# The dictionary is used to create another pandas Series
area = pd.Series(area_dict)

# Recall the population Series that we created in the Series section above.
# Both Series are indexed by state name, thus we can use them to construct 
# a single 2-D object containing all the information:
states = pd.DataFrame({'population': population,
                       'area': area})
states

# Note: The order of the elements in each Series does not matter because they 
# are indexed. When generating or manipulating a new DataFrame, values are 
# matched by index.

Unnamed: 0,population,area
California,38332521,423967
Florida,19552860,170312
Illinois,12882135,149995
New York,19651127,141297
Texas,26448193,695662


Like ``Series`` object,  ``DataFrame`` has an ``index`` attribute that gives access to the index labels.

Additionally, ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels.

In [None]:
print(states.index)
print(states.columns)

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
Index(['population', 'area'], dtype='object')


### DataFrame as specialized dictionary

- We can also think of a ``DataFrame`` as a specialization of a dictionary.
- A  dictionary maps a key to a value, while a ``DataFrame`` maps a column name to a ``Series`` of column data.
- For instance, asking for the  ``'area'`` attribute returns the ``Series`` object containing the areas.

In [None]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

### Constructing DataFrame objects

A pandas ``DataFrame`` can be constructed in a variety of ways.
Here we'll give several examples.

#### From a single Series object

A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:

In [None]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.
We'll use a simple list comprehension to create some data:

In [None]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Note, if some keys in the dictionary are missing, pandas will use ``NaN`` (i.e., "not a number") as values:

In [None]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### From a two-dimensional NumPy array

- Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
- If omitted, an integer index will be used for each:

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.759882,0.611878
b,0.548366,0.921287
c,0.614557,0.265547


# Pandas Data Indexing and Selection

When we covered NumPy, we studied methods and tools to access, set, and modify values in NumPy arrays, including:
- indexing (e.g., ``arr[2, 1]``)
- slicing (e.g., ``arr[:, 1:5]``)
- masking or Boolean indexing (e.g., ``arr[arr > 0]``)
- fancy indexing (e.g., ``arr[0, [1, 5]]``)
- combinations thereof (e.g., ``arr[:, [1, 5]]``)

These all hold true for accessing and modifying values in pandas 1-D `Series` and 2-D `DataFrame` objects.


## Data Selection in Series

To recap, a ``Series`` object is akin to:
- a 1-D NumPy array
- a standard Python dictionary.


### Series as dictionary
 ``Series`` provide a mapping from a collection of keys to a collection of values or items:

In [None]:
# Here's a simple pandas Series generated from a list with an explicit index
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['w', 'x', 'y', 'z'])

# To list the keys in a Series
print(data.keys())

# To check if an element is found within a collection:
'z' in data

Index(['w', 'x', 'y', 'z'], dtype='object')


True

``Series`` objects can be modified just like a dictionary using the same syntax:
- A Python dictionary can be extended by assigning a new key, `e`
- A pandas ``Series`` can be extended by assigning to a new label

In [None]:
# Adding the new data 'e' to the Series `data`
data['e'] = 1.25
data

w    0.25
x    0.50
y    0.75
z    1.00
e    1.25
dtype: float64

### Series as one-dimensional array

``Series`` 
- builds on dictionary-like interface
- provides array-style item selection via Numpy mechanisms incl.:
  - *slicing*
  - *masking*

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])

# Array-like slicing by explicit index...
print(data['a':'c'])

# ...is the same as slicing by implicit integer index
data[0:3]

# Note: Slicing is a bit inconsistant in Python. There are two kinds of slicing:
# (1) with an explicit index
#     -- data['a':'c'] when the final or actual index is included in the slice
# (2) with an implicit index
#     -- data[0:3] when the final or actual index is excluded from the slice



a    0.25
b    0.50
c    0.75
dtype: float64


a    0.25
b    0.50
c    0.75
dtype: float64

In [None]:
# We can also perform masking or Boolean indexing 
# Recall that | stands for OR and & stands for AND
data[(data < 0.33) | (data > 0.66)]

a    0.25
c    0.75
d    1.00
dtype: float64

### Indexers: loc and iloc
Pandas provides special *indexer* attributes that explicitly expose indexing schemes
-  ``loc`` allows indexing and slicing with the explicit index:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])

data.loc['a']

0.25

In [None]:
# The line below is the same as data['a':'c'] from above
data.loc['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

Also, the ``iloc`` attribute allows indexing and slicing with implicit Python-style index:

In [None]:
# This returns the same value as data['a'] and data.loc['a']
data.iloc[0]

0.25

In [None]:
# This is the same as data['a':'c'] and data.loc['a':'c']
data.iloc[0:3]

a    0.25
b    0.50
c    0.75
dtype: float64

**Note**:
- Choose one method of indexing and **stay consistant in its usage**
- Explicit indexing is better than implicit.
- The explicit nature of ``loc`` and ``iloc`` makes code readable thus preventing subtle bugs from mixed indexing/slicing convention.

## Data Selection in DataFrame

A ``DataFrame`` acts like a 
-  2-D  structured array, or 
- a dictionary of ``Series`` structures sharing the same index.


### DataFrame as a dictionary
Let us  consider ``DataFrame`` as a dictionary of related ``Series`` objects.

Now, let us go back to the "states" of US example.

In [None]:
# Let's rebuild the states data. 
# Recall that order does not matter if there is an index
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 
                  'Florida': 170312, 'Illinois': 149995})
pop = pd.Series({'New York': 19651127, 'Illinois': 12882135, 'Florida': 19552860, 
                 'Texas': 26448193, 'California': 38332521})

data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


Individual ``Series`` that are columns of the ``DataFrame`` are accessed via dictionary-style indexing of  column name:

In [None]:
data['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

Equivalently, use attribute-style access with column names as strings:

In [None]:
# Note that this method fails if the column name contains non-alphanumeric characters
data.area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

We can do a quick sanity check to see if attribute-style column access yields exact same object as dictionary-style access:

In [None]:
# Recall comparing the values of two objects using the key word `is`
# Is this different from using `==` ?
data.area is data['area']


True

In [None]:
# Like with Series, dictionary-style syntax can be used to modify the object. 
# In this case, we can add a new column named density simply by defining it:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


### DataFrame as two-dimensional array

 ``DataFrame`` is  an enhanced two-dimensional array where we can examine the underlying data using the ``values`` attribute:

In [None]:
# What data structure is returned by data.value?
# Recall the built-in Python function: type()
# How is this different from the pandas dtype (Try `data.values.dtype`)?
data.values

# Let's rebuild the original DataFrame
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})

pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})

data = pd.DataFrame({'area':area, 'pop':pop})

data['density'] = data['pop'] / data['area']
data

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01]])

For array-style indexing, pandas uses the ``loc`` and ``iloc`` indexers. 

Using the ``iloc`` indexer, you can index the underlying array as a simple NumPy array (using the implicit Python-style index). However, ``DataFrame`` index and column labels are maintained in the result:

In [None]:
# Just like with 2-D NumPy arrays, iloc is formatted as [row, col]
# And for each dimension, we need to define start:stop:step when slicing
data.iloc[:3, :3]

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


In [None]:
# However, with pandas, we can do the same using the `loc` indexer and explicit index and column names:
data.loc[:'Florida', :'density']

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


In [None]:
# Any familiar NumPy-style data access patterns can be used. 
# For example, we can combine masking and fancy indexing:
# In English, what are we actually doing in Line 4?
data.loc[data.density > 100, ['area', 'density']]


Unnamed: 0,area,density
New York,141297,139.076746
Florida,170312,114.806121


In [None]:
# Similarly, if we can select specific elements, we can also set or modify their values
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


# Operating on Data in pandas


Pandas inherits much of its operations functionality from NumPy, especially pulling from NumPy's universal functions or `ufuncs`.

However, pandas does have some differences: 
- for unary operations like negation and trigonometric functions, ufuncs will *preserve index and column labels* in the output
- binary operations such as addition and multiplication, pandas will automatically *align indices* when passing the objects to the ufunc.

Note: Just like with NumPy, don't worry about memorizing these. Be aware that they exist and how to find their documentation

 Python operators and equivalent Pandas object methods:

| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


## Ufuncs: Index Preservation

Any NumPy ufunc will work on pandas ``Series`` and ``DataFrame`` objects.

In [None]:
# Let's generate a random Series using a NumPy function
rng = np.random.RandomState(42)  # `42` is the seed, which allows others to duplicate our random array
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int64

In [None]:
# Similarly, we can generate a random DataFrame
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


In [None]:
# Applying a NumPy ufunc on either of these pandas objects will generate another
# pandas object *with the indices preserved.* 
# This means we can perform calculations without worrying about the order of the values.
print(np.exp(ser))

# So we can utilize more complex calculations:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


## UFuncs: Index Alignment

For binary operations on two ``Series`` or ``DataFrame`` objects, pandas will align indices in the process of performing the operation.

### Index alignment in Series

In [None]:
# Suppose we are combining two different data sources --> 
# one contains US states by *area* and the other by *population*:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')

population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [None]:
# Note: the order of the states from above does not matter in the calculation below
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [None]:
# What if there is unique data in one of the sources and we try to operate on the data?
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [None]:
# Note: If NaN values are not desired, a fill value can be used for missing values:
A.multiply(B, fill_value=0)

0     0.0
1     4.0
2    18.0
3     0.0
dtype: float64

### Index alignment in DataFrame

In [None]:
# Similar alignments take place for *both* columns and rows when operating on DataFrames:
rng = np.random.RandomState(42)
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,6,19
1,14,10


In [None]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,7,4,6
1,9,2,6
2,7,4,3


In [None]:
A + B

Unnamed: 0,A,B,C
0,10.0,26.0,
1,16.0,19.0,
2,,,


In [None]:
# Indices are aligned correctly irrespective of order and the result are sorted.
# Similar to Series, we can pass a fill_value in place of missing entries.
fill = A.stack().mean()
print(fill)
A.add(B, fill_value=fill)

12.25


Unnamed: 0,A,B,C
0,10.0,26.0,18.25
1,16.0,19.0,18.25
2,16.25,19.25,15.25


## Ufuncs: Operations Between DataFrame and Series

In [None]:
# When performing operations between a DataFrame and a Series, index and column 
# alignments are similarly maintained.

# Let's start with a NumPy array A
rng = np.random.RandomState(42)
A = rng.randint(10, size=(3, 4))
A

array([[6, 3, 7, 4],
       [6, 9, 2, 6],
       [7, 4, 3, 7]])

In [None]:
# Here, we're subtracting the values in the first row from every other row in 
# the NumPy ndarray A
A - A[0]

array([[ 0,  0,  0,  0],
       [ 0,  6, -5,  2],
       [ 1,  1, -4,  3]])

In [None]:
# According to NumPy's broadcasting rules, subtraction between a 2D array 
# and a row is applied row-wise. Pandas also follows this convention by default.

# Let's convert the ndarray into a DataFrame
df = pd.DataFrame(A, columns=list('QRST'))
df

Unnamed: 0,Q,R,S,T
0,6,3,7,4
1,6,9,2,6
2,7,4,3,7


In [None]:
# As expected, subtraction from the DataFrame produces the same row-wise output
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,0,6,-5,2
2,1,1,-4,3


In [None]:
# To operate on columns in a DataFrame, we need to explicitly set the `axis`:
df.subtract(df['Q'], axis=0)

Unnamed: 0,Q,R,S,T
0,0,-3,1,-2
1,0,3,-4,0
2,0,-3,-4,0


Preservation and alignment of indices and columns means that operations on data in pandas will always maintain the data context thus preventing the types of errors that arise when working with heterogeneous and/or misaligned data in  NumPy.