# Video: Computing with Data Frames

This video shows how to do various computations with Pandas data frames, and how to fall back to NumPy and other functions where Pandas does not have direct support.

## What Can We Compute with a Data Frame?

* Anything that you can compute with NumPy
* Anything that you can express as a Python function

Script:
* Data frames do not help you compute things that you could not compute with NumPy or pure Python.
* Data frames do make some computations a lot easier.

## What Is Easier to Compute with a Data Frame?

* Cases with missing data (NaN)
* Cases with mismatched data

Script:
* Data frame functions make handling missing data a lot easier.
* NumPy does have a fillna function to fill in missing data, but you have to call it separately for each input.
* Pandas functions for computing usually have a fill_value option that you add and it takes care of all the inputs.
* The other thing that gets a lot easier with data frames is that they take care of matching up the rows between data frames.
* If you have sources of data with slightly different coverage, a few missing rows in the beginning will lead to your data not lining up at all.
* Instead of computing nonsensical answers from mismatched pairs of rows, pandas will take care of the alignment and match up rows with the same labels.
* For the cases where the labels are missing on one side or the other, pandas will put NaN, or Not a Number, in the output.
* Unless you used the fill_value option which will handle those cases too.
* Let's see some examples now.

In [None]:
import pandas as pd

In [None]:
df1 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df1

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


Script:
* Just a small sample data frame to see what is going on easily.

In [None]:
df2 = pd.DataFrame({"a": [4, 5, 6], "c": [7, 8, 9]}, index=[0, 2, 3])
df2

Unnamed: 0,a,c
0,4,7
2,5,8
3,6,9


Script:
* This data frame skipped index value 1, and added 3 at the end.
* So each side has an index value that the other does not.
* So what happens if we add the A columns together?

In [None]:
df1["a"] + df2["a"]

0    5.0
1    NaN
2    8.0
3    NaN
Name: a, dtype: float64

Script:
* We got a new series that has NaN, or not a number, for both labels where one side was missing.

In [None]:
(df1["a"] + df2["a"]).fillna(0.0)

0    5.0
1    0.0
2    8.0
3    0.0
Name: a, dtype: float64

Script:
* We can fill those in afterwards, but we completely ignored the input values that were available.

In [None]:
df1["a"].add(df2["a"], fill_value=0)

0    5.0
1    2.0
2    8.0
3    6.0
Name: a, dtype: float64

Script:
* Specifying the fill_value lets us use the values that were present.
* To get the same effect with just fillna, it would be longer.

In [None]:
(df1["a"] + df2["a"]).fillna(df1["a"]).fillna(df2["a"])

0    5.0
1    2.0
2    8.0
3    6.0
Name: a, dtype: float64

Script:
* That fills in the original values as defaults, but seems like more work than necessary and error prone.
* This would not be so easy with just NumPy because the rows were not aligned.
* The examples so far were working with individual series.
* What about data frames?

In [None]:
df1.add(df2)

Unnamed: 0,a,b,c
0,5.0,,
1,,,
2,8.0,,
3,,,


Script:
* Operating on data frames matches up both rows and columns, or more precisely, index values and column names.
* In the previous output, the only real results are from the intersection of shared labels and shared columns.

In [None]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c
0,5.0,4.0,7.0
1,2.0,5.0,
2,8.0,6.0,8.0
3,6.0,,9.0


Script:
* Like before, you can use the fill_value argument.
* This can still holes, rather not a number values.
* These happen when neither input data frame had a particular index and column combination.
* Generally that would mean one side was missing the index value and the other side was missing the column name.

In [None]:
df1.add(df2, fill_value=0).fillna(0)

Unnamed: 0,a,b,c
0,5.0,4.0,7.0
1,2.0,5.0,0.0
2,8.0,6.0,8.0
3,6.0,0.0,9.0


Script:
* You can fill those values in with one last fillna call.
* So you essentially are specifying defaults when one side is missing with fill_value, and both sides are missing with fillna.
* Bear in mind, many use cases will not be expecting so many missing values on both side, so double check if you need to fill in so much.

## Function Coverage in the Library

* See the pandas data frame documentation for list directly supported functions.
* NumPy function should work on numeric data in Pandas.
  * Handle one column at a time...


Script:
* What functions are available?
* The most commonly used functions are directly available at the data frame or series level.
* Check the pandas documentation for the list.
* Going forward, you should be able to find that documentation on your own.