# Operating on Data in Pandas

## Ufuncs: Index Preservation

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.
Let's start by defining a simple ``Series`` and ``DataFrame`` on which to demonstrate this:

In [73]:
import pandas as pd
import numpy as np

In [74]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int64

In [75]:
#iteroperability between np 1d array and pandas series

np.sum(ser)

20

In [76]:
type(ser)

pandas.core.series.Series

In [77]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'],
                 index = [1,2,3])
df

Unnamed: 0,A,B,C,D
1,6,9,2,6
2,7,4,3,7
3,7,2,5,4


In [78]:
np.sin(df * np.pi / 4)

# Index and columns are preserved while working on Pandas DataFrame in a 2D Matrice of Numpy

Unnamed: 0,A,B,C,D
1,-1.0,0.7071068,1.0,-1.0
2,-0.707107,1.224647e-16,0.707107,-0.7071068
3,-0.707107,1.0,-0.707107,1.224647e-16


### Index alignment in Series

As an example, suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:

In [79]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [80]:

population

California    38332521
Texas         26448193
New York      19651127
Name: population, dtype: int64

In [81]:
area

Alaska        1723337
Texas          695662
California     423967
Name: area, dtype: int64

Let's see what happens when we divide these to compute the population density:

In [82]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [83]:
#can't be achieved with Numpy arrays.. Numpy doesn't match indexes


population.values / area.values

array([22.24319503, 38.01874042, 46.35060512])

In [84]:
A = pd.Series([21, 41, 61], index=[0, 1, 2])
B = pd.Series([11, 31, 51], index=[1, 2, 3])
A

0    21
1    41
2    61
dtype: int64

In [85]:
B

1    11
2    31
3    51
dtype: int64

In [86]:
A + B

0     NaN
1    52.0
2    92.0
3     NaN
dtype: float64

In [87]:
A.add(B, fill_value=0)

# Is 0 the right way to fill when value is not available ?

0    21.0
1    52.0
2    92.0
3    51.0
dtype: float64

### Index alignment in DataFrame

A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrame``s:

In [88]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,1,11
1,5,1


In [89]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

Unnamed: 0,B,A,C
0,4,0,9
1,5,8,0
2,9,2,6


In [90]:
A + B

Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


In [91]:
A.add(B, fill_value=0)

Unnamed: 0,A,B,C
0,1.0,15.0,9.0
1,13.0,6.0,0.0
2,2.0,9.0,6.0


In [92]:

A.add(B, fill_value=A.mean().mean())

Unnamed: 0,A,B,C
0,1.0,15.0,13.5
1,13.0,6.0,4.5
2,6.5,13.5,10.5


The following table lists Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


## Ufuncs: Operations Between DataFrame and Series

When performing operations between a ``DataFrame`` and a ``Series``, the index and column alignment is similarly maintained.
Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array.
Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:

In [98]:
A = rng.randint(10, size=(3, 4))
type(A)

numpy.ndarray

In [99]:
A

array([[1, 9, 8, 9],
       [4, 1, 3, 6],
       [7, 2, 0, 3]])

In [100]:
df = pd.DataFrame(A, columns=list('QRST'))
df

Unnamed: 0,Q,R,S,T
0,1,9,8,9
1,4,1,3,6
2,7,2,0,3


In [101]:
type(df.iloc[0])

pandas.core.series.Series

In [102]:
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,3,-8,-5,-3
2,6,-7,-8,-6


If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the ``axis`` keyword:

In [103]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,-8,0,-1,0
1,3,0,2,5
2,5,0,-2,1


Note that these ``DataFrame``/``Series`` operations, like the operations discussed above, will automatically align  indices between the two elements:

In [104]:
halfrow = df.iloc[0, ::2]
halfrow

Q    1
S    8
Name: 0, dtype: int64

In [105]:
df

Unnamed: 0,Q,R,S,T
0,1,9,8,9
1,4,1,3,6
2,7,2,0,3


In [106]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,3.0,,-5.0,
2,6.0,,-8.0,
