## Combining Data

In [None]:
import pandas as pd

Pandas provides various facilities for easily combining together Series or DataFrame with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

* merge() for combining data on common columns or indices
* join() for combining data on a key column or an index
* concat() for combining DataFrames across rows or columns

In addition, pandas also provides utilities to compare two Series or DataFrame and summarize their differences.



## concat

The pandas.concat() function does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (**if any**) on the other axes. Note that I say “if any” because there is only a single possible axis of concatenation for Series.

![merge](../images/pd_merging_concat_basic.png)

In [None]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)


df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)


df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[8, 9, 10, 11],
)


In [None]:
frames = [df1, df2, df3]
result = pd.concat(frames)
result

It takes a list or dict of homogeneously-typed objects and concatenates them with some configurable handling of “what to do with the other axes”

Documentation: [concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)


Suppose we wanted to associate specific keys with each of the pieces of the chopped up DataFrame. We can do this using the keys argument:

![concat_keys](../images/pd_merging_concat_keys.png)

In [None]:
result = pd.concat(frames, keys=["x", "y", "z"])
result

And then, we can use indices as usual

In [None]:
result.loc["y"]

Concat makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

```
frames = [ process_your_file(f) for f in files ]
result = pd.concat(frames)
```

In [None]:
df4 = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    },
    index=[2, 3, 6, 7],
)


result = pd.concat([df1, df4], axis=1)
result

In [None]:
result = pd.concat([df1, df4], axis=1, join="inner")
result

Suppose we just wanted to reuse the exact index from the original DataFrame:

In [None]:
result = pd.concat([df1, df4], axis=1).reindex(df1.index)
result

Ignoring indexes on the concatenation axis

In [None]:
result = pd.concat([df1, df4], ignore_index=True, sort=False)
result

![concat_ignore](../images/pd_merging_concat_ignore_index.png)

Concatenating mixed dimensions

In [None]:
s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X")

result = pd.concat([df1, s1], axis=1)
result

#### Concatenating  with group keys

In [None]:
s3 = pd.Series([0, 1, 2, 3], name="foo")

s4 = pd.Series([0, 1, 2, 3])

s5 = pd.Series([0, 1, 4, 5])

pd.concat([s3, s4, s5], axis=1)

In [None]:
pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])

In [None]:
pd.concat(frames, keys=["x", "y", "z"])

In [None]:
pieces = {"x": df1, "y": df2, "z": df3}

result = pd.concat(pieces)
result

In [None]:
result = pd.concat(pieces, keys=["z", "y"])
result

In [None]:
result.index.levels

In [None]:
result = pd.concat(
    pieces, keys=["x", "y", "z"], levels=[["z", "y", "x", "w"]], names=["group_key"]
)
result.index.levels

## Appending rows to a df

You should use ignore_index with this method to instruct DataFrame to discard its index. If you wish to preserve the index, you should construct an appropriately-indexed DataFrame and append or concatenate those objects.

In [None]:
s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])

result = pd.concat([df1, s2.to_frame().T], ignore_index=True)
result

## Merge

pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL.

See the [merge cookbook](https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook-merge)

pandas provides a single function, merge(), as the entry point for all standard database join operations between DataFrame or named Series objects

Disclaimer: The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-index join. If you are joining on index only, you may wish to use DataFrame.join to save yourself some typing.

There are several cases to consider which are very important to understand:
 * one-to-one joins: for example when joining two DataFrame objects on their indexes (which must contain unique values).
 * many-to-one joins: for example when joining an index (unique) to one or more columns in a different DataFrame.
 * many-to-many joins: joining columns on columns.
 
![merge_key](../images/pd_merging_merge_on_key.png)

In [None]:
left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)


right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)


result = pd.merge(left, right, on="key")
result

a more complicated example with multiple join keys. Only the keys appearing in left and right are present (the intersection), since **how='inner' by default**.

![merge_key2](../images/pd_merging_merge_on_key_multiple.png)

In [None]:
left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)


right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)


result = pd.merge(left, right, on=["key1", "key2"])
result

The **how** argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names:

* left: Use keys from left frame only
* right: Use keys from right frame only
* outer: Use union of keys from both frames
* inner: Use intersection of keys from both frames
* cross: Create the cartesian product of rows of both frames



left:
![pd_merging_merge_on_key_left.png](../images/pd_merging_merge_on_key_left.png)

right:
![pd_merging_merge_on_key_right.png](../images/pd_merging_merge_on_key_right.png)

inner:
![pd_merging_merge_on_key_inner.png](../images/pd_merging_merge_on_key_inner.png)

outer:
![pd_merging_merge_on_key_outer.png](../images/pd_merging_merge_on_key_outer.png)


In [None]:
result = pd.merge(left, right, how="left", on=["key1", "key2"])
result

In [None]:
result = pd.merge(left, right, how="right", on=["key1", "key2"])
result

In [None]:
result = pd.merge(left, right, how="inner", on=["key1", "key2"])
result

In [None]:
result = pd.merge(left, right, how="outer", on=["key1", "key2"])


result

Special attention to cross:

cross:
![pd_merging_merge_on_key_cross.png](../images/pd_merging_merge_cross.png)


In [None]:
result = pd.merge(left, right, how="cross")
result

## merging and multiindex

You can merge a mult-indexed Series and a DataFrame, if the names of the MultiIndex correspond to the columns from the DataFrame. Transform the Series to a DataFrame using Series.reset_index() before merging

In [None]:
df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})
df

In [None]:
ser = pd.Series(
    ["a", "b", "c", "d", "e", "f"],
    index=pd.MultiIndex.from_arrays(
        [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"]
    ),
)
ser

In [None]:
pd.merge(df, ser.reset_index(), on=["Let", "Num"])

### Another example

![merging_merge_on_key_dup.png](../images/pd_merging_merge_on_key_dup.png)


In [None]:
left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})

right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

result = pd.merge(left, right, on="B", how="outer")
result

## Validate

Users can use the validate argument to automatically check whether there are unexpected duplicates in their merge keys. Key uniqueness is checked before merge operations and so should protect against memory overflows. Checking key uniqueness is also a good way to ensure user data structures are as expected.



In [None]:
left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})
left

In [None]:
right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
right

In [None]:
result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")
result

If the user is aware of the duplicates in the right DataFrame but wants to ensure there are no duplicates in the left DataFrame, one can use the validate='one_to_many'

In [None]:
pd.merge(left, right, on="B", how="outer", validate="one_to_many")

## merge indicator

merge() accepts the argument indicator. If True, a Categorical-type column called _merge will be added to the output object that takes on values:

In [None]:
df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})

df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})

pd.merge(df1, df2, on="col1", how="outer", indicator=True)

# Merging ordered data

A merge_ordered() function allows combining time series and other ordered data. In particular it has an optional fill_method keyword to fill/interpolate missing data:

In [None]:
left = pd.DataFrame(
    {"k": ["K0", "K1", "K1", "K2"], "lv": [1, 2, 3, 4], "s": ["a", "b", "c", "d"]}
)
right = pd.DataFrame({"k": ["K1", "K2", "K4"], "rv": [1, 2, 3]})

pd.merge_ordered(left, right, fill_method="ffill", left_by="s")

## Resampling

frequency conversion and resampling of time series

The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the on/level keyword parameter.

In [None]:
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)
series

In [None]:
series.resample('3T').sum()

In [None]:
series.resample('30S').asfreq()[0:5]

In [None]:
series.resample('30S').ffill()[0:5]

In [None]:
series.resample('30S').bfill()[0:5]

In [None]:
import numpy as np
def custom_resampler(arraylike):
    return np.sum(arraylike) + 5

series.resample('3T').apply(custom_resampler)

# Apply

Apply a function along an axis of the DataFrame

In [None]:
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
df

In [None]:
df.apply(np.sqrt)

In [None]:
df.apply(np.sum, axis=0)

In [None]:
df.apply(np.sum, axis=1)


In [None]:
df.apply(lambda x: [1, 2], axis=1, result_type='expand')

In [None]:
# Returning a Series inside the function is similar to passing result_type='expand'.
# The resulting column names will be the Series index.

df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)

In [None]:
# Passing result_type='broadcast' will ensure the same shape result, 
# whether list-like or scalar is returned by the function, 
# and broadcast it along the axis. 
# The resulting column names will be the originals.

df.apply(lambda x: [1, 2], axis=1, result_type='broadcast')