# pandas: Combining Data with `merge`, `join`, and `concat`

In [None]:
import pandas as pd

In [None]:
climate_temp = pd.read_csv("climate_temp.csv")
climate_temp.head()

In [None]:
climate_temp.shape

In [None]:
climate_precip = pd.read_csv("climate_precip.csv")
climate_precip.head()

In [None]:
climate_precip.shape

## merge()


### Inner Join

Here we will do a inner join using`merge()`. This function mergea  DataFrame or named Series objects with a database-style join.


Doc: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge

In [None]:
precip_one_station = climate_precip.query("STATION == 'GHCND:USC00045721'")
precip_one_station.head()

In [None]:
precip_one_station.shape

In [None]:
inner_merged = pd.merge(precip_one_station, climate_temp)
inner_merged.head()

In [None]:
inner_merged.shape

#### You can specify a single _key column_ with a string, or multiple _key columns_ with a list.

In [None]:
inner_merged_total = pd.merge(
    climate_temp, climate_precip,on=["STATION", "DATE"]
)
inner_merged_total.head()

In [None]:
inner_merged_total.shape

In [None]:
inner_merged_total.columns

In [None]:
inner_merged_total.columns

#### Column differences?! 

Why 48 columns instead of 47? Because you specified the keys columns to join on, Pandas doesn't try to merge all mergeable columns. This can result in "duplicate" column names, which may or may not have different values. 

These duplicated columns will actually have new names, by default they are appended with `_x` and `_y`. You can also use the `suffixes` parameter to control what is appended to the column names.

### Outer Join
With the outer join, you will retain rows that don't have matches as well. For this example, you will use the smaller precipitation DataFrame `precip_one_station` with the full `climate_temp` DataFrame and join with `STATION` and `DATE` columns as the key columns.

In [None]:
outer_merged = pd.merge(
    precip_one_station, climate_temp, how="outer", on=["STATION", "DATE"]
)
outer_merged.head()

In [None]:
outer_merged.shape

The number of rows in `outer_merged` matches. With an outer join, you can expect to have the same number of rows as the larger DataFrame, since none are lost like they are in an inner join. 

### Left Join
Also known as a left outer join, the lef join will retain rows that don't have matches only on the left (or first) DataFrame to be merged.

In [None]:
left_merged = pd.merge(
    climate_temp, precip_one_station, how="left", on=["STATION", "DATE"]
)
left_merged.head()

In [None]:
left_merged.shape

The number of rows in the resulting DataFrame matches that of the rows in the `climate_temp` DataFrame.

### Right Join
This works the same as the left join, however non-matching rows are only retained in the _right_ DataFrame. In the next example, you will recreate the `left_merged` DataFrame but with a right join.

In [None]:
right_merged = pd.merge(
    precip_one_station, climate_temp, how="right", on=["STATION", "DATE"]
)
right_merged.head()

In [None]:
right_merged.shape


## .join()
`.join()` uses `merge()` under the hood, but provides a much more simplified interface to `merge()` and by default joins on indexes.

Doc: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html?highlight=join#pandas.DataFrame.join

In [None]:
precip_one_station.join(climate_temp, lsuffix="_left", rsuffix="_right")

The overlapping columns are kept, just renamed to be unique. If we flip this around and instead call `.join()` on the larger DataFrame, you'll notice that the DataFrame is larger, but data that doesn't exist in the smaller DataFrame (`precip_one_station`) is filled in with `NaN`values.

In [None]:
climate_temp.join(precip_one_station, lsuffix="_left", rsuffix="_right")

If you must use `.join()` and want to merge the columns, you must set them to be indexes first. First take a look at this previously used `merge()` operation:

In [None]:
inner_merged_total = pd.merge(
    climate_temp, climate_precip, on=["STATION", "DATE"]
)
inner_merged_total.head()

In [None]:
inner_joined_total = climate_temp.join(
    climate_precip.set_index(["STATION", "DATE"]),
    on=["STATION", "DATE"],
    how="inner",
    lsuffix="_x",
    rsuffix="_y",
)
inner_joined_total.head()

## concat()

First, you will see a basic concatenation along axis 0.

In [None]:
double_precip = pd.concat([precip_one_station, precip_one_station])
double_precip.head()

To reset the index, use the `ignore_index` parameter.

In [None]:
reindexed = pd.concat(
    [precip_one_station, precip_one_station], ignore_index=True
)
reindexed.head()

When axis labels for the axis you are **not** concatenating along don't match (for example, column labels when concatenating along rows), then all columns are preserved and missing data is filled in with `NaN`. 

In [None]:
outer_joined = pd.concat([climate_precip, climate_temp])
outer_joined.head()

In [None]:
inner_joined = pd.concat([climate_temp, climate_precip], join="inner")
inner_joined.head()

In [None]:
inner_joined.shape

To illustrate how this would work with rows, concatenate along columns instead:

In [None]:
inner_joined_cols = pd.concat(
    [climate_temp, climate_precip], axis="columns", join="inner"
)
inner_joined.head()

In [None]:
inner_joined_cols.shape

You can also use the `keys` parameter to set hierarchical axis labels which can be used, for example, to preserve original labels while maintaining labels that tell you which dataset each row or column came from.

In [None]:
hierarchical_keys = pd.concat(
    [climate_temp, climate_precip], keys=["temp", "precip"]
)
hierarchical_keys.head()

In [None]:
hierarchical_keys.tail()