# Merging

You've already merged datasets. But so far, our examples have been "well-behaved" so it was easy to just proceed. But real world datasets are messy (bad variable names, poor documentation) and big, and so merging isn't always as easy as "just do it".

## A nice overview

The [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html) has a wonderful breakdown of the mechanics of merging. You should read it! I'm going to borrow but alter

## Important parameters of `pd.merge`

Read through the [parameters of the function here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html). 

- `right`, `left` - name of your datasets
- `on` - what variable(s) to use to match from the left and right datasets. `on = ` a single variable **or a list of variables** you're using
    - if the variable names aren't the same in the datasets (e.g. "ID" in one and "identity" in the other), use `left_on` and `right_on` instead of `on`
    - if the variables are the index variables, use `left_index = True` and/or `right_index = True` instead of `on`
- `how` - what observations are in the resulting dataset

    | option | observations in resulting dataset |
    | :--- | :--- |
    `how = "inner"`| Observations that are in both datasets 
    `how = "left"` | "inner" + all unmatched obs in left 
    `how = "right"` | "inner" + all unmatched obs in right), right, 
    `how = "outer"` | "inner" + all unmatched obs in left)
- `suffix` = when a variable is in both datasets, how should we name each. 
    - **It's a good idea to always use this option and specific the source, because the default option is uninformative! **
- `indicator=True` will create a variable saying which dataset the variable came from
- `validate` = "one_to_one", "one_to_many", or "many_to_one". Will check if the merge is actually what you think it is. Useful!

## Tips and best practices

**THESE ARE IMPORTANT**

1. What variable (**or variables**) should you be merging on? For example: Should you merge based on the firm, or the firm AND the year? It depends on the observational units in each dataset and the variable(s) you're using as a key. 
1. What are the observation units in your datasets? 
1. **Before your merge, examine the "keys"** _(keys are the variables you'll use in the `on` parameter)_
    1. Drop any observations with missing keys in each dataset
    2. How many unique keys are in each dataset? Simply replace `df` and keys in this code:
        
    ```python
    len(df[<keys>].drop_duplicates())
    ```    
    3. What are the observation units in your datasets? What will the observation unit be after your merge? 
1. **Always specify `how`, `on`, `indicator`, and `validate`**
    1. This will force you to think about the observation levels in each dataset you're merging before you try the merge, and whether the merge you're doing is 1:1, 1:M, M:M, or M:1.
    2. Guess how many observations you'll have (more or less than left? more or less than right or left?) and then check afterwards. 
1. **After the merge**, check that it did what you expected, and give it a _good_ name.    
    1. Examine the `_merge` variable (value_counts, e.g.)
    1. Good names: I often actively name the dataframe to the new observation level. 
    
    _For example, I know exactly how `state_industry_year_df` and `state_industry_df` should differ._     


## Illustration

### `how`

