Combining and merging data 
===========================================================================================================================================================================

Data contained in pandas objects can be combined in several ways:

- [pandas.merge](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) joins rows in DataFrames based on one or more keys. This function is familiar from SQL or other relational databases, as it implements database join operations.

- [pandas.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) concatenates or *stacks* objects along an axis.

- The instance methods [pandas.DataFrame.combine_first](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.combine_first.html) or [pandas.Series.combine_first](https://pandas.pydata.org/docs/reference/api/pandas.Series.combine_first.html) allow overlapping data to be joined.

-   With [pandas.merge_asof](https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html) you can perform time series based window joins between DataFrame objects.

Database-like DataFrame joins
-------------------------------

Merge or join operations combine data sets by linking rows with one or more keys. These operations are especially important in relational, SQL-based databases. The merge function in pandas is the main entry point for applying these algorithms to your data.

In [1]:
import pandas as pd
# First dataset: User engagement data for a social media platform
engg = pd.DataFrame(
    {
        'UserID': [101, 102, 103, 104, 105],
        'Likes': [250, 300, 350, 400, 450],
        'Comments': [50, 60, 70, 80, 90],
        'Shares': [25, 30, 35, 40, 45],
    }
)

# Second dataset: User profile data
prof = pd.DataFrame(
    {
        'UserID': [101, 102, 103, 106, 107],
        'Name': ['Dipesh', 'Bob', 'Charlie', 'Diana', 'Ethan'],
        'Country': ['Nepal', 'USA', 'Australia', 'Canada', 'UK'],
    }
)

print(engg) 

print(prof)


   UserID  Likes  Comments  Shares
0     101    250        50      25
1     102    300        60      30
2     103    350        70      35
3     104    400        80      40
4     105    450        90      45
   UserID     Name    Country
0     101   Dipesh      Nepal
1     102      Bob        USA
2     103  Charlie  Australia
3     106    Diana     Canada
4     107    Ethan         UK


When we call `merge` with these objects, we get:

In [2]:
pd.merge(prof, engg)

Unnamed: 0,UserID,Name,Country,Likes,Comments,Shares
0,101,Dipesh,Nepal,250,50,25
1,102,Bob,USA,300,60,30
2,103,Charlie,Australia,350,70,35


By default, `merge` performs a so-called *inner join*; the keys in the result are the intersection or common set in both tables.

Note:

I did not specify which column to merge over. If this information is not specified, merge will use the overlapping column names as keys. However, it is good practice to specify this explicitly:


In [3]:
pd.merge(prof, engg, on='UserID')

Unnamed: 0,UserID,Name,Country,Likes,Comments,Shares
0,101,Dipesh,Nepal,250,50,25
1,102,Bob,USA,300,60,30
2,103,Charlie,Australia,350,70,35


If the column names are different in each object, you can specify them separately. In the following example `profile` gets the key `Profile_ID` and not `UserID`:

In [6]:
profile = pd.DataFrame(
    {
        'Profile_ID': [101, 102, 103, 106, 107],
        'Name': ['Dipesh', 'Bob', 'Charlie', 'Diana', 'Ethan'],
        'Country': ['Nepal', 'USA', 'Australia', 'Canada', 'UK'],
    }
)
# print(profile)

# pd.merge(profile, engg)


pd.merge(profile, engg, left_on='Profile_ID', right_on='UserID')

Unnamed: 0,Profile_ID,Name,Country,UserID,Likes,Comments,Shares
0,101,Dipesh,Nepal,101,250,50,25
1,102,Bob,USA,102,300,60,30
2,103,Charlie,Australia,103,350,70,35


However, you can use `merge` not only to perform an inner join, with which the keys in the result are the intersection or common set in both tables. Other possible options are:

| Option |        Behavior |
|  --------     |    -------  |
| `how='inner'` | uses only the key combinations observed in both tables |
| `how='left'`  | uses all key combinations found in the left table |
| `how='right'` | uses all key combinations found in the right table |
| `how='outer'` | uses all key combinations observed in both tables together |

In [20]:
pd.merge(prof, engg, on="UserID", how="left")

Unnamed: 0,UserID,Name,Country,Likes,Comments,Shares
0,101,Dipesh,Nepal,250.0,50.0,25.0
1,102,Bob,USA,300.0,60.0,30.0
2,103,Charlie,Australia,350.0,70.0,35.0
3,106,Diana,Canada,,,
4,107,Ethan,UK,,,


In [21]:
pd.merge(engg, prof, on="UserID", how="outer")

Unnamed: 0,UserID,Likes,Comments,Shares,Name,Country
0,101,250.0,50.0,25.0,Dipesh,Nepal
1,102,300.0,60.0,30.0,Bob,USA
2,103,350.0,70.0,35.0,Charlie,Australia
3,104,400.0,80.0,40.0,,
4,105,450.0,90.0,45.0,,
5,106,,,,Diana,Canada
6,107,,,,Ethan,UK


In [7]:
concatenated_series = pd.concat([prof, engg])
print(concatenated_series)

   UserID     Name    Country  Likes  Comments  Shares
0     101   Dipesh      Nepal    NaN       NaN     NaN
1     102      Bob        USA    NaN       NaN     NaN
2     103  Charlie  Australia    NaN       NaN     NaN
3     106    Diana     Canada    NaN       NaN     NaN
4     107    Ethan         UK    NaN       NaN     NaN
0     101      NaN        NaN  250.0      50.0    25.0
1     102      NaN        NaN  300.0      60.0    30.0
2     103      NaN        NaN  350.0      70.0    35.0
3     104      NaN        NaN  400.0      80.0    40.0
4     105      NaN        NaN  450.0      90.0    45.0


The join method only affects the unique key values that appear in the result.

To join multiple keys, you can pass a list of column names:

In [22]:
# pd.merge(engg, prof, on=["UserID", ...], how="outer")

Data Cleaning using Pandas
-----------------------------

Generally, following tasks are involved in data cleaning:

- Rename columns using the rename() method.
- Update values using the at[] or iat[] method to access and modify specific elements.
- Create a copy of a Series or dataframe using the copy() method.
- Check for NULL values using the isnull() method, and drop them using the dropna() method.
- Check for duplicate values using the duplicated() method. Drop them using the drop_duplicates() method.
- Replace NULL values using the fill () method with a specified value.
- Replace values using the replace() method.
- Sort values using the sort_values() method.
- Rank values using the rank() method.