### Combining datasets in Pandas

Combining datasets in Pandas is a core skill for data analysis. 

Pandas provides several flexible and powerful functions to combine, merge, and concatenate datasets based on different needs.

#### Common Ways to Combine Datasets
1. Concatenation `(pd.concat)`
Used to stack DataFrames either vertically (rows) or horizontally (columns).

`Stack Rows (same columns)`



In [2]:
import pandas as pd

In [3]:
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [3, 4], 'Name': ['Carol', 'David']})

result = pd.concat([df1, df2])

In [4]:
result

Unnamed: 0,ID,Name
0,1,Alice
1,2,Bob
0,3,Carol
1,4,David


`Stack Columns (same rows)`

In [7]:
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [3, 4], 'Name': ['Carol', 'David']})

result = pd.concat([df1, df2], axis=1)
result

Unnamed: 0,ID,Name,ID.1,Name.1
0,1,Alice,3,Carol
1,2,Bob,4,David


2. Merging `(pd.merge)`

Similar to SQL joins. You merge on key columns.

Example Scenario: Customers and Orders

Dataset 1: Customers (df_customers)

In [16]:
df_customers = pd.DataFrame({
    'CustomerID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Country': ['Kenya', 'Uganda', 'Kenya']
})
df_customers

Unnamed: 0,CustomerID,Name,Country
0,1,Alice,Kenya
1,2,Bob,Uganda
2,3,Charlie,Kenya


 Dataset 2: Orders (df_orders)

In [17]:
df_orders = pd.DataFrame({
    'OrderID': [101, 102, 103, 104],
    'CustomerID': [2, 1, 2, 4],
    'Amount': [250, 150, 300, 500]
})
df_orders

Unnamed: 0,OrderID,CustomerID,Amount
0,101,2,250
1,102,1,150
2,103,2,300
3,104,4,500


 1. Inner Join – Only customers who have orders

In [11]:
pd.merge(df_customers, df_orders, on='CustomerID', how='inner')

Unnamed: 0,CustomerID,Name,Country,OrderID,Amount
0,1,Alice,Kenya,102,150
1,2,Bob,Uganda,101,250
2,2,Bob,Uganda,103,300


2. Left Join – All customers, even if they didn’t order

In [12]:
pd.merge(df_customers, df_orders, on='CustomerID', how='left')

Unnamed: 0,CustomerID,Name,Country,OrderID,Amount
0,1,Alice,Kenya,102.0,150.0
1,2,Bob,Uganda,101.0,250.0
2,2,Bob,Uganda,103.0,300.0
3,3,Charlie,Kenya,,


3. Right Join – All orders, even if customer not found

In [13]:
pd.merge(df_customers, df_orders, on='CustomerID', how='right')

Unnamed: 0,CustomerID,Name,Country,OrderID,Amount
0,2,Bob,Uganda,101,250
1,1,Alice,Kenya,102,150
2,2,Bob,Uganda,103,300
3,4,,,104,500


4. Outer Join – All customers and all orders

In [14]:
pd.merge(df_customers, df_orders, on='CustomerID', how='outer')

Unnamed: 0,CustomerID,Name,Country,OrderID,Amount
0,1,Alice,Kenya,102.0,150.0
1,2,Bob,Uganda,101.0,250.0
2,2,Bob,Uganda,103.0,300.0
3,3,Charlie,Kenya,,
4,4,,,104.0,500.0


### Joins

In Pandas, `.join()` is another way to combine datasets, similar to merge(), but with a slightly different use case and syntax.

It is best used when joining on the index or when you want a simpler syntax for merging by key.

Syntax
```python 
df1.join(df2, how='left', on=None)
```

- `df1` is the left DataFrame.
- `df2` is the right DataFrame.
- `on` specifies the column to join on (if not index).
- `how` can be 'left', 'right', 'inner', 'outer'.

Join on Index

In [18]:

df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
}, index=[1, 2, 3])
df1

Unnamed: 0,Name,Age
1,Alice,25
2,Bob,30
3,Charlie,35


In [19]:
df2 = pd.DataFrame({
    'Salary': [50000, 60000, 70000]
}, index=[1, 2, 3])
df2

Unnamed: 0,Salary
1,50000
2,60000
3,70000


In [21]:
result = df1.join(df2)
result  

Unnamed: 0,Name,Age,Salary
1,Alice,25,50000
2,Bob,30,60000
3,Charlie,35,70000


Join on Column Instead of Index


In [None]:
df1 = pd.DataFrame({
    'CustomerID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})
df1

Unnamed: 0,CustomerID,Name
0,1,Alice
1,2,Bob
2,3,Charlie


In [23]:
df2 = pd.DataFrame({
    'CustomerID': [1, 2],
    'Salary': [50000, 60000]
})
df2

Unnamed: 0,CustomerID,Salary
0,1,50000
1,2,60000


In [24]:
df1 = df1.set_index('CustomerID')
df1

Unnamed: 0_level_0,Name
CustomerID,Unnamed: 1_level_1
1,Alice
2,Bob
3,Charlie


In [25]:
df2 = df2.set_index('CustomerID')
df2

Unnamed: 0_level_0,Salary
CustomerID,Unnamed: 1_level_1
1,50000
2,60000


In [26]:
result = df1.join(df2, how='left')
result

Unnamed: 0_level_0,Name,Salary
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Alice,50000.0
2,Bob,60000.0
3,Charlie,


#### When to Use `.join()`:
- When joining on the index.
- When you want a simpler syntax than merge() for small tasks.
- When doing one-to-one or one-to-many joins.