### 4.6: Merging and exporting data

#### **1. Methods of Combining Data:**
##### **1.1: Concatenating Data**
        a: Create dictionaries of data (to experiement with)
        b: Convert the dictionaries into dataframe
        c: Concatenate the dataframes
##### **1.2: Joining Data**
##### **1.3: Merging Data**

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import os

In [2]:
# Set display options for better viewing

pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 100)  # Limit columns
pd.set_option('display.max_rows', 50)      # Limit rows

#### **1.1: Concatenating Data**
##### *a: Create dictionaries of data (to experiment with)*

In [3]:
# Define a dictionary containing January 2020 data
data1 = {'customer_id':['6732', '767', '890', '635'],
    'month':['Jan-20', 'Jan-20', 'Jan-20', 'Jan-20'],
    'purchased_meat':[0, 13, 3, 4],
    'purchased_alcohol':[1, 2, 10, 0],
    'purchased_snacks':[10, 5, 1, 7]}

In [4]:
# Define a dictionary containing February 2020 data
data2 = {'customer_id':['6732', '767', '890', '635'],
    'month':['Feb-20', 'Feb-20', 'Feb-20', 'Feb-20'],
    'purchased_meat':[0, 10, 5, 3],
    'purchased_alcohol':[2, 4, 14, 0],
    'purchased_snacks':[15, 3, 2, 6]}

##### *b: Convert the dictionaries into dataframes*

In [5]:
df = pd.DataFrame(data1, index=[0, 1, 3, 4])
df_1 = pd.DataFrame(data2, index=[0, 1, 3, 4])

In [6]:
df

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Jan-20,0,1,10
1,767,Jan-20,13,2,5
3,890,Jan-20,3,10,1
4,635,Jan-20,4,0,7


In [7]:
df_1

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Feb-20,0,2,15
1,767,Feb-20,10,4,3
3,890,Feb-20,5,14,2
4,635,Feb-20,3,0,6


##### *c: Concatenate the dataframes*

In [8]:
# Create frames for **df** and **df_1** --> Creates a list of dataframes
frames = [df, df_1]

# Concatenate dataframes --> Combines two dataframes together
df_concat = pd.concat(frames)

In [9]:
df_concat

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks
0,6732,Jan-20,0,1,10
1,767,Jan-20,13,2,5
3,890,Jan-20,3,10,1
4,635,Jan-20,4,0,7
0,6732,Feb-20,0,2,15
1,767,Feb-20,10,4,3
3,890,Feb-20,5,14,2
4,635,Feb-20,3,0,6


In [10]:
# Concatenating dataframes in wide format
df_concat_wide = pd.concat(frames, axis = 1)

In [11]:
# Return in wide formate concatenated dataframe
df_concat_wide

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,customer_id.1,month.1,purchased_meat.1,purchased_alcohol.1,purchased_snacks.1
0,6732,Jan-20,0,1,10,6732,Feb-20,0,2,15
1,767,Jan-20,13,2,5,767,Feb-20,10,4,3
3,890,Jan-20,3,10,1,890,Feb-20,5,14,2
4,635,Jan-20,4,0,7,635,Feb-20,3,0,6


#### **1.2: Joining Data**
*The **df.join()** function is not a focus of this course because it is very similar to **df.merge()**. It is primarily used when the index of a DataFrame contains meaningful information, rather than just being simple row numbers.*

#### **1.3: Merging Data**
*Merging combines **DataFrames** using a common **'key'** column, ideal for datasets of different shapes. It's the most powerful and versatile method for joining data, applying the same logic used in Excel and SQL.*

In [12]:
# Step 1: Create data dictionary
data3 = {'customer_id':['6732', '767', '890', '635'],
    'month':['Jan-20', 'Jan-20', 'Jan-20', 'Jan-20'],
    'days_purchased_on':[0, 10, 4, 1]}

# Step 2: Convert data dictionary into dataframe
df_2 = pd.DataFrame(data3, index=[0, 1, 2, 3])

In [13]:
# Return df_2
df_2

Unnamed: 0,customer_id,month,days_purchased_on
0,6732,Jan-20,0
1,767,Jan-20,10
2,890,Jan-20,4
3,635,Jan-20,1


In [14]:
# Merge df and df_2 with help of key column "customer_id"
df_merged = df.merge(df_2, on = 'customer_id')

In [15]:
# Return df_merged
df_merged

Unnamed: 0,customer_id,month_x,purchased_meat,purchased_alcohol,purchased_snacks,month_y,days_purchased_on
0,6732,Jan-20,0,1,10,Jan-20,0
1,767,Jan-20,13,2,5,Jan-20,10
2,890,Jan-20,3,10,1,Jan-20,4
3,635,Jan-20,4,0,7,Jan-20,1


##### *Since there were two key columns ("customer_id", "month") and but we didn't specify "month" column as a key, it's duplicated in the resulting dataframe. Therefore, next step as below is required to avoid this error.*

In [16]:
# Merge df and df_2 with help of key columns "customer_id" and "month" <-- two common columns
df_merged = df.merge(df_2, on = ['customer_id', 'month'])  # keeps only one column for each common column

In [17]:
df_merged

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on
0,6732,Jan-20,0,1,10,0
1,767,Jan-20,13,2,5,10
2,890,Jan-20,3,10,1,4
3,635,Jan-20,4,0,7,1


In [18]:
# Checking full matching of datasets
df_merged = df.merge(df_2, on = ['customer_id', 'month'], indicator = True)

In [19]:
df_merged

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on,_merge
0,6732,Jan-20,0,1,10,0,both
1,767,Jan-20,13,2,5,10,both
2,890,Jan-20,3,10,1,4,both
3,635,Jan-20,4,0,7,1,both


In [20]:
# Frequency check for "_merge"
df_merged['_merge'].value_counts()

_merge
both          4
left_only     0
right_only    0
Name: count, dtype: int64

In [21]:
# Test merge without overwriting
pd.merge(df, df_2, on = ['customer_id', 'month'], indicator = True)

Unnamed: 0,customer_id,month,purchased_meat,purchased_alcohol,purchased_snacks,days_purchased_on,_merge
0,6732,Jan-20,0,1,10,0,both
1,767,Jan-20,13,2,5,10,both
2,890,Jan-20,3,10,1,4,both
3,635,Jan-20,4,0,7,1,both
