Pandas Merge, Join, Concat examples documentation

https://pandas.pydata.org/docs/user_guide/merging.html

## Tree main functions to combine datasets

* Merge

* Join 

* Concatenate

## Different ways to join/merge datasets


left            
        
- Use keys from left frame only

right
		
- Use keys from right frame only


inner


- Use intersection of keys from both frames


outer
	
- Use union of keys from both frames

cross

- combine every row with every other row

## MERGE

In [1]:
import pandas as pd

In [2]:
left_df = pd.DataFrame( 
    
    
            {
                "key": ["K0", "K1", "K2", "K3", "K4"],

                "A": ["A0", "A1", "A2", "A3", "A4"],

                "B": ["B0", "B1", "B2", "B3", "B4"]
       
            }

                      )



right_df = pd.DataFrame(

            {

                "key": ["K1", "K2", "K3", "K4", "K5"],

                "C": ["C1", "C2", "C3", "C4", "C5"],

                "D": ["D1", "D2", "D3", "D4", "D5"],

            }

                        )

In [3]:
left_df

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3
4,K4,A4,B4


In [4]:
right_df

Unnamed: 0,key,C,D
0,K1,C1,D1
1,K2,C2,D2
2,K3,C3,D3
3,K4,C4,D4
4,K5,C5,D5


In [5]:
pd.merge(left_df, right_df, on='key', indicator=True)  # detta är en, by default, inner join

Unnamed: 0,key,A,B,C,D,_merge
0,K1,A1,B1,C1,D1,both
1,K2,A2,B2,C2,D2,both
2,K3,A3,B3,C3,D3,both
3,K4,A4,B4,C4,D4,both


notera skillnaden i ordningen och hur resultatet ser ut

In [6]:
pd.merge(right_df, left_df, on='key')

Unnamed: 0,key,C,D,A,B
0,K1,C1,D1,A1,B1
1,K2,C2,D2,A2,B2
2,K3,C3,D3,A3,B3
3,K4,C4,D4,A4,B4


**inner**

Notera ett inner är default

In [7]:
pd.merge(left_df, right_df, on='key', how='inner')

Unnamed: 0,key,A,B,C,D
0,K1,A1,B1,C1,D1
1,K2,A2,B2,C2,D2
2,K3,A3,B3,C3,D3
3,K4,A4,B4,C4,D4


**left**

In [8]:
left_df

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3
4,K4,A4,B4


In [9]:
right_df

Unnamed: 0,key,C,D
0,K1,C1,D1
1,K2,C2,D2
2,K3,C3,D3
3,K4,C4,D4
4,K5,C5,D5


In [10]:
pd.merge(left_df, right_df, on='key', how='left', indicator=True)

Unnamed: 0,key,A,B,C,D,_merge
0,K0,A0,B0,,,left_only
1,K1,A1,B1,C1,D1,both
2,K2,A2,B2,C2,D2,both
3,K3,A3,B3,C3,D3,both
4,K4,A4,B4,C4,D4,both


In [11]:
pd.merge(right_df, left_df, on='key', how='left')

Unnamed: 0,key,C,D,A,B
0,K1,C1,D1,A1,B1
1,K2,C2,D2,A2,B2
2,K3,C3,D3,A3,B3
3,K4,C4,D4,A4,B4
4,K5,C5,D5,,


**right**

In [12]:
left_df

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3
4,K4,A4,B4


In [13]:
right_df

Unnamed: 0,key,C,D
0,K1,C1,D1
1,K2,C2,D2
2,K3,C3,D3
3,K4,C4,D4
4,K5,C5,D5


In [14]:
pd.merge(left_df, right_df, on='key', how='right')

Unnamed: 0,key,A,B,C,D
0,K1,A1,B1,C1,D1
1,K2,A2,B2,C2,D2
2,K3,A3,B3,C3,D3
3,K4,A4,B4,C4,D4
4,K5,,,C5,D5


**outer**


In [15]:
left_df

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3
4,K4,A4,B4


In [16]:
right_df

Unnamed: 0,key,C,D
0,K1,C1,D1
1,K2,C2,D2
2,K3,C3,D3
3,K4,C4,D4
4,K5,C5,D5


In [17]:
pd.merge(left_df, right_df, on='key', how='outer')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,,
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3
4,K4,A4,B4,C4,D4
5,K5,,,C5,D5


**cross**

In [18]:
data1 = {'A': [1, 2], 'B': [3, 4]}
df1 = pd.DataFrame(data1)

data2 = {'X': ['a', 'b'], 'Y': ['c', 'd']}
df2 = pd.DataFrame(data2)

In [19]:
df1

Unnamed: 0,A,B
0,1,3
1,2,4


In [20]:
df2

Unnamed: 0,X,Y
0,a,c
1,b,d


In [21]:
pd.merge(df1, df2, how='cross')

Unnamed: 0,A,B,X,Y
0,1,3,a,c
1,1,3,b,d
2,2,4,a,c
3,2,4,b,d


In [22]:
data1 = {'A': [1, 2], 'B': [3, 4]}
df1 = pd.DataFrame(data1)

data2 = {'X': ['a', 'b','c'], 'Y': ['d', 'e', 'f']}
df2 = pd.DataFrame(data2)

In [23]:
df1

Unnamed: 0,A,B
0,1,3
1,2,4


In [24]:
df2

Unnamed: 0,X,Y
0,a,d
1,b,e
2,c,f


In [25]:
pd.merge(df1, df2, how='cross')

Unnamed: 0,A,B,X,Y
0,1,3,a,d
1,1,3,b,e
2,1,3,c,f
3,2,4,a,d
4,2,4,b,e
5,2,4,c,f


In [26]:
df1

Unnamed: 0,A,B
0,1,3
1,2,4


In [27]:
df2

Unnamed: 0,X,Y
0,a,d
1,b,e
2,c,f


In [28]:
pd.merge(df2, df1, how='cross')

Unnamed: 0,X,Y,A,B
0,a,d,1,3
1,a,d,2,4
2,b,e,1,3
3,b,e,2,4
4,c,f,1,3
5,c,f,2,4


In [None]:
data1 = {'A': [1, 2], 'B': [3, 4]}
df1 = pd.DataFrame(data1)

data2 = {'A': ['a', 'b'], 'Y': ['c', 'd']}
df2 = pd.DataFrame(data2)

In [None]:
df1

In [None]:
df2

In [None]:
pd.merge(df1, df2, how='cross', suffixes=['_left', '_right'])

## Duplicerade nycklar

In [None]:
left_df = pd.DataFrame( 
    
    
            {
                "key": ["K0", "K1", "K2", "K3", "K4"],

                "A": ["A0", "A1", "A2", "A3", "A4"],

                "B": ["B0", "B1", "B2", "B3", "B4"]
       
            }

                      )



right_df = pd.DataFrame(

            {

                "key": ["K1", "K4", "K3", "K1", "K5"],

                "C": ["C1", "C2", "C3", "C4", "C5"],

                "D": ["D1", "D2", "D3", "D4", "D5"],

            }

                        )

In [None]:
left_df

In [None]:
right_df

In [None]:
pd.merge(left_df, right_df, on='key', how='left')

**Att inte ange kolumn att merge på (bad practice)**

In [None]:
left_df = pd.DataFrame( 
    
    
            {
                "key": ["K0", "K1", "K2", "K3", "K4"],

                "A": ["A0", "A1", "A2", "A3", "A4"],

                "B": ["B0", "B1", "B2", "B3", "B4"]
       
            }

                      )



right_df = pd.DataFrame(

            {

                "key": ["K1", "K2", "K3", "K4", "K5"],

                "C": ["C1", "C2", "C3", "C4", "C5"],

                "D": ["D1", "D2", "D3", "D4", "D5"],

            }

                        )

In [None]:
left_df

In [None]:
right_df

In [None]:
pd.merge(left_df, right_df, how='inner')

**Föröka merge på icke gemensam kolumn**

In [None]:
pd.merge(left_df, right_df, on='A', how='inner')

**Att merga på flera kolumner samtidigt**

In [None]:
left_df = pd.DataFrame(

    {

        "key1": ["K0", "K0", "K1", "K2"],

        "key2": ["K0", "K1", "K0", "K1"],

        "A": ["A0", "A1", "A2", "A3"],

        "B": ["B0", "B1", "B2", "B3"],

    }

)



right_df = pd.DataFrame(

    {

        "key1": ["K0", "K1", "K1", "K2"],

        "key2": ["K0", "K0", "K0", "K0"],

        "C": ["C0", "C1", "C2", "C3"],

        "D": ["D0", "D1", "D2", "D3"],

    }

)


In [None]:
left_df

In [None]:
right_df

In [None]:
pd.merge(left_df, right_df, on=['key1', 'key2'], how='right')

In [None]:
pd.merge(left_df, right_df, on=['key1', 'key2'], how='left')

In [None]:
pd.merge(left_df, right_df, on=['key1', 'key2'], how='inner')

**ett ekvivalent sätt att merga på**

In [None]:
left_df = pd.DataFrame( 
    
    
            {
                "key": ["K0", "K1", "K2", "K3", "K4"],

                "A": ["A0", "A1", "A2", "A3", "A4"],

                "B": ["B0", "B1", "B2", "B3", "B4"]
       
            }

                      )



right_df = pd.DataFrame(

            {

                "key": ["K1", "K2", "K3", "K4", "K5"],

                "C": ["C1", "C2", "C3", "C4", "C5"],

                "D": ["D1", "D2", "D3", "D4", "D5"],

            }

                        )

In [None]:
left_df

In [None]:
right_df

In [None]:
left_df.merge(right_df, on='key', how='inner')

In [None]:
pd.merge(left_df, right_df, on='key', how='inner')

**att merga på kolumner som inte heter samma sak**

In [None]:
left_df = pd.DataFrame( 
    
    
            {
                "key": ["K0", "K1", "K2", "K3", "K4"],

                "A": ["A0", "A1", "A2", "A3", "A4"],

                "B": ["B0", "B1", "B2", "B3", "B4"]
       
            }

                      )



right_df = pd.DataFrame(

            {

                "nyckel": ["K1", "K2", "K3", "K4", "K5"],

                "C": ["C1", "C2", "C3", "C4", "C5"],

                "D": ["D1", "D2", "D3", "D4", "D5"],

            }

                        )

In [None]:
left_df

In [None]:
right_df

In [None]:
pd.merge(left_df, right_df, left_on='key', right_on='nyckel')

# Join

In [30]:
left_df = pd.DataFrame(

    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]

)



right_df = pd.DataFrame(

    {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]

)

In [31]:
left_df

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [32]:
right_df

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


In [36]:
left_df.join(right_df)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [37]:
left_df.join(right_df, how='left')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [38]:
left_df.join(right_df, how='inner')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2


In [39]:
left_df.join(right_df, how='right')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2
K3,,,C3,D3


In [40]:
left_df.join(right_df, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


**Concatenate**

In [41]:
df1 = pd.DataFrame(

    {

        "A": ["A0", "A1", "A2", "A3"],

        "B": ["B0", "B1", "B2", "B3"],

        "C": ["C0", "C1", "C2", "C3"],

        "D": ["D0", "D1", "D2", "D3"],

    },

    index=[0, 1, 2, 3],

)



df2 = pd.DataFrame(

    {

        "A": ["A4", "A5", "A6", "A7"],

        "B": ["B4", "B5", "B6", "B7"],

        "C": ["C4", "C5", "C6", "C7"],

        "D": ["D4", "D5", "D6", "D7"],

    },

    index=[4, 5, 6, 7],

)



df3 = pd.DataFrame(

    {

        "A": ["A8", "A9", "A10", "A11"],

        "B": ["B8", "B9", "B10", "B11"],

        "C": ["C8", "C9", "C10", "C11"],

        "D": ["D8", "D9", "D10", "D11"],

    },

    index=[8, 9, 10, 11],

)

In [42]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [43]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [44]:
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


In [47]:
pd.concat([df1, df2, df3])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


In [51]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [52]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [50]:
pd.concat([df1, df2], axis='columns')

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,,,,
1,A1,B1,C1,D1,,,,
2,A2,B2,C2,D2,,,,
3,A3,B3,C3,D3,,,,
4,,,,,A4,B4,C4,D4
5,,,,,A5,B5,C5,D5
6,,,,,A6,B6,C6,D6
7,,,,,A7,B7,C7,D7


In [53]:
df1 = pd.DataFrame(

    {

        "A": ["A0", "A1", "A2", "A3"],

        "B": ["B0", "B1", "B2", "B3"],

        "C": ["C0", "C1", "C2", "C3"],

        "D": ["D0", "D1", "D2", "D3"],

    },

    index=[0, 1, 2, 3],

)



df2 = pd.DataFrame(

    {

        "A": ["A4", "A5", "A6", "A7"],

        "B": ["B4", "B5", "B6", "B7"],

        "C": ["C4", "C5", "C6", "C7"],

        "D": ["D4", "D5", "D6", "D7"],

    },

    index=[0, 1, 2, 3],

)

In [54]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [55]:
df2

Unnamed: 0,A,B,C,D
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7


In [61]:
result = pd.concat([df1, df2], axis='columns')

result['A']

Unnamed: 0,A,A.1
0,A0,A4
1,A1,A5
2,A2,A6
3,A3,A7


In [62]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [63]:
df2

Unnamed: 0,A,B,C,D
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7


In [65]:
df1.join(df2, rsuffix='_r', lsuffix='_l')

Unnamed: 0,A_l,B_l,C_l,D_l,A_r,B_r,C_r,D_r
0,A0,B0,C0,D0,A4,B4,C4,D4
1,A1,B1,C1,D1,A5,B5,C5,D5
2,A2,B2,C2,D2,A6,B6,C6,D6
3,A3,B3,C3,D3,A7,B7,C7,D7


In [66]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [67]:
df2

Unnamed: 0,A,B,C,D
0,A4,B4,C4,D4
1,A5,B5,C5,D5
2,A6,B6,C6,D6
3,A7,B7,C7,D7


In [70]:
pd.concat([df1, df2]).reset_index(drop=True)

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


## SCRAPE WIKIPEDIA YE

## probably need to:

**conda install lxml**

IN YOUR ENVIRONMENT!!!!

In [86]:
swedish_demo = pd.read_html('https://sv.wikipedia.org/wiki/Sveriges_demografi')

len(swedish_demo)


21

In [87]:
isinstance(swedish_demo, list)

True

In [88]:
swedish_demo[1]

Unnamed: 0_level_0,Vid utgången av år,Folkmängd,Årlig tillväxt,Årlig tillväxt
Unnamed: 0_level_1,Vid utgången av år,Folkmängd,Totalt,Promille
0,1570,900 000,—,—
1,1650,1 225 000,4 063,386
2,1700,1 485 000,5 200,386
3,1720,1 350 000,−6 750,"−4,75"
4,1755,1 878 000,15 086,948
5,1815,2 465 000,9 783,454
6,1865,4 099 000,32 680,1022
7,1900,5 140 000,29 743,648
8,2000,8 861 000,,
9,2020,10 379 000,,


In [92]:
swedish_demo[1]['Årlig tillväxt']['Promille']

0                                                     —
1                                                   386
2                                                   386
3                                                 −4,75
4                                                   948
5                                                   454
6                                                  1022
7                                                   648
8                                                   NaN
9                                                   NaN
10    Datan avser folkmängden inom Sveriges nuvarand...
Name: Promille, dtype: object

In [93]:
min_df = swedish_demo[1]

In [94]:
min_df

Unnamed: 0_level_0,Vid utgången av år,Folkmängd,Årlig tillväxt,Årlig tillväxt
Unnamed: 0_level_1,Vid utgången av år,Folkmängd,Totalt,Promille
0,1570,900 000,—,—
1,1650,1 225 000,4 063,386
2,1700,1 485 000,5 200,386
3,1720,1 350 000,−6 750,"−4,75"
4,1755,1 878 000,15 086,948
5,1815,2 465 000,9 783,454
6,1865,4 099 000,32 680,1022
7,1900,5 140 000,29 743,648
8,2000,8 861 000,,
9,2020,10 379 000,,


In [83]:
swedish_demo[1].columns = swedish_demo[1].columns.droplevel(level=0)

In [84]:
swedish_demo[1]

Unnamed: 0,Vid utgången av år,Folkmängd,Totalt,Promille
0,1570,900 000,—,—
1,1650,1 225 000,4 063,386
2,1700,1 485 000,5 200,386
3,1720,1 350 000,−6 750,"−4,75"
4,1755,1 878 000,15 086,948
5,1815,2 465 000,9 783,454
6,1865,4 099 000,32 680,1022
7,1900,5 140 000,29 743,648
8,2000,8 861 000,,
9,2020,10 379 000,,


In [79]:
swedish_demo[1].columns.droplevel(level=0)

Index(['Vid utgången av år', 'Folkmängd', 'Totalt', 'Promille'], dtype='object')