# Big Data Real-Time Analytics with Python and Spark

## Chapter 5 - Exercise

This is a challenging worklist, but it will consolidate your knowledge of data manipulation with Pandas.

In [1]:
# Python version
from platform import python_version
print('The version used in this notebook is: ', python_version())

The version used in this notebook is:  3.8.13


In [2]:
# imports
import pandas as pd
import numpy as np

In [3]:
# # package version used in this notebook
%reload_ext watermark
%watermark -a 'Bianca Amorim' --iversions

Author: Bianca Amorim

pandas: 1.4.2
numpy : 1.22.3



## SQL Join and joining tables with Pandas

In [4]:
from IPython.display import Image
Image(url = 'images/sql-join.png')

In [5]:
# Data Dictionary
data1 = {
    'subject_id': ['1', '2', '3', '4', '5'],
    'name': ['Bob', 'Maria', 'Mateus', 'Ivo', 'Gerson'],
    'surname': ['Anderson', 'Teixeira', 'Amoedo', 'Trindade', 'Vargas']}

# Create the dataframe a
df_a = pd.DataFrame(data1, columns = ['subject_id', 'name', 'surname'])
df_a

Unnamed: 0,subject_id,name,surname
0,1,Bob,Anderson
1,2,Maria,Teixeira
2,3,Mateus,Amoedo
3,4,Ivo,Trindade
4,5,Gerson,Vargas


In [6]:
# Data Dictionary
data2 = {
    'subject_id': ['4', '5', '6', '7', '8'],
    'name': ['Roberto', 'Mariana', 'Ana', 'Marcos', 'Maria'],
    'surname': ['Sampaio', 'Fernandes', 'arantes', 'Menezes', 'martins']}

# Create the dataframe b
df_b = pd.DataFrame(data2, columns = ['subject_id', 'name', 'surname'])
df_b

Unnamed: 0,subject_id,name,surname
0,4,Roberto,Sampaio
1,5,Mariana,Fernandes
2,6,Ana,arantes
3,7,Marcos,Menezes
4,8,Maria,martins


In [7]:
# Data Dictionary
data3 = {
    'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
    'test_id': [81,  75, 75, 71, 76, 84, 95, 61, 57, 90]}

# Create the dataframe c
df_c = pd.DataFrame(data3, columns = ['subject_id', 'test_id'])
df_c

Unnamed: 0,subject_id,test_id
0,1,81
1,2,75
2,3,75
3,4,71
4,5,76
5,7,84
6,8,95
7,9,61
8,10,57
9,11,90


### Exercise 1 - Concatenate dataframes a and b by rows and create a new dataframe called df_ab_row.

In [8]:
# Do not need to put axis here because the default is by row
df_ab_row = pd.concat([df_a, df_b],
                     axis = 0,
                     ignore_index = True)
df_ab_row

Unnamed: 0,subject_id,name,surname
0,1,Bob,Anderson
1,2,Maria,Teixeira
2,3,Mateus,Amoedo
3,4,Ivo,Trindade
4,5,Gerson,Vargas
5,4,Roberto,Sampaio
6,5,Mariana,Fernandes
7,6,Ana,arantes
8,7,Marcos,Menezes
9,8,Maria,martins


### Exercise 2 - Concatenate dataframes a and b by columns and create a new dataframe called df_ab_column.

In [11]:
df_ab_column = pd.concat([df_a, df_b],
                     axis = 1)
df_ab_column

Unnamed: 0,subject_id,name,surname,subject_id.1,name.1,surname.1
0,1,Bob,Anderson,4,Roberto,Sampaio
1,2,Maria,Teixeira,5,Mariana,Fernandes
2,3,Mateus,Amoedo,6,Ana,arantes
3,4,Ivo,Trindade,7,Marcos,Menezes
4,5,Gerson,Vargas,8,Maria,martins


### Exercise 3 - Merge dataframes df_ab_row and df_c using the subject_id column.

In [12]:
df_abc = pd.merge(df_ab_row, df_c, on = ['subject_id'])
df_abc

Unnamed: 0,subject_id,name,surname,test_id
0,1,Bob,Anderson,81
1,2,Maria,Teixeira,75
2,3,Mateus,Amoedo,75
3,4,Ivo,Trindade,71
4,4,Roberto,Sampaio,71
5,5,Gerson,Vargas,76
6,5,Mariana,Fernandes,76
7,7,Marcos,Menezes,84
8,8,Maria,martins,95


### Exercise 4 - Merge dataframes df_ab_row and df_c by subject_id column using left and right.

In [18]:
df_abc_left = pd.merge(df_ab_row, df_c, left_on = 'subject_id' , right_on = 'subject_id')
df_abc_left

Unnamed: 0,subject_id,name,surname,test_id
0,1,Bob,Anderson,81
1,2,Maria,Teixeira,75
2,3,Mateus,Amoedo,75
3,4,Ivo,Trindade,71
4,4,Roberto,Sampaio,71
5,5,Gerson,Vargas,76
6,5,Mariana,Fernandes,76
7,7,Marcos,Menezes,84
8,8,Maria,martins,95


Merge outer join is the complete outer join that produces the set of all records in Table A and Table B, with matching records on both sides, when available. If there is no match, the missing side will contain null.

### Exercise 5 - Perform an outer join between dataframes df_a and df_b using the subject_id column.

In [15]:
df_ab_outer = pd.merge(df_a, df_b, how = 'outer' , on = ['subject_id'])
df_ab_outer

Unnamed: 0,subject_id,name_x,surname_x,name_y,surname_y
0,1,Bob,Anderson,,
1,2,Maria,Teixeira,,
2,3,Mateus,Amoedo,,
3,4,Ivo,Trindade,Roberto,Sampaio
4,5,Gerson,Vargas,Mariana,Fernandes
5,6,,,Ana,arantes
6,7,,,Marcos,Menezes
7,8,,,Maria,martins


### Exercise 6 - Exercise 5 generates missing values. Don't let this happen when you merge!

In [20]:
df_ab_outer = pd.merge(df_a, df_b, how = 'outer' , on = ['subject_id']).replace(np.nan, 'Outro')
df_ab_outer

Unnamed: 0,subject_id,name_x,surname_x,name_y,surname_y
0,1,Bob,Anderson,Outro,Outro
1,2,Maria,Teixeira,Outro,Outro
2,3,Mateus,Amoedo,Outro,Outro
3,4,Ivo,Trindade,Roberto,Sampaio
4,5,Gerson,Vargas,Mariana,Fernandes
5,6,Outro,Outro,Ana,arantes
6,7,Outro,Outro,Marcos,Menezes
7,8,Outro,Outro,Maria,martins


Merge inner join is the inner join that produces only the set of records that match in Table A and Table B.

### Exercise 7 - Perform the merge inner join between dataframes df_a and df_b using the subject_id column.

In [None]:
df_ab_inner = pd.merge(df_a, df_b, how = 'inner' , on = ['subject_id'])
df_ab_inner

Merge left join is the outer left join that produces a complete set of records from Table A, with matching records (when available) in Table B. If there is no match, the right-hand side will contain null.

### Exercise 8 - Merge left join between dataframes df_a and df_a using the subject_id column.

### Exercise 9 - Exercise 8 generates missing values. Don't let this happen when you merge!

The right join merge is the opposite of the left join.

### Exercise 10 - Merge right join between dataframes df_a and df_a using the subject_id column.
**Don't allow missing values!**

### Exercise 11 - You noticed that the previous item generated very similar column names. Add a suffix to identify the column names. 
**Do not allow missing values.**

### Exercise 12 - Merge dataframes df_a and df_b based on index

In [17]:
df_ab_index = pd.merge(df_a, df_b, left_index = True , on = ['subject_id'])
df_ab_index

MergeError: Can only pass argument "on" OR "left_index" and "right_index", not a combination of both.

# The End