<h2>Introduction</h2>
<div style="font-size:18px;font-family:Calibri">
    Data Wrangling is a broad term, often used informally to describe the <b> process of transforming the raw data into a clean and organized format </b> ready to use. It is one of the important step in preprocessing the data.   
</div>

In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

<h2>Creating a DataFrame</h2>

In [21]:
dictionary = {
    "Name": ["Anaya","Morris","Ayesha","Johnny","Katherine"],
    "Age": [24, 35, 12, 45, 20],
    "Driver": [True, True, False, np.nan, False]
}

data = pd.DataFrame(dictionary)
data.head(2)

Unnamed: 0,Name,Age,Driver
0,Anaya,24,True
1,Morris,35,True


In [17]:
data["EyesColor"] = ["Brown", "Black", "Blue", np.nan, "Green"]
data

Unnamed: 0,Name,Age,Driver,EyesColor
0,Anaya,24,True,Brown
1,Morris,35,True,Black
2,Ayesha,12,False,Blue
3,Johnny,45,,
4,Katherine,20,False,Green


<h2>Getting Information about the Data</h2>

In [11]:
url = 'https://raw.githubusercontent.com/chrisalbon/sim_data/master/titanic.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


In [24]:
df.shape

(1313, 6)

In [34]:
df.describe(include="all").round(2)

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
count,1313,1313,756.0,1313,1313.0,1313.0
unique,1310,4,,2,,
top,"Connolly, Miss Kate",3rd,,male,,
freq,2,711,,851,,
mean,,,30.4,,0.34,0.35
std,,,14.26,,0.47,0.48
min,,,0.17,,0.0,0.0
25%,,,21.0,,0.0,0.0
50%,,,28.0,,0.0,0.0
75%,,,39.0,,1.0,1.0


In [276]:
df.columns

Index(['Name', 'Passenger Class', 'Age', 'Gender', 'Survived'], dtype='object')

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      1313 non-null   object 
 1   PClass    1313 non-null   object 
 2   Age       756 non-null    float64
 3   Sex       1313 non-null   object 
 4   Survived  1313 non-null   int64  
 5   SexCode   1313 non-null   int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 61.7+ KB


<h2>Slicing the DataFrame</h2>

In [59]:
df.iloc[0]

Name        Allen, Miss Elisabeth Walton
PClass                               1st
Age                                 29.0
Sex                               female
Survived                               1
SexCode                                1
Name: 0, dtype: object

In [61]:
df.iloc[:5]

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


In [69]:
df.iloc[1305::] # Last 8 records

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
1305,"Youssef, Mr Gerios",3rd,,male,0,0
1306,"Zabour, Miss Hileni",3rd,,female,0,1
1307,"Zabour, Miss Tamini",3rd,,female,0,1
1308,"Zakarian, Mr Artun",3rd,27.0,male,0,0
1309,"Zakarian, Mr Maprieder",3rd,26.0,male,0,0
1310,"Zenni, Mr Philip",3rd,22.0,male,0,0
1311,"Lievens, Mr Rene",3rd,24.0,male,0,0
1312,"Zimmerman, Leo",3rd,29.0,male,0,0


In [73]:
df = df.set_index(df["Name"])
df.head()

Unnamed: 0_level_0,Name,PClass,Age,Sex,Survived,SexCode
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Allen, Miss Elisabeth Walton","Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
"Allison, Miss Helen Loraine","Allison, Miss Helen Loraine",1st,2.0,female,0,1
"Allison, Mr Hudson Joshua Creighton","Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
"Allison, Mrs Hudson JC (Bessie Waldo Daniels)","Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
"Allison, Master Hudson Trevor","Allison, Master Hudson Trevor",1st,0.92,male,1,0


In [77]:
df.loc["Allen, Miss Elisabeth Walton"]

Name        Allen, Miss Elisabeth Walton
PClass                               1st
Age                                 29.0
Sex                               female
Survived                               1
SexCode                                1
Name: Allen, Miss Elisabeth Walton, dtype: object

In [89]:
df.iloc[0]

Name        Allen, Miss Elisabeth Walton
PClass                               1st
Age                                 29.0
Sex                               female
Survived                               1
SexCode                                1
Name: Allen, Miss Elisabeth Walton, dtype: object

<div style="font-size:18px;font-family:Calibri">
The main difference between iloc[] and loc[] is that iloc[] works by looking for the position within the dataframe (by default starts with 0) while loc[] works when the index is a label(e.g., a string).
</div>

<h2>Selecting Rows Based on Conditions</h2>

In [106]:
df = df.drop("Name", axis=1).reset_index()

In [112]:
df[df["Sex"] == "female"].head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
6,"Andrews, Miss Kornelia Theodosia",1st,63.0,female,1,1
8,"Appleton, Mrs Edward Dale (Charlotte Lamson)",1st,58.0,female,1,1


In [126]:
# There are 134 passengers who are female and the passenger class is 1st and who survived the disaster.
df[(df["Sex"] == "female") & (df["PClass"] == "1st") & (df["Survived"] == 1)].head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
6,"Andrews, Miss Kornelia Theodosia",1st,63.0,female,1,1
8,"Appleton, Mrs Edward Dale (Charlotte Lamson)",1st,58.0,female,1,1
11,"Astor, Mrs John Jacob (Madeleine Talmadge Force)",1st,19.0,female,1,1
12,"Aubert, Mrs Leontine Pauline",1st,,female,1,1


<h2>Sorting The Values</h2>

In [139]:
df.sort_values(by=["Age","Sex"], ascending=False).head(4)

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
9,"Artagaveytia, Mr Ramon",1st,71.0,male,0,0
119,"Goldschmidt, Mr George B",1st,71.0,male,0,0
505,"Mitchell, Mr Henry Michael",2nd,71.0,male,0,0
72,"Crosby, Captain Edward Gifford",1st,70.0,male,0,0


<h2>Replacing Values</h2>

In [146]:
df["Sex"].replace("female","women", inplace=True)

In [152]:
df["Sex"].replace({"women": "Female", "male": "Male"}, inplace=True)

In [154]:
df.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,Female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,Female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,Male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,Female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,Male,1,0


In [158]:
df.replace(1, "One").head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,Female,One,One
1,"Allison, Miss Helen Loraine",1st,2.0,Female,0,One
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,Male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,Female,0,One
4,"Allison, Master Hudson Trevor",1st,0.92,Male,One,0


In [164]:
df.replace(r"1st", "First", regex=True).head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",First,29.0,Female,1,1
1,"Allison, Miss Helen Loraine",First,2.0,Female,0,1
2,"Allison, Mr Hudson Joshua Creighton",First,30.0,Male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",First,25.0,Female,0,1
4,"Allison, Master Hudson Trevor",First,0.92,Male,1,0


<h2>Renaming the Columns in the DataFrame</h2>

In [177]:
df.rename(columns={"PClass": "Passenger Class", "Sex": "Gender"}, inplace=True)

In [179]:
df.head()

Unnamed: 0,Name,Passenger Class,Age,Gender,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,Female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,Female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,Male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,Female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,Male,1,0


<h2> Finding the Minimum, Maximum, Sum, Average and Count</h2>

In [184]:
df["Age"].min()

0.17

In [186]:
df["Age"].max()

71.0

In [188]:
df["Age"].mean()

30.397989417989418

In [240]:
df["Age"].median()

28.0

In [190]:
df["Age"].sum()

22980.88

In [192]:
df["Age"].count()

756

In [194]:
df.count()

Name               1313
Passenger Class    1313
Age                 756
Gender             1313
Survived           1313
SexCode            1313
dtype: int64

In [196]:
df["Age"].std()

14.259048710359023

In [198]:
df["Age"].var()

203.32047012439133

In [200]:
# Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a 
# normal distribution.
df["Age"].kurt()

-0.036536168924722556

In [202]:
# Skewness is a measure of the asymmetry of a distribution.
df["Age"].skew()

0.36851087371648295

In [204]:
# Standard error of mean means "how different the population mean is likely to be from a sample mean"
df["Age"].sem()

0.5185965877244655

<h2>Finding the Unique Values</h2>

In [211]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             1313 non-null   object 
 1   Passenger Class  1313 non-null   object 
 2   Age              756 non-null    float64
 3   Gender           1313 non-null   object 
 4   Survived         1313 non-null   int64  
 5   SexCode          1313 non-null   int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 61.7+ KB


In [209]:
df["Gender"].unique()

array(['Female', 'Male'], dtype=object)

In [213]:
df["Passenger Class"].unique()

array(['1st', '2nd', '*', '3rd'], dtype=object)

In [217]:
df.Gender.value_counts()

Gender
Male      851
Female    462
Name: count, dtype: int64

In [221]:
df["Passenger Class"].value_counts()

Passenger Class
3rd    711
1st    322
2nd    279
*        1
Name: count, dtype: int64

In [225]:
df["Age"].nunique()
# Number of unique values

75

<h2>Handling the Missing Values</h2>

In [238]:
df[df["Age"].isna()].shape
# Total 557 Records do not have the age mentioned.

(557, 6)

In [250]:
df[df["Age"].isna()].fillna(df["Age"].median()).head()

Unnamed: 0,Name,Passenger Class,Age,Gender,Survived,SexCode
12,"Aubert, Mrs Leontine Pauline",1st,28.0,Female,1,1
13,"Barkworth, Mr Algernon H",1st,28.0,Male,1,0
14,"Baumann, Mr John D",1st,28.0,Male,0,0
29,"Borebank, Mr John James",1st,28.0,Male,0,0
32,"Bradley, Mr George",1st,28.0,Male,1,0


<h2>Deleting a Column</h2>
<div style="font-size:18px;font-family:Calibri">
    The Best way to delete a columns is using a drop() with $axis=1$ parameter
</div>

In [264]:
df.drop(["Gender"], axis=1).head()
# By default the method returns the copy of the dataframe until and unless the
# parameter inplace is set to True.

Unnamed: 0,Name,Passenger Class,Age,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,1,0


In [268]:
df.drop(["SexCode"], axis=1, inplace=True)

In [280]:
df.drop([df.columns[0]], axis=1).head()
# Columns can also be dropped using index position if the column name is not mentioned.

Unnamed: 0,Passenger Class,Age,Gender,Survived
0,1st,29.0,Female,1
1,1st,2.0,Female,0
2,1st,30.0,Male,0
3,1st,25.0,Female,0
4,1st,0.92,Male,1


<h2>Deleting a Row</h2>
<div style="font-size:18px;font-family:Calibri">
    Use a boolean condition to create a new dataframe excluding the rows you want to delete.
</div>

In [300]:
df.drop(df[df["Gender"] != "Male"].index.tolist()).head()

Unnamed: 0,Name,Passenger Class,Age,Gender,Survived
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,Male,0
4,"Allison, Master Hudson Trevor",1st,0.92,Male,1
5,"Anderson, Mr Harry",1st,47.0,Male,1
7,"Andrews, Mr Thomas, jr",1st,39.0,Male,0
9,"Artagaveytia, Mr Ramon",1st,71.0,Male,0


In [306]:
df.drop([0, 1, 2, 3, 4, 5], axis = 0).head(2)

Unnamed: 0,Name,Passenger Class,Age,Gender,Survived
6,"Andrews, Miss Kornelia Theodosia",1st,63.0,Female,1
7,"Andrews, Mr Thomas, jr",1st,39.0,Male,0


<h2>Dropping Duplicate Rows</h2>

In [312]:
df.drop_duplicates().head(2)

Unnamed: 0,Name,Passenger Class,Age,Gender,Survived
0,"Allen, Miss Elisabeth Walton",1st,29.0,Female,1
1,"Allison, Miss Helen Loraine",1st,2.0,Female,0


In [322]:
df.drop_duplicates(subset = ["Name"]).tail()
# Dropping the rows with duplicate names.

Unnamed: 0,Name,Passenger Class,Age,Gender,Survived
1308,"Zakarian, Mr Artun",3rd,27.0,Male,0
1309,"Zakarian, Mr Maprieder",3rd,26.0,Male,0
1310,"Zenni, Mr Philip",3rd,22.0,Male,0
1311,"Lievens, Mr Rene",3rd,24.0,Male,0
1312,"Zimmerman, Leo",3rd,29.0,Male,0


In [328]:
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1308    False
1309    False
1310    False
1311    False
1312    False
Length: 1313, dtype: bool

<h2> Grouping Rows By Values </h2>

In [340]:
df.head(3)

Unnamed: 0,Name,Passenger Class,Age,Gender,Survived
0,"Allen, Miss Elisabeth Walton",1st,29.0,Female,1
1,"Allison, Miss Helen Loraine",1st,2.0,Female,0
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,Male,0


In [346]:
df.groupby("Passenger Class").mean(["Age", "Survived"])

Unnamed: 0_level_0,Age,Survived
Passenger Class,Unnamed: 1_level_1,Unnamed: 2_level_1
*,,0.0
1st,39.667788,0.599379
2nd,28.300142,0.426523
3rd,25.208585,0.194093


In [354]:
df.groupby("Survived")["Name"].count()

Survived
0    863
1    450
Name: Name, dtype: int64

In [356]:
df.groupby(["Gender","Survived"]).mean(["Age"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Gender,Survived,Unnamed: 2_level_1
Female,0,24.901408
Female,1,30.867143
Male,0,32.32078
Male,1,25.951875


<h2> Grouping Rows By Time </h2>

In [382]:
time_index = pd.date_range("06/06/2017", periods = 100000, freq = "30s")
time_index

DatetimeIndex(['2017-06-06 00:00:00', '2017-06-06 00:00:30',
               '2017-06-06 00:01:00', '2017-06-06 00:01:30',
               '2017-06-06 00:02:00', '2017-06-06 00:02:30',
               '2017-06-06 00:03:00', '2017-06-06 00:03:30',
               '2017-06-06 00:04:00', '2017-06-06 00:04:30',
               ...
               '2017-07-10 17:15:00', '2017-07-10 17:15:30',
               '2017-07-10 17:16:00', '2017-07-10 17:16:30',
               '2017-07-10 17:17:00', '2017-07-10 17:17:30',
               '2017-07-10 17:18:00', '2017-07-10 17:18:30',
               '2017-07-10 17:19:00', '2017-07-10 17:19:30'],
              dtype='datetime64[ns]', length=100000, freq='30s')

In [384]:
data = pd.DataFrame(index = time_index)

In [386]:
data["Sale_Amount"] = np.random.randint(1, 10, 100000)

In [388]:
data.head()

Unnamed: 0,Sale_Amount
2017-06-06 00:00:00,1
2017-06-06 00:00:30,6
2017-06-06 00:01:00,4
2017-06-06 00:01:30,3
2017-06-06 00:02:00,7


In [392]:
# Group the rows by week, and calculate the sum per week
data.resample('W').sum()

Unnamed: 0,Sale_Amount
2017-06-11,86314
2017-06-18,100218
2017-06-25,100795
2017-07-02,100093
2017-07-09,101056
2017-07-16,10406


In [396]:
data.resample("2W").mean()
# Group by 2 weeks and calculate the mean

Unnamed: 0,Sale_Amount
2017-06-11,4.995023
2017-06-25,4.985441
2017-07-09,4.988814
2017-07-23,5.002885


In [398]:
data.resample("M").count()

Unnamed: 0,Sale_Amount
2017-06-30,72000
2017-07-31,28000


In [400]:
data.resample("M", label = "left").count()

Unnamed: 0,Sale_Amount
2017-05-31,72000
2017-06-30,28000


<h2>Aggregating Operation & Statistics</h2>
<div style="font-size:18px;font-family:Calibri">
    Aggregate functions are especially useful during Exploratory Data Analysis to learn information about different subpopulations of data and the relationship between variables.
</div>

In [5]:
url = 'https://raw.githubusercontent.com/chrisalbon/sim_data/master/titanic.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


In [7]:
df.agg("min")

Name        Abbing, Mr Anthony
PClass                       *
Age                       0.17
Sex                     female
Survived                     0
SexCode                      0
dtype: object

In [9]:
df.agg({"Age": ["min", "max", "mean"], "SexCode": ["min", "max"]})

Unnamed: 0,Age,SexCode
min,0.17,0.0
max,71.0,1.0
mean,30.397989,


In [17]:
df.groupby(["PClass", "Survived"]).agg({"Survived": ["count"]}).reset_index()

Unnamed: 0_level_0,PClass,Survived,Survived
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count
0,*,0,1
1,1st,0,129
2,1st,1,193
3,2nd,0,160
4,2nd,1,119
5,3rd,0,573
6,3rd,1,138


<h2>Looping Over Columns</h2>

In [35]:
for name in df["Name"][:5]:
    print(name.upper())

ALLEN, MISS ELISABETH WALTON
ALLISON, MISS HELEN LORAINE
ALLISON, MR HUDSON JOSHUA CREIGHTON
ALLISON, MRS HUDSON JC (BESSIE WALDO DANIELS)
ALLISON, MASTER HUDSON TREVOR


In [37]:
[name.upper() for name in df["Name"][:5]]

['ALLEN, MISS ELISABETH WALTON',
 'ALLISON, MISS HELEN LORAINE',
 'ALLISON, MR HUDSON JOSHUA CREIGHTON',
 'ALLISON, MRS HUDSON JC (BESSIE WALDO DANIELS)',
 'ALLISON, MASTER HUDSON TREVOR']

<h2>Applying a Function Over All Elements in a Column</h2>

In [43]:
def uppercase(x):
    return x.upper()

df["Name"].apply(uppercase)[:5]

0                     ALLEN, MISS ELISABETH WALTON
1                      ALLISON, MISS HELEN LORAINE
2              ALLISON, MR HUDSON JOSHUA CREIGHTON
3    ALLISON, MRS HUDSON JC (BESSIE WALDO DANIELS)
4                    ALLISON, MASTER HUDSON TREVOR
Name: Name, dtype: object

<h2>Applying a Function To Each Group</h2>

In [55]:
# here, "x" represents the individual group.
df.groupby("Sex").apply(lambda x : x.count())

Unnamed: 0_level_0,Name,PClass,Age,Sex,Survived,SexCode
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,462,462,288,462,462,462
male,851,851,468,851,851,851


<h2>Concatenate DataFrames</h2>

In [58]:
dictionary = {
    "Name": ["Anaya","Morris","Ayesha","Johnny","Katherine"],
    "Age": [24, 35, 12, 45, 20],
    "Driver": [True, True, False, np.nan, False]
}
data_a = pd.DataFrame(dictionary)

dictionary = {
    "Name": ["Anu", "Meghna", "Shaima"],
    "Age": [59, 45, 25],
    "Driver": [True, True, False]
}
data_b = pd.DataFrame(dictionary)

In [66]:
df2 = pd.concat([data_a, data_b], axis = 0).reset_index().drop(["index"], axis = 1)
# Stacking the data vertically

df2.head()

Unnamed: 0,Name,Age,Driver
0,Anaya,24,True
1,Morris,35,True
2,Ayesha,12,False
3,Johnny,45,
4,Katherine,20,False


<h2> Merging DataFrames </h2>

In [69]:
emp_data = {
    "id": [1, 2, 3, 4, 5],
    "name": ["Amy Jones", "Catherine", "Arjun Roy", "Alice", "Kim Jung"] 
}
df_emp = pd.DataFrame(emp_data)

sales_data = {
    "id": [3, 4, 5, 6, 7, 8],
    "tot_sales": [1234, 5262, 9821, 2756, 5123, 6789]
}
df_sales = pd.DataFrame(sales_data)

In [87]:
pd.merge(df_emp, df_sales, on = "id", how = "inner")

Unnamed: 0,id,name,tot_sales
0,3,Arjun Roy,1234
1,4,Alice,5262
2,5,Kim Jung,9821


In [89]:
pd.merge(df_emp, df_sales, on = "id", how = "left")

Unnamed: 0,id,name,tot_sales
0,1,Amy Jones,
1,2,Catherine,
2,3,Arjun Roy,1234.0
3,4,Alice,5262.0
4,5,Kim Jung,9821.0


In [91]:
pd.merge(df_emp, df_sales, on = "id", how = "right")

Unnamed: 0,id,name,tot_sales
0,3,Arjun Roy,1234
1,4,Alice,5262
2,5,Kim Jung,9821
3,6,,2756
4,7,,5123
5,8,,6789


In [93]:
pd.merge(df_emp, df_sales, on = "id", how = "outer")

Unnamed: 0,id,name,tot_sales
0,1,Amy Jones,
1,2,Catherine,
2,3,Arjun Roy,1234.0
3,4,Alice,5262.0
4,5,Kim Jung,9821.0
5,6,,2756.0
6,7,,5123.0
7,8,,6789.0


<h2> Done with Day 3 :)</h2>