# Pandas Basics for Machine Learning

## Section 0: Introduction to Pandas and Its Importance for Machine Learning

### Why Pandas is Important for Machine Learning

The name "Pandas" is derived from the term "Panel Data", an econometrics term for multidimensional structured data sets. Pandas is a powerful and flexible Python library used for data manipulation and analysis. It is essential for machine learning for several reasons:

#### 1. Handling Tabular Data
- **Tabular Data Management**: Machine learning often involves working with structured, tabular data. Pandas provides efficient data structures like DataFrames and Series to handle such data. These structures are intuitive and allow for complex operations with simple syntax.
- **Data Importing**: Pandas can read data from various file formats such as CSV, Excel, SQL databases, and more, making it easy to import datasets for analysis and modeling.

#### 2. Data Cleaning and Preparation
- **Handling Missing Data**: Real-world data is often messy and contains missing values. Pandas provides straightforward methods to detect, handle, and fill missing data, which is crucial for preparing clean datasets for machine learning models.
- **Filtering and Sorting**: Pandas allows for easy filtering, sorting, and subsetting of data based on specific criteria, helping to prepare and clean data efficiently.

#### 3. Data Transformation
- **Feature Engineering**: Creating new features from existing data is a key step in improving model performance. Pandas offers powerful tools for manipulating and transforming data, enabling effective feature engineering.
- **Merging and Joining**: Combining multiple datasets is a common task in data analysis. Pandas provides robust methods for merging, joining, and concatenating datasets, facilitating comprehensive data preparation.

#### 4. Exploratory Data Analysis (EDA)
- **Descriptive Statistics**: Pandas allows for quick calculation of summary statistics, such as mean, median, standard deviation, etc., providing insights into the data distribution and helping in identifying patterns and anomalies.
- **Data Visualization**: Although not a visualization library, Pandas integrates well with libraries like Matplotlib and Seaborn, enabling easy creation of plots and charts for data exploration.

#### 5. Integration with Machine Learning Libraries
- **Seamless Integration & DataFrame Compatibility**: Many machine learning functions and models in libraries like Scikit-learn accept Pandas DataFrames as input, making it convenient to directly use preprocessed data for training and evaluation.

#### Conclusion
Pandas is the backbone of data manipulation and preparation in the machine learning workflow. Its rich functionality and ease of use make it indispensable for data scientists and machine learning practitioners. By mastering Pandas, you can streamline your data handling processes and build more accurate and robust machine learning models.


## Section 1: Getting Started with Pandas

In this section, we'll cover the basics of getting started with pandas, including how to import the library and understand its core data structures: Series and DataFrame.

In [27]:
import pandas as pd
import numpy as np
# Display version
print(pd.__version__)

# Creating a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

# Creating a DataFrame
dates = pd.date_range('20230101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)

df = pd.DataFrame(
{"a" : [4, 5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]},
index = [1, 2, 3])
print(df)

df = pd.DataFrame(
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]],
index=[1, 2, 3],
columns=['a', 'b', 'c'])
print(df)

2.2.2
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
                   A         B         C         D
2023-01-01 -0.231556 -0.323562 -0.575303  1.338373
2023-01-02 -0.500776  0.071741  1.811697  1.138624
2023-01-03 -1.183986  0.072321  1.288304 -0.855652
2023-01-04 -1.053116  0.266195  1.654831 -1.845351
2023-01-05  0.342120  1.313130  0.287180  0.107517
2023-01-06  0.684319  0.174792  1.255920 -0.090869
   a  b   c
1  4  7  10
2  5  8  11
3  6  9  12
   a  b   c
1  4  7  10
2  5  8  11
3  6  9  12


In [28]:
df = pd.DataFrame(
    {"a" : [4 ,5, 6],
    "b" : [7, 8, 9],
    "c" : [10, 11, 12]},
    index = pd.MultiIndex.from_tuples(
    [('d', 1), ('d', 2),
    ('e', 2)], names=['n', 'v'])
    )
print(df)

     a  b   c
n v          
d 1  4  7  10
  2  5  8  11
e 2  6  9  12


## Section 2: Loading Data with Pandas
In this section, we'll learn how to load data into a Pandas DataFrame. We'll use the `pd.read_csv()` function to read a CSV file.

In [29]:
import pandas as pd

# Load a sample dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url) # read_csv can be replaced with read_excel, read_json, etc.

df.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


## Section 3: Exploring the Data
In this section, we'll learn how to explore our dataset. We'll use various functions to understand the structure and summary statistics of the data.

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


In [31]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


## Section 4: Sorting & Filtering & apply

In this section, we'll learn how to sort and filter data in a DataFrame. We'll use the `sort_values()` function to sort data and `loc`/`iloc` for filtering.

In [32]:
# Display the first few rows of the dataset
print("First few rows of the Titanic dataset:")
df.head()

First few rows of the Titanic dataset:


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [33]:
# Sort the DataFrame by 'age'
df_sorted = df.sort_values(by='age', ascending=False)
print("\nDataFrame sorted by 'age':")
df_sorted.head()


DataFrame sorted by 'age':


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
630,1,1,male,80.0,0,0,30.0,S,First,man,True,A,Southampton,yes,True
851,0,3,male,74.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
493,0,1,male,71.0,0,0,49.5042,C,First,man,True,,Cherbourg,no,True
96,0,1,male,71.0,0,0,34.6542,C,First,man,True,A,Cherbourg,no,True
116,0,3,male,70.5,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


In [34]:
df_filtered = df[df['age'] > 30]

df_filtered.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,C,Southampton,yes,True


In [35]:
df_filtered_multi = df[(df['age'] > 30) & (df['sex'] == 'female')]

df_filtered_multi.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,C,Southampton,yes,True
15,1,2,female,55.0,0,0,16.0,S,Second,woman,False,,Southampton,yes,True
18,0,3,female,31.0,1,0,18.0,S,Third,woman,False,,Southampton,no,False


In [36]:
# Use loc to select specific rows and columns
df_loc = df.loc[0:5, ['survived', 'pclass', 'sex', 'age']]
print("\nUsing loc to select specific rows and columns:")
df_loc


Using loc to select specific rows and columns:


Unnamed: 0,survived,pclass,sex,age
0,0,3,male,22.0
1,1,1,female,38.0
2,1,3,female,26.0
3,1,1,female,35.0
4,0,3,male,35.0
5,0,3,male,


In [37]:
# Use iloc to select specific rows and columns by index
df_iloc = df.iloc[0:5, [0, 1, 2, 3]]
print("\nUsing iloc to select specific rows and columns by index:")
df_iloc


Using iloc to select specific rows and columns by index:


Unnamed: 0,survived,pclass,sex,age
0,0,3,male,22.0
1,1,1,female,38.0
2,1,3,female,26.0
3,1,1,female,35.0
4,0,3,male,35.0


In [38]:
# Adding a new column 'age_group' based on age
df['age_group'] = df['age'].apply(lambda x: 'child' if x < 18 else 'adult')
print("\nAdding a new column 'age_group' based on age:")
df[['age', 'age_group']].head(10)


Adding a new column 'age_group' based on age:


Unnamed: 0,age,age_group
0,22.0,adult
1,38.0,adult
2,26.0,adult
3,35.0,adult
4,35.0,adult
5,,adult
6,54.0,adult
7,2.0,child
8,27.0,adult
9,14.0,child


In [39]:
# Applying a custom function with apply and lambda to create a new column 'family_size'
df['family_size'] = df['sibsp'] + df['parch'] + 1
print("\nAdding a new column 'family_size':")
df[['sibsp', 'parch', 'family_size']].head(10)


Adding a new column 'family_size':


Unnamed: 0,sibsp,parch,family_size
0,1,0,2
1,1,0,2
2,0,0,1
3,1,0,2
4,0,0,1
5,0,0,1
6,0,0,1
7,3,1,5
8,0,2,3
9,1,0,2


In [40]:
# Another example of using apply with lambda to categorize family size
df['family_size_category'] = df['family_size'].apply(lambda x: 'small' if x <= 3 else 'large')
print("\nCategorizing 'family_size' into 'family_size_category':")
df[['family_size', 'family_size_category']].head(10)


Categorizing 'family_size' into 'family_size_category':


Unnamed: 0,family_size,family_size_category
0,2,small
1,2,small
2,1,small
3,2,small
4,1,small
5,1,small
6,1,small
7,5,large
8,3,small
9,2,small


## Section 5: Data Cleaning

Data cleaning is a crucial step in preparing data for analysis. This section covers handling missing data and removing duplicates.

In [41]:
import numpy as np
import pandas as pd

# Load the dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [42]:
# Handling missing data
print("Missing values in each column:\n", df.isnull().sum())

Missing values in each column:
 survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


In [43]:
# Drop rows with missing data
df_dropped = df.dropna()

df_dropped

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


In [44]:
# Fill missing data with a specified value
df_filled = df.fillna(value=0)
df_filled

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,0,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,0,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,0,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,0,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,0.0,1,2,23.4500,S,Third,woman,False,0,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [45]:
# Fill missing data using forward fill
df_ffill = df.ffill()
df_ffill

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,C,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,C,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,C,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,19.0,1,2,23.4500,S,Third,woman,False,B,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [46]:
# Fill missing data using backward fill
df_bfill = df.bfill()
df_bfill

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,C,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,C,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,E,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,B,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,26.0,1,2,23.4500,S,Third,woman,False,C,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [47]:
# knnImputer(k=5)
# SimpImputer

## Section 6: Groupby
In this section, we'll learn how to group data using the `groupby` function and perform aggregate operations.

In [48]:
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [49]:
# Group by 'class' and then calculate mean only for numeric columns
grouped = df.groupby('class').mean(numeric_only=True)
# grouped
grouped['survived']

class
First     0.629630
Second    0.472826
Third     0.242363
Name: survived, dtype: float64

In [50]:
# Group by 'class' and 'sex', and then calculate mean only for numeric columns
grouped_multi = df.groupby(['class', 'who']).mean(numeric_only=True)
grouped_multi['survived']

class   who  
First   child    0.833333
        man      0.352941
        woman    0.978022
Second  child    1.000000
        man      0.080808
        woman    0.909091
Third   child    0.431034
        man      0.119122
        woman    0.491228
Name: survived, dtype: float64

![Medallion Architecture](../images/Medallian_Architecture.png)

## Section 7: Merging & Concatenating DataFrames
In this section, we'll learn how to merge and concatenate DataFrames using `merge` and `concat` functions.

### Concatination
Concatenation is used to combine DataFrames either along rows (vertically) or columns (horizontally).
#### Example 1: Concatenating DataFrames Vertically 

In [51]:
import pandas as pd

# Create sample DataFrames
df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
})
df2 = pd.DataFrame({
    'A': ['A3', 'A4', 'A5'],
    'B': ['B3', 'B4', 'B5']
})

# Concatenate DataFrames vertically
df_concat_vert = pd.concat([df1, df2], axis=0)
print("Concatenated DataFrame (Vertically):")
print(df_concat_vert)

Concatenated DataFrame (Vertically):
    A   B
0  A0  B0
1  A1  B1
2  A2  B2
0  A3  B3
1  A4  B4
2  A5  B5


#### Example 2: Concatenating DataFrames Horizontally

In [52]:
# Create sample DataFrames
df3 = pd.DataFrame({
    'C': ['C0', 'C1', 'C2'],
    'D': ['D0', 'D1', 'D2']
})

# Concatenate DataFrames horizontally
df_concat_horiz = pd.concat([df1, df3], axis=1)
print("\nConcatenated DataFrame (Horizontally):")
print(df_concat_horiz)


Concatenated DataFrame (Horizontally):
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2


### Merging
Merging combines DataFrames based on common columns or indices, similar to SQL joins.

#### Example 1: Inner Merge

In [53]:
# Create sample DataFrames
left = pd.DataFrame({
    'key': ['K0', 'K1', 'K2', 'K3'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']
})
right = pd.DataFrame({
    'key': ['K0', 'K1', 'K2', 'K4'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']
})

# Perform an inner merge
df_inner_merge = pd.merge(left, right, on='key', how='inner')
print("\nInner Merge DataFrame:")
print(df_inner_merge)


Inner Merge DataFrame:
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2


#### Example 2: Outer Merge

In [54]:
# Perform an outer merge
df_outer_merge = pd.merge(left, right, on='key', how='outer')
print("\nOuter Merge DataFrame:")
print(df_outer_merge)


Outer Merge DataFrame:
  key    A    B    C    D
0  K0   A0   B0   C0   D0
1  K1   A1   B1   C1   D1
2  K2   A2   B2   C2   D2
3  K3   A3   B3  NaN  NaN
4  K4  NaN  NaN   C3   D3


#### Example 3: Left Merge

In [55]:
# Perform a left merge
df_left_merge = pd.merge(left, right, on='key', how='left')
print("\nLeft Merge DataFrame:")
print(df_left_merge)


Left Merge DataFrame:
  key   A   B    C    D
0  K0  A0  B0   C0   D0
1  K1  A1  B1   C1   D1
2  K2  A2  B2   C2   D2
3  K3  A3  B3  NaN  NaN


#### Example 4: Right Merge

In [56]:
# Perform a right merge
df_right_merge = pd.merge(left, right, on='key', how='right')
print("\nRight Merge DataFrame:")
print(df_right_merge)


Right Merge DataFrame:
  key    A    B   C   D
0  K0   A0   B0  C0  D0
1  K1   A1   B1  C1  D1
2  K2   A2   B2  C2  D2
3  K4  NaN  NaN  C3  D3


### Joining:

Joining is a convenient method for combining DataFrames based on their indices.

#### Example 1: Join DataFrames

In [57]:
# Create sample DataFrames with indices
left = left.set_index('key')
right = right.set_index('key')

# Perform a join
df_join = left.join(right, how='inner')
print("\nJoined DataFrame:")
print(df_join)


Joined DataFrame:
      A   B   C   D
key                
K0   A0  B0  C0  D0
K1   A1  B1  C1  D1
K2   A2  B2  C2  D2
