# [Database Fundamentals for Data Processing](#)

Databases are fundamental to modern data processing and management. They provide a structured way to store, retrieve, and manipulate large amounts of data efficiently. Understanding database concepts will help you better grasp how Pandas operates and why certain operations are structured the way they are.


A **database** is an organized collection of data stored and accessed electronically. It's designed to efficiently manage, store, and retrieve information. Databases can range from simple collections stored in a single file to large, complex systems distributed across multiple servers.


<img src="../images/database.png" width="800">

Key characteristics of databases include:

- **Data integrity**: Ensuring the accuracy and consistency of data over its lifecycle.
- **Data security**: Controlling access to data and protecting it from unauthorized use.
- **Data independence**: The ability to modify the database structure without affecting the programs that use it.
- **Concurrent access**: Allowing multiple users to access and modify data simultaneously.


There are several types of databases, but the three main categories are:

1. **Relational Databases**:
   - Use tables to store data, with rows representing records and columns representing fields.
   - Employ Structured Query Language (SQL) for managing and querying data.
   - Examples: MySQL, PostgreSQL, Oracle, Microsoft SQL Server.

2. **NoSQL Databases**:
   - Designed for specific data models with flexible schemas.
   - Types include document, key-value, wide-column, and graph databases.
   - Examples: MongoDB (document), Redis (key-value), Cassandra (wide-column), Neo4j (graph).

3. **Time-Series Databases**:
   - Optimized for handling time-series data, such as stock prices, sensor data, and server logs.
   - Examples: InfluxDB, Prometheus, TimescaleDB.

<img src="../images/types-of-databases.png" width="800">

For this lecture, we'll focus primarily on concepts related to relational databases, as they align closely with Pandas' DataFrame structure.


Databases play a crucial role in data processing for several reasons:

1. **Efficient data storage and retrieval**: Databases are optimized for handling large volumes of data quickly.

2. **Data integrity and consistency**: They enforce rules and constraints to maintain data quality.

3. **Concurrent access**: Multiple users or applications can work with the same data simultaneously.

4. **Scalability**: Databases can grow to accommodate increasing amounts of data and users.

5. **Security**: They provide mechanisms to control access to sensitive information.

6. **Data relationships**: Relational databases can represent complex relationships between different data entities.


Understanding database concepts will help you:
- Organize your data more effectively in Pandas
- Perform complex data manipulations and analyses
- Understand the logic behind operations like merging, joining, and aggregating in Pandas
- Work more efficiently with large datasets


Let's create a simple example to illustrate how data might be structured in a database-like format using Pandas:


In [1]:
import pandas as pd

In [2]:
# Creating DataFrames to represent database tables
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['John Doe', 'Jane Smith', 'Bob Johnson', 'Alice Brown'],
    'email': ['john@example.com', 'jane@example.com', 'bob@example.com', 'alice@example.com']
})

In [3]:
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 1, 3, 4],
    'product': ['Widget A', 'Widget B', 'Widget C', 'Widget A', 'Widget B'],
    'quantity': [2, 1, 3, 1, 2],
    'order_date': ['2023-05-01', '2023-05-02', '2023-05-03', '2023-05-04', '2023-05-05']
})

In [4]:
print("Customers table:")
customers

Customers table:


Unnamed: 0,customer_id,name,email
0,1,John Doe,john@example.com
1,2,Jane Smith,jane@example.com
2,3,Bob Johnson,bob@example.com
3,4,Alice Brown,alice@example.com


In [5]:
print("\nOrders table:")
orders


Orders table:


Unnamed: 0,order_id,customer_id,product,quantity,order_date
0,101,1,Widget A,2,2023-05-01
1,102,2,Widget B,1,2023-05-02
2,103,1,Widget C,3,2023-05-03
3,104,3,Widget A,1,2023-05-04
4,105,4,Widget B,2,2023-05-05


In this example, we've created two DataFrames that represent tables you might find in a simple e-commerce database. The `customers` table contains information about customers, while the `orders` table contains information about orders placed by these customers.


As we progress through this lecture, we'll explore how to work with this data in ways that mimic database operations, setting the stage for more advanced Pandas operations in subsequent lectures.

## <a id='toc1_'></a>[Relational Database Concepts](#toc0_)

Relational databases are the most common type of database used in data processing. They organize data into tables with predefined relationships between them. Understanding these concepts will help you work more effectively with structured data in Pandas.


### <a id='toc1_1_'></a>[Tables, Rows, and Columns](#toc0_)


In a relational database:

- **Tables** (also called relations) represent entities or concepts (e.g., customers, orders).
- **Rows** (or records) represent individual instances of that entity.
- **Columns** (or fields) represent attributes of the entity.


<img src="../images/table-row-column.png" width="800">

Let's visualize this with our previous example:


In [6]:
print("Customers table:")
customers

Customers table:


Unnamed: 0,customer_id,name,email
0,1,John Doe,john@example.com
1,2,Jane Smith,jane@example.com
2,3,Bob Johnson,bob@example.com
3,4,Alice Brown,alice@example.com


In [7]:
print("\nOrders table:")
orders


Orders table:


Unnamed: 0,order_id,customer_id,product,quantity,order_date
0,101,1,Widget A,2,2023-05-01
1,102,2,Widget B,1,2023-05-02
2,103,1,Widget C,3,2023-05-03
3,104,3,Widget A,1,2023-05-04
4,105,4,Widget B,2,2023-05-05


Here, `customers` and `orders` are tables. Each row in `customers` represents a unique customer, and each row in `orders` represents a unique order. The columns represent attributes like name, email, product, etc.


### <a id='toc1_2_'></a>[Primary Keys and Foreign Keys](#toc0_)


- **Primary Key**: A column (or set of columns) that uniquely identifies each row in a table.
- **Foreign Key**: A column that refers to the primary key in another table, establishing a relationship between the tables.


<img src="../images/primary-key.png" width="800">

<img src="../images/foreign-key.png" width="800">

In our example:
- `customer_id` is the primary key in the `customers` table.
- In the `orders` table, `order_id` is the primary key, and `customer_id` is a foreign key referencing the `customers` table.


### <a id='toc1_3_'></a>[Relationships Between Tables](#toc0_)


There are three main types of relationships between tables:

1. **One-to-One (1:1)**: Each record in Table A is related to exactly one record in Table B, and vice versa.

2. **One-to-Many (1:N)**: Each record in Table A can be related to multiple records in Table B, but each record in Table B is related to only one record in Table A.

3. **Many-to-Many (M:N)**: Multiple records in Table A can be related to multiple records in Table B, and vice versa.


<img src="../images/many-to-many.png" width="800">

<img src="../images/one-to-many.png" width="800">

<img src="../images/one-to-one.png" width="800">

In our example, we have a One-to-Many relationship between `customers` and `orders`. One customer can have many orders, but each order belongs to only one customer.


Let's demonstrate these concepts with some Pandas operations:


In [8]:
# Displaying the primary key (customer_id) of the customers table
print("Primary Key of customers table:")
customers['customer_id']

Primary Key of customers table:


0    1
1    2
2    3
3    4
Name: customer_id, dtype: int64

In [9]:
# Showing the foreign key (customer_id) in the orders table
print("\nForeign Key in orders table:")
orders['customer_id']


Foreign Key in orders table:


0    1
1    2
2    1
3    3
4    4
Name: customer_id, dtype: int64

In [10]:
# Demonstrating the One-to-Many relationship
# Count of orders per customer
order_counts = orders['customer_id'].value_counts().sort_index()
print("\nNumber of orders per customer:")
order_counts


Number of orders per customer:


customer_id
1    2
2    1
3    1
4    1
Name: count, dtype: int64

In [11]:
# Joining tables based on the relationship
merged_data = pd.merge(customers, orders, on='customer_id')
print("\nMerged data (customers and their orders):")
merged_data


Merged data (customers and their orders):


Unnamed: 0,customer_id,name,email,order_id,product,quantity,order_date
0,1,John Doe,john@example.com,101,Widget A,2,2023-05-01
1,1,John Doe,john@example.com,103,Widget C,3,2023-05-03
2,2,Jane Smith,jane@example.com,102,Widget B,1,2023-05-02
3,3,Bob Johnson,bob@example.com,104,Widget A,1,2023-05-04
4,4,Alice Brown,alice@example.com,105,Widget B,2,2023-05-05


In this example:
- We show the primary key of the `customers` table and the corresponding foreign key in the `orders` table.
- We demonstrate the One-to-Many relationship by counting orders per customer.
- We perform a merge operation, which is similar to a JOIN in SQL, to combine data from both tables based on their relationship.


Understanding these relational database concepts is crucial because:

1. It helps in organizing data efficiently in Pandas DataFrames.
2. Many Pandas operations (like merging, grouping, and reshaping) are based on these database concepts.
3. When working with real databases, you'll be able to translate SQL operations to Pandas operations more easily.


As we progress through more advanced Pandas operations in future lectures, you'll see how these database concepts underpin many of the data manipulation techniques we'll explore.

## <a id='toc2_'></a>[Basic SQL Operations](#toc0_)

While we're focusing on Pandas, understanding basic SQL operations is valuable because many Pandas functions have direct parallels in SQL. This knowledge will help you translate database operations to Pandas and vice versa.


When we have a query, the engine executes it in the following order:

1. **FROM**: Specifies the table(s) from which to retrieve data.
2. **JOIN**: Combines data from multiple tables based on a related column.
3. **WHERE**: Filters rows based on a condition.
4. **GROUP BY**: Groups rows that have the same values into summary rows.
5. **HAVING**: Filters groups based on a condition.
6. **SELECT**: Specifies the columns to retrieve.
7. **ORDER BY**: Sorts the result set by one or more columns.
8. **LIMIT**: Limits the number of rows returned.

It's important to note that before executing the query, the SQL engine creates an execution plan to determine the most efficient way to retrieve the data. This plan is based on the query structure, table indexes, and other factors. The engine then executes the query and returns the result set.

<img src="../images/sql-query-execution.png" width="500">

Let's explore some fundamental SQL operations and their Pandas equivalents using our sample data.


### <a id='toc2_1_'></a>[SELECT Statements](#toc0_)


In SQL, the SELECT statement is used to retrieve data from one or more tables.


SQL:
```sql
SELECT * FROM customers;
```


Pandas equivalent:

In [12]:
# Selecting all columns from customers
customers

Unnamed: 0,customer_id,name,email
0,1,John Doe,john@example.com
1,2,Jane Smith,jane@example.com
2,3,Bob Johnson,bob@example.com
3,4,Alice Brown,alice@example.com


In [13]:
# Selecting specific columns
customers[['name', 'email']]

Unnamed: 0,name,email
0,John Doe,john@example.com
1,Jane Smith,jane@example.com
2,Bob Johnson,bob@example.com
3,Alice Brown,alice@example.com


### <a id='toc2_2_'></a>[Filtering with WHERE](#toc0_)


The WHERE clause in SQL is used to filter rows based on specified conditions.


SQL:

```sql
SELECT * FROM orders WHERE quantity > 1;
```


Pandas equivalent:

In [14]:
# Filtering orders with quantity greater than 1
orders[orders['quantity'] > 1]

Unnamed: 0,order_id,customer_id,product,quantity,order_date
0,101,1,Widget A,2,2023-05-01
2,103,1,Widget C,3,2023-05-03
4,105,4,Widget B,2,2023-05-05


### <a id='toc2_3_'></a>[Sorting with ORDER BY](#toc0_)


ORDER BY is used to sort the result set in ascending or descending order.


SQL:

```sql
SELECT * FROM orders ORDER BY quantity DESC;
```


Pandas equivalent:

In [15]:
# Sorting orders by quantity in descending order
orders.sort_values('quantity', ascending=False)

Unnamed: 0,order_id,customer_id,product,quantity,order_date
2,103,1,Widget C,3,2023-05-03
0,101,1,Widget A,2,2023-05-01
4,105,4,Widget B,2,2023-05-05
1,102,2,Widget B,1,2023-05-02
3,104,3,Widget A,1,2023-05-04


### <a id='toc2_4_'></a>[Limiting Results with LIMIT](#toc0_)


The LIMIT clause is used to specify the maximum number of rows to return.


SQL:

```sql
SELECT * FROM orders LIMIT 3;
```


Pandas equivalent:

In [16]:
# Selecting the first 3 rows from orders
orders.head(3)

Unnamed: 0,order_id,customer_id,product,quantity,order_date
0,101,1,Widget A,2,2023-05-01
1,102,2,Widget B,1,2023-05-02
2,103,1,Widget C,3,2023-05-03


### <a id='toc2_5_'></a>[Combining Operations](#toc0_)


In practice, these operations are often combined. Let's look at a more complex example:


SQL:

```sql
SELECT customer_id, product, quantity
FROM orders
WHERE quantity > 1
ORDER BY quantity DESC
LIMIT 2;
```


Pandas equivalent:

In [17]:
# Combining filtering, sorting, and limiting
orders[orders['quantity'] > 1]\
.sort_values('quantity', ascending=False)\
[['customer_id', 'product', 'quantity']]\
.head(2)

Unnamed: 0,customer_id,product,quantity
2,1,Widget C,3
0,1,Widget A,2


### <a id='toc2_6_'></a>[Aggregate Functions](#toc0_)


SQL provides several aggregate functions like COUNT, SUM, AVG, MAX, and MIN.


SQL:

```sql
SELECT COUNT(*) as order_count, SUM(quantity) as total_quantity
FROM orders;
```


Pandas equivalent:

In [18]:
# Calculating count of orders and sum of quantities
{
    'order_count': len(orders),
    'total_quantity': orders['quantity'].sum()
}

{'order_count': 5, 'total_quantity': 9}

### <a id='toc2_7_'></a>[GROUP BY](#toc0_)


The GROUP BY clause is used with aggregate functions to group the result set by one or more columns.


SQL:

```sql
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id;
```


Pandas equivalent:

In [19]:
# Grouping orders by customer and counting
orders.groupby('customer_id').size().reset_index(name='order_count')

Unnamed: 0,customer_id,order_count
0,1,2
1,2,1
2,3,1
3,4,1


### <a id='toc2_8_'></a>[HAVING](#toc0_)


The HAVING clause is used to filter the results of GROUP BY based on a specified condition.


SQL:

```sql
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1;
```


Pandas equivalent:

In [20]:
# Grouping by customer, counting orders, and filtering groups with more than 1 order
orders.groupby('customer_id')\
.size()\
.reset_index(name='order_count')\
.query('order_count > 1')

Unnamed: 0,customer_id,order_count
0,1,2


Understanding these SQL operations and their Pandas equivalents will help you:

1. Translate SQL queries to Pandas operations when working with data.
2. Understand the logic behind many Pandas functions and methods.
3. Efficiently manipulate and analyze data in Pandas DataFrames.


As we progress to more advanced Pandas operations in the upcoming lectures, you'll see how these basic SQL concepts form the foundation for complex data manipulations in Pandas.

## <a id='toc3_'></a>[Joining Tables](#toc0_)

Joining tables is a fundamental operation in relational databases and data processing. It allows you to combine data from multiple tables based on related columns. Understanding joins is crucial for working with complex datasets in Pandas.


### <a id='toc3_1_'></a>[Types of Joins](#toc0_)


There are four main types of joins:

1. **INNER JOIN**: Returns only the rows that have matching values in both tables.
2. **LEFT JOIN**: Returns all rows from the left table and the matched rows from the right table.
3. **RIGHT JOIN**: Returns all rows from the right table and the matched rows from the left table.
4. **FULL OUTER JOIN**: Returns all rows when there's a match in either the left or right table.


<img src="../images/sql-joins.png" width="800">

Let's demonstrate these joins using our `customers` and `orders` DataFrames:


In [21]:
# Reminder of our data
print("Customers:")
customers

Customers:


Unnamed: 0,customer_id,name,email
0,1,John Doe,john@example.com
1,2,Jane Smith,jane@example.com
2,3,Bob Johnson,bob@example.com
3,4,Alice Brown,alice@example.com


In [22]:
print("\nOrders:")
orders


Orders:


Unnamed: 0,order_id,customer_id,product,quantity,order_date
0,101,1,Widget A,2,2023-05-01
1,102,2,Widget B,1,2023-05-02
2,103,1,Widget C,3,2023-05-03
3,104,3,Widget A,1,2023-05-04
4,105,4,Widget B,2,2023-05-05


In [23]:
# Adding a customer with no orders and an order with no matching customer
customers.loc[len(customers)] = {'customer_id': 5, 'name': 'Eva Green', 'email': 'eva@example.com'}
orders.loc[len(orders)] = {'order_id': 106, 'customer_id': 6, 'product': 'Widget D', 'quantity': 1, 'order_date': '2023-05-06'}

### <a id='toc3_2_'></a>[INNER JOIN](#toc0_)


SQL:

```sql
SELECT *
FROM customers
INNER JOIN orders ON customers.customer_id = orders.customer_id;
```


Pandas equivalent:

In [25]:
# Inner join
print("Inner Join Result:")
pd.merge(customers, orders, on='customer_id', how='inner')

Inner Join Result:


Unnamed: 0,customer_id,name,email,order_id,product,quantity,order_date
0,1,John Doe,john@example.com,101,Widget A,2,2023-05-01
1,1,John Doe,john@example.com,103,Widget C,3,2023-05-03
2,2,Jane Smith,jane@example.com,102,Widget B,1,2023-05-02
3,3,Bob Johnson,bob@example.com,104,Widget A,1,2023-05-04
4,4,Alice Brown,alice@example.com,105,Widget B,2,2023-05-05


The inner join returns only the rows where there's a match in both tables. Customers without orders and orders without matching customers are excluded.


### <a id='toc3_3_'></a>[LEFT JOIN](#toc0_)


SQL:

```sql
SELECT *
FROM customers
LEFT JOIN orders ON customers.customer_id = orders.customer_id;
```


Pandas equivalent:

In [26]:
# Left join
print("Left Join Result:")
pd.merge(customers, orders, on='customer_id', how='left')

Left Join Result:


Unnamed: 0,customer_id,name,email,order_id,product,quantity,order_date
0,1,John Doe,john@example.com,101.0,Widget A,2.0,2023-05-01
1,1,John Doe,john@example.com,103.0,Widget C,3.0,2023-05-03
2,2,Jane Smith,jane@example.com,102.0,Widget B,1.0,2023-05-02
3,3,Bob Johnson,bob@example.com,104.0,Widget A,1.0,2023-05-04
4,4,Alice Brown,alice@example.com,105.0,Widget B,2.0,2023-05-05
5,5,Eva Green,eva@example.com,,,,


The left join returns all rows from the left table (customers) and the matched rows from the right table (orders). Customers without orders will have NaN values for order columns.


### <a id='toc3_4_'></a>[RIGHT JOIN](#toc0_)


SQL:

```sql
SELECT *
FROM customers
RIGHT JOIN orders ON customers.customer_id = orders.customer_id;
```


Pandas equivalent:

In [27]:
# Right join
print("Right Join Result:")
pd.merge(customers, orders, on='customer_id', how='right')

Right Join Result:


Unnamed: 0,customer_id,name,email,order_id,product,quantity,order_date
0,1,John Doe,john@example.com,101,Widget A,2,2023-05-01
1,2,Jane Smith,jane@example.com,102,Widget B,1,2023-05-02
2,1,John Doe,john@example.com,103,Widget C,3,2023-05-03
3,3,Bob Johnson,bob@example.com,104,Widget A,1,2023-05-04
4,4,Alice Brown,alice@example.com,105,Widget B,2,2023-05-05
5,6,,,106,Widget D,1,2023-05-06


The right join returns all rows from the right table (orders) and the matched rows from the left table (customers). Orders without matching customers will have NaN values for customer columns.


### <a id='toc3_5_'></a>[FULL OUTER JOIN](#toc0_)


SQL:

```sql
SELECT *
FROM customers
FULL OUTER JOIN orders ON customers.customer_id = orders.customer_id;
```


Pandas equivalent:

In [28]:
# Full outer join
print("Full Outer Join Result:")
pd.merge(customers, orders, on='customer_id', how='outer')

Full Outer Join Result:


Unnamed: 0,customer_id,name,email,order_id,product,quantity,order_date
0,1,John Doe,john@example.com,101.0,Widget A,2.0,2023-05-01
1,1,John Doe,john@example.com,103.0,Widget C,3.0,2023-05-03
2,2,Jane Smith,jane@example.com,102.0,Widget B,1.0,2023-05-02
3,3,Bob Johnson,bob@example.com,104.0,Widget A,1.0,2023-05-04
4,4,Alice Brown,alice@example.com,105.0,Widget B,2.0,2023-05-05
5,5,Eva Green,eva@example.com,,,,
6,6,,,106.0,Widget D,1.0,2023-05-06


The full outer join returns all rows when there's a match in either the customers or orders table. NaN values are filled where there's no match.


Understanding these join operations is crucial because:

1. They allow you to combine data from different sources or tables.
2. They're essential for analyzing relationships between different entities in your data.
3. Many real-world data analysis tasks require joining data from multiple sources.


In the upcoming lectures on merging, joining, and concatenating DataFrames, we'll explore these concepts in more depth and learn about additional Pandas functions for combining data. The foundation we've built here with SQL-like joins will make those advanced operations more intuitive and easier to grasp.

## <a id='toc4_'></a>[Aggregation and Grouping](#toc0_)

Aggregation and grouping are powerful techniques in data analysis that allow you to summarize and compute statistics on groups of data. These operations are fundamental in both SQL and Pandas, enabling you to extract meaningful insights from your datasets.


### <a id='toc4_1_'></a>[Aggregate Functions](#toc0_)


Common aggregate functions include:

- COUNT: Counts the number of rows
- SUM: Calculates the sum of values
- AVG (MEAN): Calculates the average of values
- MAX: Finds the maximum value
- MIN: Finds the minimum value


Let's start with some basic aggregations on our `orders` DataFrame:


In [29]:
# Basic aggregations
print("Total number of orders:", len(orders))
print("Total quantity ordered:", orders['quantity'].sum())
print("Average quantity per order:", orders['quantity'].mean())
print("Maximum quantity in an order:", orders['quantity'].max())
print("Minimum quantity in an order:", orders['quantity'].min())

# Multiple aggregations at once
order_stats = orders['quantity'].agg(['count', 'sum', 'mean', 'max', 'min'])
print("\nOrder Statistics:")
order_stats

Total number of orders: 6
Total quantity ordered: 10
Average quantity per order: 1.6666666666666667
Maximum quantity in an order: 3
Minimum quantity in an order: 1

Order Statistics:


count     6.000000
sum      10.000000
mean      1.666667
max       3.000000
min       1.000000
Name: quantity, dtype: float64

### <a id='toc4_2_'></a>[GROUP BY Clause](#toc0_)


The GROUP BY clause is used to group rows that have the same values in specified columns. It's often used with aggregate functions to perform calculations on each group.


SQL equivalent:

```sql
SELECT customer_id, COUNT(*) as order_count, SUM(quantity) as total_quantity
FROM orders
GROUP BY customer_id;
```


Pandas implementation:


In [30]:
# Group by customer_id and calculate order count and total quantity
customer_orders = orders.groupby('customer_id').agg({
    'order_id': 'count',
    'quantity': 'sum'
}).rename(columns={'order_id': 'order_count', 'quantity': 'total_quantity'})

print("Customer Order Summary:")
customer_orders

Customer Order Summary:


Unnamed: 0_level_0,order_count,total_quantity
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2,5
2,1,1
3,1,1
4,1,2
6,1,1


### <a id='toc4_3_'></a>[Multiple Aggregations](#toc0_)


You can perform multiple aggregations on different columns:


In [31]:
# Multiple aggregations on different columns
order_analysis = orders.groupby('customer_id').agg({
    'order_id': 'count',
    'quantity': ['sum', 'mean', 'max'],
    'order_date': ['min', 'max']
})

# Flatten column names
order_analysis.columns = ['_'.join(col).strip() for col in order_analysis.columns.values]

print("Detailed Order Analysis:")
order_analysis

Detailed Order Analysis:


Unnamed: 0_level_0,order_id_count,quantity_sum,quantity_mean,quantity_max,order_date_min,order_date_max
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,2,5,2.5,3,2023-05-01,2023-05-03
2,1,1,1.0,1,2023-05-02,2023-05-02
3,1,1,1.0,1,2023-05-04,2023-05-04
4,1,2,2.0,2,2023-05-05,2023-05-05
6,1,1,1.0,1,2023-05-06,2023-05-06


### <a id='toc4_4_'></a>[Grouping by Multiple Columns](#toc0_)


You can group by multiple columns to create more detailed summaries:


In [32]:
# Group by both customer_id and product
product_analysis = orders.groupby(['customer_id', 'product']).agg({
    'quantity': ['count', 'sum', 'mean']
})

print("Product Analysis by Customer:")
product_analysis

Product Analysis by Customer:


Unnamed: 0_level_0,Unnamed: 1_level_0,quantity,quantity,quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,count,sum,mean
customer_id,product,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,Widget A,1,2,2.0
1,Widget C,1,3,3.0
2,Widget B,1,1,1.0
3,Widget A,1,1,1.0
4,Widget B,1,2,2.0
6,Widget D,1,1,1.0


### <a id='toc4_5_'></a>[HAVING Clause (Filtering Groups)](#toc0_)


In SQL, the HAVING clause is used to filter groups based on aggregate results. In Pandas, we can achieve this by applying a filter after grouping:


SQL equivalent:

```sql
SELECT customer_id, COUNT(*) as order_count
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1;
```


Pandas implementation:


In [33]:
# Filter groups to show only customers with more than one order
frequent_customers = orders.groupby('customer_id').size().reset_index(name='order_count')
frequent_customers = frequent_customers[frequent_customers['order_count'] > 1]

print("Customers with Multiple Orders:")
frequent_customers

Customers with Multiple Orders:


Unnamed: 0,customer_id,order_count
0,1,2


Understanding aggregation and grouping is crucial because:

1. They allow you to summarize large datasets into meaningful insights.
2. They're essential for performing analyses at different levels of granularity.
3. Many data analysis tasks involve understanding patterns and trends within groups of data.


As we move forward to more advanced Pandas operations, you'll see how these aggregation and grouping concepts form the basis for complex data manipulations and analyses. The ability to efficiently group and summarize data is a key skill in data processing and analysis.

## <a id='toc5_'></a>[Conclusion](#toc0_)

In this lecture on Database Fundamentals for Data Processing, we've covered essential concepts that form the foundation of working with structured data, both in traditional databases and in Pandas. Let's recap the key points:

1. **Introduction to Databases**: We learned about the importance of databases in data processing, different types of databases, and their key characteristics.

2. **Relational Database Concepts**: We explored tables, rows, columns, primary keys, foreign keys, and relationships between tables. These concepts directly translate to how we structure and work with data in Pandas DataFrames.

3. **Basic SQL Operations**: We covered fundamental SQL operations like SELECT, WHERE, ORDER BY, and LIMIT, and saw how they correspond to Pandas operations. This knowledge helps in translating database queries to Pandas code and vice versa.

4. **Joining Tables**: We examined different types of joins (INNER, LEFT, RIGHT, FULL OUTER) and their implementations in both SQL and Pandas. Understanding joins is crucial for combining data from multiple sources or tables.

5. **Aggregation and Grouping**: We explored aggregate functions, the GROUP BY clause, and advanced grouping operations. These techniques are essential for summarizing data and extracting meaningful insights.


In the upcoming lectures, we'll build upon these fundamentals to explore more advanced Pandas operations:

- Merging, Joining, and Concatenating DataFrames
- Advanced Grouping and Aggregating Data
- Reshaping DataFrames with Dummies
- Pivoting, Melting, Stacking, and Unstacking Data


The concepts you've learned here will serve as a strong foundation for understanding these more complex operations. You'll see how the principles of relational databases and SQL queries are reflected in Pandas' powerful data manipulation capabilities.