# Welcome to the Notebook

By the end of this notebook, you will be able to work with both Pandas and SQLAlchemy to efficiently perform essential data analysis tasks, understanding when and how to use each tool effectively.

We will achieve the following learning objectives:
- **Set Up and Configure the Environment:** Configure and initialize your environment to work seamlessly with Pandas and SQLAlchemy.
- **Load and Explore Data:** Analyze and interpret data by loading and performing exploratory analysis with both Pandas and SQLAlchemy.
- **Filter and Query Data:** Construct and execute data filtering and querying techniques to retrieve relevant information from datasets.
- **Group and Aggregate Data:** Synthesize data by performing grouping and aggregation operations to derive meaningful insights.
- **Merge and Join Data:** Integrate and combine data from multiple tables using efficient merging and joining techniques.




### 1- Introduction and Setup the Environment



**Overview of Data Analysis Using Pandas**
Pandas is a powerful Python library for data analysis, offering efficient data structures like `DataFrame` and `Series`. It simplifies data manipulation, transformation, and exploration, making it easy to work with structured data from formats like CSV and Excel.

**Key Features:**
- **Data Wrangling & Cleaning**: Tools for reshaping, merging, and handling missing data.
- **Data Exploration**: Built-in functions for summarizing and exploring data.
- **Ease of Use**: A user-friendly API designed for fast and intuitive data operations.

**Overview of Database Interaction with SQLAlchemy**
SQLAlchemy is a robust database toolkit and ORM library for Python. It provides a unified interface for interacting with relational databases, making it suitable for handling large datasets that require efficient querying and storage.

**Key Features:**
- **Database Abstraction**: Works seamlessly with databases like SQLite, PostgreSQL, and MySQL.
- **SQL Queries in Python**: Supports both raw SQL queries and ORM for object-oriented interactions.
- **Connection Management**: Efficiently handles database connections and transactions.

**Benefits and Limitations of Each Approach**

**Pandas:**
- **Benefits**: Easy to use, ideal for in-memory data processing, rich functionality for data analysis.
- **Limitations**: Limited by memory, less efficient for very large datasets.

**SQLAlchemy:**
- **Benefits**: Efficient for large datasets, supports complex queries, ensures data integrity.
- **Limitations**: More complex to set up, steeper learning curve for ORM features.

Now let's go ahead and start with installing the needed modules.

In [32]:
! pip install pandas sqlalchemy==2.0.36 




[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Import the needed modules

In [33]:
import pandas as pd
from sqlalchemy import create_engine ,text

print("Modules are imported.")

Modules are imported.


### 2. Data Loading and Exploration

**Setting Up the SQLAlchemy Engine and Loading the `actor` Table from the `sakila` Database** 

The engine is a central part of SQLAlchemy and is responsible for managing the connection to the database. It provides an interface to execute SQL queries and retrieve data. 

In [35]:
# Create the engine
engine = create_engine("mysql+pymysql://root:1234@localhost/sakila")

# Establish the connection
connection = engine.connect()

# Use `text()` to safely format and execute the raw SQL query, preventing SQL injection and ensuring proper parsing.
result = connection.execute(text("SELECT * FROM actor"))

# Fetch and print the results
for row in result:
    print(row)

# Close the connection
# Explanation: It's important to close the connection to free up resources
# and avoid potential memory leaks or connection exhaustion in the database.
connection.close()

(1, 'PENELOPE', 'GUINESS', datetime.datetime(2006, 2, 15, 4, 34, 33))
(2, 'NICK', 'WAHLBERG', datetime.datetime(2006, 2, 15, 4, 34, 33))
(3, 'ED', 'CHASE', datetime.datetime(2006, 2, 15, 4, 34, 33))
(4, 'JENNIFER', 'DAVIS', datetime.datetime(2006, 2, 15, 4, 34, 33))
(5, 'JOHNNY', 'LOLLOBRIGIDA', datetime.datetime(2006, 2, 15, 4, 34, 33))
(6, 'BETTE', 'NICHOLSON', datetime.datetime(2006, 2, 15, 4, 34, 33))
(7, 'GRACE', 'MOSTEL', datetime.datetime(2006, 2, 15, 4, 34, 33))
(8, 'MATTHEW', 'JOHANSSON', datetime.datetime(2006, 2, 15, 4, 34, 33))
(9, 'JOE', 'SWANK', datetime.datetime(2006, 2, 15, 4, 34, 33))
(10, 'CHRISTIAN', 'GABLE', datetime.datetime(2006, 2, 15, 4, 34, 33))
(11, 'ZERO', 'CAGE', datetime.datetime(2006, 2, 15, 4, 34, 33))
(12, 'KARL', 'BERRY', datetime.datetime(2006, 2, 15, 4, 34, 33))
(13, 'UMA', 'WOOD', datetime.datetime(2006, 2, 15, 4, 34, 33))
(14, 'VIVIEN', 'BERGEN', datetime.datetime(2006, 2, 15, 4, 34, 33))
(15, 'CUBA', 'OLIVIER', datetime.datetime(2006, 2, 15, 4, 34,

**Loading the Data into a Pandas DataFrame**

Once we have the data retrieved from the actor table using SQLAlchemy, we can easily load it into a Pandas DataFrame for further analysis. Pandas provides convenient methods to read data directly from a SQL query or a database table.

Steps to Load Data into a Pandas DataFrame:

Use the `read_sql()` method in Pandas to execute the SQL query and load the data.
Pass the SQL query and the SQLAlchemy engine as arguments to `read_sql()`.
<br>The data will be loaded into a Pandas DataFrame, allowing you to use Pandas functions to explore and manipulate the data.

Here's how to load the data into a Pandas DataFrame:

In [36]:
# Load data into a Pandas DataFrame
query = "SELECT * FROM actor"
df = pd.read_sql(query, engine)

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,actor_id,first_name,last_name,last_update
0,1,PENELOPE,GUINESS,2006-02-15 04:34:33
1,2,NICK,WAHLBERG,2006-02-15 04:34:33
2,3,ED,CHASE,2006-02-15 04:34:33
3,4,JENNIFER,DAVIS,2006-02-15 04:34:33
4,5,JOHNNY,LOLLOBRIGIDA,2006-02-15 04:34:33


**Count the Total Number of Records in the `actor` Table**

Let's compare the performance of counting the total number of records in the actor table using SQLAlchemy and Pandas. We'll measure the time taken by each method to understand the efficiency difference.

Using SQLAlchemy:

In [51]:
import time

# Measure the start time
start_time = time.time()

# Establish the connection and execute the query
connection = engine.connect()
result = connection.execute(text("SELECT COUNT(*) FROM actor"))
print("Total records in 'actor' table (SQLAlchemy):", result.scalar())

# Measure the end time and calculate the duration
end_time = time.time()
print("Time taken (SQLAlchemy):", end_time - start_time, "seconds")

# Close the connection
connection.close()

Total records in 'actor' table (SQLAlchemy): 200
Time taken (SQLAlchemy): 0.0009763240814208984 seconds


Using Pandas:

In [52]:
# Measure the start time
start_time = time.time()

# Load the entire table into a Pandas DataFrame
df = pd.read_sql("SELECT * FROM actor", engine)

# Count the total number of records
total_records = df.shape[0]
print("Total records in 'actor' table (Pandas):", total_records)

# Measure the end time and calculate the duration
end_time = time.time()
print("Time taken (Pandas):", end_time - start_time, "seconds")

Total records in 'actor' table (Pandas): 200
Time taken (Pandas): 0.004912853240966797 seconds


Comparison:

- SQLAlchemy: Uses a direct SQL query to count the records, which is efficient when working with large datasets because it only retrieves the count rather than the entire dataset. The measured time shows how quickly the query is executed.

- Pandas: Loads the entire dataset into memory and then uses `.shape[0]` to get the count. This approach can be inefficient for very large datasets, and the measured time reflects the overhead of loading all the data.

### 3. Filtering and Querying Data

We’ll filter rows from the `actor` table where the first_name is 'PENELOPE'.

Let’s see how to do this using both Pandas and SQLAlchemy.

**Using Pandas:**

Pandas provides a simple and intuitive way to filter data using conditional expressions. Here's how to filter rows based on a condition:

In [None]:
# Filter rows where the first_name is 'PENELOPE'
filtered_df = df[df['first_name'] == 'PENELOPE']

# Display the filtered rows
print("Filtered rows (Pandas):")
print(filtered_df)

Filtered rows (Pandas):
     actor_id first_name last_name         last_update
0           1   PENELOPE   GUINESS 2006-02-15 04:34:33
53         54   PENELOPE   PINKETT 2006-02-15 04:34:33
103       104   PENELOPE    CRONYN 2006-02-15 04:34:33
119       120   PENELOPE    MONROE 2006-02-15 04:34:33


`df['first_name'] == 'PENELOPE'`: Creates a Boolean mask where each row is True if the first_name is 'PENELOPE' and False otherwise.

`df[mask]`: Filters the DataFrame to include only the rows where the mask is True

**Using SQLAlchemy: Direct Query with `text()`**

We can use raw SQL queries to filter rows using SQLAlchemy's text() function.

In [40]:
connection = engine.connect()

# Use a SQL query to filter rows
result = connection.execute(text("SELECT * FROM actor WHERE first_name = 'PENELOPE'"))

# Fetch the filtered results
filtered_rows = result.fetchall()
print("Filtered rows (SQLAlchemy with Direct Query using text()):")
for row in filtered_rows:
    print(row)

# Close the connection
connection.close()

Filtered rows (SQLAlchemy with Direct Query using text()):
(1, 'PENELOPE', 'GUINESS', datetime.datetime(2006, 2, 15, 4, 34, 33))
(54, 'PENELOPE', 'PINKETT', datetime.datetime(2006, 2, 15, 4, 34, 33))
(104, 'PENELOPE', 'CRONYN', datetime.datetime(2006, 2, 15, 4, 34, 33))
(120, 'PENELOPE', 'MONROE', datetime.datetime(2006, 2, 15, 4, 34, 33))


**Using SQLAlchemy: `Filtering with select()` and `where()`**

We can use the `select()` method combined with `where()` to filter rows in a more Pythonic way. Note that `filter()` can be used as an alternative to `where()`, and both functions serve the same purpose: to apply filtering conditions to a query.

To be able to use the `select()` and `where()` methods, we first need to load the table schema from the database. This is done using SQLAlchemy's `MetaData` object, which allows us to reflect the table structure and understand the columns available for querying.

In [None]:
from sqlalchemy import select, Table, MetaData

# Setup metadata to reflect the actor table structure from the database
# Explanation: MetaData() is needed to load the table schema from the database,
# allowing SQLAlchemy to understand the structure and columns of the table.
metadata = MetaData()
actor_table = Table('actor', metadata, autoload_with=engine)

# Construct a query to select rows where first_name is 'PENELOPE'
query_statement = select(actor_table).where(actor_table.c.first_name == 'PENELOPE')

# alternatively you can use .filter() method 
# query_statement = select(actor_table).filter(actor_table.c.first_name == 'PENELOPE')

# open the connection 
connection = engine.connect()

# Execute the query
result = connection.execute(query_statement)

# Fetch the filtered results
filtered_rows = result.fetchall()
print("Filtered rows (SQLAlchemy with select() and where()):")
for row in filtered_rows:
    print(row)

# Close the connection
connection.close()


Filtered rows (SQLAlchemy with select() and where()):
(1, 'PENELOPE', 'GUINESS', datetime.datetime(2006, 2, 15, 4, 34, 33))
(54, 'PENELOPE', 'PINKETT', datetime.datetime(2006, 2, 15, 4, 34, 33))
(104, 'PENELOPE', 'CRONYN', datetime.datetime(2006, 2, 15, 4, 34, 33))
(120, 'PENELOPE', 'MONROE', datetime.datetime(2006, 2, 15, 4, 34, 33))


**Exercise:** Filter the actor table to retrieve rows where the last_name is 'WAHLBERG' using both SQLAlchemy (select() and where()).

In [None]:
### Write your code here 

### Solution for instructor: 
# query_statement = select(actor_table).where(actor_table.c.last_name == 'WAHLBERG')
# connection = engine.connect()
# result = connection.execute(query_statement)

# # Fetch the filtered results
# filtered_rows = result.fetchall()
# print("Filtered rows (SQLAlchemy):")
# for row in filtered_rows:
#     print(row)

# # Close the connection
# connection.close()

### 4. Grouping and Calculating Aggregations


In this part, we'll group the data by the `last_name` and calculate the `total count` of actors for each unique last_name. We'll do this using both Pandas and SQLAlchemy methods.

**Using Pandas**
Pandas makes it easy to group data and calculate aggregations using built-in methods like `groupby()` and `size()`.

In [46]:
# Group the data by 'last_name' and calculate the count of actors
grouped_df = df.groupby('last_name').size().reset_index(name='actor_count')

# Display the grouped and aggregated data
grouped_df

Unnamed: 0,last_name,actor_count
0,AKROYD,3
1,ALLEN,3
2,ASTAIRE,1
3,BACALL,1
4,BAILEY,2
...,...,...
116,WINSLET,2
117,WITHERSPOON,1
118,WOOD,2
119,WRAY,1


- `groupby('last_name')`: Groups the DataFrame by the last_name column.
- `size()`: Counts the number of occurrences in each group.
- `reset_index(name='actor_count')`: Resets the index and names the new count column as actor_count.

**Using SQLAlchemy**

SQLAlchemy allows us to perform grouping and aggregation using SQL queries. We use the func module in SQLAlchemy to access SQL functions, such as `COUNT`, `SUM`, and `AVG`, which are commonly used for aggregations. Here’s how to group data by `last_name` and calculate the count:

In [47]:
from sqlalchemy import func  # Import func to use SQL functions like COUNT

# Construct a query to group by 'last_name' and calculate the count of actors
query_statement = select(
    actor_table.c.last_name,  # Select the 'last_name' column
    func.count(actor_table.c.actor_id).label('actor_count')  # Use func.count() to count actor_id and label it as 'actor_count'
).group_by(actor_table.c.last_name)  # Group the results by 'last_name'

# Execute the query
connection = engine.connect()
result = connection.execute(query_statement)

# Fetch the grouped and aggregated results
grouped_rows = result.fetchall()
print("Grouped and aggregated data (SQLAlchemy):")
for row in grouped_rows:
    print(row)

# Close the connection
# Closing the connection is important to free up resources and maintain database performance
connection.close()

Grouped and aggregated data (SQLAlchemy):
('AKROYD', 3)
('ALLEN', 3)
('ASTAIRE', 1)
('BACALL', 1)
('BAILEY', 2)
('BALE', 1)
('BALL', 1)
('BARRYMORE', 1)
('BASINGER', 1)
('BENING', 2)
('BERGEN', 1)
('BERGMAN', 1)
('BERRY', 3)
('BIRCH', 1)
('BLOOM', 1)
('BOLGER', 2)
('BRIDGES', 1)
('BRODY', 2)
('BULLOCK', 1)
('CAGE', 2)
('CARREY', 1)
('CHAPLIN', 1)
('CHASE', 2)
('CLOSE', 1)
('COSTNER', 1)
('CRAWFORD', 2)
('CRONYN', 2)
('CROWE', 1)
('CRUISE', 1)
('CRUZ', 1)
('DAMON', 1)
('DAVIS', 3)
('DAY-LEWIS', 1)
('DEAN', 2)
('DEE', 2)
('DEGENERES', 3)
('DENCH', 2)
('DEPP', 2)
('DERN', 1)
('DREYFUSS', 1)
('DUKAKIS', 2)
('DUNST', 1)
('FAWCETT', 2)
('GABLE', 1)
('GARLAND', 3)
('GIBSON', 1)
('GOLDBERG', 1)
('GOODING', 2)
('GRANT', 1)
('GUINESS', 3)
('HACKMAN', 2)
('HARRIS', 3)
('HAWKE', 1)
('HESTON', 1)
('HOFFMAN', 3)
('HOPE', 1)
('HOPKINS', 3)
('HOPPER', 2)
('HUDSON', 1)
('HUNT', 1)
('HURT', 1)
('JACKMAN', 2)
('JOHANSSON', 3)
('JOLIE', 1)
('JOVOVICH', 1)
('KEITEL', 3)
('KILMER', 5)
('LEIGH', 1)
('LOLLOBR

**Exercise**: Filtering and Aggregation with SQLAlchemy

Use SQLAlchemy to perform the following operations on the actor table:

- Filter the rows to include only actors whose first_name is 'PENELOPE' or 'NICK'.
- Group the filtered data by last_name and calculate the total count of actors for each last_name.

Good Luck! 

In [None]:
### Write your solution here 

### Solution for instructor:

# Construct the query:
# 1. Filter rows where first_name is 'PENELOPE' or 'NICK'
# 2. Group by last_name and count the number of actors

# query_statement = select(
#     actor_table.c.last_name,  # Select the 'last_name' column
#     func.count(actor_table.c.actor_id).label('actor_count')  # Count actor_id and label it as 'actor_count'
# ).where(
#     actor_table.c.first_name.in_(['PENELOPE', 'NICK'])  # Filter for 'PENELOPE' or 'NICK'
# ).group_by(
#     actor_table.c.last_name  # Group the results by 'last_name'
# )

# # Execute the query
# connection = engine.connect()
# result = connection.execute(query_statement)

# # Fetch and print the results
# filtered_and_grouped_rows = result.fetchall()
# print("Filtered and grouped data (SQLAlchemy):")
# for row in filtered_and_grouped_rows:
#     print(row)

# # Close the connection
# connection.close()

Filtered and grouped data (SQLAlchemy):
('CRONYN', 1)
('DEGENERES', 1)
('GUINESS', 1)
('MONROE', 1)
('PINKETT', 1)
('STALLONE', 1)
('WAHLBERG', 1)


**Checking for Missing Data and Filling Missing Values**

Handling missing data is an essential part of data cleaning. We'll learn how to check for and fill missing values using both Pandas and SQLAlchemy.


Using Pandas

Pandas provides convenient methods to check for and handle missing data in a DataFrame

In [53]:
# Checking for missing data
missing_data_summary = df.isnull().sum()
print("Missing data summary (Pandas):")
print(missing_data_summary)

# Explanation:
# - `df.isnull()` returns a DataFrame of the same shape as `df` with `True` for missing values and `False` otherwise.
# - `.sum()` counts the number of missing values in each column.

# Filling missing values
# Fill missing values in the 'last_name' column with 'Unknown'
df['last_name'].fillna('Unknown', inplace=True)
print("\nData after filling missing values (Pandas):")
print(df.head())

# Explanation:
# - `fillna('Unknown')` replaces all missing values in the 'last_name' column with 'Unknown'.
# - `inplace=True` modifies the DataFrame directly without creating a copy.


Missing data summary (Pandas):
actor_id       0
first_name     0
last_name      0
last_update    0
dtype: int64

Data after filling missing values (Pandas):
   actor_id first_name     last_name         last_update
0         1   PENELOPE       GUINESS 2006-02-15 04:34:33
1         2       NICK      WAHLBERG 2006-02-15 04:34:33
2         3         ED         CHASE 2006-02-15 04:34:33
3         4   JENNIFER         DAVIS 2006-02-15 04:34:33
4         5     JOHNNY  LOLLOBRIGIDA 2006-02-15 04:34:33


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['last_name'].fillna('Unknown', inplace=True)


### 5. Merging and Joining Data

**Merging Two DataFrames and Performing Data Joins in SQL**

Data merging and joining are essential operations in data analysis, especially when working with relational data. We’ll cover how to merge two DataFrames using Pandas and how to perform a SQL join using SQLAlchemy.

Before we start let's have a look at the MySQL Workbench to see our data tables.

Using Pandas: Merging Two DataFrames

We can use the `merge()` method in Pandas to join two DataFrames. Let's assume we have loaded two DataFrames: `df_actor` and `df_film_actor`.


In [55]:
# Example DataFrames (for demonstration purposes)
df_actor = pd.read_sql("SELECT * FROM actor", engine)
df_film_actor = pd.read_sql("SELECT * FROM film_actor", engine)

# Merging the two DataFrames on 'actor_id'
merged_df = pd.merge(df_actor, df_film_actor, on='actor_id')

# Display the first few rows of the merged DataFrame
print("Merged DataFrame (Pandas):")
merged_df.head()

# Explanation:
# - `pd.merge()` merges the two DataFrames based on the 'actor_id' column.
# - `on='actor_id'` specifies the common key to join on.

Merged DataFrame (Pandas):


Unnamed: 0,actor_id,first_name,last_name,last_update_x,film_id,last_update_y
0,1,PENELOPE,GUINESS,2006-02-15 04:34:33,1,2006-02-15 05:05:03
1,1,PENELOPE,GUINESS,2006-02-15 04:34:33,23,2006-02-15 05:05:03
2,1,PENELOPE,GUINESS,2006-02-15 04:34:33,25,2006-02-15 05:05:03
3,1,PENELOPE,GUINESS,2006-02-15 04:34:33,106,2006-02-15 05:05:03
4,1,PENELOPE,GUINESS,2006-02-15 04:34:33,140,2006-02-15 05:05:03


Using SQLAlchemy: Performing Data Joins

We can perform a SQL join between the `actor` and `film_actor` tables using SQLAlchemy and execute the query to get the results.

In [56]:
# Setup the SQLAlchemy engine and metadata
metadata = MetaData()
actor_table = Table('actor', metadata, autoload_with=engine)
film_actor_table = Table('film_actor', metadata, autoload_with=engine)

# Construct the join query
join_query = select(
    actor_table, film_actor_table
).select_from(
    actor_table.join(film_actor_table, actor_table.c.actor_id == film_actor_table.c.actor_id)
)

# Execute the join query
connection = engine.connect()
result = connection.execute(join_query)

# Fetch and display the joined results
joined_rows = result.fetchall()
print("Joined data (SQLAlchemy):")
for row in joined_rows[:5]:  # Display only the first 5 rows for brevity
    print(row)

# Close the connection
connection.close()

Joined data (SQLAlchemy):
(1, 'PENELOPE', 'GUINESS', datetime.datetime(2006, 2, 15, 4, 34, 33), 1, 1, datetime.datetime(2006, 2, 15, 5, 5, 3))
(1, 'PENELOPE', 'GUINESS', datetime.datetime(2006, 2, 15, 4, 34, 33), 1, 23, datetime.datetime(2006, 2, 15, 5, 5, 3))
(1, 'PENELOPE', 'GUINESS', datetime.datetime(2006, 2, 15, 4, 34, 33), 1, 25, datetime.datetime(2006, 2, 15, 5, 5, 3))
(1, 'PENELOPE', 'GUINESS', datetime.datetime(2006, 2, 15, 4, 34, 33), 1, 106, datetime.datetime(2006, 2, 15, 5, 5, 3))
(1, 'PENELOPE', 'GUINESS', datetime.datetime(2006, 2, 15, 4, 34, 33), 1, 140, datetime.datetime(2006, 2, 15, 5, 5, 3))
