# Notebook 1: Introduction and Setup

# Learning Objectives:
- Understand the workshop's purpose and structure.
- Set up the Python environment with the necessary libraries.
- Verify that all tools are correctly installed and configured.

Welcome to the workshop on "Introduction to SQL for Data Science Using Pandas and SQLite".
In this notebook, we'll ensure that your environment is set up correctly for the exercises that follow.

Let's start by importing the necessary libraries.

In [41]:
import pandas as pd
import sqlite3

# Verify the versions of the libraries
print("Pandas version:", pd.__version__)
print("SQLite version:", sqlite3.sqlite_version)

Pandas version: 2.2.3
SQLite version: 3.46.1


SQLite3 comes with Python's standard library, so no extra install is needed.

If you need to install any missing libraries, uncomment and run the following commands:

In [42]:
# %pip install pandas

# Key Points:
- Pandas is a powerful library for data manipulation in Python.
- SQLite3 allows us to interact with SQLite databases using Python.

# Notebook 2: Pandas Recap

# Learning Objectives:
- Review key Pandas concepts and operations.
- Understand DataFrame creation, selection, filtering, and aggregation.
- Prepare for comparing Pandas operations with SQL queries.

In [43]:
import pandas as pd

# Let's create a simple DataFrame to recap Pandas basics.
data = {
    'Country': ['A', 'B', 'C', 'D', 'E'],
    'Value': [100, 200, 300, 400, 500],
    'Category': ['X', 'Y', 'X', 'Y', 'X']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Country,Value,Category
0,A,100,X
1,B,200,Y
2,C,300,X
3,D,400,Y
4,E,500,X


In [44]:
# Selecting a single column
df['Country']

0    A
1    B
2    C
3    D
4    E
Name: Country, dtype: object

In [45]:
# Selecting multiple columns
df[['Country', 'Value']]

Unnamed: 0,Country,Value
0,A,100
1,B,200
2,C,300
3,D,400
4,E,500


In [46]:
# Filtering rows based on a condition
df[df['Value'] > 250]

Unnamed: 0,Country,Value,Category
2,C,300,X
3,D,400,Y
4,E,500,X


In [47]:
# Grouping and aggregating data
grouped = df.groupby('Category')['Value'].sum().reset_index()
grouped

Unnamed: 0,Category,Value
0,X,900
1,Y,600


# Key Points:
- Pandas DataFrames are 2-dimensional labeled data structures with columns of potentially different types.
- You can select, filter, and manipulate data using intuitive syntax.
- Grouping and aggregation are tools for summarizing data.

In [48]:
df

Unnamed: 0,Country,Value,Category
0,A,100,X
1,B,200,Y
2,C,300,X
3,D,400,Y
4,E,500,X


# Notebook 3: Introduction to the Datasets

# Learning Objectives:
- Familiarize yourself with the World Bank and ESS datasets.
- Load datasets into Pandas DataFrames.
- Perform initial data exploration.

In [49]:
import pandas as pd

# Load the World Bank GDP data
# For the purpose of this notebook, we'll create a sample dataset
gdp_data = {
    'CountryCode': ['USA', 'CAN', 'MEX'],
    'CountryName': ['United States', 'Canada', 'Mexico'],
    'Year': [2020, 2020, 2020],
    'GDP': [21137518, 1647126, 1074314]
}
gdp = pd.DataFrame(gdp_data)
gdp

Unnamed: 0,CountryCode,CountryName,Year,GDP
0,USA,United States,2020,21137518
1,CAN,Canada,2020,1647126
2,MEX,Mexico,2020,1074314


In [50]:
# Load the World Bank Population data
population_data = {
    'CountryCode': ['USA', 'CAN', 'MEX'],
    'CountryName': ['United States', 'Canada', 'Mexico'],
    'Year': [2020, 2020, 2020],
    'Population': [331002651, 37742154, 128932753]
}
population = pd.DataFrame(population_data)
population

Unnamed: 0,CountryCode,CountryName,Year,Population
0,USA,United States,2020,331002651
1,CAN,Canada,2020,37742154
2,MEX,Mexico,2020,128932753


In [51]:
# Load the ESS data
# Again, we'll create a sample dataset
ess_data = {
    'CountryCode': ['USA', 'CAN', 'MEX'],
    'Year': [2020, 2020, 2020],
    'HappinessScore': [7.0, 7.5, 6.5]
}
ess = pd.DataFrame(ess_data)
ess

Unnamed: 0,CountryCode,Year,HappinessScore
0,USA,2020,7.0
1,CAN,2020,7.5
2,MEX,2020,6.5


In [52]:
# Initial data exploration
print("GDP DataFrame Info:")
gdp.info()

GDP DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   CountryCode  3 non-null      object
 1   CountryName  3 non-null      object
 2   Year         3 non-null      int64 
 3   GDP          3 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 228.0+ bytes


In [53]:
print("\nPopulation DataFrame Info:")
population.info()


Population DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   CountryCode  3 non-null      object
 1   CountryName  3 non-null      object
 2   Year         3 non-null      int64 
 3   Population   3 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 228.0+ bytes


In [54]:
print("\nESS DataFrame Info:")
ess.info()


ESS DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CountryCode     3 non-null      object 
 1   Year            3 non-null      int64  
 2   HappinessScore  3 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 204.0+ bytes


# Key Points:
- We've loaded sample datasets representing GDP, population, and happiness scores.
- Initial data exploration helps us understand the structure and contents of the datasets.
- Consistent 'CountryCode' and 'Year' columns will allow us to merge these datasets later.

In [55]:
# Notebook 4: Connecting Python to SQLite

# Learning Objectives:
# - Create a SQLite database and establish a connection.
# - Import Pandas DataFrames into SQLite tables.
# - Verify data transfer and explore the database structure.

import pandas as pd
import sqlite3

# Create a connection to a new SQLite database
conn = sqlite3.connect('workshop.db')

# Write the DataFrames to SQLite tables
gdp.to_sql('gdp', conn, if_exists='replace', index=False)
population.to_sql('population', conn, if_exists='replace', index=False)
ess.to_sql('ess', conn, if_exists='replace', index=False)

# Verify that the tables have been created
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", conn)
print("Tables in the database:")
print(tables)

# View the first few rows of the 'gdp' table using SQL query
gdp_sql = pd.read_sql_query("SELECT * FROM gdp LIMIT 5;", conn)
gdp_sql

# Key Points:
# - Established a connection to a SQLite database using Python.
# - Imported Pandas DataFrames into SQLite as tables.
# - Verified data transfer by querying the SQLite database.


Tables in the database:
             name
0  gdp_multi_year
1       employees
2             gdp
3      population
4             ess


Unnamed: 0,CountryCode,CountryName,Year,GDP
0,USA,United States,2020,21137518
1,CAN,Canada,2020,1647126
2,MEX,Mexico,2020,1074314


In [56]:
# Notebook 5: SQL Basics Through Pandas Comparison

# Learning Objectives:
# - Perform basic SQL queries and compare them with Pandas operations.
# - Understand SELECT statements, WHERE clauses, and GROUP BY clauses.
# - Develop intuition for translating between SQL and Pandas syntax.

import pandas as pd
import sqlite3

# Connect to the existing SQLite database
conn = sqlite3.connect('workshop.db')

# SELECT Statements vs. DataFrame Selection

# SQL: Select specific columns from the 'gdp' table
query_sql = "SELECT CountryName, GDP FROM gdp;"
gdp_sql = pd.read_sql_query(query_sql, conn)
print("SQL Query Result:")
gdp_sql

# Pandas: Select specific columns from the 'gdp' DataFrame
gdp_pandas = gdp[['CountryName', 'GDP']]
print("Pandas Selection Result:")
gdp_pandas

# WHERE Clauses vs. Boolean Indexing

# SQL: Filter countries with GDP greater than 1 trillion
query_sql = "SELECT * FROM gdp WHERE GDP > 1000000;"
large_gdp_sql = pd.read_sql_query(query_sql, conn)
print("SQL Query Result with WHERE clause:")
large_gdp_sql

# Pandas: Filter countries with GDP greater than 1 trillion
large_gdp_pandas = gdp[gdp['GDP'] > 1000000]
print("Pandas Filtering Result:")
large_gdp_pandas

# GROUP BY vs. groupby() in Pandas

# SQL: Calculate average GDP
query_sql = "SELECT AVG(GDP) as Avg_GDP FROM gdp;"
avg_gdp_sql = pd.read_sql_query(query_sql, conn)
print("SQL Average GDP:")
avg_gdp_sql

# Pandas: Calculate average GDP
avg_gdp_pandas = pd.DataFrame({'Avg_GDP': [gdp['GDP'].mean()]})
print("Pandas Average GDP:")
avg_gdp_pandas

# Key Points:
# - SQL and Pandas can perform similar data manipulation tasks.
# - SELECT statements in SQL correspond to column selection in Pandas.
# - WHERE clauses in SQL are similar to boolean indexing in Pandas.
# - GROUP BY in SQL is analogous to groupby() in Pandas.


SQL Query Result:
Pandas Selection Result:
SQL Query Result with WHERE clause:
Pandas Filtering Result:
SQL Average GDP:
Pandas Average GDP:


Unnamed: 0,Avg_GDP
0,7952986.0


In [57]:
# Notebook 5: SQL Basics Through Pandas Comparison

# Learning Objectives:
# - Perform basic SQL queries and compare them with Pandas operations.
# - Understand SELECT statements, WHERE clauses, and GROUP BY clauses.
# - Develop intuition for translating between SQL and Pandas syntax.

import pandas as pd
import sqlite3

# Connect to the existing SQLite database
conn = sqlite3.connect('workshop.db')

# SELECT Statements vs. DataFrame Selection

# SQL: Select specific columns from the 'gdp' table
query_sql = "SELECT CountryName, GDP FROM gdp;"
gdp_sql = pd.read_sql_query(query_sql, conn)
print("SQL Query Result:")
gdp_sql

# Pandas: Select specific columns from the 'gdp' DataFrame
gdp_pandas = gdp[['CountryName', 'GDP']]
print("Pandas Selection Result:")
gdp_pandas

# WHERE Clauses vs. Boolean Indexing

# SQL: Filter countries with GDP greater than 1 trillion
query_sql = "SELECT * FROM gdp WHERE GDP > 1000000;"
large_gdp_sql = pd.read_sql_query(query_sql, conn)
print("SQL Query Result with WHERE clause:")
large_gdp_sql

# Pandas: Filter countries with GDP greater than 1 trillion
large_gdp_pandas = gdp[gdp['GDP'] > 1000000]
print("Pandas Filtering Result:")
large_gdp_pandas

# GROUP BY vs. groupby() in Pandas

# SQL: Calculate average GDP
query_sql = "SELECT AVG(GDP) as Avg_GDP FROM gdp;"
avg_gdp_sql = pd.read_sql_query(query_sql, conn)
print("SQL Average GDP:")
avg_gdp_sql

# Pandas: Calculate average GDP
avg_gdp_pandas = pd.DataFrame({'Avg_GDP': [gdp['GDP'].mean()]})
print("Pandas Average GDP:")
avg_gdp_pandas

# Key Points:
# - SQL and Pandas can perform similar data manipulation tasks.
# - SELECT statements in SQL correspond to column selection in Pandas.
# - WHERE clauses in SQL are similar to boolean indexing in Pandas.
# - GROUP BY in SQL is analogous to groupby() in Pandas.


SQL Query Result:
Pandas Selection Result:
SQL Query Result with WHERE clause:
Pandas Filtering Result:
SQL Average GDP:
Pandas Average GDP:


Unnamed: 0,Avg_GDP
0,7952986.0


In [8]:
# Notebook 6: Combining Datasets Using Joins and Merges

# Learning Objectives:
# - Combine datasets using SQL JOIN operations.
# - Merge DataFrames in Pandas.
# - Understand the pros and cons of combining data in SQL vs. Pandas.

import pandas as pd
import sqlite3

# Connect to the SQLite database
conn = sqlite3.connect('workshop.db')

# SQL INNER JOIN
query_sql = """
SELECT gdp.CountryCode, gdp.GDP, ess.HappinessScore
FROM gdp
INNER JOIN ess
ON gdp.CountryCode = ess.CountryCode;
"""
combined_sql = pd.read_sql_query(query_sql, conn)
print("Combined Data using SQL INNER JOIN:")
combined_sql

# Pandas Merge
combined_pandas = pd.merge(gdp, ess, on='CountryCode', how='inner')
print("Combined Data using Pandas merge:")
combined_pandas[['CountryCode', 'GDP', 'HappinessScore']]

# Discussing Data Alignment Issues
# Ensure that the 'CountryCode' is consistent across datasets
# In real-world datasets, you might need to clean or standardize keys before merging

# Key Points:
# - SQL JOIN operations allow you to combine tables based on common keys.
# - Pandas merge() function serves a similar purpose for DataFrames.
# - Understanding the join/merge type ('inner', 'left', 'right', 'outer') is crucial.
# - Data alignment and key consistency are important to avoid mismatches.


Combined Data using SQL INNER JOIN:
Combined Data using Pandas merge:


Unnamed: 0,CountryCode,GDP,HappinessScore
0,USA,21137518,7.0
1,CAN,1647126,7.5
2,MEX,1074314,6.5


In [9]:
# Notebook 7: Advanced SQL Concepts

# Learning Objectives:
# - Understand and apply SQL subqueries and nested SELECT statements.
# - Utilize SQL window functions for advanced analytical tasks.
# - Recognize scenarios where SQL has advantages over Pandas.

import pandas as pd
import sqlite3

# Connect to the SQLite database
conn = sqlite3.connect('workshop.db')

# Subqueries and Nested SELECTs

# Example: Find countries with GDP above the global average

# Step 1: Calculate global average GDP
query_avg = "SELECT AVG(GDP) as GlobalAvgGDP FROM gdp;"
global_avg = pd.read_sql_query(query_avg, conn).iloc[0]['GlobalAvgGDP']
print(f"Global Average GDP: {global_avg}")

# Step 2: Use subquery to select countries above average GDP
query_sql = f"""
SELECT CountryName, GDP
FROM gdp
WHERE GDP > ({global_avg});
"""
above_avg_sql = pd.read_sql_query(query_sql, conn)
print("Countries with GDP above the global average:")
above_avg_sql

# Window Functions in SQL

# For window functions, we'll need a dataset with multiple years
# Let's create a sample 'gdp_multi_year' table
gdp_multi_year_data = {
    'CountryName': ['USA', 'USA', 'USA', 'CAN', 'CAN', 'CAN'],
    'Year': [2018, 2019, 2020, 2018, 2019, 2020],
    'GDP': [20580230, 21433226, 21137518, 1712515, 1736426, 1647126]
}
gdp_multi_year = pd.DataFrame(gdp_multi_year_data)
gdp_multi_year.to_sql('gdp_multi_year', conn, if_exists='replace', index=False)

# SQL Window Function: Calculate moving average GDP over 3 years for each country
query_sql = """
SELECT
    CountryName,
    Year,
    GDP,
    AVG(GDP) OVER (
        PARTITION BY CountryName
        ORDER BY Year
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) as MovingAvgGDP
FROM gdp_multi_year
ORDER BY CountryName, Year;
"""
moving_avg_sql = pd.read_sql_query(query_sql, conn)
print("Moving Average GDP over 3 years:")
moving_avg_sql

# Attempting the Equivalent in Pandas
gdp_multi_year_sorted = gdp_multi_year.sort_values(['CountryName', 'Year'])
gdp_multi_year_sorted['MovingAvgGDP'] = gdp_multi_year_sorted.groupby('CountryName')['GDP'].rolling(window=3, min_periods=1).mean().reset_index(0, drop=True)
print("Moving Average GDP using Pandas:")
gdp_multi_year_sorted

# Discussion:
# - The SQL window function provides a concise and efficient way to calculate moving averages.
# - The Pandas equivalent is more verbose and may be less efficient for large datasets.

# Key Points:
# - Advanced SQL features like subqueries and window functions enable complex analytical tasks.
# - SQL can perform operations that are cumbersome or less efficient in Pandas.
# - Understanding these SQL features expands your data analysis capabilities.


Global Average GDP: 7952986.0
Countries with GDP above the global average:
Moving Average GDP over 3 years:
Moving Average GDP using Pandas:


Unnamed: 0,CountryName,Year,GDP,MovingAvgGDP
3,CAN,2018,1712515,1712515.0
4,CAN,2019,1736426,1724470.0
5,CAN,2020,1647126,1698689.0
0,USA,2018,20580230,20580230.0
1,USA,2019,21433226,21006730.0
2,USA,2020,21137518,21050320.0


In [10]:
# Notebook 8: Handling Large Datasets

# Learning Objectives:
# - Recognize the limitations of Pandas with large datasets.
# - Use SQLite to handle datasets that exceed memory limitations.
# - Compare approaches to processing large datasets in Pandas and SQL.

import pandas as pd
import sqlite3

# Simulating a large dataset by replicating the 'gdp' DataFrame
replication_factor = 100000  # Adjust this number to simulate a large dataset
large_gdp = pd.concat([gdp] * replication_factor, ignore_index=True)

# Estimate the size of the DataFrame
import sys
print(f"Size of large_gdp DataFrame: {sys.getsizeof(large_gdp) / (1024 ** 2):.2f} MB")

# Attempting to process the large DataFrame in Pandas
try:
    total_gdp = large_gdp['GDP'].sum()
    print(f"Total GDP calculated in Pandas: {total_gdp}")
except MemoryError as e:
    print("MemoryError encountered in Pandas:", e)

# Using SQLite to handle the large dataset
conn_large = sqlite3.connect('large_workshop.db')

# Write the large DataFrame to SQLite in chunks to avoid memory issues
chunk_size = 100000
for i in range(0, large_gdp.shape[0], chunk_size):
    chunk = large_gdp.iloc[i:i+chunk_size]
    chunk.to_sql('large_gdp', conn_large, if_exists='append', index=False)
    print(f"Processed rows {i} to {i+chunk_size}")

# Querying the total GDP using SQL
query_sql = "SELECT SUM(GDP) as Total_GDP FROM large_gdp;"
total_gdp_sql = pd.read_sql_query(query_sql, conn_large)
print("Total GDP calculated using SQLite:")
total_gdp_sql

# Key Points:
# - Pandas may not handle very large datasets efficiently due to memory constraints.
# - SQLite and other databases can process large datasets without loading all data into memory.
# - Using databases allows for efficient querying and aggregation on large datasets.


Size of large_gdp DataFrame: 35.86 MB
Total GDP calculated in Pandas: 2385895800000
Processed rows 0 to 100000
Processed rows 100000 to 200000
Processed rows 200000 to 300000
Total GDP calculated using SQLite:


Unnamed: 0,Total_GDP
0,2385895800000


In [15]:
%pip install "dask[dataframe]"

Collecting dask-expr<1.2,>=1.1 (from dask[dataframe])
  Downloading dask_expr-1.1.15-py3-none-any.whl.metadata (2.5 kB)
Collecting pyarrow>=14.0.1 (from dask-expr<1.2,>=1.1->dask[dataframe])
  Downloading pyarrow-17.0.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (3.3 kB)
Downloading dask_expr-1.1.15-py3-none-any.whl (242 kB)
Downloading pyarrow-17.0.0-cp312-cp312-macosx_11_0_arm64.whl (27.2 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.2/27.2 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0mm eta [36m0:00:01[0m
[?25hInstalling collected packages: pyarrow, dask-expr
Successfully installed dask-expr-1.1.15 pyarrow-17.0.0
Note: you may need to restart the kernel to use updated packages.


In [16]:
# Notebook 9: Scaling Data Analysis Beyond Pandas

# Learning Objectives:
# - Explore options for handling datasets larger than memory.
# - Understand when to use tools like Dask.
# - Recognize the trade-offs between different data scaling solutions.

# Option 1: Upgrading Hardware
# - Increase your system's RAM to handle larger datasets.
# - This can be expensive and has physical limitations.

# Option 2: Using Dask
import dask.dataframe as dd

# Reading a large CSV file with Dask
# Assuming 'large_world_bank_gdp.csv' is a very large CSV file
# For demonstration purposes, we'll use the existing 'gdp' DataFrame
# Save 'gdp' to a CSV file to simulate reading from disk
gdp.to_csv('gdp.csv', index=False)

# Read the CSV file using Dask
ddf = dd.read_csv('gdp.csv')

# Perform computations using Dask
total_gdp_dask = ddf['GDP'].sum().compute()
print(f"Total GDP calculated using Dask: {total_gdp_dask}")

# Option 3: Leveraging Advanced SQL Databases
# - Use databases like PostgreSQL or MySQL for better performance and features.
# - Suitable for handling large-scale data and concurrent access.

# Key Points:
# - Scaling data analysis requires choosing the right tools based on data size and complexity.
# - Dask extends Pandas to handle larger-than-memory datasets with parallel computing.
# - Advanced SQL databases offer robust solutions for large datasets but require more setup and maintenance.


Total GDP calculated using Dask: 23858958


In [17]:
# Notebook 10: Relational Databases and Data Structures

# Learning Objectives:
# - Understand the concepts of relational databases and normal forms.
# - Discuss the suitability of Pandas for relational data.
# - Recognize when to use relational databases over Pandas.

# Relational Database Concepts
# - Data is organized into tables with rows and columns.
# - Tables can have relationships (one-to-one, one-to-many, many-to-many).
# - Normalization reduces data redundancy and ensures data integrity.

# Normal Forms
# - First Normal Form (1NF): Each column contains atomic values.
# - Second Normal Form (2NF): Meets 1NF and all non-key attributes depend on the primary key.
# - Third Normal Form (3NF): Meets 2NF and all attributes are only dependent on the primary key.

# Pandas and Relational Data
# - Pandas DataFrames can represent tables but do not enforce relational constraints.
# - No built-in support for foreign keys or normalization rules.
# - Suitable for flat data or when relational integrity is managed manually.

# When to Use Relational Databases
# - Complex data relationships that require integrity constraints.
# - Multi-user environments where concurrent access is needed.
# - Large datasets that benefit from database indexing and query optimization.

# Key Points:
# - Relational databases provide structure and integrity for complex data relationships.
# - Pandas is flexible but lacks built-in relational database features.
# - Choose the appropriate tool based on the complexity and requirements of your data.


In [18]:
# Notebook 11: Working with Hierarchical Data

# Learning Objectives:
# - Identify hierarchical data structures and common use cases.
# - Use SQL recursive queries to navigate hierarchical data.
# - Attempt hierarchical data operations in Pandas and discuss limitations.

import pandas as pd
import sqlite3

# Connect to the SQLite database
conn = sqlite3.connect('workshop.db')

# Create an 'employees' table to represent hierarchical data
employees_data = {
    'EmployeeID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'ManagerID': [None, 1, 1, 2, 2]
}
employees = pd.DataFrame(employees_data)
employees.to_sql('employees', conn, if_exists='replace', index=False)

# SQL Recursive Query to get all subordinates of 'Alice'
query_sql = """
WITH RECURSIVE subordinates AS (
    SELECT * FROM employees WHERE Name = 'Alice'
    UNION ALL
    SELECT e.* FROM employees e INNER JOIN subordinates s ON e.ManagerID = s.EmployeeID
)
SELECT * FROM subordinates;
"""
subordinates_sql = pd.read_sql_query(query_sql, conn)
print("Subordinates of Alice using SQL recursive query:")
subordinates_sql

# Attempting the same in Pandas
def get_subordinates(df, manager_id):
    subs = df[df['ManagerID'] == manager_id]
    result = []
    for _, row in subs.iterrows():
        result.append(row)
        result.extend(get_subordinates(df, row['EmployeeID']))
    return result

# Get subordinates of 'Alice' (EmployeeID 1)
subordinates = get_subordinates(employees, 1)
subordinates_df = pd.DataFrame(subordinates)
print("Subordinates of Alice using Pandas:")
subordinates_df

# Discussion:
# - SQL handles hierarchical data efficiently using recursive queries.
# - Pandas requires custom recursive functions, which may be less efficient and harder to maintain.

# Key Points:
# - Hierarchical data represents relationships like organizational structures or file systems.
# - SQL provides built-in support for recursive queries to navigate hierarchical data.
# - Pandas can handle hierarchical data but may not be the most efficient tool for complex hierarchies.


Subordinates of Alice using SQL recursive query:
Subordinates of Alice using Pandas:


Unnamed: 0,EmployeeID,Name,ManagerID
1,2,Bob,1.0
3,4,David,2.0
4,5,Eve,2.0
2,3,Charlie,1.0


# Notebook 12: Conclusion and Next Steps

# Learning Objectives:
- Summarize the key concepts learned throughout the workshop.
- Identify areas for further study and practice.
- Feel confident in applying SQL and Pandas to data analysis tasks.

# Summary of Key Concepts:

- SQL and Pandas are powerful tools for data manipulation and analysis.
- We can perform similar operations in both, but each has its strengths.
- SQL excels at handling large datasets and complex queries, especially with advanced features like window functions.
- Pandas is great for in-memory data manipulation and integrates well with the Python ecosystem.
- Understanding when to use each tool is crucial for efficient data analysis.
- Scaling data analysis may require additional tools like Dask or moving to more robust databases.
- Relational databases provide structure and integrity for complex data relationships.
- Handling hierarchical data may be more efficient in SQL due to built-in recursive query support.

# Next Steps:

- Practice applying these concepts to your own datasets.
- Explore more advanced SQL features and functions.
- Learn about database optimization and indexing.
- Experiment with big data tools and distributed computing frameworks.
- Continue developing your data science skills by integrating SQL and Pandas into your workflow.

# Key Takeaways:
- You now have a foundational understanding of integrating SQL with Python and Pandas.
- You can perform data manipulation tasks using both SQL queries and Pandas operations.
- You've learned how to handle large datasets and when to scale beyond Pandas.
- You're equipped to choose the right tools for different data analysis scenarios.
