### Chapter: Mastering SQL for Data Science with Python

The chapter introduces Structured Query Language (SQL), a powerful and essential tool for data professionals. SQL is the standard language used to communicate with and manage relational databases. This chapter focuses on its application in data science, demonstrating how to use SQL for data retrieval, manipulation, and analysis. It also covers the integration of SQL with Python, a crucial skill for any data scientist.

### Introduction to SQL for Data Science
SQL is a declarative language, meaning you tell the database what you want, not how to get it. This makes it highly efficient for managing large datasets. For data science, SQL is invaluable for:

* **Data Retrieval:** Extracting specific subsets of data from large databases.

* **Data Cleaning and Transformation:**  Handling missing values, standardizing formats, and creating new features.

* **Exploratory Data Analysis (EDA):**  Performing quick summaries, aggregations, and data profiling.

* **Feature Engineering:**  Creating new variables from existing ones before feeding them into machine learning models.


A typical data science workflow using SQL might look like this:Data Extraction: You use a SELECT query to pull a specific subset of data relevant to your project.Data Wrangling: You perform initial cleaning, filtering (WHERE), and aggregation (GROUP BY) directly in the database.Analysis: The prepared data is loaded into Python (often as a Pandas DataFrame) for more sophisticated analysis, modeling, and visualization.Core SQL Commands: Your Essential ToolkitLet's begin with the building blocks of SQL. We'll use a hypothetical sales table with columns like order_id, customer_id, product_id, sale_date, and amount.SELECT: The most fundamental command. It specifies which columns you want to see.SELECT customer_id, amount FROM sales;FROM: Specifies the table you're querying.WHERE: Filters rows based on one or more conditions.SELECT * FROM sales WHERE amount > 100 AND sale_date >= '2023-01-01';ORDER BY: Sorts the result set. DESC is for descending order, ASC for ascending.SELECT product_id, amount FROM sales ORDER BY amount DESC;GROUP BY: Aggregates rows that have the same values into summary rows. This is perfect for calculating metrics like total sales per customer.HAVING: Filters the results of a GROUP BY clause. This is similar to WHERE but operates on the aggregated data.JOIN: This is how you combine data from multiple tables. The most common type is INNER JOIN.SELECT s.order_id, p.product_name FROM sales s JOIN products p ON s.product_id = p.product_id;The diagram above illustrates how a JOIN connects records from two tables based on a common key.SQL and Python: The Colab WorkflowGoogle Colab comes with the sqlite3 library pre-installed, making it the perfect sandbox for practicing SQL with Python. We will create a simple, file-based database in Colab's temporary storage, allowing you to run everything in a single notebook cell. We will also use the popular pandas library, which has excellent support for SQL.Let's walk through a complete example. You can copy and paste the code below directly into a Colab cell.#

Step 1: Import necessary libraries

In [None]:
import sqlite3
import pandas as pd

# Step 2: Connect to a SQLite database.
# The database file 'sales_data.db' will be created in the Colab environment.
conn = sqlite3.connect('sales_data.db')
cursor = conn.cursor()

# Step 3: Create tables and insert data using SQL queries
# We'll create two tables: 'sales' and 'products'
print("Creating and populating tables...")

# Create the sales table
cursor.execute('''
CREATE TABLE IF NOT EXISTS sales (
    order_id INTEGER PRIMARY KEY,
    customer_id INTEGER,
    product_id INTEGER,
    sale_date TEXT,
    amount REAL
);
''')

# Create the products table
cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
    product_id INTEGER PRIMARY KEY,
    product_name TEXT
);
''')

# Insert data into the sales table
sales_data = [
    (1, 101, 1, '2023-01-01', 150.00),
    (2, 102, 2, '2023-01-02', 200.50),
    (3, 101, 3, '2023-01-03', 75.25),
    (4, 103, 1, '2023-01-04', 150.00),
    (5, 102, 2, '2023-01-05', 200.50)
]
cursor.executemany("INSERT OR IGNORE INTO sales VALUES (?, ?, ?, ?, ?)", sales_data)

# Insert data into the products table
products_data = [
    (1, 'Laptop'),
    (2, 'Monitor'),
    (3, 'Mouse')
]
cursor.executemany("INSERT OR IGNORE INTO products VALUES (?, ?)", products_data)

# Commit the changes to the database
conn.commit()
print("Database populated successfully!")

# Step 4: Execute SQL queries and fetch results
# Query 1: Find all sales with an amount greater than $150
query_1 = "SELECT * FROM sales WHERE amount > 150;"
cursor.execute(query_1)
results_1 = cursor.fetchall()

print("\n--- Sales with Amount > $150 ---")
for row in results_1:
    print(row)

# Step 5: Integrating SQL with Pandas
# This is a key data science workflow in Colab.
# We'll run a JOIN query and load the results directly into a DataFrame.

# Query 2: Get a list of all sales with product names
query_2 = """
SELECT
    s.order_id,
    s.sale_date,
    s.amount,
    p.product_name
FROM
    sales AS s
JOIN
    products AS p
ON
    s.product_id = p.product_id;
"""

print("\n--- Sales data loaded into a Pandas DataFrame ---")
df_sales = pd.read_sql_query(query_2, conn)
print(df_sales)

# Query 3: Find the total sales per customer using GROUP BY
query_3 = """
SELECT
    customer_id,
    SUM(amount) AS total_amount
FROM
    sales
GROUP BY
    customer_id
ORDER BY
    total_amount DESC;
"""

print("\n--- Total sales per customer ---")
df_summary = pd.read_sql_query(query_3, conn)
print(df_summary)

# Step 6: Close the database connection
conn.close()
print("\nConnection to database closed.")

Next Steps: Beyond the BasicsThis chapter provides a solid foundation. In a production environment, you would typically connect to a more robust database like PostgreSQL, MySQL, or a cloud-based service like BigQuery. The syntax remains largely the same, but you would use a different Python library, such as psycopg2 for PostgreSQL. The workflow of connecting, querying, and loading into Pandas remains consistent.Remember to experiment with different queries and clauses. The best way to learn is by doing!

## Knowledge Check

<iframe src="https://docs.google.com/forms/d/e/1FAIpQLSduyV-41gyQSCvhPZVweI7VZjrayWSMa2OFB-ra-BsTnRPgeQ/viewform?embedded=true" width="100%" height="800px" frameborder="0" style="min-height: 800px; height: 100vh">Loading…</iframe>



<iframe src="https://docs.google.com/forms/d/e/1FAIpQLSduyV-41gyQSCvhPZVweI7VZjrayWSMa2OFB-ra-BsTnRPgeQ/viewform?embedded=true" width="100%" height="800px" frameborder="0" style="min-height: 800px; height: 100vh">Loading…</iframe>