<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/DataScience_05_WriteBetterQueries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Chapter 5: Writing Better Queries

When dealing with large amounts of data, the ability to write effective queries is not just a skill—it's an art form that can significantly impact the success of your projects. **Efficient queries** are the cornerstone of data analysis, enabling you to extract meaningful insights from vast seas of information with precision and speed.

But why does "better" matter when it comes to writing queries? The answer lies in the **three critical factors** that define the quality of your data interactions:

1. *Performance*. Better queries run faster, consuming less computational resources. In a world where time is money, this translates to cost savings and quicker insights.

2. *Accuracy*. Improved query writing ensures you're extracting exactly the data you need. This precision minimizes errors and misinterpretations that could lead to flawed analyses.

3. *Scalability*. As your data grows, well-crafted queries continue to perform efficiently, allowing your analyses to scale seamlessly.

Throughout this chapter, we'll explore techniques to enhance your query writing skills, focusing on data manipulation, optimization strategies, and best practices. By the end, you'll be equipped to craft queries that not only retrieve data but do so with elegance and efficiency.

Remember, in the world of big data, the difference between a good query and a great one can be the difference between drowning in information and surfing the waves of insight. Let's dive in and learn how to write better queries!


## Sample Data Set: Zombie Attacks!
For this chapter, we'll be dealing with data set about zombie attacks. Let's start by loading our data set and taking a look.

In [1]:
!wget https://github.com/brendanpshea/data-science/raw/main/data/zombie_attacks.csv -q -nc

In [2]:
## load csv file into a sqlite database
import pandas as pd
import sqlite3

# Load the CSV file into a DataFrame
df = pd.read_csv('zombie_attacks.csv')

# Save to SQLite
conn = sqlite3.connect('zombie_attacks.db')
df.to_sql('ZombieAttacks', conn, if_exists='replace', index=False)
conn.close()

### Getting to Know Our Data
Now, let's connect to the database and take a look at our data.

In [3]:
%reload_ext sql
%config SqlMagic.autopandas = True
%sql sqlite:///zombie_attacks.db

In [4]:
%%sql
--Get table schema
PRAGMA table_info(ZombieAttacks);

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,Date,TEXT,0,,0
1,1,Location,TEXT,0,,0
2,2,ZombieType,TEXT,0,,0
3,3,VictimCount,REAL,0,,0
4,4,SurvivalRate,REAL,0,,0
5,5,WeatherCondition,TEXT,0,,0
6,6,MoonPhase,TEXT,0,,0
7,7,TemperatureCelsius,REAL,0,,0
8,8,HumidityPercent,REAL,0,,0
9,9,WindSpeedKmh,REAL,0,,0


In [5]:
%%sql
SELECT *
FROM ZombieAttacks
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,VictimCount,SurvivalRate,WeatherCondition,MoonPhase,TemperatureCelsius,HumidityPercent,WindSpeedKmh,PopulationDensity,EmergencyResponseTime,Month
0,2023-02-24,Des Moines,Runner,25.5,0.2215,Foggy,Full Moon,12.8,58.9,8.1,65.1,8.7,2.0
1,2023-09-29,Rochester,Walker,13.0,0.3739,Stormy,Waxing Crescent,21.2,46.4,35.6,594.5,4.0,9.0
2,2023-06-01,Rochester,Walker,12.0,0.1924,Cloudy,New Moon,20.6,31.9,9.9,236.4,13.2,6.0
3,2023-02-14,St. Louis,Crawler,7.0,0.7949,Stormy,New Moon,10.6,57.9,5.3,156.8,4.7,2.0
4,2023-08-31,Winnipeg,Runner,39.0,0.0678,Sunny,Full Moon,16.7,45.1,3.8,19.6,6.9,8.0


### Data Dictionary for `zombie_attacks.csv`

| **Column Name** | **Data Type** | **Description** |
| --- | --- | --- |
| `Date` | `datetime` | The date of the recorded zombie attack. |
| `Location` | `string` | The location where the zombie attack occurred, centered around major cities near Minneapolis. |
| `ZombieType` | `string` | The type of zombie involved in the attack, with possible values: 'Walker', 'Runner', 'Crawler', 'Jumper'. |
| `VictimCount` | `integer` | The number of victims in the zombie attack. |
| `SurvivalRate` | `float` | The survival rate of victims, represented as a proportion between 0 and 1. |
| `WeatherCondition` | `string` | The weather condition at the time of the attack, with possible values: 'Sunny', 'Rainy', 'Cloudy', 'Foggy', 'Stormy'. |
| `MoonPhase` | `string` | The phase of the moon at the time of the attack, with possible values: 'New Moon', 'Waxing Crescent', 'First Quarter', 'Waxing Gibbous', 'Full Moon', 'Waning Gibbous', 'Last Quarter', 'Waning Crescent'. |
| `TemperatureCelsius` | `float` | The temperature in degrees Celsius at the time of the attack, adjusted for weather conditions and location-specific patterns. |
| `HumidityPercent` | `float` | The humidity percentage at the time of the attack. |
| `WindSpeedKmh` | `float` | The wind speed in kilometers per hour at the time of the attack. |
| `PopulationDensity` | `float` | The population density of the location where the attack occurred. |
| `EmergencyResponseTime` | `float` | The time in minutes for emergency response to arrive at the scene of the attack. |
| `Month` | `integer` | The month of the year when the attack occurred, extracted from the `Date` column. |

## Filtering Data

Filtering is a fundamental operation in data analysis, allowing you to extract specific subsets of data based on certain conditions. In SQL, filtering is primarily done using the `WHERE` clause. Let's explore various filtering techniques using our Zombie Attacks dataset.

### Basic Filtering

The simplest form of filtering uses comparison operators.

**Example: Find all zombie attacks with more than 20 victims**

In [15]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
    VictimCount
FROM ZombieAttacks
WHERE VictimCount > 20
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,VictimCount
0,2023-02-24,Des Moines,Runner,25.5
1,2023-08-31,Winnipeg,Runner,39.0
2,2023-01-03,Chicago,Walker,25.0
3,2023-03-08,,Walker,23.0
4,2023-01-08,Chicago,Runner,28.5


This query returns all attacks where the victim count exceeds 20. Other comparison operators include `<` (less than), `=` (equal to), `>=` (greater than or equal to), `<=` (less than or equal to), and `<>` (not equal to).

#### Filtering with Multiple Conditions

You can combine multiple conditions using logical operators like `AND`, `OR`, and `NOT`.

**Example: Find Runner zombies in Rochester with more than 10 victims**

In [16]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
    VictimCount
FROM ZombieAttacks
WHERE ZombieType = 'Runner'
  AND Location = 'Rochester'
  AND VictimCount > 10
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,VictimCount
0,2023-11-06,Rochester,Runner,19.5
1,2023-06-22,Rochester,Runner,19.5
2,2023-06-02,Rochester,Runner,18.0
3,2023-08-08,Rochester,Runner,10.5
4,2023-11-15,Rochester,Runner,24.0


This query demonstrates the use of `AND` to combine multiple conditions.

#### Filtering with IN Clause

The `IN` clause is useful when you want to match against multiple possible values.

**Example: Find attacks in either Rochester, Minneapolis, or Madison**

In [17]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
    VictimCount
FROM ZombieAttacks
WHERE Location IN ('Rochester', 'Minneapolis', 'Madison')
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,VictimCount
0,2023-09-29,Rochester,Walker,13.0
1,2023-06-01,Rochester,Walker,12.0
2,2023-08-06,Minneapolis,Walker,6.0
3,2023-02-21,Rochester,Walker,10.0
4,2023-01-16,Minneapolis,Walker,6.0


This query retrieves attacks from any of the specified cities.

#### Filtering with LIKE Clause

The `LIKE` clause is used for pattern matching in string fields.

**Example: Find all attacks where the weather condition includes the word "Rain"**

In [18]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
    WeatherCondition
FROM ZombieAttacks
WHERE WeatherCondition LIKE '%Rain%'
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,WeatherCondition
0,2023-03-06,Kansas City,Crawler,Rainy
1,2023-02-21,Rochester,Walker,Rainy
2,2023-09-10,Milwaukee,Walker,Rainy
3,2023-12-03,Des Moines,Walker,Rainy
4,2023-05-28,Minneapolis,Runner,Rainy


This query will match 'Rainy', 'Light Rain', 'Heavy Rain', etc. The `%` is a wildcard that matches any number of characters.

#### Filtering with BETWEEN Clause

`BETWEEN` is used to filter values within a range.

**Example: Find attacks with temperatures between 15°C and 25°C**

In [19]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
    TemperatureCelsius
FROM ZombieAttacks
WHERE TemperatureCelsius BETWEEN 15 AND 25
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,TemperatureCelsius
0,2023-09-29,Rochester,Walker,21.2
1,2023-06-01,Rochester,Walker,20.6
2,2023-08-31,Winnipeg,Runner,16.7
3,2023-04-24,Chicago,Walker,17.4
4,2023-03-06,Kansas City,Crawler,24.0


This query retrieves all attacks that occurred when the temperature was between 15°C and 25°C, inclusive.

#### Filtering Null Values

Sometimes, you need to filter based on the presence or absence of data.

**Example: Find attacks where the emergency response time is not recorded**

In [20]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
    EmergencyResponseTime
FROM ZombieAttacks
WHERE EmergencyResponseTime IS NULL
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,EmergencyResponseTime
0,2023-10-23,Chicago,Walker,
1,2023-01-21,Madison,Crawler,
2,2023-05-16,Minneapolis,Jumper,
3,2023-08-28,Fargo,Runner,
4,2023-02-25,Fargo,Runner,


Use `IS NULL` to find rows where a column has no value, and `IS NOT NULL` to find rows where a column has any value.

#### Filtering with Subqueries

Subqueries allow you to use the result of one query to filter another.

**Example: Find attacks with above-average victim counts**

In [21]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
    VictimCount
FROM ZombieAttacks
WHERE VictimCount > (SELECT AVG(VictimCount) FROM ZombieAttacks)
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,VictimCount
0,2023-02-24,Des Moines,Runner,25.5
1,2023-09-29,Rochester,Walker,13.0
2,2023-06-01,Rochester,Walker,12.0
3,2023-08-31,Winnipeg,Runner,39.0
4,2023-10-23,Chicago,Walker,12.0


This query first calculates the average victim count across all attacks, then returns only the attacks that exceed this average.

## Sorting Data

Sorting is a crucial operation in data analysis, allowing you to organize your query results in a specific order. In SQL, sorting is primarily done using the `ORDER BY` clause. Let's explore various sorting techniques using our Zombie Attacks dataset.

#### Basic Sorting

The simplest form of sorting arranges data based on a single column, either in ascending (ASC) or descending (DESC) order.

**Example: Sort zombie attacks by date, showing the most recent first**

In [22]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
    VictimCount
FROM ZombieAttacks
ORDER BY Date DESC
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,VictimCount
0,2023-12-31,Des Moines,Runner,6.0
1,2023-12-31,Des Moines,Walker,6.0
2,2023-12-31,Winnipeg,Runner,18.0
3,2023-12-30,Milwaukee,Walker,9.0
4,2023-12-30,Milwaukee,Walker,8.0


This query returns the 5 most recent zombie attacks. By default, `ORDER BY` sorts in ascending order, so we use `DESC` to get the most recent dates first.

#### Sorting by Multiple Columns

You can sort by multiple columns to create a hierarchical order.

**Example: Sort attacks by location, then by date within each location**

In [25]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
    VictimCount
FROM ZombieAttacks
WHERE Location IS NOT NULL
AND Date IS NOT NULL
ORDER BY Location, Date
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,VictimCount
0,2023-01-03,Chicago,Walker,25.0
1,2023-01-05,Chicago,Crawler,8.0
2,2023-01-05,Chicago,Walker,17.0
3,2023-01-07,Chicago,Walker,5.0
4,2023-01-07,Chicago,Runner,7.5


This query first sorts the attacks alphabetically by location, and then within each location, it sorts by date in descending order.

#### Sorting with Expressions

You can use expressions in the `ORDER BY` clause to sort based on computed values.

**Example: Sort attacks by survival rate, considering the victim count**

In [27]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
   (1 - SurvivalRate) * VictimCount AS Casualties
FROM ZombieAttacks
ORDER BY Casualties DESC
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,Casualties
0,2023-08-31,Winnipeg,Runner,36.3558
1,2023-08-07,Minneapolis,Runner,31.0707
2,2023-07-19,Winnipeg,Runner,29.99745
3,2023-04-12,Des Moines,Runner,29.11545
4,2023-01-08,Chicago,Runner,26.7843


This query calculates a 'Casualties' value and sorts the results based on this calculated field, showing the 5 attacks with the highest casualty counts.

#### Sorting with Case Statements

Case statements in the `ORDER BY` clause allow for complex, conditional sorting.

**Example: Sort zombie types in a specific order, then by date**

In [29]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
    VictimCount
FROM ZombieAttacks
ORDER BY
    CASE ZombieType
        WHEN 'Runner' THEN 1
        WHEN 'Jumper' THEN 2
        WHEN 'Walker' THEN 3
        WHEN 'Crawler' THEN 4
        ELSE 5
    END,
    Date DESC
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,VictimCount
0,2023-12-31,Des Moines,Runner,6.0
1,2023-12-31,Winnipeg,Runner,18.0
2,2023-12-25,Des Moines,Runner,18.0
3,2023-12-24,Des Moines,Runner,9.0
4,2023-12-22,Fargo,Runner,25.5


This query sorts the attacks first by a custom zombie type order, and then by date within each type.

#### Sorting with Nulls

By default, NULL values are sorted differently in different database systems. You can control their position in the sorted results.

**Example: Sort attacks by emergency response time, with missing values last**

In [30]:
%%sql
SELECT *
FROM ZombieAttacks
ORDER BY EmergencyResponseTime DESC NULLS LAST
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,VictimCount,SurvivalRate,WeatherCondition,MoonPhase,TemperatureCelsius,HumidityPercent,WindSpeedKmh,PopulationDensity,EmergencyResponseTime,Month
0,2023-09-20,Des Moines,Runner,12.0,0.2408,,Waning Gibbous,34.6,43.2,0.4,149.9,50.7,9.0
1,2023-07-07,Madison,Runner,12.0,0.1891,Rainy,Last Quarter,14.8,75.1,19.4,225.8,42.4,7.0
2,2023-03-04,Winnipeg,Crawler,6.0,0.2017,Rainy,Full Moon,1.4,60.5,13.9,29.7,39.9,3.0
3,2023-04-28,Minneapolis,Crawler,9.0,0.1004,Cloudy,Waning Crescent,20.3,51.3,7.2,32.1,38.9,4.0
4,2023-09-12,Madison,Jumper,8.0,0.4428,Foggy,Last Quarter,14.9,80.9,20.2,590.7,35.5,9.0


This query ensures that attacks with no recorded emergency response time appear at the end of the sorted list.

### Date Functions

Working with dates and times is a common task in data analysis. SQLite, despite its lightweight nature, provides several useful functions for handling date and time data. Let's explore some of the basic date functions in SQLite that are particularly useful for beginners.

#### Basic SQLite Date Functions

**date(timestring, modifier)**: This function returns the date in the format 'YYYY-MM-DD'. Example:

In [37]:
%%sql
SELECT date('now');

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,date('now')
0,2024-07-05


**time(timestring, modifier)**: This function returns the time in the format 'HH:MM:SS'. Example:

In [38]:
%%sql
SELECT time('now');

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,time('now')
0,13:14:53


**datetime(timestring, modifier)**: This function returns the date and time in the format 'YYYY-MM-DD HH:MM:SS'. Example:

In [39]:
%%sql
SELECT datetime('now');

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,datetime('now')
0,2024-07-05 13:15:27


**strftime(format, timestring, modifier)**: This versatile function allows you to format date and time in various ways. Example:

In [42]:
%%sql
-- in US style
SELECT strftime('%m-%d-%Y', 'now');

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,"strftime('%m-%d-%Y', 'now')"
0,07-05-2024


In [44]:
%%sql
-- spell out months
SELECT strftime('%B %d, %Y', 'now');

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,"strftime('%B %d, %Y', 'now')"
0,


These functions can be used with our Zombie Attacks database. For instance, to get the day of the week for each attack:

In [43]:
%%sql
SELECT
    Date,
    Location,
    ZombieType,
    VictimCount,
    strftime('%w', Date) AS DayOfWeek
FROM ZombieAttacks
LIMIT 5;

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,Date,Location,ZombieType,VictimCount,DayOfWeek
0,2023-02-24,Des Moines,Runner,25.5,5
1,2023-09-29,Rochester,Walker,13.0,5
2,2023-06-01,Rochester,Walker,12.0,4
3,2023-02-14,St. Louis,Crawler,7.0,2
4,2023-08-31,Winnipeg,Runner,39.0,4


#### Modifiers in SQLite Date Functions

SQLite date functions also accept modifiers that allow you to perform date arithmetic. Some common modifiers include:

-   '+N days'
-   '-N days'
-   '+N months'
-   '+N years'
-   'start of month'
-   'start of year'

For example, to get the date 7 days from now:

In [47]:
%%sql
SELECT date('now', '+7 days');

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,"date('now', '+7 days')"
0,2024-07-12


Or ten years, 2 months, and 3 weeks from now.

In [51]:
%%sql
SELECT date('now', '+10 years', '+2 months', '+21 days');

 * sqlite:///zombie_attacks.db
Done.


Unnamed: 0,"date('now', '+10 years', '+2 months', '+21 days')"
0,2034-09-26


#### Comparison with Other Databases

While these functions are specific to SQLite, other database systems have their own ways of handling dates. Here's a comparison table showing equivalent operations in SQLite with PostgreSQL and MySQL (two widely used open source relational databases):

| Operation | SQLite | PostgreSQL | MySQL |
| --- | --- | --- | --- |
| Current Date | `date('now')` | `CURRENT_DATE` | `CURDATE()` |
| Current Time | `time('now')` | `CURRENT_TIME` | `CURTIME()` |
| Current Date and Time | `datetime('now')` | `CURRENT_TIMESTAMP` | `NOW()` |
| Extract Year | `strftime('%Y', date)` | `EXTRACT(YEAR FROM date)` | `YEAR(date)` |
| Extract Month | `strftime('%m', date)` | `EXTRACT(MONTH FROM date)` | `MONTH(date)` |
| Extract Day | `strftime('%d', date)` | `EXTRACT(DAY FROM date)` | `DAY(date)` |
| Format Date | `strftime('%Y-%m-%d', date)` | `TO_CHAR(date, 'YYYY-MM-DD')` | `DATE_FORMAT(date, '%Y-%m-%d')` |
| Add Days | `date(date, '+N days')` | `date + INTERVAL 'N days'` | `DATE_ADD(date, INTERVAL N DAY)` |
| Subtract Days | `date(date, '-N days')` | `date - INTERVAL 'N days'` | `DATE_SUB(date, INTERVAL N DAY)` |

This table provides a quick reference for equivalent date operations across these three popular database systems. While the specific syntax may differ, the general concepts remain the same.