<a href="https://colab.research.google.com/github/dottybusch/exercises/blob/main/query_quests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Query Quests: An Introduction to SQL

This Notebook will guide you through your first steps in SQL.

`Structured Query Language (SQL)` also called as **Sequel** allows you to to interact with a _Relational Database_.

**Relational Database**: A relational database is a collection of related tables\
**Table**: It's a collection of related data organized in rows and columns\
**Column (Field)**: Column represent a specific attribute or category of data\
**Row (Record)**: Each row in the table is a single entry or instance of data

---

**Data Hygiene:**\
Ideally the entries in each column should be consistent in terms of data types i.e. numerical (integer or float), string, datetime, etc. Preferably, the column name should not contain whitespaces e.g. using `First_Name` or `FirstName` instead of `First Name` as column name/header is considered as better practice.

---

You can perform various operations on a database using SQL like adding, deleting, modifying, or updating data or tables. Based on this, SQL commands or instructions can be categorized into five types:
- Data Definition Language (DDL):
  - Function: To define, alter, or delete Database structures i.e. table, database
  - Example commands: CREATE, DROP, ALTER, TRUNCATE, COMMENT, RENAME
- Data Query Language (DQL):
  - Function: To query data from a database i.e. data retrieval
  - Example commands: SELECT ... FROM, WHERE, SELECT DISTINCT
- Data Manipulation Language (DML):
  - Function: To manipulate (add, delete, update) data in a database
  - Example commands: INSERT, UPDATE, DELETE
- Data Control Language (DCL):
  - Function: To control accesses
  - Example commands: GRANT, REVOKE
- Transaction Control Language (TCL):
  - Function: To group a set of tasks/instructions into a single execution unit
  - Example commands: BEGIN TRANSACTION, COMMIT, ROLLBACK

---

To practice the commands we will import a dataset containing a list of countries along with their respective population (2024), capital city and the region they belong to.

Please run the command below to complete the setup required to activate the sql environment in this notebook. Note that this needs to be rerun each time you close and reopen the notebook.

---
#### Name of the table: `Countries`
#### Column names: `Country`, `Population`, `Capital City`, `Region`


In [1]:
# Let's first install the Pandas library
!pip install -q pandas

print("\033[92mPandas installed successfully\033[00m")

[92mPandas installed successfully[00m


In [2]:
# Import the libraries
import pandas as pd

print("\033[92mPandas library imported successfully\033[00m")

# The URL where the data is located
file_url = 'https://github.com/FootlooseNFree/my_files/blob/main/countries.csv?raw=true'

# Import the data from the URL
df = pd.read_csv(file_url)
print('\033[92mData imported successfully\033[00m')

[92mPandas library imported successfully[00m
[92mData imported successfully[00m


In [3]:
# Import sqlite3
import sqlite3

# The function runs the SQL query
def run_query():
  con = sqlite3.connect('Countries.db')
  cur = con.cursor()

  # Create Countries table and add the data to it
  df.to_sql('Countries', con, if_exists='replace', index=False)
  pd.set_option('display.max_columns', None)
  pd.set_option('display.max_rows', None)

  query = input('Enter your SQL Query:  ')
  try:
    df_output = pd.read_sql_query(query, con)
  except Exception as e:
    print(e)
    print('Try again!!!')
    return None

  con.close()
  return df_output

In these introductory sessions we will deal mainly deal with Data Query Language (DQL)

The core query has the syntax:\
`SELECT ____________ FROM _____________;`\
This could be interpreted as `SELECT column(s) FROM table;`\
The `column(s)` can be either mentioned by comma separated names or _asterisk_ `*` can be used to specifiy `all` columns.

`SELECT * FROM table_name;`

`SELECT column1, column2, ... FROM table_name;`

For better readability, the query is also written on multiple lines as follows:
```
SELECT column1, column2, ...
FROM table_name;
```
---

### Query 1

Write a query to retrieve `all` columns and rows from the table `Countries`

In [None]:
query_result = run_query()
query_result

### Query 2

This time run a query to fetch only the columns `Country`, `Capital City` and `Population`

**Note:** The column name `Capital City` contains a whitespace that might cause problems while parsing the column name. Hence, it should be wrapped with [backticks (or back quotes)](https://i.sstatic.net/TOn1U.png)

\`Capital City\`

In [None]:
query_result = run_query()
query_result

### Query 2a

This time try changing the order of the columns in the above query 2 so that the table should have the following order: `Region`, `Country`, `Capital City`, `Population`

In [None]:
query_result = run_query()
query_result

## Limiting the output to specific number of rows

While running the above queries, the output retrieved all the rows in the dataset, which could lead to unnecessary time consumption if your data contains thousands of rows. The number of rows to be displayed can be limited with the `LIMIT` keyword.

Syntax:
```
SELECT column1, column2
FROM table_name
LIMIT n
```
OR
```
SELECT column1, column2 FROM table_name LIMIT n;
```

Where `n` is the number of rows you want to retrieve

---

### Query 3

Write a query to get the `Country`, `Capital City` and `Population` of the first **15** countries from the table `Countries`

In [None]:
query_result = run_query()
query_result

### Retrieving the unique values from a column or multiple columns

Sometimes you might want to know what are the different categories in data from a single column or a combination of columns. The `SELECT DISTINCT` keyword extracts them for you.

Syntax:
```
SELECT DISCINCT column1, column2, ...
FROM table_name;
```
---
### Query 4

Can you find out what are the different regions in the `Region` column of the `Countries` table?

In [None]:
query_result = run_query()
query_result

### Sorting the table

The `ORDER BY` command will sort the data for you w.r.t. the column or columns you mention. The default sorting is ascending.

Syntax:
```
SELECT column1, column2, ...
FROM table_name
ODER BY column1, column2, ... ASC|DESC;
```
The sorting can also be done w.r.t. an unselected column in the query.

---
### Query 5

Sort the entire `Countries` table in the ascending order w.r.t the `Region` names

In [None]:
query_result = run_query()
query_result

### Query 6

Sort the `Countries` table in the descending order w.r.t. `Population`. Only `Country` and `Population` columns should be retrieved.

In [None]:
query_result = run_query()
query_result

### Querying with conditions

After performing the above simple queries, you will probably feel the need to get more insights from the data based on certain conditions. SQL provides this flexibility with the `WHERE` clause combined by one or multiple conditions, which retrieves the data where the condition is met i.e. it is True.

The table below explains what kind of conditions can be stated:

|Operator|Description|
|:---|:---|
|`=`|Equal|
|`>`|Greater than|
|`<`|Less than|
|`>=`|Greater than or equal to|
|`<=`|Less than or equal to|
|`!=`|Not equal to (some versions of SQL may use `<>`|
|`BETWEEN`|Between a certain range|
|`LIKE`|Search for a pattern using `%` or `_`|
|`IN`|Match exactly in a list of values|

To specify a **string** value in the condition single quotes should be used. Some machines might also allow double quotes.\
Multiple conditions can be combined by `AND` and `OR`

Syntax:
```
SELECT column1, column2, ...
FROM table_name
WHERE condition
```
---
### Query 7

Retrieve the list of countries from the `Countries` table whose `Population` is more than 500000000 (500 million)

In [None]:
query_result = run_query()
query_result

### Query 8

Which countries from the `Region` **Europe** have a population of more than 20000000 (20 million)?

**Hint:** The two conditions can be combined by using `AND`

In [None]:
query_result = run_query()
query_result

### Query 9

Write a query to retrieve the countries with population between 10000000 (10 million) and 100000000 (100 million)

**Hint:** The `BETWEEN ... AND ...` syntax can be used

In [None]:
query_result = run_query()
query_result

### Conditions with pattern using `LIKE`

```
SELECT column1, column2, ...
FROM table_name
WHERE column LIKE pattern;
```

The `LIKE` keyword can be used to match a pattern. It's helpful when the entire string value of the data is not know, and similarity is sought. `%` is used as a placeholder to match either zero, one or many string characters. `_` is used to match exactly one string character.

E.g. For the string `Sequel`, the pattern `Seq%` will match the entire string i.e. `Sequel`, but `Seq_` will only match `Sequ`.

---
### Query 10

Find the entries in the `Country` column of the `Countries` table that end in `-land`.

In [None]:
query_result = run_query()
query_result

### Conditions for matching entries from a list using `IN`

```
SELECT column1, column2, ...
FROM table_name
WHERE column IN (item1, item2, ...);
```
The above condition will be satisfied for the entries in the column matching one of the items specified between the parantheses.

---
### Query 11

Which countries in the `Countries` table are from the `Region`s `'Europe'` and `'Americas'`? Also include their respective `Capital City`.

In [None]:
query_result = run_query()
query_result

### Aggregate Functions

An aggregate function is a function that performs a calculation on a set of values, and returns a single value.

Aggregate functions are often used with the `GROUP BY` clause of the `SELECT` statement. The GROUP BY clause splits the result-set into groups of values and the aggregate function can be used to return a single value for each group.

The most commonly used SQL aggregate functions are:

`MIN(column_name)` - returns the smallest value within the selected column\
`MAX(column_name)` - returns the largest value within the selected column\
`COUNT(column_name)` - returns the number of rows in a selected column\
`SUM(column_name)` - returns the total sum of a numerical column\
`AVG(column_name)` - returns the average value of a numerical column

---
The basic syntax for aggregate functions is
```
SELECT aggregate_function(column_name)
FROM table_name;
```
---
### Query 12

Which `Country` in the `Countries` table has the lowest population?

**Hint:** While using the aggregate function, you can also specify the column name that will show you the country name.

In [None]:
query_result = run_query()
query_result

### Query 13
Can you count the total number of entries in the `Country` column of the `Countries` table?

In [None]:
query_result = run_query()
query_result

### Aggregate functions with conditions

By adding the `WHERE` clause it is also possible to use the aggregate functions to calculate values based on the stated condition.

The syntax will be as follows:
```
SELECT aggregate_function(column_name)
FROM table_name
WHERE condition;
```
---
### Query 14
Find the total `Population` for all the `Country` in `'Europe'`.

**Hint:** Europe is in the `Region` column.

In [None]:
query_result = run_query()
query_result

### Query 15

What is the average population of the countries in `Americas`?

In [None]:
query_result = run_query()
query_result

### Counting unique elements

By using the `DISTINCT` keyword in the `COUNT()` you can find the number of unique elements in that column

```
SELECT COUNT(DISTINCT column_name)
FROM table_name;
```
---
### Query 16
How many unique regions are there in the `Region` column?

In [None]:
query_result = run_query()
query_result

### Aggregate functions with GROUP BY

To get deeper insights from your data, you might sometimes want to see how the data is distributed for different categories. The `GROUP BY` statement allows you to do that.\
It allows you to find answers for questions like _How many employees are there in each departmen?_, _What is the average salary by gender in a particular company?_

Syntax:
```
SELECT aggregate_function
FROM table_name
GROUP BY column_name;
```
```
SELECT column1, aggregate_function, column2, ...
FROM table_name
GROUP BY column_name
```
---
### Query 17
How many countries are there in each `Region` of the `Countries` table? The retrieved data should show the columns `Region` followed by the count of countries in it.

In [None]:
query_result = run_query()
query_result

### Query 18
Find the total `Population` by `Region` in the `Countries` table.

In [None]:
query_result = run_query()
query_result

### GROUP BY with conditions using `HAVING`

`HAVING` keyword is analogous to `WHERE` and is used exlusively while specifiying conditions for aggregate functions.

Syntax:
```
SELECT aggregate_function
FROM table_name
GROUP BY column_name
HAVING condition(aggregated)
```
---
### Query 19
In the `Countries` table, find the `Region`s which have more than 50 countries.

Hint:
- Select the columns `Region` and aggregated function count of `Country`
- Group it by `Region`
- Specify the condition with `HAVING` the count of `Country` greater than 50

In [None]:
query_result = run_query()
query_result

### Query 20

Which `Region` has a total population of all its countries combined as more than 1 billion (1000000000)?

**Hint:**
- Select the columns `Region` and the sum of the `Population`
- Group the `Region`s
- Specify the condition using `HAVING` that the sum of `Population` should be more than 1 billion

In [None]:
query_result = run_query()
query_result

### Order of writing SQL statements and clauses

The SQL statements and clauses can be combined to perform more complex queries suited to your requirements. However, they must follow the below hierarchy:
```
SELECT
------------
DISTINCT
------------
FROM
------------
WHERE
------------
GROUP BY
------------
HAVING
------------
ORDER BY
------------
LIMIT
------------
```