## SQL

SQL, which stands for Structured Query Language, is a language for interacting with data stored in something called a relational database.

You can think of a relational database as a collection of tables. A table is just a set of rows and columns, like a spreadsheet, which represents exactly one type of entity. For example, a table might represent employees in a company or purchases made, but not both.

Each row, or record, of a table contains information about a single entity. For example, in a table representing employees, each row represents a single person. Each column, or field, of a table contains a single attribute for all rows in the table. For example, in a table representing employees, we might have a column containing first and last names for all employees.

A query is a request for data from a database table (or combination of tables).

> SELECT name FROM people;

this query selects two columns, name and birthdate, from the people table
>SELECT name, birthdate<br>
FROM people;

Sometimes, you may want to select all columns from a table. Typing out every column name would be a pain, so there's a handy shortcut:

>SELECT *<br>
FROM people;

If you only want to return a certain number of results, you can use the LIMIT keyword to limit the number of rows returned:

>SELECT *<br>
FROM people<br>
LIMIT 10;

### DISTINCT, COUNT

If you want to select all the unique values from a column, you can use the DISTINCT keyword.
>SELECT DISTINCT country<br>
FROM films;

What if you want to count the number of employees in your employees table? The COUNT() function lets you do this by returning the number of rows in one or more columns. For example, this code gives the number of rows in the people table:

>SELECT COUNT(*)<br>
FROM people;

It's also common to combine COUNT() with DISTINCT to count the number of distinct values in a column. For example, this query counts the number of distinct birth dates contained in the people table:

>SELECT COUNT(DISTINCT birthdate)<br>
FROM people;

### WHERE
In SQL, the WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:

 - = equal
 - <> not equal
 - "<" less than
 - ">" greater than
 - <= less than or equal to
 - ">"= greater than or equal to

For example, you can filter text records such as title. The following code returns all films with the title 'Metropolis':

>SELECT title<br>
FROM films<br>
WHERE title = 'Metropolis';

>SELECT title, release_year<br>
FROM films<br>
WHERE release_year > 2000;

>SELECT title, release_year<br>
FROM films<br>
WHERE release_year < 2000 AND language = 'Spanish';


What if you want to select rows based on multiple conditions where some but not all of the conditions need to be met? For this, SQL has the OR operator. For example, the following returns all films released in either 1994 or 2000:

>SELECT title<br>
FROM films<br>
WHERE release_year = 1994<br>
OR release_year = 2000;

>SELECT title<br>
FROM films<br>
WHERE (release_year = 1994 OR release_year = 1995)<br>
AND (certification = 'PG' OR certification = 'R');

>SELECT release_year, title<br>
FROM films<br>
WHERE (release_year >= 1990 AND release_year < 2000)<br>
AND (language = 'Spanish' OR language = 'French')<br>
AND (gross > 2000000);

### WHERE IN

>SELECT name<br>
FROM kids<br>
WHERE age IN (2, 4, 6, 8, 10);

>SELECT title, release_year<br>
FROM films<br>
WHERE release_year IN (1990, 2000)<br>
AND duration > 120

>SELECT title, language<br>
FROM films<br>
WHERE language IN ('English', 'Spanish', 'French');

### BETWEEN
Similar to the WHERE clause, the BETWEEN clause can be used with multiple AND and OR operators.

>SELECT name<br>
FROM kids<br>
WHERE age BETWEEN 2 AND 12<br>
AND nationality = 'USA';

>SELECT title, release_year<br>
FROM films<br>
WHERE release_year BETWEEN 1990 AND 2000<br>
AND budget > 100000000<br>
AND (language = 'Spanish' OR language = 'French')

### NULL and IS NULL

In SQL, NULL represents a missing or unknown value. You can check for NULL values using the expression IS NULL. For example, to count the number of missing birth dates in the people table:

>SELECT COUNT(*)<br>
FROM people<br>
WHERE birthdate IS NULL;<br>

Moreover, IS NULL is useful when combined with WHERE to figure out what data you're missing.

Sometimes, you'll want to filter out missing values so you only get results which are not NULL. To do this, you can use the IS NOT NULL operator. For example, this query gives the names of all people whose birth dates are not missing in the people table.

>SELECT name<br>
FROM people<br>
WHERE birthdate IS NOT NULL;


>SELECT name<br>
FROM people<br>
WHERE happydate IS NULL;

>SELECT COUNT(*)<br>
FROM films<br>
WHERE language IS NULL;

### LIKE and NOT LIKE

In SQL, the LIKE operator can be used in a WHERE clause to search for a pattern in a column. To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE.

The % wildcard will match zero, one, or many characters in text. For example, the following query matches companies like 'Data', 'DataC' 'DataCamp', 'DataMind', and so on:

>SELECT name<br>
FROM companies<br>
WHERE name LIKE 'Data%';

The _ wildcard will match a single character. For example, the following query matches companies like 'DataCamp', 'DataComp', and so on:

>SELECT name<br>
FROM companies<br>
WHERE name LIKE 'DataC_mp';

Get the names of all people whose names begin with 'B'. The pattern you need is 'B%'.
>SELECT name<br>
FROM people<br>
WHERE name LIKE 'B%';

Get the names of people whose names have 'r' as the second letter. The pattern you need is '_r%'.
>SELECT name<br>
FROM people<br>
WHERE name LIKE '_r%';

Get the names of people whose names don't start with A. The pattern you need is 'A%'.
>SELECT name<br>
FROM people<br>
WHERE name NOT LIKE 'A%';

### Aggregate functions

To perform some calculation on the data in a database, SQL provides a few functions, called aggregate functions.

For example, gives you the average value from the budget column of the films table:

>SELECT AVG(budget)<br>
FROM films;<br>

Similarly, the MAX() function returns the highest budget:
>SELECT MAX(budget)<br>
FROM films;

The SUM() function returns the result of adding up the numeric values in a column:
>SELECT SUM(budget)<br>
FROM films;

Combined with where:

>SELECT SUM(gross)<br>
FROM films<br>
WHERE release_year >= 2000;

>SELECT AVG(gross)<br>
FROM films<br>
WHERE title LIKE 'A%';

>SELECT MAX(gross)<br>
FROM films<br>
WHERE release_year BETWEEN 2000 AND 2012;

### A note on arithmetic

In addition to using aggregate functions, you can perform basic arithmetic with symbols like +, -, *, and /.

So, for example, this gives a result of 12:

SELECT (4 * 3);
However, the following gives a result of 1:

SELECT (4 / 3);
What's going on here?

SQL assumes that if you divide an integer by an integer, you want to get an integer back. So be careful when dividing!

If you want more precision when dividing, you can add decimal places to your numbers. For example,

SELECT (4.0 / 3.0) AS result;
gives you the result you would expect: 1.333.

### ALIAS

If you use two functions like this you'd have two columns named max, which isn't very useful.

>SELECT MAX(budget), MAX(duration)<br>
FROM films;

To avoid situations like this, SQL allows you to do something called aliasing. Aliasing simply means you assign a temporary name to something. To alias, you use the AS keyword.

For example, in the above example we could use aliases to make the result clearer:

>SELECT MAX(budget) AS max_budget, MAX(duration) AS max_duration <br>
FROM films;

>SELECT title, duration/60.0 AS duration_hours<br>
FROM films;

>SELECT AVG(duration)/60.0 AS avg_duration_hours<br>
FROM films;

>SELECT (MAX(release_year) - MIN(release_year)) / 10 AS number_of_decades<br>
FROM films;

### ORDER BY

In SQL, the ORDER BY keyword is used to sort results in ascending or descending order according to the values of one or more columns.

By default ORDER BY will sort in ascending order.

If you want to sort the results in descending order, you can use the DESC keyword. For example,

>SELECT title<br>
FROM films<br>
ORDER BY release_year DESC;<br>

gives you the titles of films sorted by release year, from newest to oldest.

>SELECT birthdate, name<br>
FROM people<br>
ORDER BY birthdate;

>SELECT title<br>
FROM films<br>
WHERE release_year IN (2000, 2012)<br>
ORDER BY release_year;

>SELECT title, gross<br>
FROM films<br>
WHERE title LIKE 'M%'<br>
ORDER BY title;

>SELECT imdb_score, film_id<br>
FROM reviews<br>
ORDER BY imdb_score DESC;

#### Sorting multiple columns

ORDER BY can also be used to sort on multiple columns. It will sort by the first column specified, then sort by the next, then the next, and so on. For example,

>SELECT birthdate, name<br>
FROM people<br>
ORDER BY birthdate, name;<br>

sorts on birth dates first (oldest to newest) and then sorts on the names in alphabetical order.

The order of columns is important.

>SELECT name, birthdate<br>
FROM people<br>
ORDER BY name, birthdate;

### GROUP BY

Often you'll need to aggregate results. For example, you might want to count the number of male and female employees in your company. Here, what you want is to group all the males together and count them, and group all the females together and count them. In SQL, GROUP BY allows you to group a result by one or more columns, like so:

>SELECT sex, count(*)<br>
FROM employees<br>
GROUP BY sex;

Commonly, GROUP BY is used with aggregate functions like COUNT() or MAX(). Note that GROUP BY always goes after the FROM clause.


A word of warning: SQL will return an error if you try to SELECT a field that is not in your GROUP BY clause without using it to calculate some kind of value about the entire group.

Note that you can combine GROUP BY with ORDER BY to group your results, calculate something about them, and then order your results. For example,

>SELECT sex, count(*)<br>
FROM employees<br>
GROUP BY sex<br>
ORDER BY count DESC;

>SELECT release_year, MAX(budget)<br>
FROM films<br>
GROUP BY release_year

>SELECT imdb_score, COUNT(*)<br>
FROM reviews<br>
GROUP BY imdb_score;

>SELECT release_year, MIN(gross)<br>
FROM films<br>
GROUP BY release_year<br>
ORDER BY MIN(gross)

>SELECT release_year, country, MAX(budget)<br>
FROM films<br>
GROUP BY release_year, country<br>
ORDER BY release_year, country

>SELECT country, SUM(budget)<br>
FROM films<br>
GROUP BY country<br>
ORDER BY SUM(budget)

### HAVING

In SQL, aggregate functions can't be used in WHERE clauses. For example, the following query is invalid:

>SELECT release_year<br>
FROM films<br>
GROUP BY release_year<br>
WHERE COUNT(title) > 10;

This means that if you want to filter based on the result of an aggregate function, you need another way.

That's where the HAVING clause comes in. For example,

>SELECT release_year<br>
FROM films<br>
GROUP BY release_year<br>
HAVING COUNT(title) > 10;

shows only those years in which more than 10 films were released.


>SELECT release_year, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross<br>
FROM films<br>
WHERE release_year > 1990<br>
GROUP BY release_year<br>
HAVING AVG(budget) > 60000000<br>
ORDER BY release_year

Result
>release_year	avg_budget	avg_gross<br>
2005	70.	41.32<br>
2006	93.	39.23

Other example:
>-- select country, average budget, average gross<br>
SELECT country, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross

>-- from the films table<br>
FROM films<br>

>-- group by country<br>
GROUP BY country<br>

>-- where the country has more than 10 titles<br>
HAVING COUNT(country) > 10<br>

>-- order by country<br>
ORDER BY country<br>

>-- limit to only show 5 results<br>
LIMIT 5;

### CASE


In [None]:
#----------------------------------------------------------------------------------------------------
# Example 0
"""
-- Select matches where Barcelona was the away team
SELECT  
    m.date,
    t.team_long_name AS opponent,
    CASE WHEN m.home_goal < m.away_goal THEN 'Barcelona win!'
        WHEN m.home_goal > m.away_goal THEN 'Barcelona loss :(' 
        ELSE 'Tie' END AS outcome

FROM matches_spain AS m

-- Join teams_spain to matches_spain
LEFT JOIN teams_spain AS t 
ON m.hometeam_id = t.team_api_id
WHERE m.awayteam_id = 8634;
"""
#----------------------------------------------------------------------------------------------------
# Example 1
"""
SELECT 
    date,
    -- Identify the home team as Barcelona or Real Madrid
    CASE WHEN hometeam_id = 8634 THEN 'FC Barcelona' 
        WHEN hometeam_id = 8633 THEN  'Real Madrid CF' END AS home,
        
    -- Identify the away team as Barcelona or Real Madrid
    CASE WHEN awayteam_id = 8634 THEN 'FC Barcelona' 
        WHEN awayteam_id = 8633 THEN 'Real Madrid CF' END AS away
        
FROM matches_spain

WHERE (awayteam_id = 8634 OR hometeam_id = 8634)
    AND (awayteam_id = 8633 OR hometeam_id = 8633);
"""

#----------------------------------------------------------------------------------------------------
# Example 2
"""
-- Select the season, date, home_goal, and away_goal columns
SELECT 
    season,
    date,
    home_goal,
    away_goal
FROM matches_italy

WHERE 
-- Exclude games not won by Bologna
    -- Identify when Bologna won a match
    CASE WHEN hometeam_id = 9857  AND home_goal > away_goal  THEN 'Bologna Win'
        WHEN awayteam_id = 9857  AND away_goal > home_goal  THEN 'Bologna Win' 
        END IS NOT null;
"""

#----------------------------------------------------------------------------------------------------
# Example 3
"""
SELECT 
    c.name AS country,
    -- Count matches in each of the 3 seasons
    COUNT(CASE WHEN m.season = '2012/2013' THEN m.id END) AS matches_2012_2013,
    COUNT(CASE WHEN m.season = '2013/2014' THEN m.id END) AS matches_2013_2014,
    COUNT(CASE WHEN m.season = '2014/2015' THEN m.id END) AS matches_2014_2015
    
FROM country AS c

LEFT JOIN match AS m
ON c.id = m.country_id

-- Group by country name alias
GROUP BY country;

"""

#----------------------------------------------------------------------------------------------------
# Example 4
"""
SELECT 
    c.name AS country,
    -- Sum the total records in each season where the home team won
    SUM(CASE WHEN m.season = '2012/2013' AND m.home_goal > m.away_goal 
        THEN 1 ELSE 0 end) AS matches_2012_2013,
    SUM(CASE WHEN m.season  = '2013/2014' AND m.home_goal > m.away_goal
        THEN 1 ELSE 0 end) AS matches_2013_2014,
    SUM(CASE WHEN m.season = '2014/2015' AND m.home_goal > m.away_goal
       THEN 1 ELSE 0 end) AS matches_2014_2015

FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
-- Group by country name alias
GROUP BY country;
"""

### Subqueries

 - You can have a subquery in FROM, WHERE, SELECT or GROUP BY part of the main query
 - Allow for comparing groups to summarize values, reshape data or combine data that cannot be joined
 - SQL first process the information inside the subquery and then moves to the outer query
 
 
 - Subqueries are incredibly powerful for performing complex filters and transformations. You can filter data based on single, scalar values using a subquery in ways you cannot by using WHERE statements or joins. Subqueries can also be used for more advanced manipulation of your data set.
 
 
 - In addition to filtering using a single-value (scalar) subquery, you can create a list of values in a subquery to filter data based on a complex set of conditions. This type of subquery generates a one column reference list for the main query. As long as the values in your list match a column in your main query's table, you don't need to use a join -- even if the list is from a separate table.

In [2]:
#----------------------------------------------------------------------------------------------------
# Example 0 (WHERE)
"""
SELECT
    -- Select the team long and short names
    team_long_name,
    team_short_name
FROM team
-- Filter for teams with 8 or more home goals
WHERE team_api_id IN
      (SELECT hometeam_ID 
       FROM match
       WHERE home_goal >= 8);
"""

 - A subquery in FROM is an effective way of answering detailed questions that requires filtering or transforming data before including it in your final results.

In [None]:
#----------------------------------------------------------------------------------------------------
# Example 1 (FROM)
"""
SELECT
    -- Select country, date, home, and away goals from the subquery
    country,
    date,
    home_goal,
    away_goal
FROM
    -- Select country name, date, home_goal, away_goal, and total goals in the subquery
    (SELECT c.name AS country, 
            m.date, 
            m.home_goal, 
            m.away_goal,
           (m.home_goal + m.away_goal) AS total_goals
    FROM match AS m
    LEFT JOIN country AS c
    ON m.country_id = c.id) AS subquery
-- Filter by total goals scored in the main query
WHERE total_goals >= 10;
"""

- Subqueries in SELECT statements generate a single value that allow you to pass an aggregate value down a data frame. This is useful for performing calculations on data within your database.

- Subqueries in SELECT are a useful way to create calculated columns in a query. A subquery in SELECT can be treated as a single numeric value to use in your calculations. When writing queries in SELECT, it's important to remember that filtering the main query does not filter the subquery -- and vice versa.

In [4]:
#----------------------------------------------------------------------------------------------------
# Example 2 (SELECT)
"""
SELECT 
    l.name AS league,
    
    -- Select and round the league's total goals
    ROUND(AVG(m.home_goal + m.away_goal), 2) AS avg_goals,
    
    -- Select & round the average total goals for the season
    (SELECT ROUND(AVG(home_goal + away_goal), 2) 
     FROM match
     WHERE season = '2013/2014') AS overall_avg
     
FROM league AS l

LEFT JOIN match AS m
ON l.country_id = m.country_id

-- Filter for the 2013/2014 season
WHERE season = '2013/2014'

GROUP BY league;
"""

#----------------------------------------------------------------------------------------------------
# Example 3 (SELECT)
"""
SELECT
    -- Select the league name and average goals scored
    l.name AS league,
    ROUND(AVG(m.home_goal + m.away_goal),2) AS avg_goals,
    
    -- Subtract the overall average from the league average
    ROUND(AVG(m.home_goal + m.away_goal) - 
        (SELECT AVG(home_goal + away_goal)
         FROM match 
         WHERE season = '2013/2014'),2) AS diff
         
FROM league AS l

LEFT JOIN match AS m
ON l.country_id = m.country_id

-- Only include 2013/2014 results

WHERE season = '2013/2014'

GROUP BY l.name;
"""

### Correlated subqueries

 - Correlated subqueries are subqueries that reference one or more columns in the main query. Correlated subqueries depend on information in the main query to run, and thus, cannot be executed on their own.

 - Correlated subqueries are evaluated in SQL once per row of data retrieved -- a process that takes a lot more computing power and time than a simple subquery.

In [None]:
"""
SELECT 
    -- Select country ID, date, home, and away goals from match
    main.country_id,
    main.date,
    main.home_goal,
    main.away_goal

FROM match AS main

WHERE 
    -- Filter for matches with the highest number of goals scored

    (home_goal + away_goal) = 
    
        (SELECT MAX(sub.home_goal + sub.away_goal)
        
         FROM match AS sub
         
         WHERE main.country_id = sub.country_id
         
               AND main.season = sub.season
        );
"""

### Nested subqueries

 - Nested subqueries can be either simple or correlated. Just like an unnested subquery, a nested subquery's components can be executed independently of the outer query, while a correlated subquery requires both the outer and inner subquery to run and produce results.
 
 -Nesting subqueries and performing your transformations one step at a time, adding it to a subquery, and then performing the next set of transformations is often the easiest way to yield accurate information about your data.

In [None]:
"""
SELECT
    -- Select the season and max goals scored in a match
    season,
    
    MAX(home_goal + away_goal) AS max_goals,
    
    -- Select the overall max goals scored in a match
   (SELECT MAX(home_goal + away_goal) FROM match) AS overall_max_goals,
   
    -- Select the max number of goals scored in any match in July
   (SELECT MAX(home_goal + away_goal) 
    FROM match
    WHERE id IN (
          -- Nested subquery
          SELECT id FROM match WHERE EXTRACT(MONTH FROM date) = 07)
    ) AS july_max_goals
          
FROM match
GROUP BY season;
"""

What's the average number of matches per season where a team scored 5 or more goals? How does this differ by country?

In [None]:
"""
SELECT
    c.name AS country,

    -- Calculate the average matches per season
    AVG(outer_s.matches) AS avg_seasonal_high_scores

FROM country AS c

-- Left join outer_s to country

LEFT JOIN (
  SELECT country_id, season,
         COUNT(id) AS matches

  FROM (
    SELECT country_id, season, id
    FROM match
    WHERE home_goal >= 5 OR away_goal >= 5) AS inner_s
    
  -- Close parentheses and alias the subquery
  GROUP BY country_id, season) AS outer_s
  
ON c.id = outer_s.country_id

GROUP BY country;
"""

### Common table expressions