- **Distinct**: select all the unique values from a column

In [None]:
select distinct language from flims;

- **count**: count the number of objects

In [None]:
select count(*) from people;

select count(distinct birthdate) from people;

- **round**: (float number, number of decimal point)

In [None]:
round(3.141592, 2) -> 3.14

# Filterling

In SQL, the **WHERE** keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:
- = equal
- <>  / != not equal
- < less than
- \> greater than
- <= less than or equal to
- \>= greater than or equal to

- **between**: between is inclusive

In [None]:
select title from films
    where release_year
    between 1994 and 2000;

- **in** operator

In [None]:
select name from kids
    where age in (2, 4, 6, 8, 10);

- **Null** operator

In [None]:
select name from kids
    where age is null;

- **like** and **not like**
    - search for a pattern in a column
    - % wildcard will match zero, one, or many characters in text.
    - _ wildcard will match a single character.

In [None]:
select name from companies
    where name like 'Data%';
    
select name from companies
    where name like 'DataC_mp';

- aggregate functions
    - **avg(), max(), min(), sum()**

In [None]:
select avg(budget) from films;

select sum(budget) from films;

- arithmetic: 

In [None]:
select title, (gross - budget) as net_profit from films;

- **order by**
    - **DESC** keyword: descending order
    - **ORDER BY** can also be used to sort on multiple columns. It will sort by the first column specified, then sort by the next, then the next, and so on.
    - Make sure to always put the ORDER BY clause at the end of your query. You can't sort values that you haven't calculated yet!

In [None]:
select title from films
    order by release_year DESC;

- **group by**
    - **GROUP BY** is used with aggregate functions like **COUNT()** or **MAX()**.
    - Note that **GROUP BY always goes after the FROM clause!**

In [None]:
select sex, count(*) from employees
    group by sex;

- **having**
    - In SQL, **aggregate functions can't be used in WHERE clauses.**
    - if you want to filter based on the result of an aggregate function, you need another way! That's where the **HAVING clause** comes in. 

In [None]:
select release_year from films
    group by release_year
    having count(title) > 10;

- **limit**
    - limit the number of rows returned

In [None]:
select country, avg(budget) as avg_budget, avg(gross) as avg_gross from films
    group by country
    having count(title) > 10
    order by country
    limit 5;

# Joining
- inner join
- full (outer) join
- left/right (outer) join - fetches all data from the left/right table with matching data from right/left, if preset
- cross join
    - all possible pairs

## inner join

In [None]:
select p1.country, p1.continent, prime_minister, president 
    from prime_ministers as p1
    inner join presidents as p2
    on p1.country = p2.country

In [None]:
## inner join w/ using

select left.id as L_id, left.val as L_val, right.val as R_val
    from left
    inner join right
    using (id);

## self join

In [None]:
select p1.country_code, p1.size as size2010, p2.size as size2015
    ((p2.size - p1.size)/p1.size * 100.) as growth_perc
    from populations as p1
    inner join populations as p2   # join to itself
    on p1.country_code = p2.country_code
        and p1.year = p2.year - 5

## cross join

In [None]:
select primie_minister, president 
    from prime_minister as p1
    cross join president as p2
    where p1.continent in ('North America', 'Oceania');

# Set Theory Venn Diagrams

## Union
- can also be used to determine all occurrences of a field across multiple tables.
- **union all**: includes duplicates

In [None]:
select prime_minister as leader, country
    from prime_ministers
union
select monarch, country
    from monarchs
order by country;

## intersect

In [None]:
select country from prime_ministers
intersect
select country from presidents;

## except

In [None]:
select monarch, country from monarchs
except
select prime_minister, country from prime_ministers;

## Semi-join (an intro to subqueries)

In [None]:
select president, country, continent from presidents
    where country in 
        (select name from states
            where indep_year < 1800);

## Anti-join 

In [None]:
select president, country, continent from presidents
    where continent like '%America'
        and country not in 
            (select name from states
                where indep_year < 1800);

In [None]:
## Identify the country codes that are included in either economies or currencies but not in populations.
## Use that result to determine the names of cities in the countries that match the specification in the previous instruction.

select name from cities as c1
    where c1.country_code in 
        (select e.code from economies as e
         union    # get all additional (unique) values of the currencies table
         select c2.code from currencies as c2
         except
         select p.country_code from populations as p);

# Subqueries

- can be in any part of a query:
  **select, from where, group by**
- can return
	- scalar quantities
	- a list
	- a table
- why?
	- comparing groups to summarized value
	- reshaping data
	- combining data that cannot be joined

## subqueries inside **where/select** clauses

In [None]:
select name, fert_rate from states
    where continent = 'Asia'
        and fert_rate < 
            (select avg(fert_rate) from states);

In [None]:
select distinct continent, 
    (select count(*) from states 
        where prime_minister.continent = states.continent) as countries_num
    from prime_ministers;

In [None]:
/*
SELECT countries.name AS country, COUNT(*) AS cities_num
    FROM cities
    INNER JOIN countries
    ON countries.code = cities.country_code
    GROUP BY country
    ORDER BY cities_num DESC, country
    LIMIT 9;
*/

select countries.name as country,
    (select count(*) from cities
        where countries.code = cities.country_code) as cities_num
    from countries
    order by cities_num desc, country
    limit 9;

## subquery filtering list with **in**

In [None]:
select team_long_name, team_short_name as abbr
    from team
    where team_api_id in 
        (select hometeam_id from match
            where country_id = 15722);

## Subquery inside the **From** clause

- Restructure and transform your data
	- transforming data from long to wide before selecting
	- pre-filtering data
- Calculating aggregates of aggregates e.g.) Which 3 teams has the highest average of home goals scored?
        1. calculate the avg for each team
        2. get the 3 highest of the avg values
- you can create multiple subqueries in one **from** statement
	- alias them / join them !!
- you can join a subquery to a table in **from**
	- include a joining columns in both tables !

In [None]:
select team, home_avg
from (select t.team_long_name as team, avg(m.home_goal) as home_avg
      from match as m
      left join team as t
      on m.hometeam_id = t.team_api_id
      where season = '2011/2012'
      group by team) as subquery
order by home_avg desc
limit 3;

In [None]:
select distinct monarchs.continent, subquery.max_perc
    from monarchs,
        (select continent, max(women_parli_perc) as max_perc
            from states
            group by continent)
            as subquery
    where monarchs.continent = subquery.continent
    order by continent;

In [None]:
SELECT name, continent, inflation_rate
    FROM countries
    INNER JOIN economies
    on countries.code = economies.code
        WHERE year = 2015
        and inflation_rate in (
            SELECT MAX(inflation_rate) AS max_inf
                FROM (
                     SELECT name, continent, inflation_rate
                         FROM countries
                         INNER JOIN economies
                             on countries.code = economies.code
                             WHERE year = 2015) AS subquery
                         GROUP BY continent);

## subquery in **select** clause
- returns a ***single value***
	- include aggregate values to compare to individual values
- used in mathematical calculations
	- deviation from the average

In [None]:
select season, count(id) as matches, 
    (select count(id) from match) as total_matches
    from match
    group by season;

In [None]:
## calculate the difference from the average value ##

select date, (home_goal + away_goal) as goals,
    (home_goal + away_goal) - 
        (select avg(home_goal + away_goal) 
         from match
         where season = '2011/2012') as diff
from match
where season = '2011/2012';

## Correlated subquery
- Uses values from the outer query to generate a result
- re-run for every row generated in the final data set
- used for advanced joining, filtering, and evaluating data

In [None]:
# simple query: what is the average number of goals scored in each country?

select c.name as country, avg(m.home_goal + m.away_goal) as avg_goals
    from country as c
    left join match as m
        on c.id = m.country_id
    group by country;

In [None]:
## correlated query: what is the average number of goals scored in each country?

select c.name as country,
    (select avg(home_goal + away_goal)
     from match as m
     where m.country_id = c.id) as avg_goals # match inner query to outer query
from country as c
group by country;

# Nested subquries
- subquery inside another subquery
- Inner subquery using **EXTRACT**

In [None]:
select
    extract(MONTH from date) as month,
    sum(home_goal + away_goal) as goals
from match
group by month;

In [None]:
select
    extract(MONTH from date) as month,
    sum(m.home_goal + m.away_goal) as total_goals,
    sum(m.home_goal + m.away_goal) - 
        (select avg(goals)     # average total goals by month
         from (select
                  extract(MONTH from date) as month,
                  sum(home_goal + away_goal) as goals
               from match
               group by month) as s) as diff
from match as m
group by month;

# Case when and then
- categorizing data
- filtering data
- aggregating data

In [None]:
select name, continent, indep_year,
    case when indep_year < 1900 then 'before 1900'
         when indep_year <= 1930 then 'between 1900 and 1930'
         else 'after 1930'
    end as indep_year_group
    from states
    order by indep_year_group;

In [None]:
# exclude null values

select date, season,
    case when hometeam_id = 8455 and home_goal>away_goal then 'Chelsea home win!'
         when awayteam_id = 8455 and home_goal < away_goal then 'Chelsea away win!'
    end as outcome
    from match
    where case when hometeam_id = 8455 and home_goal>away_goal then 'Chelsea home win!'
               when awayteam_id = 8455 and home_goal < away_goal then 'Chelsea away win!'
          end is not null;

## Case when w/ **count**

In [None]:
select season, 
    count(case when hometeam_id = 8650 
                and home_goal > away_goal then id 
          end) as home_wins
    from match
    group by season;

## Case when w/ **sum**

In [None]:
select season,
    sum(case when hometeam_id = 8650 then home_goal end) as home_goals,
    sum(case when awayteam_id = 8650 then away_goal end) as away_goals
from match
group by season;

## Case when w/ **avg**

In [None]:
AVG(CASE WHEN condition_is_met THEN 1
         WHEN condition_is_not_met THEN 0 END)

# into clause
- save the result of the query

In [None]:
select country_code, size,
    case when size > 500000 then 'large'
         when size > 10000  then 'medium'
         else 'small'
    end as popsize_group
    into pop_plus   # into table
    from populations
    where year = 2015;
    
select * from pop_plus;

# Common Table Expressions (CTEs)
- To improve readability from the complicated subquery structures
- Table declared before the main query
- Named and referenced later in FROM statement
- Setting up CTEs

In [None]:
with cte as (
    select col1, col2 from table)

select avg(col1) as avg_col from cte;

- Why CTEs?
	- Executed once
		- CTE is then stored in memory
		- improves query performance
	- improving organization of queries
	- reference other CTEs
	- Reference itself (SELF JOIN)

In [None]:
with s1 as (
    select country_id, id from match
        where (home_goal + away_goal) > =10), 
s2 as (
    select country_id, id from match
        where (home_goal + away_goal) <= 1)

select c.name as country, count(s1.id) as high_scores, count(s2.id) as low_scores
    from country as c
    inner join s1
    on c.id = s1.country_id
    inner join s2
    on c.id = s2.country_id
    group by country;

# Window functions
- Working w/ aggregate values
    - requires you to use group by with all non-aggregate columns

In [None]:
select country_id, season, date, avg(home_goal) as avg_home
    from match
    group by country_id;
    
# -> Error: "match.season" must appear in the group by clause or be used in an aggregate function

## over() clause
- aggregate function for entire range
- run faster than subqueries in SELECT

In [None]:
/* overall average by using subquery */
select date, 
    (select avg(home_goal + away_goal)
     from match
     where season = '2011/2012') as overall_avg
from match
where season = '2011/2012';

In [None]:
/* overall average by over() clause */
select date,
       avg(home_goal + away_goal) over() as overall_avg
from match
where season = '2011/2012';

## rank() function + over(order by)
- e.g., what is the rank of matches based on number of goals scored?

In [None]:
select date, (home_goal + away_goal) as goals,
    rank() over(order by home_goal + away_goal) as goals_rank
from match
where season = '2011/2012';

## Over w/ a **partition by**
- Calculate separate values for different categories
- Calculate different calculations in the same column

In [None]:
avg(home_goal) over(partition by season)

In [None]:
select date, avg(home_goal + away_goal) over(partition by season) as season_avg
    from match
    where season = '2011/2012';

## partition by multiple columns

In [None]:
select c.name, m.season, (home_goal+away_goal) as goals,
    avg(home_goal + away_goal) over(partition by m.season, c.name) as season_ctry_avg
    from country as c
    left join match as m
    on c.id = m.country_id

# Sliding windows
- perform calculations relative to the current row (e.g., cumulative)
- can be used to calculate running totals, sums, averages, etc
- can be partitioned by one or more columns

``
rows between <start> and <finish>
``
- PRECEDING, FOLLOWING, 
- UNBOUNDED PRECEDING, UNBOUNDED FOLLOWING,   <- start / end point
- CURRENT ROW

In [None]:
rows between <start> and <finish>

In [None]:
select date, home_goal, away_goal,
    sum(home_goal) over(order by date 
        rows between unbounded preceding and current row) 
        as running_total
from match
where hometeam_id = 8456 and season = '2011/2012';

In [None]:
select date, home_goal, away_goal,
    sum(home_goal) over(order by date 
            rows between 1 preceding and current row) 
        as last2
from match
where hometeam_id = 8456 and season = '2011/2012';

# Other Questions

### What are the benefits of performing in-database analytics?

- works with an efficient speed: a rapid and effective means to obtain, alter, or store data.
- reliable and efficient language used for communicating with the database.

### Under what conditions would a window function be useful when doing data science with SQL?
- Working w/ aggregate values 
    - requires you to use group by with all non-aggregate columns
- e.g., over() clause
    - aggregate function for entire range
    - run faster than subqueries in SELECT

### Explain the difference between databases, database management systems, and querying languages
- database: organized collection of structured information or collectioin of virtual tables that store information
- DBMS: a type of software that for either manipulation of the data in the database or the management of the database structure itself.
- qiuering languages: language that is used for updating, retrieving, or deleting information in the data structure

### Describe a situation where left join, but not a right join, is appropriate
- if there are more than 2 tables, it would be set a main data as a reft one, then add up with left join.

### What are the main benefits of using a relational database over a large excel spreadsheet?
- for the easy retrieval and updating of data, efficiency, data consistency, data integrity, speed, and security, relational databases are definitely the structure to opt for. 
- a type of database that stores and provides access to data points that are related to one another. 

### Can you explain the difference between the WHERE and HAVING filters? Can you exemplify a situation where just one, but not the other, of these filters is appropriate?
- Where Clause: 
    - used to fetch/filter the records into rows before they are grouped.
    - no aggregate function is allowed
- Having Clause:
    - Data from a group is fetched with the help of the “Having” clause by posing certain condition.
    - aggregate function is allowed

### What does the LIMIT command do?
- limit the upper number of output lines

### How can you use the BETWEEN to select from a range of values? Can you provide an example and recode the same example using an AND statement?

In [None]:
select title from films
    where release_year
    between 1994 and 2000;

### How would you find records where the first name of an employee started with the letter P?

In [None]:
select name from companies
    where name like 'P%';

### What are the wild card operators?
- %: Represents zero or more characters: bl% finds bl, black, blue, and blob
- \_: Represents a single character: h_t finds hot, hat, and hit
- []: Represents any single character within the brackets: h[oa]t finds hot and hat, but not hit
- ^: Represents any character not in the brackets: h[^oa]t finds hit, but not hot and hat
- \-: Represents any single character within the specified range: c[a-b]t finds cat and cbt

### How would you select values NOT IN a query result given a where statement?

In [None]:
select name from company
    where name not in ('adam', 'john');

### How could you find the number of unique neighborhoods that customers come from?

In [None]:
select count(distinct neighborhoods) as n_unique_nbh
    from customers;

### How would you concatenate two columns together to make a new column?

In [None]:
select concat(c.FIRSTNAME, ',', c.LASTNAME) as name
    from customers c;