# Filtering, sorting and calculating data with SQL

## Basics of Filtering with SQL

* Why filter?
    * Be specific about the data you want to retrieve
    * Reduce the number of records you retrieve
    * Increase query performance
    * Reduce the strain on the client application
    * Governance limitations

* Filtering is done with WHERE Clause Operators:

In [None]:
# WHERE Clause Operators
statement = """

SELECT column_name, column_name
FROM table_name
WHERE column_name operator value;

"""

* Possible operators:
    * = equal
    * <> not equal
    * \> greater than
    * < less than
    * \>= greater than or equal
    * <= less than or equal
    * BETWEEN ... AND
    * ISNULL



## Advanced filtering: IN, OR, and NOT

* IN Operator
    * Specifies a range of conditions
    * Comma delimited list of values
    * Enclosed in ()

In [None]:
# IN Operator example
statement = """

SELECT
ProductID
,UnitPrice
,SupplierID
From Products
WHERE SupplierID IN (9, 10, 11);

"""

* OR Operator
    * DBMS will not evaluate the second conditions in a WHERE clause if the first condition is met
    * Use for any rows matching the specific conditions

In [None]:
# WHERE Operator example
statement = """

SELECT
ProductName
,ProductID
,UnitPrice
,SupplierID
,ProductName
FROM Products
WHERE ProductName = 'Tofu' OR 'Konbu';

"""

* IN vs. OR
    * IN works the same as OR
    * Benefits of IN:
        * Long list of options
        * IN executes faster than OR
        * Don't have to think about the order with IN
        * Can contain another SELECT for subqueries

* Using OR with AND
    * You have to be very careful because you can get different results depending on how you implement it (use of parenthesis)

In [5]:
# Example of NOT operator
statement = """
SELECT *
FROM Employees
WHERE NOT City='London' AND
NOT City='Seatle';
"""

## Using wildcards in SQL

* What are wildcards?
    * Special character used to match parts of a value
    * Search pattern made from literal text, wild character, or a combination
    * Uses LIKE as an operator (though it is technically a predicate)
    * Can only be used with strings
    * Cannot be used for non-text datatypes
    * Helpful for data scientists as they explore string variables

* Using % wildcards
    * **'%Pizza'** - Grabs anything ending with the word Pizza
    * **'Pizza%'** - Grabs anything after the work Pizza
    * **'%Pizza%'** - Grabs anything before and after the word Pizza
    * **'S%E'** - Grabs anything that starts with "S" and ends with "E" (Like 'Sadie')
    * **'t%@gmail.com'** - Grabs gmail addresses that start with "t" (hoping to find 'Tom')
    * **OBS**: wildcards will not match NULLs

In [None]:
statement = """

WHERE size LIKE '%pizza'

"""

output = """
    spizza
    mpizza
"""

* Bracket [] wildcard
    * Used to specifiy a set of characters in a specific location
    * Does not work with all DBMS
    * Does **not** work with **SQLite**

* Downsides of wildcards
    * Takes longer to run
    * Better to use another operator (if possible): =,<,=>, and etc
    * Statements with wildcards will take longer to run if used at the end of search patterns
    * Placement of wildcards is important

## Sorting with ORDER BY

* Why sort data?
    * Data displayed appears in the order of the underlying tables
    * Updated and deleted data can change this order
    * Sequence of retrieved data cannot be assumed if order was not specified
    * Sorting data logically helps keep information you want on top
    * ORDER BY clause allows user to sort data by particular columns

In [3]:
statement = """

SELECT something
FROM database
ORDER BY characteristic

"""

* Rules for ORDER BY
    * Takes the name of one or more columns
    * Add a comma after each additional column name
    * Can sort by a column not retrieved
    * Must always be the last clause in a select statement

* Sorting by column position
    * ORDER BY 2,3 (meaning 2nd and 3rd columns)

* Sort direction
    * DESC descending order
    * ASC ascending order
    * Only applies to the column names it directly precedes

## Math Operations

* Math operators
    * \+ addition
    * \- subtraction
    * \* multiplication
    * / division

In [None]:
# Multiplication Example
statement = """

SELECT
ProductID
,UnitsOnOrder
,UnitPrice
,UnitsOnOrder * UnitPrice AS Total_Order_Cost
FROM Products

"""

# Combining Math Operations
statement = """

SELECT
ProductId
,Quantity
,UnitPrice
,Discount
,(UnitPrice - Discount)/Quantity AS Total_Cost
FROM Products

"""

## Aggregate Functions

* What are Aggregate Functions?
    * Used to summarize data
    * Finding the highest and lowest values
    * Finding the total number of rows
    * Finding the average value

* Aggregate Functions
    * AVG()
    * COUNT()
    * MIN()
    * MAX()
    * SUM()

In [4]:
# AVERAGE Function
statement = """
SELECT AVG(UnitPrice) AS avg_price
FROM products
"""

# COUNT (*) - Counts all the rows in a table
# containing values or NULL Values
statement = """
SELECT COUNT (*) AS
total_customers
FROM Customers
"""

# COUNT (column) - Counts all the rows in a
# specific column ignoring NULL values
statement = """
SELECT COUNT(CustomerID) AS
total_customers
FROM Customers
"""

# MAX and MIN Function
statement = """
SELECT MAX(UnitPrice) AS max_prod_price
FROM Products
"""

statement = """
SELECT MAX(UnitPrice) AS max_prod_price
,MIN(UnitPrice) AS min_prod_price
FROM Products
"""

# SUM Function
statement = """
SELECT SUM(UnitPrice) AS
    total_prod_price
FROM Products
"""

statement = """
SELECT SUM(UnitPrice*UnitsInStock)
       AS total_price
FROM Products
WHERE SupplierID = 23;
"""

* Using DISTINCT on Aggregate Functions
    * If DISTINCT is not specified, ALL is assumed
    * Cannot use DISTINCT on COUNT(*)
    * No value to use with MIN and MAX functions 

In [None]:
statement = """

SELECT COUNT(DISTINCT CustomerID)
FROM Customers

"""

## Grouping Data with SQL

* Grouping example: count customers after a group on region rather than counting the whole table

In [None]:
statement = """

SELECT
Region
,COUNT(CustomerID) AS total_customers
FROM Customers
GROUP BY Region;

"""

* Additional GROUP BY Information
    * GROUP BY clauses can contain multiple columns
    * Every column in your SELECT statement must be present in a GROUP BY clause, except for aggregated calculations
    * NULLs will be grouped together if your GROUP BY column contains NULLs

* HAVING Clause - Filtering for Groups
    * WHERE does not work for groups
    * WHERE filters on rows
    * Instead use HAVING clause to filter for groups
    * GROUP BY does not sort data

In [None]:
# Grouping example
statement = """

SELECT
CustomerID
,COUNT (*) AS orders
FROM Orders
GROUP BY CustomerID
HAVING COUNT (*) >=2;

"""

* WHERE vs. HAVING
    * WHERE filters before data is grouped
    * HAVING filters after data is grouped
    * Rows eliminated by the WHERE clause will not be inlcuded in the group

## Suggested Reading
[Pyhton-SQL Package Documentation](https://pypi.python.org/pypi/python-sql)