# Guided Project: Answering Business Questions using SQL

## Helper Functions

We will create some helper functions in python and use a context manager to handle the connection to the SQLite database.

In [2]:
import sqlite3, pandas as pd

def run_query(q):
    with sqlite3.connect('chinook.db') as conn:
        return pd.read_sql(q, conn)

The `run_query(q)` function will return a dataframe. We can run the function as the last line of a Jupyter cell and it will print the results nicely for us.

We will create a `run_command(c)` function to run SQL queries that do not return tables.

In [3]:
def run_command(c):
    with sqlite3.connect('chinook.db') as conn:
        conn.isolation_level = None
        conn.execute(c)

def show_tables():
    q = '''
        SELECT 
            name, 
            type 
          FROM sqlite_master 
         WHERE type IN ('table','view');
        '''
    return run_query(q)

We have also created a `show_tables()` function to return a list of all tables and views in our database. This will be handy to quickly check the state of our database as we work.

In [4]:
show_tables()

Unnamed: 0,name,type
0,album,table
1,artist,table
2,customer,table
3,employee,table
4,genre,table
5,invoice,table
6,invoice_line,table
7,media_type,table
8,playlist,table
9,playlist_track,table


## Top Genres in the USA

The Chinook record store has just signed a deal with a new record label, and you've been tasked with selecting the first three albums that will be added to the store, from a list of four. All four albums are by artists that don't have any tracks in the store right now - we have the artist names, and the genre of music they produce:

| Artist Name | Genre |
| --- | --- |
| Regal	| Hip-Hop |
| Red Tone | Punk |
| Meteor and the Girls | Pop |
| Slim Jim Bites | Blues |


The record label specializes in artists from the USA, and they have given Chinook some money to advertise the new albums in the USA, so we're interested in finding out which genres sell the best in the USA.

In [5]:
q = '''
WITH quantity_per_track AS
     (
     SELECT
        il.track_id,
        SUM(il.quantity) total_quantity
     FROM invoice_line il
     INNER JOIN invoice i ON i.invoice_id = il.invoice_id
     WHERE i.billing_country = 'USA'
     GROUP BY 1
     ),
     tracks_per_genre AS
     (
     SELECT
         g.name genre,
         SUM(qpt.total_quantity) no_tracks
     FROM quantity_per_track qpt
     INNER JOIN track t ON t.track_id = qpt.track_id
     INNER JOIN genre g ON g.genre_id = t.genre_id
     GROUP BY 1
     ORDER BY 2 DESC
     ),
     sum_tracks AS
     (
     SELECT SUM(no_tracks) sum FROM tracks_per_genre
     )
SELECT tpg.*, 
CAST(tpg.no_tracks AS float)/(s.sum) ratio
FROM tracks_per_genre tpg, sum_tracks s;
    '''
run_query(q)
genre_sales_usa = run_query(q)
run_query(q)

Unnamed: 0,genre,no_tracks,ratio
0,Rock,561,0.533777
1,Alternative & Punk,130,0.123692
2,Metal,124,0.117983
3,R&B/Soul,53,0.050428
4,Blues,36,0.034253
5,Alternative,35,0.033302
6,Latin,22,0.020932
7,Pop,22,0.020932
8,Hip Hop/Rap,20,0.019029
9,Jazz,14,0.013321


The Hip-hop genre has the fewest number of purchase in our list of four. Punk, blues and pop genres respectively have the highest number of track purchases in the USA.

## Sales Support Performance

Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase.

We will analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.

In [6]:
q = '''
WITH total_sales AS (
     SELECT
         c.support_rep_id,
         SUM(i.total) total
     FROM customer c
     INNER JOIN invoice i ON c.customer_id = i.customer_id
     GROUP BY 1
     ),
     customers AS
     (
     SELECT
         c.support_rep_id,
         COUNT(c.customer_id) number_of_customers
     FROM customer c
     GROUP BY 1
     )
SELECT 
    e.first_name || ' ' || e.last_name employee_name,
    ts.total,
    cs.number_of_customers,
    ts.total / cs.number_of_customers average_sale_per_customer,
    e.hire_date
FROM employee e
INNER JOIN total_sales ts ON ts.support_rep_id = e.employee_id
INNER JOIN customers cs ON cs.support_rep_id = e.employee_id
GROUP BY 1, 2
ORDER BY 3 DESC;
'''
run_query(q)

Unnamed: 0,employee_name,total,number_of_customers,average_sale_per_customer,hire_date
0,Jane Peacock,1731.51,21,82.452857,2017-04-01 00:00:00
1,Margaret Park,1584.0,20,79.2,2017-05-03 00:00:00
2,Steve Johnson,1393.92,18,77.44,2017-10-17 00:00:00


Jane has the highest sales but she has been in the company for the longest. Steve has the lowest sales but joined the company much later compared to the other two.

The average sales per customer are between $77 to $82, a difference of $5 per customer between the top and bottom employee. 

## Sales By Country

Our next task is to analyze the sales data for customers from each different country. We have been given guidance to use the country value from the `customers` table, and ignore the country from the billing address in the `invoice` table.

In particular, we have been directed to calculate data, for each country, on the:

- total number of customers
- total value of sales
- average value of sales per customer
- average order value

In [7]:
q = '''
WITH customers_per_country AS
    (
    SELECT
        c.country,
        COUNT(c.customer_id) customers
    FROM customer c
    GROUP BY 1
    ORDER BY 2
    ),
    total_sales AS
    (
    SELECT
        c.country,
        SUM(i.total) sales,
        AVG(i.total) average
    FROM customer c
    INNER JOIN invoice i ON i.customer_id = c.customer_id
    GROUP BY 1
    ),
    purchases_per_country AS
    (
    SELECT
        cpc.country,
        cpc.customers,
        ts.sales,
        ts.average average_order_value
    FROM customers_per_country cpc
    INNER JOIN total_sales ts ON ts.country = cpc.country
    ORDER BY 3
    ),
    purchases_per_country_others AS 
    (
    SELECT
    CASE
        WHEN ppc.customers = 1 THEN 'Other'
        ELSE ppc.country
    END AS country,
    SUM(customers) customers,
    SUM(sales) sales,
    SUM(sales)/SUM(customers) average_sales_per_customer,
    average_order_value
    FROM purchases_per_country ppc
    GROUP BY 1
    )

SELECT
    country,
    customers,
    sales,
    average_sales_per_customer,
    average_order_value
FROM 
    (
     SELECT
         ppco.*,
         CASE
             WHEN ppco.country = 'Other' THEN 1
             ELSE 0
         END AS sort
     FROM purchases_per_country_others ppco
    )
ORDER BY sort ASC, 3 DESC;


'''
run_query(q)

Unnamed: 0,country,customers,sales,average_sales_per_customer,average_order_value
0,USA,13,1040.49,80.037692,7.942672
1,Canada,8,535.59,66.94875,7.047237
2,Brazil,5,427.68,85.536,7.011148
3,France,5,389.07,77.814,7.7814
4,Germany,4,334.62,83.655,8.161463
5,Czech Republic,2,273.24,136.62,9.108
6,United Kingdom,3,245.52,81.84,8.768571
7,Portugal,2,185.13,92.565,6.383793
8,India,2,183.15,91.575,8.721429
9,Other,15,1094.94,72.996,8.833846


Based on the data, there may be opportunity in the following countries:

- Czech Republic
- United Kingdom
- India

It's worth keeping in mind that the amount of data from each of these countries is relatively low. We should be cautious spending too much money on new marketing campaigns, as the sample size is not large enough to give us high confidence. A better approach would be to run small campaigns in these countries, collecting and analyzing the new customers to make sure that these trends hold with new customers.

## Albums vs Individual Tracks


In [21]:
q = '''
WITH invoices AS
    (
    SELECT
        il.invoice_id invoice_id,
        MIN(il.track_id) first_track
    FROM invoice_line il
    INNER JOIN track t ON t.track_id = il.track_id
    GROUP BY 1
    ),
    tracks_per_album AS
    (
    SELECT
        album_id,
        COUNT(track_id) no_of_tracks
    FROM track
    GROUP BY 1
    ),
    album_or_not AS
    (
    SELECT
    i.*,
    CASE
        WHEN
            (
            (
            SELECT il.track_id 
            FROM invoice_line il
            WHERE il.invoice_id = i.invoice_id
            EXCEPT
            SELECT t.track_id
            FROM track t
            WHERE t.album_id = (
                            SELECT t2.album_id 
                            FROM track t2
                            WHERE t2.track_id = i.first_track
                            )
            ) IS NULL

            AND

            (
            SELECT t.track_id
            FROM track t
            WHERE t.album_id = (
                                SELECT t2.album_id 
                                FROM track t2
                                WHERE t2.track_id = i.first_track
                                )
            EXCEPT
            SELECT il.track_id 
            FROM invoice_line il
            WHERE il.invoice_id = i.invoice_id
            ) IS NULL
            )

            THEN 'yes'
            ELSE 'no'
        END AS 'album_purchase'
    FROM invoices i
    )
    
SELECT
    album_purchase,
    COUNT(invoice_id) no_of_invoices,
    CAST(COUNT(invoice_id) AS FLOAT) / 
    (SELECT COUNT(*) FROM invoice) ratio
FROM album_or_not
GROUP BY 1;


'''
run_query(q)


Unnamed: 0,album_purchase,no_of_invoices,ratio
0,no,500,0.814332
1,yes,114,0.185668


Album purchases account for 18.6% of purchases. Based on this data, I would recommend against purchasing only select tracks from albums from record companies, since there is potential to lose one fifth of revenue.