# Answering Business Questions with SQL

This project is from the intermediate SQL course in the DataQuest data engineering certificate.

### Import Modules 

In [30]:
import sqlite3
import pandas as pd

db = 'chinook.db'

### Define some Helper Functions

In [31]:
# This function will run a sql query and return as a pandas dataframe
def run_query(q):
    with sqlite3.connect(db) as conn:
        return pd.read_sql(q, conn)
    
# Function that takes a sql command and executes using sqlite module. 
    # Note, won't return tables
def run_command(q):
    with sqlite3.connect(db) as conn:
        conn.isolation_level = None
        conn.execute(c)

# Function that calls run_query and returns a list of all tables and views in the database
def show_tables():
    q = '''
    SELECT
        name,
        type
    FROM sqlite_master
    WHERE type IN ("table","view");
    '''
    return run_query(q)


In [32]:
# Show the tables
show_tables()

Unnamed: 0,name,type
0,album,table
1,artist,table
2,customer,table
3,employee,table
4,genre,table
5,invoice,table
6,invoice_line,table
7,media_type,table
8,playlist,table
9,playlist_track,table


### Finding Top 3 Music Genres in the USA

The first question we want to answer with the data is to find the top 3 selling music genres in the USA.

Write a query that returns each genre, with the number of tracks sold in the USA:
1. in absolute numbers
2. in percentages.

In [33]:
top3 = '''
    WITH usa_tracks_sold AS
       (
        SELECT il.* FROM invoice_line il
        INNER JOIN invoice i on il.invoice_id = i.invoice_id
        INNER JOIN customer c on i.customer_id = c.customer_id
        WHERE c.country = "USA"
       )

    SELECT
        g.name genre,
        count(uts.invoice_line_id) tracks_sold,
        cast(count(uts.invoice_line_id) AS FLOAT) / (
            SELECT COUNT(*) from usa_tracks_sold
        ) percentage_sold
    FROM usa_tracks_sold uts
    INNER JOIN track t on t.track_id = uts.track_id
    INNER JOIN genre g on g.genre_id = t.genre_id
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 3;
        '''
run_query(top3)

Unnamed: 0,genre,tracks_sold,percentage_sold
0,Rock,561,0.533777
1,Alternative & Punk,130,0.123692
2,Metal,124,0.117983


Therefore, the top three genres in the US are Rock, Alterinative & Punk and Metal. These genres should be the most stocked in the store.

### Top Sales by Employee

Next we want to figure out who the top sales people are.

Write a query that finds the total dollar amount of sales assigned to each sales support agent within the company.

In [41]:
sales_query = '''
            SELECT 
                e.first_name || ' ' || e.last_name employee,
                e.title,
                e.hire_date,
                e.city,
                e.country,
                sum(i.total) total_sales
            FROM employee e
            INNER JOIN customer c ON e.employee_id = c.support_rep_id
            INNER JOIN invoice i ON c.customer_id = i.customer_id
            GROUP BY employee_id
            ORDER BY total_sales DESC
            '''
run_query(sales_query)

Unnamed: 0,employee,title,hire_date,city,country,total_sales
0,Jane Peacock,Sales Support Agent,2017-04-01 00:00:00,Calgary,Canada,1731.51
1,Margaret Park,Sales Support Agent,2017-05-03 00:00:00,Calgary,Canada,1584.0
2,Steve Johnson,Sales Support Agent,2017-10-17 00:00:00,Calgary,Canada,1393.92


The top three sales people are all relatively similar. The person with the least sales is also the newest employee. Sales increase the longer the employee has been employeed by the company.

### Sales Data by Customer and Country

Next task is to analyze the purchases by customer and country. For each country, include:

1. total number of customers
2. total value of sales
3. average value of sales per customer
4. average order value


In [42]:

sales_by_country = '''
WITH country_or_other AS
    (
     SELECT
       CASE
           WHEN (
                 SELECT count(*)
                 FROM customer
                 where country = c.country
                ) = 1 THEN "Other"
           ELSE c.country
       END AS country,
       c.customer_id,
       il.*
     FROM invoice_line il
     INNER JOIN invoice i ON i.invoice_id = il.invoice_id
     INNER JOIN customer c ON c.customer_id = i.customer_id
    )

SELECT
    country,
    customers,
    total_sales,
    average_order,
    customer_lifetime_value
FROM
    (
    SELECT
        country,
        count(distinct customer_id) customers,
        SUM(unit_price) total_sales,
        SUM(unit_price) / count(distinct customer_id) customer_lifetime_value,
        SUM(unit_price) / count(distinct invoice_id) average_order,
        CASE
            WHEN country = "Other" THEN 1
            ELSE 0
        END AS sort
    FROM country_or_other
    GROUP BY country
    ORDER BY sort ASC, total_sales DESC
    );
'''

run_query(sales_by_country)

Unnamed: 0,country,customers,total_sales,average_order,customer_lifetime_value
0,USA,13,1040.49,7.942672,80.037692
1,Canada,8,535.59,7.047237,66.94875
2,Brazil,5,427.68,7.011148,85.536
3,France,5,389.07,7.7814,77.814
4,Germany,4,334.62,8.161463,83.655
5,Czech Republic,2,273.24,9.108,136.62
6,United Kingdom,3,245.52,8.768571,81.84
7,Portugal,2,185.13,6.383793,92.565
8,India,2,183.15,8.721429,91.575
9,Other,15,1094.94,7.448571,72.996


The USA is the top selling country and has the most customers. I find that Canada's second place result is interesting. Canada has 1/10th the population of the USA, but it's sales are a little bit more than half of the USA's. It would be worthwile to study why Canada performs proportionalty better than the USA and then apply the strategy to the USA because both countries are pretty similar.

### Analyzing Full Album Sales

Finally, we want to compare if customers are buying full albums or just a few songs from each album (i.e. just the popular songs).

In [43]:
albums_vs_tracks = '''
WITH invoice_first_track AS
    (
     SELECT
         il.invoice_id invoice_id,
         MIN(il.track_id) first_track_id
     FROM invoice_line il
     GROUP BY 1
    )

SELECT
    album_purchase,
    COUNT(invoice_id) number_of_invoices,
    CAST(count(invoice_id) AS FLOAT) / (
                                         SELECT COUNT(*) FROM invoice
                                      ) percent
FROM
    (
    SELECT
        ifs.*,
        CASE
            WHEN
                 (
                  SELECT t.track_id FROM track t
                  WHERE t.album_id = (
                                      SELECT t2.album_id FROM track t2
                                      WHERE t2.track_id = ifs.first_track_id
                                     ) 

                  EXCEPT 

                  SELECT il2.track_id FROM invoice_line il2
                  WHERE il2.invoice_id = ifs.invoice_id
                 ) IS NULL
             AND
                 (
                  SELECT il2.track_id FROM invoice_line il2
                  WHERE il2.invoice_id = ifs.invoice_id

                  EXCEPT 

                  SELECT t.track_id FROM track t
                  WHERE t.album_id = (
                                      SELECT t2.album_id FROM track t2
                                      WHERE t2.track_id = ifs.first_track_id
                                     ) 
                 ) IS NULL
             THEN "yes"
             ELSE "no"
         END AS "album_purchase"
     FROM invoice_first_track ifs
    )
GROUP BY album_purchase;
'''

run_query(albums_vs_tracks)

Unnamed: 0,album_purchase,number_of_invoices,percent
0,no,500,0.814332
1,yes,114,0.185668


It looks like the most customers prefer to buy individual tracks from albums instead of the full album! We may be able to optimize costs by just purchasing the popular songs instead of full albums.