# Answering Business Questions using SQL

In this project, we will explore the Chinook database which stores data on music purchases for fictional retail record shop. Using a combination of SQL and a few Python libraries (SQLite, pandas, etc.), we will present different findings within the transactions stored within to make informed business decisions based on common retail inquiries.

First, let's construct some helper functions to assist in our analysis.

In [1]:
import sqlite3
import pandas as pd

def run_query(q):
    with sqlite3.connect('chinook.db') as conn:
        return pd.read_sql(q, conn)
    
def run_command(c):
    with sqlite3.connect('chinook.db') as conn:
        conn.isolation_level = None
        return conn.execute(c)

def show_tables():
    show = '''
            SELECT name, type 
            FROM sqlite_master WHERE type IN ("table", "view");
            '''
    print(run_query(show))
    
show_tables()

                    name   type
0                  album  table
1                 artist  table
2               customer  table
3               employee  table
4                  genre  table
5                invoice  table
6           invoice_line  table
7             media_type  table
8               playlist  table
9         playlist_track  table
10                 track  table
11  customers_by_country   view


## Finding the Best Selling Genres in USA

A record label has asked us to advertise four of their newest American artists' releases in our stores in the USA. The artists are labeled by the following genres: *Blues, Hip-Hop, Pop,* and *Punk*. Since the budget provided by the label is not going to cover all four artists, our goal is to evaluate which genres sell best in that market and select three of the artists based on their likelyhood to sell records.

In [2]:
genre_sales_usa = '''
                WITH track_sales_usa AS 
                ( 
                    SELECT il.track_id, il.quantity units_sold, billing_country 
                    FROM invoice_line il 
                    INNER JOIN invoice i ON il.invoice_id = i.invoice_id 
                    WHERE billing_country = 'USA' 
                ), 
                tracks_genre AS
                ( 
                    SELECT u.track_id, u.units_sold, t.genre_id, g.name genre 
                    FROM track_sales_usa u 
                    INNER JOIN track t ON u.track_id = t.track_id 
                    INNER JOIN genre g ON g.genre_id = t.genre_id 
                ) 
                SELECT COUNT(units_sold) total, genre FROM tracks_genre 
                GROUP BY genre 
                ORDER BY total DESC;
                '''
print(run_query(genre_sales_usa))

    total               genre
0     561                Rock
1     130  Alternative & Punk
2     124               Metal
3      53            R&B/Soul
4      36               Blues
5      35         Alternative
6      22               Latin
7      22                 Pop
8      20         Hip Hop/Rap
9      14                Jazz
10     13      Easy Listening
11      6              Reggae
12      5   Electronica/Dance
13      4           Classical
14      3         Heavy Metal
15      2          Soundtrack
16      1            TV Shows


It appears that of the four genres we are interested in, the top three sellers are *Punk* (130), *Blues* (36) and *Pop* (22). *Hip-Hop* (20) was the least so it would likely be a safer bet to not feature the artist that plays in that genre.

## Exploring Employee Sales

Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. We want to analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than their peers.

It may be possible that there are other factors that influence the relative sales of each employee. Let's consider whether any extra columns from the employee table may explain any variance seen, or whether the variance might instead be indicative of employee performance. It stands to reason that hire date would impact an employee's overall sales to date, so that data will be extracted with our query.

In [3]:
employee_sales = '''
                WITH total_sales AS 
                ( 
                    SELECT c.support_rep_id employee_id, SUM(i.total) total_sales 
                    FROM invoice i 
                    INNER JOIN customer c ON c.customer_id = i.customer_id 
                    GROUP BY employee_id 
                ) 
                SELECT e.first_name || ' ' || e.last_name name,
                        t.total_sales,
                        e.hire_date
                FROM employee e
                INNER JOIN total_sales t ON e.employee_id = t.employee_id
                ORDER BY t.total_sales DESC;
                '''
print(run_query(employee_sales))

            name  total_sales            hire_date
0   Jane Peacock      1731.51  2017-04-01 00:00:00
1  Margaret Park      1584.00  2017-05-03 00:00:00
2  Steve Johnson      1393.92  2017-10-17 00:00:00


As hypothesized, total sales is positively correlated with the amount of time an employee has been working with the earliest hire, Jane Peacock, having the highest total sales ($1,731.51).

## Analyzing Customer Sales by Country

We have been tasked with calculating the following data on our customers by country of origin:

- total number of customers
- total value of sales
- average value of sales per customers
- average order value

It is known that some countries have only one customer and as such will be grouped together as "Other" for our analysis. To do this, we will create a view to facilitate exploration of the data. By assigning adding another column, `Other`, to indicate which countries have more than one customer (1) and which only have one (0), we can use this column to group and sort more easily.

In [4]:
d = 'DROP VIEW customers_by_country;'
run_command(d)
q = '''
    CREATE VIEW customers_by_country AS
    SELECT 
        COUNT(customer_id) cust_count,
        country,
        CASE
            WHEN COUNT(customer_id) > 1 THEN 1
            ELSE 0
        END as other
    FROM customer
    GROUP BY country;
    '''
run_command(q)

<sqlite3.Cursor at 0x7fdc8e55fce0>

In [5]:
print(run_query('SELECT * FROM customers_by_country;'))

    cust_count         country  other
0            1       Argentina      0
1            1       Australia      0
2            1         Austria      0
3            1         Belgium      0
4            5          Brazil      1
5            8          Canada      1
6            1           Chile      0
7            2  Czech Republic      1
8            1         Denmark      0
9            1         Finland      0
10           5          France      1
11           4         Germany      1
12           1         Hungary      0
13           2           India      1
14           1         Ireland      0
15           1           Italy      0
16           1     Netherlands      0
17           1          Norway      0
18           1          Poland      0
19           2        Portugal      1
20           1           Spain      0
21           1          Sweden      0
22          13             USA      1
23           3  United Kingdom      1


Let's first look at all individually countries in our data set to see our target values sorted by `Other`:

In [6]:
customer_sales = '''
                    SELECT 
                        country,
                        cust_count,
                        total_sales,
                        avg_sales_per_cust,
                        avg_order_val,
                        other
                    FROM
                        (
                        SELECT 
                            cbc.*,
                            SUM(i.total) total_sales,
                            CAST(SUM(i.total) as Float) / CAST(COUNT(DISTINCT(c.customer_id)) as Float) avg_sales_per_cust,
                            AVG(i.total) avg_order_val,
                            CASE
                                WHEN cbc.other = 1 THEN 1
                                ELSE 0
                            END AS other
                        FROM customers_by_country cbc
                        INNER JOIN customer c ON cbc.country = c.country
                        INNER JOIN invoice i ON i.customer_id = c.customer_id
                        GROUP BY c.country
                        )
                    ORDER BY total_sales, other ASC;
                '''
print(run_query(customer_sales))

           country  cust_count  total_sales  avg_sales_per_cust  \
0          Denmark           1        37.62           37.620000   
1        Argentina           1        39.60           39.600000   
2            Italy           1        50.49           50.490000   
3          Belgium           1        60.39           60.390000   
4      Netherlands           1        65.34           65.340000   
5          Austria           1        69.30           69.300000   
6           Norway           1        72.27           72.270000   
7           Sweden           1        75.24           75.240000   
8           Poland           1        76.23           76.230000   
9          Hungary           1        78.21           78.210000   
10         Finland           1        79.20           79.200000   
11       Australia           1        81.18           81.180000   
12           Chile           1        97.02           97.020000   
13           Spain           1        98.01           98.01000

Now for a more complex query to aggregate all countries with only a single customer into one row for evaluation. We will sort the resulting query by both the `Other` column and `Total_Sales` in descending order:

In [7]:
customer_sales = '''
SELECT 
    country,
    cust_count,
    total_sales,
    avg_sales_per_cust,
    avg_order_val,
    other
FROM
    (
    SELECT 
        cbc.*,
        SUM(i.total) total_sales,
        CAST(SUM(i.total) as Float) / CAST(COUNT(DISTINCT(c.customer_id)) as Float) avg_sales_per_cust,
        AVG(i.total) avg_order_val
    FROM customers_by_country cbc
    INNER JOIN customer c ON cbc.country = c.country
    INNER JOIN invoice i ON i.customer_id = c.customer_id
    WHERE c.country IN (SELECT country FROM customers_by_country WHERE other = 1)
    GROUP BY c.country
    )

UNION

SELECT 
    CASE
        WHEN other = 0 THEN 'Other'
        ELSE NULL
    END AS country,
    SUM(cust_count) cust_count,
    SUM(total_sales) total_sales,
    CAST(SUM(total_sales) as Float) / CAST(SUM(cust_count) as Float) avg_sales_per_cust,
    CAST(SUM(total_sales) as Float) / c_count avg_order_val,
    other
FROM
    (
    SELECT 
        cbc.*,
        SUM(i.total) total_sales,
        CAST(SUM(i.total) as Float) / CAST(COUNT(DISTINCT(c.customer_id)) as Float) avg_sales_per_cust,
        AVG(i.total) avg_order_val,
        COUNT(c.customer_id) c_count
    FROM customers_by_country cbc
    INNER JOIN customer c ON cbc.country = c.country
    INNER JOIN invoice i ON i.customer_id = c.customer_id
    WHERE c.country IN (SELECT country FROM customers_by_country WHERE other = 0)
    GROUP BY c.country
    )

ORDER BY other DESC, total_sales DESC;
'''

print(run_query(customer_sales))

          country  cust_count  total_sales  avg_sales_per_cust  avg_order_val  \
0             USA          13      1040.49           80.037692       7.942672   
1          Canada           8       535.59           66.948750       7.047237   
2          Brazil           5       427.68           85.536000       7.011148   
3          France           5       389.07           77.814000       7.781400   
4         Germany           4       334.62           83.655000       8.161463   
5  Czech Republic           2       273.24          136.620000       9.108000   
6  United Kingdom           3       245.52           81.840000       8.768571   
7        Portugal           2       185.13           92.565000       6.383793   
8           India           2       183.15           91.575000       8.721429   
9           Other          15      1094.94           72.996000     109.494000   

   other  
0      1  
1      1  
2      1  
3      1  
4      1  
5      1  
6      1  
7      1  
8      1 

Given the resulting figures of aggregating our "Other" group, it would benefit our marketing team to focus their efforts on applying their budgets in such a way to capture more sales in markets where we have few customers. Clearly there is demand in these regions that could be untapped and further push the total sales overall.

## Sales by Individual Tracks vs Whole Albums

In order to cut cost on our purchasing strategy, our company is considering purchasing only the most popular tracks instead of whole albums. We have been tasked with find what percentage of purchases are individual tracks vs whole albums. Since tracks are purchased as line items in each transaction, we must determine which invoices contain whole albums.

In [8]:
albums_vs_tracks = '''
WITH invoice_first_track AS
(
 SELECT
     il.invoice_id invoice_id,
     MIN(il.track_id) first_track_id
 FROM invoice_line il
 GROUP BY 1
)

SELECT
album_purchase,
COUNT(invoice_id) number_of_invoices,
CAST(count(invoice_id) AS FLOAT) / (
                                     SELECT COUNT(*) FROM invoice
                                  ) percent
FROM
(
SELECT
    ifs.*,
    CASE
        WHEN
             (
              SELECT t.track_id FROM track t
              WHERE t.album_id = (
                                  SELECT t2.album_id FROM track t2
                                  WHERE t2.track_id = ifs.first_track_id
                                 ) 

              EXCEPT 

              SELECT il2.track_id FROM invoice_line il2
              WHERE il2.invoice_id = ifs.invoice_id
             ) IS NULL
         AND
             (
              SELECT il2.track_id FROM invoice_line il2
              WHERE il2.invoice_id = ifs.invoice_id

              EXCEPT 

              SELECT t.track_id FROM track t
              WHERE t.album_id = (
                                  SELECT t2.album_id FROM track t2
                                  WHERE t2.track_id = ifs.first_track_id
                                 ) 
             ) IS NULL
         THEN "yes"
         ELSE "no"
     END AS "album_purchase"
 FROM invoice_first_track ifs
)
GROUP BY album_purchase;
'''

run_query(albums_vs_tracks)

Unnamed: 0,album_purchase,number_of_invoices,percent
0,no,500,0.814332
1,yes,114,0.185668


Considering our findings demonstrate that the vast majority of invoices (81.5%) involve sales of individual tracks vs whole album purchases (18.5%), it would be a wise move to cut costs and purchase only the most popular tracks from albums going forward.