# Answering Business Questions Using SQL

In this project, we will be taking a look at the 'Chinook' database, which is a sample database of a digital media store, and will be answering some business questions using SQL. The data model includes information about artists, albums, media tracks, invoices and customers. More information about this database can be found [here](https://github.com/lerocha/chinook-database).

--------

## Preparing before the analysis

We import the necessary libraries.

In [1]:
import sqlite3 
import pandas as pd

We create some helper functions to work easier with the database.

In [2]:
def run_query(query):
    with sqlite3.connect('chinook.db') as conn:
        return pd.read_sql(query, conn)
    
def run_command(command):
    with sqlite3.connect('chinook.db') as conn:
        conn.isolation_level = None
        conn.execute(command)
        
def show_tables():
    query = 'SELECT name, type FROM sqlite_master WHERE type IN ("table", "view");'
    return run_query(query)

We run the ```show_tables()``` function to see the schema and understand the database.

In [3]:
show_tables()

Unnamed: 0,name,type
0,album,table
1,artist,table
2,customer,table
3,employee,table
4,genre,table
5,invoice,table
6,invoice_line,table
7,media_type,table
8,playlist,table
9,playlist_track,table


## Starting the analysis

### Best-selling genres in the USA

We want to know which music genres sell the best in the USA; to answer this question we calculate how many songs have been sold for each specific genre and what percentage of the sales they represent, using data that belongs only to USA sales.

In [4]:
query = ''' SELECT g.name genre, COUNT(t.name) sold_tracks,
            ROUND(CAST(COUNT(t.name) AS FLOAT) / (SELECT COUNT(il.track_id)
                                                  FROM invoice_line il
                                                  INNER JOIN invoice i ON i.invoice_id = il.invoice_id
                                                  WHERE i.billing_country = "USA"), 2) sales_percentage
            FROM genre g
            INNER JOIN track t ON t.genre_id = g.genre_id
            INNER JOIN invoice_line il ON il.track_id = t.track_id
            INNER JOIN invoice i ON i.invoice_id = il.invoice_id
            WHERE i.billing_country = "USA"
            GROUP BY 1
            ORDER BY 2 DESC'''

run_query(query)

Unnamed: 0,genre,sold_tracks,sales_percentage
0,Rock,561,0.53
1,Alternative & Punk,130,0.12
2,Metal,124,0.12
3,R&B/Soul,53,0.05
4,Blues,36,0.03
5,Alternative,35,0.03
6,Pop,22,0.02
7,Latin,22,0.02
8,Hip Hop/Rap,20,0.02
9,Jazz,14,0.01


**Comment:** We can see that 'Rock', 'Alternative & Punk', and 'Metal' are the most sold genres. Therefore, if the digital store wanted to increase their inventory in the USA, they could consider adding more music from these genres in order to make sure that the music they add will have a higher chance of getting sold.

### Employee sales performance

We want to know which employees, specifically those with 'Sales Support Agent' positions, are performing better; to answer this question we calculate the total amount sold by each agent. Another important factor to take into account to check performance is the hire date of the employee, because it makes sense that employees that have been working for longer would have sold more. 

To address the performance question more accurately, we can calculate the amount sold per day for each employee after the first year of the store opening (let's pretend that the store opened on '2017-04-01', when they hired the first employee). We can calculate the amount of days worked and divide the total sold to get the average sold per day. 

In [5]:
query = ''' SELECT e.first_name || " " || e.last_name AS employee_name,
                   DATE(e.hire_date) hire_date,
                   SUM(i.total) total_sold,
                   strftime('%J','2018-04-01') - strftime('%J',e.hire_date) working_days,
                   SUM(i.total) / (strftime('%J','2018-01-01') - strftime('%J',e.hire_date)) sold_per_day
            FROM employee e
            LEFT JOIN customer c ON c.support_rep_id = e.employee_id
            LEFT JOIN invoice i ON i.customer_id = c.customer_id
            WHERE e.title = "Sales Support Agent"
            GROUP BY 1'''

run_query(query)

Unnamed: 0,employee_name,hire_date,total_sold,working_days,sold_per_day
0,Jane Peacock,2017-04-01,1731.51,365.0,6.2964
1,Margaret Park,2017-05-03,1584.0,333.0,6.518519
2,Steve Johnson,2017-10-17,1393.92,166.0,18.341053


**Comment:** After making this analysis, we can see that Steve Johnson has the best performance. Regardless of being hired several months later, he has been able to sell a lot more daily than his other colleagues. Meanwhile, Jane Peacock and Margaret Park have a similar performance. 

### Summary by country

We want to summarize some of the information by country, like the number of customers, the total amount of sales, the average sale per customer and average order value. For all the countries that only have one customer, we will group them to an 'Other' category.

In [6]:
query = ''' SELECT CASE
                    WHEN country IN (SELECT country
                                     FROM customer
                                     GROUP BY country
                                     HAVING COUNT(customer_id) > 1) THEN country
                    ELSE 'Other'
                    END AS country, 
                COUNT(DISTINCT(c.customer_id)) num_customers, 
                SUM(i.total) total_sales,
                ROUND(SUM(i.total) / COUNT(DISTINCT(c.customer_id)), 2) avg_sale_per_customer,
                ROUND(SUM(i.total) / COUNT(i.invoice_id), 2) avg_order_value
            FROM (SELECT c.*,
                         CASE
                             WHEN c.country IN (SELECT country
                                                 FROM customer
                                                 GROUP BY country
                                                 HAVING COUNT(customer_id) > 1) THEN 0
                            ELSE 1
                            END AS sort
                  FROM customer c) AS c
            INNER JOIN invoice i ON i.customer_id = c.customer_id
            GROUP BY 1
            ORDER BY sort ASC, total_sales DESC'''

run_query(query)

Unnamed: 0,country,num_customers,total_sales,avg_sale_per_customer,avg_order_value
0,USA,13,1040.49,80.04,7.94
1,Canada,8,535.59,66.95,7.05
2,Brazil,5,427.68,85.54,7.01
3,France,5,389.07,77.81,7.78
4,Germany,4,334.62,83.66,8.16
5,Czech Republic,2,273.24,136.62,9.11
6,United Kingdom,3,245.52,81.84,8.77
7,Portugal,2,185.13,92.57,6.38
8,India,2,183.15,91.58,8.72
9,Other,15,1094.94,73.0,7.45


**Comment:** By looking at the results, we can see that the USA has the highest amount of customers, as well as the highest total sales. The customers from the Czech Republic are the ones that spend the most money on the store on average, as well as making the highest order values. We can also see that there are 15 customers from other countries, whose sales sum up higher than those of the USA, which means that customers from these other countries make up a big part of the store's total sales.

### Album purchases vs individual songs

We want to know whether most of the purchases are full-album purchases or mostly made of individual song choices. To answer this question, we first make a view where we get the albums that belong to each of the songs in the invoice, and select the first album id for each invoice. Then we create a second view, where we compare all the songs bought for each invoice against the songs in the album given by the id. If the songs in the invoice match exaclty the songs in the album, we label the invoice as a full-album purchase. At the end, we group the invoices that were album purchases and those who were not, count the total invoices for each, and calculate the percentage of the sales for each type. 

In [7]:
query = ''' WITH invoice_first_album AS
                (SELECT il.invoice_id, MIN(t.album_id) first_album_id
                FROM invoice_line il
                INNER JOIN track t ON t.track_id = il.track_id
                GROUP BY 1),
                
                album_purchase AS
                (SELECT ifa.invoice_id, 
                        CASE WHEN (SELECT t.track_id FROM track t
                                   WHERE t.album_id = ifa.first_album_id
                                   EXCEPT
                                   SELECT il.track_id FROM invoice_line il
                                   WHERE il.invoice_id = ifa.invoice_id) IS NULL
                             AND (SELECT il.track_id FROM invoice_line il
                                   WHERE il.invoice_id = ifa.invoice_id
                                   EXCEPT
                                   SELECT t.track_id FROM track t
                                   WHERE t.album_id = ifa.first_album_id) IS NULL
                            THEN 'True'
                            ELSE 'False'
                        END AS 'album_purchase'
                FROM invoice_first_album ifa)
                
            SELECT album_purchase, 
            COUNT(invoice_id) number_of_invoices, 
            ROUND(CAST(COUNT(invoice_id) AS FLOAT) / (SELECT COUNT(*) FROM album_purchase), 2) percentage 
            FROM album_purchase
            GROUP BY album_purchase'''

run_query(query)

Unnamed: 0,album_purchase,number_of_invoices,percentage
0,False,500,0.81
1,True,114,0.19


**Comment:** We can see that most of the purchases are not a full-album purchase. Out of 614 total invoices, only 114 are album purchases, which make only 19% of the sales. With this information we can consider that when including more music into the store, it would be better to get the most popular individual songs, instead than most popular full albums. 