# Guided Project: Answering Business Questions using SQL

https://github.com/dataquestio/solutions/blob/master/Mission191Solutions.ipynb

I will be walking through answering business questions by writting SQLquerys. So the question will be stated and then the question will be answered through a series of SQL queries.

The schema to the ```chinook.db``` that I will be working with is shown below.

<img src="chinook database schema.jpg" width=400px>

</br>

Inital code to help with connecting to the database, running queries, running a command, and showing tables.

In [1]:
import sqlite3
import pandas as pd

In [2]:
# Database
db = 'chinook.db'


def run_query(q):
    '''
    Function for running query on chinook.db
    '''
    with sqlite3.connect(db) as conn:
        return pd.read_sql(q, conn)

def run_command(c):
    '''
    Function for creating views
    '''
    with sqlite3.connect(db) as conn:
        conn.isolation_level = None
        conn.execute(c)

def show_tables():
    '''
    Function for showing current state of database
    '''
    
    q = 'SELECT \
             name, \
             type \
         FROM sqlite_master \
         WHERE type IN ("table","view");'
    
    with sqlite3.connect(db) as conn:
        print(conn.execute(q).fetchall())

The ```conn.isolation_level = None``` line in the ```run_command``` function above tells SQLite to autocommit any changes.

## Business Question #1

The Chinook record store has just signed a deal with a new record label, and you've been tasked with selecting the first three albums that will be added to the store, from a list of four. All four albums are by artists that don't have any tracks in the store right now - we have the artist names, and the genre of music they produce:

|Artist Name         |Genre|
|:-|:-:|
|Regal	             |Hip-Hop|
|Red Tone            |Punk|
|Meteor and the Girls|Pop|
|Slim Jim Bites	     |Blues|


The record label specializes in artists from the USA, and they have given Chinook some money to advertise the new albums in the USA, so we're interested in finding out which genres sell the best in the USA.

I will write a query to find out which genres sell the most tracks in the USA, and then create a visualization of that data using pandas.

1. Write a query that returns each genre, with the number of tracks sold in the USA:
   <br> - in absolute numbers
   <br> - in percentages
2. Write a paragraph that interprets the data and makes a recommendation for the three artists whose albums we should purchase for the store, based on sales of tracks from their genres.

In [3]:
# Query
q = '''SELECT 
         g.name AS genre, 
         SUM(il.quantity) AS total_sold, 
         SUM(CAST(il.quantity AS FLOAT)) / 
            (SELECT SUM(il.quantity) AS total_tracks_sold FROM invoice_line AS il) 
             AS percent_sold
     FROM track AS t 
     INNER JOIN genre AS g ON g.genre_id = t.genre_id 
     INNER JOIN invoice_line AS il ON il.track_id = t.track_id 
     GROUP BY genre 
     ORDER BY 2 DESC
     '''

genre_sales = run_query(q)

print("Sales By Genre")
genre_sales

Sales By Genre


Unnamed: 0,genre,total_sold,percent_sold
0,Rock,2635,0.553921
1,Metal,619,0.130124
2,Alternative & Punk,492,0.103427
3,Latin,167,0.035106
4,R&B/Soul,159,0.033424
5,Blues,124,0.026067
6,Jazz,121,0.025436
7,Alternative,117,0.024595
8,Easy Listening,74,0.015556
9,Pop,63,0.013244


In [4]:
# Query OF Remaining Genres and percent sold
q = '''SELECT 
         COUNT(DISTINCT(g.name)) AS count_remaining_genres,
         SUM(il.quantity) AS total_sold, 
         SUM(CAST(il.quantity AS FLOAT)) /
            (SELECT SUM(il.quantity) AS total_tracks_sold FROM invoice_line AS il)
             AS percent_sold
     FROM track AS t
     INNER JOIN genre AS g ON g.genre_id = t.genre_id
     INNER JOIN invoice_line AS il ON il.track_id = t.track_id
     WHERE NOT g.name = "Rock" AND NOT g.name = "Metal" AND NOT g.name = "Alternative & Punk"
     '''

genre_sales_exclute_top_three = run_query(q)

print("Sales By Genre Excluting Top Three")
genre_sales_exclute_top_three

Sales By Genre Excluting Top Three


Unnamed: 0,count_remaining_genres,total_sold,percent_sold
0,15,1011,0.212529


In [5]:
# Query
q = '''SELECT
         g.name AS genre,
         SUM(il.quantity) AS total_sold,
         SUM(CAST(il.quantity AS FLOAT)) /
            (SELECT SUM(il.quantity) AS total_tracks_sold FROM invoice_line AS il
             INNER JOIN invoice AS i ON i.invoice_id = il.invoice_id
             WHERE i.billing_country = "USA")
             AS percent_sold,
          i.billing_country AS country
     FROM track AS t
     INNER JOIN genre AS g ON g.genre_id = t.genre_id
     INNER JOIN invoice_line AS il ON il.track_id = t.track_id
     INNER JOIN invoice AS i ON i.invoice_id = il.invoice_id
     WHERE i.billing_country = "USA"
     GROUP BY genre
     ORDER BY 2 DESC;
     '''

genre_sales_usa = run_query(q)

print("Sales By Genre In The USA")
genre_sales_usa

Sales By Genre In The USA


Unnamed: 0,genre,total_sold,percent_sold,country
0,Rock,561,0.533777,USA
1,Alternative & Punk,130,0.123692,USA
2,Metal,124,0.117983,USA
3,R&B/Soul,53,0.050428,USA
4,Blues,36,0.034253,USA
5,Alternative,35,0.033302,USA
6,Pop,22,0.020932,USA
7,Latin,22,0.020932,USA
8,Hip Hop/Rap,20,0.019029,USA
9,Jazz,14,0.013321,USA


When all the tracks sold are considered, regardless of country, Rock is by far the most popular genre type with 55% of total track sales coming from this genre. The next two main genres are Metal and Alternative & Punk with 13% and 10% respectively. The remaining 15 genre types only account for the remaining ~ 21% of the remaining share of track sold.

When the counntry is considered, Rock is still ahead at 53% of toal track sales with Metal and Alternative & Punk in the same respective positions when country was not considered. 

Talking more specifically about which three of the four artists should be choosen. I would choose <i>Red Tone (Punk)</i>, <i>Meteor and the Girls (Pop)</i>, and <i>Slim Jim Bites (Blues)</i>. I would exclude <i>Regal (Hip-Hop)</i> based on the fact that Hip-Hop is the lowest tracks sold out of the four genres of the new albums. 

## Business Question #2

Each customer for the Chinook store gets assigned to a sales support agent within the company when they first make a purchase. You have been asked to analyze the purchases of customers belonging to each employee to see if any sales support agent is performing either better or worse than the others.

You might like to consider whether any extra columns from the employee table explain any variance you see, or whether the variance might instead be indicative of employee performance.

1. Write a query that finds the total dollar amount of sales assigned to each sales support agent within the company. Add any extra attributes for that employee that you find are relevant to the analysis.
2. Write a short statement describing your results, and providing a possible interpretation.

In [6]:
# The employees, customers assigned to employee, and total sales
run_query('''SELECT
               e.employee_id AS id,
               e.title as title,
               e.first_name || ' ' || e.last_name AS employee,
               COUNT( DISTINCT c.customer_id) AS num_cust,
               SUM(i.total) AS tot_sales,
               SUM(i.total) / COUNT( DISTINCT c.customer_id) AS dol_sold_per_cust,
               MIN(i.total) AS min_sale_1_tran,
               MAX(i.total) AS max_sale_1_tran
           FROM employee AS e
           LEFT JOIN customer AS c ON e.employee_id = c.support_rep_id 
           LEFT JOIN invoice as i ON i.customer_id = c.customer_id
           GROUP BY e.employee_id;
           ''')

Unnamed: 0,id,title,employee,num_cust,tot_sales,dol_sold_per_cust,min_sale_1_tran,max_sale_1_tran
0,1,General Manager,Andrew Adams,0,,,,
1,2,Sales Manager,Nancy Edwards,0,,,,
2,3,Sales Support Agent,Jane Peacock,21,1731.51,82.452857,0.99,23.76
3,4,Sales Support Agent,Margaret Park,20,1584.0,79.2,0.99,19.8
4,5,Sales Support Agent,Steve Johnson,18,1393.92,77.44,0.99,16.83
5,6,IT Manager,Michael Mitchell,0,,,,
6,7,IT Staff,Robert King,0,,,,
7,8,IT Staff,Laura Callahan,0,,,,


Based on the above table with the employee, number of customers assigned to each sales support agent (num_cust), and average sales per number of customers (dol_sold_per_cust) there is little difference between the three sales support agents. The range of 77.44 to 82.45 in the dol_sold_per_cust shows a narrow range of values. Based on this narrow range, all three employees are performing the same.

## Business Question #3

Your next task is to analyze the sales data for customers from each different country. You have been given guidance to use the country value from the customers table, and ignore the country from the billing address in the invoice table.

In particular, you have been directed to calculate data, for each country, on the:

- total number of customers
- total value of sales
- average value of sales per customer
- average order value

<br>
1. Write a query that collates data on purchases from different countries.
Where a country has only one customer, collect them into an "Other" group.
The results should be sorted by the total sales from highest to lowest, with the "Other" group at the very bottom.
For each country, include:<br>
<br> - total number of customers
<br> - total value of sales
<br> - average value of sales per customer
<br> - average order value
<br><br>2. Write a few sentences interpreting your data, and make one or more recommendations to the marketing team on which countries have potential for growth.

First the data about countries in both the customer table and invoice table need to be compared to insure that both tables have the same country information about the customers.

In [7]:
country_by_cust_table = run_query('SELECT \
                                       c.country AS country_by_cust_table, \
                                       COUNT( DISTINCT c.customer_id) AS num_customers\
                                   FROM customer AS c\
                                   INNER JOIN invoice AS i ON i.customer_id = c.customer_id \
                                   GROUP BY c.country\
                                   ')
country_by_invoice_table  = run_query('SELECT \
                                           i.billing_country AS country_by_invoice_table, \
                                           COUNT( DISTINCT c.customer_id) AS num_customers \
                                       FROM customer AS c\
                                       INNER JOIN invoice AS i ON i.customer_id = c.customer_id \
                                       GROUP BY i.billing_country \
                                      ')

compare = country_by_cust_table.merge(country_by_invoice_table, left_index=True, right_index=True, indicator=True)
compare

Unnamed: 0,country_by_cust_table,num_customers_x,country_by_invoice_table,num_customers_y,_merge
0,Argentina,1,Argentina,1,both
1,Australia,1,Australia,1,both
2,Austria,1,Austria,1,both
3,Belgium,1,Belgium,1,both
4,Brazil,5,Brazil,5,both
5,Canada,8,Canada,8,both
6,Chile,1,Chile,1,both
7,Czech Republic,2,Czech Republic,2,both
8,Denmark,1,Denmark,1,both
9,Finland,1,Finland,1,both


Both the customer and invoice table show the same information in regard to countries where the customers live and the number of customers in those countries. So now we will look at total value of sales, average value of sales per customer, number of orders, and average order value.

In [8]:
print("Sales Values By Country")

run_query('''
          SELECT
               c.country AS country_by_cust_table,
               COUNT( DISTINCT c.customer_id) AS num_customers,
               SUM(i.total) AS total_sales,
               ROUND(SUM(i.total) / COUNT( DISTINCT c.customer_id),2) AS avg_sale_per_cust,
               COUNT(i.total) AS num_orders,
               ROUND(AVG(i.total), 2) AS avg_order_value
           FROM customer AS c
           INNER JOIN invoice AS i ON i.customer_id = c.customer_id
           GROUP BY c.country 
           ORDER BY total_sales DESC
           ''')

Sales Values By Country


Unnamed: 0,country_by_cust_table,num_customers,total_sales,avg_sale_per_cust,num_orders,avg_order_value
0,USA,13,1040.49,80.04,131,7.94
1,Canada,8,535.59,66.95,76,7.05
2,Brazil,5,427.68,85.54,61,7.01
3,France,5,389.07,77.81,50,7.78
4,Germany,4,334.62,83.66,41,8.16
5,Czech Republic,2,273.24,136.62,30,9.11
6,United Kingdom,3,245.52,81.84,28,8.77
7,Portugal,2,185.13,92.57,29,6.38
8,India,2,183.15,91.58,21,8.72
9,Ireland,1,114.84,114.84,13,8.83


Of the 24 countires that have customers in them, 14 of them have only one customer in them. These 14 countries will be grouped together into an <i>Other</i> category.

In [9]:
print("Sales Values By Country with Other")

run_query('''WITH country_and_other AS
           (
           SELECT
           CASE
               WHEN (
                     SELECT count(*)
                     FROM customer
                     where country = c.country
                    ) = 1 THEN "Other"
               ELSE c.country
           END AS country,
           c.customer_id,
           i.*
           FROM invoice AS i
           INNER JOIN customer c ON c.customer_id = i.customer_id
           ) 
           
           SELECT
               country,
               num_customers,
               total_sales,
               avg_sale_per_cust,
               num_orders,
               avg_order_value
           FROM
               (
                SELECT
                    c.country AS country,
                    COUNT( DISTINCT c.customer_id) AS num_customers,
                    SUM(c.total) AS total_sales,
                    ROUND(SUM(c.total) / COUNT( DISTINCT c.customer_id),2) AS avg_sale_per_cust,
                    COUNT(c.total) AS num_orders,
                    ROUND(AVG(c.total), 2) AS avg_order_value,
                    CASE
                        WHEN country = "Other" THEN 1 
                        ELSE 0 
                    END AS sort
                FROM country_and_other AS c 
                GROUP BY c.country
                ORDER BY sort, total_sales DESC)''')

Sales Values By Country with Other


Unnamed: 0,country,num_customers,total_sales,avg_sale_per_cust,num_orders,avg_order_value
0,USA,13,1040.49,80.04,131,7.94
1,Canada,8,535.59,66.95,76,7.05
2,Brazil,5,427.68,85.54,61,7.01
3,France,5,389.07,77.81,50,7.78
4,Germany,4,334.62,83.66,41,8.16
5,Czech Republic,2,273.24,136.62,30,9.11
6,United Kingdom,3,245.52,81.84,28,8.77
7,Portugal,2,185.13,92.57,29,6.38
8,India,2,183.15,91.58,21,8.72
9,Other,15,1094.94,73.0,147,7.45


Based on the "Sales Values By Country with Other" table above the USA has the most customers that have bought songs from our store. In reference to the coutries that have the most potential for growth, I would point to the countries that have the highest average sales per customers (avg_sale_per_cust) and low number of customers (num_customers) which would indicate <b>Czech Republic</b>, <b>Portugal</b>, and <b>India</b>. This would get at answering question #2 in this section.

## Business Question #4

Management is currently considering changing their purchasing strategy to save money. The strategy they are considering is to purchase only the most popular tracks from each album from record companies, instead of purchasing every track from an album. We have been asked to find out what percentage of purchases are individual tracks vs whole albums, so that management can use this data to understand the effect this decision might have on overall revenue.

1. Write a query that categorizes each invoice as either an album purchase or not, and calculates the following summary statistics:
<br>- Number of invoices
<br>- Percentage of invoices<br>
<br>
2. Write one to two sentences explaining your findings, and making a prospective recommendation on whether the Chinook store should continue to buy full albums from record companies

Some "edge cases" that will be excluded are Albums that have only two tracks.

Examples:

|invoice_id|album_id|full album purchase|tracks_on_album|tracks_on_invoice|
|:-|:-:|:-:|:-:|:-:|
|1|91|Yes|16|16|
|2|Multi|No|NaN|10|
|3|Multi|No|NaN|2|
|23|1|Yes|10|10|
|414|1|Yes|10|10|

In [10]:
print("Comparision To Examples")

run_query('''
          WITH track_album_invoice AS
          (
           SELECT
               t.track_id,
               a.album_id,
               il.invoice_id
           FROM invoice_line AS il
           INNER JOIN album AS a ON a.album_id = t.album_id
           INNER JOIN track AS t ON il.track_id = t.track_id
          ),
          
          track_album AS
          (
           SELECT
               t.track_id,
               a.album_id
           FROM track AS t
           INNER JOIN album AS a ON a.album_id = t.album_id
          )
          
          SELECT
              invoice_id,
              album_id,
              track_id,
              CASE
                  WHEN 
                  (
                        (
                        SELECT track_id
                        FROM track_album
                        WHERE album_id = t.album_id
                            EXCEPT
                        SELECT track_id
                        FROM track_album_invoice
                        WHERE invoice_id = t.invoice_id
                        ) IS NULL
                        AND
                        (
                        SELECT track_id
                        FROM track_album_invoice
                        WHERE invoice_id = t.invoice_id
                            EXCEPT
                        SELECT track_id
                        FROM track_album
                        WHERE album_id = t.album_id
                        ) IS NULL
                   ) IS TRUE THEN 1
                   ELSE 0
              END AS full_album_purchase,
              (    SELECT count(track_id)
                   FROM track_album
                   WHERE album_id = t.album_id 
              ) AS num_of_tracks_on_album,
              (    SELECT count(track_id)
                   FROM track_album_invoice
                   WHERE invoice_id = t.invoice_id
              ) AS num_of_tracks_on_invoice
          FROM track_album_invoice AS t
          GROUP BY invoice_id
          HAVING invoice_id = 1 OR invoice_id = 2 OR invoice_id = 3
          OR invoice_id = 23 OR invoice_id = 414
          ''')

Comparision To Examples


Unnamed: 0,invoice_id,album_id,track_id,full_album_purchase,num_of_tracks_on_album,num_of_tracks_on_invoice
0,1,91,1158,1,16,16
1,2,322,3476,0,11,10
2,3,203,2516,0,17,2
3,23,1,1,1,10,10
4,414,1,1,1,10,10


Now that I comparied my known cases to the query and shown they have the same result. I will use the above query to answer question # 4 in the next cell.

In [11]:
print("Number of Full Album Purchases Per Invoice")

run_query('''
          WITH track_album_invoice AS
          (
           SELECT
               t.track_id,
               a.album_id,
               il.invoice_id
           FROM invoice_line AS il
           INNER JOIN album AS a ON a.album_id = t.album_id
           INNER JOIN track AS t ON il.track_id = t.track_id
          ),
          
          track_album AS
          (
           SELECT
               t.track_id,
               a.album_id
           FROM track AS t
           INNER JOIN album AS a ON a.album_id = t.album_id
          ),
          
          album_purchases AS
          (
          SELECT
              invoice_id,
              album_id,
              track_id,
              CASE
                  WHEN 
                  (
                        (
                        SELECT track_id
                        FROM track_album
                        WHERE album_id = t.album_id
                            EXCEPT
                        SELECT track_id
                        FROM track_album_invoice
                        WHERE invoice_id = t.invoice_id
                        ) IS NULL
                        AND
                        (
                        SELECT track_id
                        FROM track_album_invoice
                        WHERE invoice_id = t.invoice_id
                            EXCEPT
                        SELECT track_id
                        FROM track_album
                        WHERE album_id = t.album_id
                        ) IS NULL
                   ) IS TRUE THEN 1
                   ELSE 0
              END AS full_album_purchase,
              (    SELECT count(track_id)
                   FROM track_album
                   WHERE album_id = t.album_id 
              ) AS num_of_tracks_on_album,
              (    SELECT count(track_id)
                   FROM track_album_invoice
                   WHERE invoice_id = t.invoice_id
              ) AS num_of_tracks_on_invoice
          FROM track_album_invoice AS t
          GROUP BY invoice_id
          )
          
          SELECT
              SUM(full_album_purchase) AS full_album_purchases,
              COUNT(DISTINCT invoice_id) AS invoices,
              ROUND((CAST(SUM(full_album_purchase) AS FLOAT) / COUNT(DISTINCT invoice_id)) * 100, 2)
              AS percent_full_album_purchases
          FROM album_purchases
          ''')

Number of Full Album Purchases Per Invoice


Unnamed: 0,full_album_purchases,invoices,percent_full_album_purchases
0,114,614,18.57


Based on the above table 18.57% of customer purchases are full album purchases. Management can now use that number to determine if it would be more cost effective to only purchase single songs from albums or to still purchase the whole album.  