# Interacting with SQLite using Python
  
  <br/><br/>
 

This notebook illustrates how to interact with **SQLite** using **Python**, by connecting the **Chinook** database which holds information about a digital music store.  
The Chinook database can be downloaded from [this](https://www.kaggle.com/code/alaasedeeq/chinook-sql/input) page and includes:
- 11 tables
- A variety of indexes, primary and foreign key constraints
- Over 15,000 rows of data  


After connecting and testing the database, we will answer some business questions using data associated with the store by querying it:  
1. Which countries have the most invoices?  
2. Which cities have the best customers?  
3. Who is the best customer?  
4. Who writes the most rock music?  
5. Which artist earned the most?  
6. Which customer spent the most on a single purchase?  
7. Which customers listen to rock music?  
8. What is the most popular genre for each country?  
9. How many songs are longer than the average song length?  
10. Which customer has spent the most for each country?  
11. What genre has the longest song on average?  
12. What is the most popular genre for each city?  
13. What month had the highest sales in the USA?  
14. What media type had the most sales?


  
Finally, we will export the database tables in **CSV** format, which will be used to create a semantic model in **Power BI** and related dashboards in the next step.  
  
  <br/><br/>

Chinook database Entity Relationship Diagram (ERD)  

![Chinook database Schema](images/chinook-schema.png?raw=true)

## Importing Libraries and Connecting to the database

In [1]:
import pandas as pd
import sqlite3

We will connect the SQL database using the connect() method, then create a cursor which will execute the SQL queries.  



In [2]:
# connect to the SQLite database 
connection = sqlite3.connect('../data/0-external/chinook.db')

# create a cursor object
cursor = connection.cursor()

## Testing the database  

There are different methods to retrieve SQL queries: first we are going to test `execute()`, which returns a list with different fetching options, and then `read_sql()`, which displays the result in a Pandas DataFrame.

In [3]:
# define the SQL command
test_01 = "SELECT * FROM invoice;"

In [4]:
# execute the SQL command
cursor.execute(test_01)

# fetch all the records
all_invoices = cursor.fetchall()

# display results
print("All Invoices")
for invoice in all_invoices:
        print(invoice)

All Invoices
(1, 18, '2017-01-03 00:00:00', '627 Broadway', 'New York', 'NY', 'USA', '10012-2612', 15.84)
(2, 30, '2017-01-03 00:00:00', '230 Elgin Street', 'Ottawa', 'ON', 'Canada', 'K2P 1L7', 9.9)
(3, 40, '2017-01-05 00:00:00', '8, Rue Hanovre', 'Paris', 'None', 'France', '75002', 1.98)
(4, 18, '2017-01-06 00:00:00', '627 Broadway', 'New York', 'NY', 'USA', '10012-2612', 7.92)
(5, 27, '2017-01-07 00:00:00', '1033 N Park Ave', 'Tucson', 'AZ', 'USA', '85719', 16.83)
(6, 31, '2017-01-10 00:00:00', '194A Chain Lake Drive', 'Halifax', 'NS', 'Canada', 'B3S 1C5', 1.98)
(7, 49, '2017-01-12 00:00:00', 'Ordynacka 10', 'Warsaw', 'None', 'Poland', '00-358', 10.89)
(8, 59, '2017-01-13 00:00:00', '3,Raj Bhavan Road', 'Bangalore', 'None', 'India', '560001', 9.9)
(9, 18, '2017-01-18 00:00:00', '627 Broadway', 'New York', 'NY', 'USA', '10012-2612', 8.91)
(10, 31, '2017-01-18 00:00:00', '194A Chain Lake Drive', 'Halifax', 'NS', 'Canada', 'B3S 1C5', 1.98)
(11, 38, '2017-01-20 00:00:00', 'Barbarossastra

In [5]:
# execute the SQL command
cursor.execute(test_01)

# fetch all the records
ten_invoices = cursor.fetchmany(10)

# display results
print("Ten Invoices")
for invoice in ten_invoices:
        print(invoice)

Ten Invoices
(1, 18, '2017-01-03 00:00:00', '627 Broadway', 'New York', 'NY', 'USA', '10012-2612', 15.84)
(2, 30, '2017-01-03 00:00:00', '230 Elgin Street', 'Ottawa', 'ON', 'Canada', 'K2P 1L7', 9.9)
(3, 40, '2017-01-05 00:00:00', '8, Rue Hanovre', 'Paris', 'None', 'France', '75002', 1.98)
(4, 18, '2017-01-06 00:00:00', '627 Broadway', 'New York', 'NY', 'USA', '10012-2612', 7.92)
(5, 27, '2017-01-07 00:00:00', '1033 N Park Ave', 'Tucson', 'AZ', 'USA', '85719', 16.83)
(6, 31, '2017-01-10 00:00:00', '194A Chain Lake Drive', 'Halifax', 'NS', 'Canada', 'B3S 1C5', 1.98)
(7, 49, '2017-01-12 00:00:00', 'Ordynacka 10', 'Warsaw', 'None', 'Poland', '00-358', 10.89)
(8, 59, '2017-01-13 00:00:00', '3,Raj Bhavan Road', 'Bangalore', 'None', 'India', '560001', 9.9)
(9, 18, '2017-01-18 00:00:00', '627 Broadway', 'New York', 'NY', 'USA', '10012-2612', 8.91)
(10, 31, '2017-01-18 00:00:00', '194A Chain Lake Drive', 'Halifax', 'NS', 'Canada', 'B3S 1C5', 1.98)


In [6]:
# execute the SQL command
cursor.execute(test_01)

# fetch all the records
invoice = cursor.fetchone()

# display results
print("First Invoice")
print(invoice)

First Invoice
(1, 18, '2017-01-03 00:00:00', '627 Broadway', 'New York', 'NY', 'USA', '10012-2612', 15.84)


In [7]:
# display SQL query in a DataFrame
pd.read_sql(test_01, con=connection)

Unnamed: 0,invoice_id,customer_id,invoice_date,billing_address,billing_city,billing_state,billing_country,billing_postal_code,total
0,1,18,2017-01-03 00:00:00,627 Broadway,New York,NY,USA,10012-2612,15.84
1,2,30,2017-01-03 00:00:00,230 Elgin Street,Ottawa,ON,Canada,K2P 1L7,9.90
2,3,40,2017-01-05 00:00:00,"8, Rue Hanovre",Paris,,France,75002,1.98
3,4,18,2017-01-06 00:00:00,627 Broadway,New York,NY,USA,10012-2612,7.92
4,5,27,2017-01-07 00:00:00,1033 N Park Ave,Tucson,AZ,USA,85719,16.83
...,...,...,...,...,...,...,...,...,...
609,610,55,2020-12-21 00:00:00,421 Bourke Street,Sidney,NSW,Australia,2010,6.93
610,611,52,2020-12-27 00:00:00,202 Hoxton Street,London,,United Kingdom,N1 5LH,1.98
611,612,33,2020-12-27 00:00:00,5112 48 Street,Yellowknife,NT,Canada,X1A 1N6,11.88
612,613,20,2020-12-29 00:00:00,541 Del Medio Avenue,Mountain View,CA,USA,94040-111,8.91


In [8]:
# display SQL query in a DataFrame
pd.read_sql_query(test_01, connection)

Unnamed: 0,invoice_id,customer_id,invoice_date,billing_address,billing_city,billing_state,billing_country,billing_postal_code,total
0,1,18,2017-01-03 00:00:00,627 Broadway,New York,NY,USA,10012-2612,15.84
1,2,30,2017-01-03 00:00:00,230 Elgin Street,Ottawa,ON,Canada,K2P 1L7,9.90
2,3,40,2017-01-05 00:00:00,"8, Rue Hanovre",Paris,,France,75002,1.98
3,4,18,2017-01-06 00:00:00,627 Broadway,New York,NY,USA,10012-2612,7.92
4,5,27,2017-01-07 00:00:00,1033 N Park Ave,Tucson,AZ,USA,85719,16.83
...,...,...,...,...,...,...,...,...,...
609,610,55,2020-12-21 00:00:00,421 Bourke Street,Sidney,NSW,Australia,2010,6.93
610,611,52,2020-12-27 00:00:00,202 Hoxton Street,London,,United Kingdom,N1 5LH,1.98
611,612,33,2020-12-27 00:00:00,5112 48 Street,Yellowknife,NT,Canada,X1A 1N6,11.88
612,613,20,2020-12-29 00:00:00,541 Del Medio Avenue,Mountain View,CA,USA,94040-111,8.91


In [9]:
# define the SQL command
test_02 = """SELECT i.billing_city, COUNT(i.invoice_id) as invoices
            FROM invoice i
            GROUP BY i.billing_city
            ORDER BY invoices DESC"""

In [10]:
# display SQL query in a DataFrame
pd.read_sql_query(test_02, connection)

Unnamed: 0,billing_city,invoices
0,Prague,30
1,São Paulo,22
2,Mountain View,20
3,Berlin,20
4,London,19
5,Paris,18
6,Porto,16
7,Brasília,15
8,São José dos Campos,13
9,Santiago,13


In [11]:
# display SQL query in a DataFrame
pd.read_sql_query(test_02, connection).head()

Unnamed: 0,billing_city,invoices
0,Prague,30
1,São Paulo,22
2,Mountain View,20
3,Berlin,20
4,London,19


## Querying the database  

After testing the functionality of the SQL connection, we will answer some business questions.  
  
  <br/><br/>
  
###  Which countries have the most invoices?  


In [12]:
# define the SQL command
query_01 = """SELECT i.billing_country as 'Billing Country', COUNT(i.invoice_id) as Invoices
              FROM invoice i
              GROUP BY i.billing_country
              ORDER BY Invoices DESC"""

In [13]:
# display SQL query in a DataFrame
pd.read_sql_query(query_01, connection)

Unnamed: 0,Billing Country,Invoices
0,USA,131
1,Canada,76
2,Brazil,61
3,France,50
4,Germany,41
5,Czech Republic,30
6,Portugal,29
7,United Kingdom,28
8,India,21
9,Ireland,13


  <br/><br/>

###  Which cities have the best customers?

In [14]:
# define the SQL command
query_02 = """SELECT i.billing_city as 'Billing City', SUM(i.total) as 'Invoice Totals (USD)'
              FROM invoice i
              GROUP BY i.billing_city
              ORDER BY SUM(i.total) DESC
              LIMIT 10"""

In [15]:
# display SQL query in a DataFrame
pd.read_sql_query(query_02, connection)

Unnamed: 0,Billing City,Invoice Totals (USD)
0,Prague,273.24
1,Mountain View,169.29
2,London,166.32
3,Berlin,158.4
4,Paris,151.47
5,São Paulo,129.69
6,Dublin,114.84
7,Delhi,111.87
8,São José dos Campos,108.9
9,Brasília,106.92


  <br/><br/>

###  Who is the best customer?

In [16]:
# define the SQL command
query_03 = """SELECT c.customer_id as 'Customer ID', SUM(i.total) as 'Invoice Totals (USD)'
              FROM customer c
              JOIN invoice i
              ON c.customer_id = i.customer_id
              GROUP BY c.customer_id
              ORDER BY SUM(i.total) DESC
              LIMIT 1"""

In [17]:
# display SQL query in a DataFrame
pd.read_sql_query(query_03, connection)

Unnamed: 0,Customer ID,Invoice Totals (USD)
0,5,144.54


  <br/><br/>

###  Who writes the most rock music?

In [18]:
# define the SQL command
query_04 = """SELECT a.artist_id as 'Artist ID', a.name as 'Artist Name', COUNT(t.track_id) as Songs
              FROM artist a
              JOIN album b
              ON a.artist_id = b.artist_id
              JOIN track t
              ON b.album_id = t.album_id
              JOIN genre g
              ON t.genre_id = g.genre_id
              WHERE g.name='Rock'
              GROUP BY a.artist_id, a.name
              ORDER BY Songs DESC
              LIMIT 20"""

In [19]:
# display SQL query in a DataFrame
pd.read_sql_query(query_04, connection)

Unnamed: 0,Artist ID,Artist Name,Songs
0,22,Led Zeppelin,114
1,150,U2,112
2,58,Deep Purple,92
3,90,Iron Maiden,81
4,118,Pearl Jam,54
5,152,Van Halen,52
6,51,Queen,45
7,142,The Rolling Stones,41
8,76,Creedence Clearwater Revival,40
9,52,Kiss,35


  <br/><br/>

###  Which artist earned the most?

In [20]:
# define the SQL command
query_05 = """SELECT a.name as 'Artist Name', SUM(l.unit_price * l.quantity) as 'Amount Spent (USD)'
              FROM artist a
              JOIN album b
              ON a.artist_id = b.artist_id
              JOIN track t
              ON b.album_id = t.album_id
              JOIN invoice_line l
              ON t.track_id = l.track_id
              JOIN invoice i
              ON l.invoice_id = i.invoice_id
              JOIN customer c
              ON i.customer_id = c.customer_id
              GROUP BY a.name
              ORDER BY SUM(l.unit_price * l.quantity) DESC
              LIMIT 10"""

In [21]:
# display SQL query in a DataFrame
pd.read_sql_query(query_05, connection)

Unnamed: 0,Artist Name,Amount Spent (USD)
0,Queen,190.08
1,Jimi Hendrix,185.13
2,Red Hot Chili Peppers,128.7
3,Nirvana,128.7
4,Pearl Jam,127.71
5,Guns N' Roses,122.76
6,AC/DC,122.76
7,Foo Fighters,119.79
8,The Rolling Stones,115.83
9,Metallica,104.94


  <br/><br/>

###  Which customer spent the most on a single purchase?

In [22]:
# define the SQL command
query_06 = """SELECT a.name as 'Artist Name', SUM(l.unit_price * l.quantity) as 'Amount Spent (USD)',
              c.first_name as 'Customer Name', c.last_name as 'Customer Surname', c.customer_id as 'Customer ID'
              FROM artist a
              JOIN album b
              ON a.artist_id = b.artist_id
              JOIN track t
              ON b.album_id = t.album_id
              JOIN invoice_line l
              ON t.track_id = l.track_id
              JOIN invoice i
              ON l.invoice_id = i.invoice_id
              JOIN customer c
              ON i.customer_id = c.customer_id
              GROUP BY a.name, c.customer_id, c.first_name, c.last_name
              ORDER BY SUM(l.unit_price * l.quantity) DESC
              LIMIT 10"""

In [23]:
# display SQL query in a DataFrame
pd.read_sql_query(query_06, connection)

Unnamed: 0,Artist Name,Amount Spent (USD),Customer Name,Customer Surname,Customer ID
0,Queen,27.72,Hugh,O'Reilly,46
1,Frank Sinatra,23.76,Wyatt,Girard,42
2,Creedence Clearwater Revival,19.8,Robert,Brown,29
3,James Brown,19.8,Aaron,Mitchell,32
4,Kiss,19.8,František,Wichterlová,5
5,Red Hot Chili Peppers,19.8,Helena,Holý,6
6,The Who,19.8,François,Tremblay,3
7,House Of Pain,18.81,Heather,Leacock,22
8,Nirvana,18.81,Hugh,O'Reilly,46
9,Queen,18.81,Niklas,Schröder,38


  <br/><br/>

###  Which customers listen to rock music?

In [24]:
# define the SQL command
query_07 = """SELECT c.email as 'Customer Email', c.first_name as 'Customer Name', c.last_name as 'Customer Surname',
              g.name as Genre
              FROM customer c
              JOIN invoice i
              ON c.customer_id = i.customer_id
              JOIN invoice_line l
              ON i.invoice_id = l.invoice_id
              JOIN track t
              ON l.track_id = t.track_id
              JOIN genre g
              ON t.genre_id = g.genre_id
              WHERE g.name='Rock'
              GROUP BY c.email, c.first_name, c.last_name, g.name
              ORDER BY c.email"""

In [25]:
# display SQL query in a DataFrame
pd.read_sql_query(query_07, connection)

Unnamed: 0,Customer Email,Customer Name,Customer Surname,Genre
0,aaronmitchell@yahoo.ca,Aaron,Mitchell,Rock
1,alero@uol.com.br,Alexandre,Rocha,Rock
2,astrid.gruber@apple.at,Astrid,Gruber,Rock
3,bjorn.hansen@yahoo.no,Bjørn,Hansen,Rock
4,camille.bernard@yahoo.fr,Camille,Bernard,Rock
5,daan_peeters@apple.be,Daan,Peeters,Rock
6,diego.gutierrez@yahoo.ar,Diego,Gutiérrez,Rock
7,dmiller@comcast.com,Dan,Miller,Rock
8,dominiquelefebvre@gmail.com,Dominique,Lefebvre,Rock
9,edfrancis@yachoo.ca,Edward,Francis,Rock


In [26]:
# comments about SQL subquery and indentation (which is not mandatory)

  <br/><br/>

###  What is the most popular genre for each country?

In [27]:
# define the SQL command
query_08 = """WITH GenrePerCountry AS
                   (SELECT SUM(l.quantity) as Purchases, c.country, g.name, g.genre_id
                   FROM customer c
                   JOIN invoice i
                   ON c.customer_id = i.customer_id
                   JOIN invoice_line l
                   ON i.invoice_id = l.invoice_id
                   JOIN track t
                   ON l.track_id = t.track_id
                   JOIN genre g
                   ON t.genre_id = g.genre_id
                   GROUP BY l.quantity, c.country, g.name, g.genre_id
                   ORDER BY c.country)

              SELECT a.country as Country, a.name as Genre, a.genre_id as 'Genre ID', a.Purchases
              FROM GenrePerCountry a
              WHERE a.Purchases = (SELECT MAX(Purchases)
                                  FROM GenrePerCountry
                                  WHERE a.country = Country
                                  GROUP BY Country)
              ORDER BY Country"""

In [28]:
# display SQL query in a DataFrame
pd.read_sql_query(query_08, connection)

Unnamed: 0,Country,Genre,Genre ID,Purchases
0,Argentina,Alternative & Punk,4,17
1,Australia,Rock,1,34
2,Austria,Rock,1,40
3,Belgium,Rock,1,26
4,Brazil,Rock,1,205
5,Canada,Rock,1,333
6,Chile,Rock,1,61
7,Czech Republic,Rock,1,143
8,Denmark,Rock,1,24
9,Finland,Rock,1,46


  <br/><br/>

###  How many songs are longer than the average song length?

In [29]:
# define the SQL command
query_09 = """SELECT a.name as 'Artist Name', t.name as 'Track Name', (t.milliseconds / 1000.0) as Seconds
              FROM track t
              JOIN album b
              ON t.album_id = b.album_id
              JOIN artist a
              ON b.artist_id = a.artist_id
              GROUP BY t.name, Seconds
              HAVING milliseconds > (SELECT AVG(milliseconds)
                                     FROM track)
              ORDER BY Seconds DESC"""

In [30]:
# display SQL query in a DataFrame
pd.read_sql_query(query_09, connection)

Unnamed: 0,Artist Name,Track Name,Seconds
0,Battlestar Galactica,Occupation / Precipice,5286.953
1,Lost,Through a Looking Glass,5088.838
2,Battlestar Galactica (Classic),"Greetings from Earth, Pt. 1",2960.293
3,Battlestar Galactica (Classic),The Man With Nine Lives,2956.998
4,Battlestar Galactica (Classic),"Battlestar Galactica, Pt. 2",2956.081
...,...,...,...
489,Iron Maiden,22 Acacia Avenue,395.572
490,Metallica,The Unforgiven II,395.520
491,Metallica,The Shortest Straw,395.389
492,"Berliner Philharmoniker, Claudio Abbado & Sabi...","Concerto for Clarinet in A Major, K. 622: II. ...",394.482


  <br/><br/>

###  Which customer has spent the most for each country?

In [31]:
# define the SQL command
query_10 = """WITH CustomerPerCountry AS
                   (SELECT c.country, SUM(i.total) as TotalSpent, c.first_name, c.last_name, c.customer_id
                   FROM customer c
                   JOIN invoice i
                   ON c.customer_id = i.customer_id
                   GROUP BY c.country, c.first_name, c.last_name, c.customer_id
                   ORDER BY TotalSpent DESC)

              SELECT a.country as Country, a.TotalSpent as 'Total Spent (USD)', a.first_name as 'Customer Name',
              a.last_name as 'Customer Surname', a.customer_id as 'Customer ID'
              FROM CustomerPerCountry a
              WHERE a.TotalSpent = (SELECT MAX(TotalSpent)
                                   FROM CustomerPerCountry
                                   WHERE a.country = Country
                                   GROUP BY Country)
              ORDER BY Country"""

In [32]:
# display SQL query in a DataFrame
pd.read_sql_query(query_10, connection)

Unnamed: 0,Country,Total Spent (USD),Customer Name,Customer Surname,Customer ID
0,Argentina,39.6,Diego,Gutiérrez,56
1,Australia,81.18,Mark,Taylor,55
2,Austria,69.3,Astrid,Gruber,7
3,Belgium,60.39,Daan,Peeters,8
4,Brazil,108.9,Luís,Gonçalves,1
5,Canada,99.99,François,Tremblay,3
6,Chile,97.02,Luis,Rojas,57
7,Czech Republic,144.54,František,Wichterlová,5
8,Denmark,37.62,Kara,Nielsen,9
9,Finland,79.2,Terhi,Hämäläinen,44


  <br/><br/>

###  What genre has the longest song on average?

In [33]:
# define the SQL command
query_11 = """SELECT g.name AS "Genre", ROUND(AVG(t.milliseconds)/1000, 2) AS "Average Length of Songs (sec)"
              FROM track t
              JOIN genre g
              ON t.genre_id = g.genre_id
              GROUP BY 1
              ORDER BY 2 DESC"""

In [34]:
# display SQL query in a DataFrame
pd.read_sql_query(query_11, connection)

Unnamed: 0,Genre,Average Length of Songs (sec)
0,Sci Fi & Fantasy,2911.78
1,Science Fiction,2625.55
2,Drama,2575.28
3,TV Shows,2145.04
4,Comedy,1585.26
5,Metal,309.75
6,Electronica/Dance,302.99
7,Heavy Metal,297.45
8,Classical,293.87
9,Jazz,291.76


  <br/><br/>

###  What is the most popular genre for each city?

In [35]:
# define the SQL command
query_12 = """WITH GenrePerCity AS
                   (SELECT SUM(l.quantity) AS Purchases, c.city, c.country, g.name
                   FROM customer c
                   JOIN invoice i
                   ON c.customer_id = i.customer_id
                   JOIN invoice_line l
                   ON i.invoice_id = l.invoice_id
                   JOIN track t
                   ON l.track_id = t.track_id
                   JOIN genre g
                   ON t.genre_id = g.genre_id
                   GROUP BY 2, 3, 4
                   ORDER BY 2)

              SELECT a.city AS "City", a.country AS "Country", a.name AS "Genre", a.Purchases AS "Total Purchases"
              FROM GenrePerCity a
              WHERE a.Purchases = (SELECT MAX(Purchases)
                                  FROM GenrePerCity
                                  WHERE a.city = City
                                  GROUP BY City)
              ORDER BY Purchases DESC"""

In [36]:
# display SQL query in a DataFrame
pd.read_sql_query(query_12, connection)

Unnamed: 0,City,Country,Genre,Total Purchases
0,Prague,Czech Republic,Rock,143
1,London,United Kingdom,Rock,109
2,Paris,France,Rock,91
3,Mountain View,USA,Rock,90
4,Berlin,Germany,Rock,84
5,Montréal,Canada,Rock,75
6,Dublin,Ireland,Rock,72
7,São José dos Campos,Brazil,Rock,72
8,Lisbon,Portugal,Rock,68
9,Frankfurt,Germany,Rock,65


  <br/><br/>

###  What month had the highest sales in the USA?

In [37]:
# define the SQL command
query_13 = """SELECT DATE(i.invoice_date, 'start of month') AS "Month", SUM(i.total) AS "Total Purchases (USD)"
              FROM invoice i
              JOIN customer c
              ON i.customer_id = c.customer_id
              WHERE c.country = 'USA'
              GROUP BY 1
              ORDER BY 2 DESC"""

In [38]:
# display SQL query in a DataFrame
pd.read_sql_query(query_13, connection)

Unnamed: 0,Month,Total Purchases (USD)
0,2017-01-01,61.38
1,2020-10-01,52.47
2,2019-07-01,52.47
3,2017-04-01,52.47
4,2018-03-01,42.57
5,2019-09-01,40.59
6,2018-01-01,38.61
7,2020-09-01,37.62
8,2020-03-01,37.62
9,2018-02-01,34.65


  <br/><br/>

###  What media type had the most sales?

In [39]:
# define the SQL command
query_14 = """SELECT m.name AS "Media Type", SUM(l.unit_price*l.quantity) AS "Total Purchases (USD)" 
              FROM media_type m
              JOIN track t
              ON m.media_type_id = t.media_type_id
              JOIN invoice_line l
              ON t.track_id = l.track_id
              JOIN invoice i
              ON l.invoice_id = i.invoice_id
              GROUP BY 1
              ORDER BY 2 DESC"""

In [40]:
# display SQL query in a DataFrame
pd.read_sql_query(query_14, connection)

Unnamed: 0,Media Type,Total Purchases (USD)
0,MPEG audio file,4216.41
1,Protected AAC audio file,434.61
2,Purchased AAC audio file,34.65
3,AAC audio file,20.79
4,Protected MPEG-4 video file,2.97


In [41]:
# create the table as dataframe
album = pd.read_sql_query('SELECT * FROM album', connection)

# export the DataFrame as CSV
album.to_csv('../data/1-raw/album.csv' , encoding='utf-8', index=False)

In [51]:
tables = ['album', 'artist', 'customer', 'employee', 'genre', 'invoice', 'invoice_line', 'media_type',
          'playlist', 'playlist_track', 'track']
columns = []

for t in tables:
    for c in columns:
        t = pd.read_sql_query('SELECT * FROM {t}.{c} WHERE {c} is NOT NULL'.format(t=t, c=c), connection)
        t.to_csv('../data/1-raw/{t}.csv' , encoding='utf-8', index=False)



In [59]:
tables = ['album', 'artist', 'customer', 'employee', 'genre', 'invoice', 'invoice_line', 'media_type',
          'playlist', 'playlist_track', 'track']
columns = []

for x in tables:
    x = pd.read_sql_query(f'SELECT * FROM {x}', connection)
    x.to_csv('../data/1-raw/x.csv', encoding='utf-8', index=False)

