# Python and Databases

We'll once again split this homework into two portions, a lesson for you to work through and some practice problems.

In [5]:
import pandas as pd

## Continued Learning

### `sqlalchemy`

This package is widely used in industry settings so it's good for you to at least be slightly familiar with it. 

`sqlalchemy` was designed so that you can interact with true `SQL` databases in `python`. For the remainder of this notebook we'll see how we can use it to read in data from a database and then turn it into a `pandas` `DataFrame`. If you'd like to learn more check out the docs, <a href="https://www.sqlalchemy.org/">https://www.sqlalchemy.org/</a>.

In [1]:
## Try to import sqlalchemy,
## If this doesn't work you'll need to install it
import sqlalchemy


Just like with `sqlite3` we'll go step by step.

##### Creating an Engine

In [2]:
## The first step is to create an engine
## The sqlalchemy engine is how we 
## communicate with the database
from sqlalchemy import create_engine

In [3]:
## When we create the engine we have to tell it
## the Dialect, this is the backend language 
## of the database. For us this is SQLite
## We also have to specify a pool, for our purposes
## we can think of this as where our database is stored
engine = create_engine("sqlite:///census.sqlite")

##### Connect to the Database

In [4]:
## next we have to actually connect the engine
## to the database
conn = engine.connect()

##### Execute a Statment Then Fetch

In [6]:
## Just like with sqlite3 we can
## use the connection to execute a query
## and fetch the rows of the data we want

## Unline sqlite3 we need to store the execute
## because it returns a results_proxy object
results_proxy = conn.execute("SELECT * FROM state_fact")

## here the column names are stored in the keys 
## of the results_proxy object
pd.DataFrame(results_proxy.fetchall(),columns = results_proxy.keys()).head()

Unnamed: 0,id,name,abbreviation,country,type,sort,status,occupied,notes,fips_state,assoc_press,standard_federal_region,census_region,census_region_name,census_division,census_division_name,circuit_court
0,13,Illinois,IL,USA,state,10,current,occupied,,17,Ill.,V,2,Midwest,3,East North Central,7
1,30,New Jersey,NJ,USA,state,10,current,occupied,,34,N.J.,II,1,Northeast,2,Mid-Atlantic,3
2,34,North Dakota,ND,USA,state,10,current,occupied,,38,N.D.,VIII,2,Midwest,4,West North Central,8
3,37,Oregon,OR,USA,state,10,current,occupied,,41,Ore.,X,4,West,9,Pacific,9
4,51,Washington DC,DC,USA,capitol,10,current,occupied,,11,,III,3,South,5,South Atlantic,D.C.


##### A Short Cut Using `pandas`

In [8]:
## pandas offers a nice shortcut called read_sql()
## we first input the query statement
## then the engine we want to run it
pd.read_sql("SELECT * FROM state_fact", engine).head()

Unnamed: 0,id,name,abbreviation,country,type,sort,status,occupied,notes,fips_state,assoc_press,standard_federal_region,census_region,census_region_name,census_division,census_division_name,circuit_court
0,13,Illinois,IL,USA,state,10,current,occupied,,17,Ill.,V,2,Midwest,3,East North Central,7
1,30,New Jersey,NJ,USA,state,10,current,occupied,,34,N.J.,II,1,Northeast,2,Mid-Atlantic,3
2,34,North Dakota,ND,USA,state,10,current,occupied,,38,N.D.,VIII,2,Midwest,4,West North Central,8
3,37,Oregon,OR,USA,state,10,current,occupied,,41,Ore.,X,4,West,9,Pacific,9
4,51,Washington DC,DC,USA,capitol,10,current,occupied,,11,,III,3,South,5,South Atlantic,D.C.


In [9]:
## The short cut even works with statements that 
## subset the data further
pd.read_sql("SELECT name,abbreviation FROM state_fact WHERE census_region == 2", engine)

Unnamed: 0,name,abbreviation
0,Illinois,IL
1,North Dakota,ND
2,Wisconsin,WI
3,Kansas,KS
4,Nebraska,NE
5,Michigan,MI
6,Missouri,MO
7,Ohio,OH
8,Indiana,IN
9,South Dakota,SD


In [10]:
## When we're done we close the connection
conn.close()

## then dispose the engine
engine.dispose()

#### But I Already Knew How To Do This With `sqlite3`!

That might be true, but `sqlite3` is limited to databases where the dialect is `SQLite`. Again this may be sufficient for personal projects, but industries are probably other dialects like PostgreSQL, MySQL, or Oracle for various reasons. These non `SQLite` dialects are supported with `sqlalchemy`, so it's good to have a slight familiarity with it.

## Practice Problems

You can use either `sqlite` or `sqlalchemy` for these problems.

1. Copy and paste your cat_store.db into the Data Gathering Homework Folder
2. Create a new table that tracks purchases made by your customers
    - It should track what product was purchased, note the product should exist in the database already
    - It should track who made that purchase, note the customer should exist in the database already
    - It should have a unique purchase id
    - It should track how expensive the purchase was
3. Add some purchases to your purchases table
4. You can look at the chinook.db database's layout by looking at the sqlite-sample-database-diagram-color.pdf file. Answer the following:
    - Examine the tracks table. What is the most popular genre? The least popular?
    - Write a function that takes in an ArtistId and returns a list of their tracks.

In [11]:
## You Code

## Sample Solution

# import sqlite3
import sqlite3

In [14]:
## You Code

## 2. SAMPLE SOLUTION
## Make a connection to the database
conn = sqlite3.connect("cat_store.db")

## Create a cursor
c = conn.cursor()

# I will now create a purchases table
# Only run this once!
c.execute("""CREATE TABLE purchases(
                    purchase_id int,
                    customer_id int,
                    product_id int,
                    price int,
                    PRIMARY KEY (purchase_id),
                    FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
                    FOREIGN KEY (product_id) REFERENCES products(product_id)
                )""")

conn.commit()

In [19]:
## You Code

## 3. SAMPLE SOLUTION 

c.execute("INSERT INTO purchases VALUES (1,1,1,12.5)")

conn.close()

In [25]:
## You Code

## 4. SAMPLE SOLUTION

## Make a connection to the database
conn = sqlite3.connect("chinook.db")

## Create a cursor
c = conn.cursor()

## Turn the tracks table into a dataframe
c.execute("SELECT * FROM tracks")

tracks_df = pd.DataFrame(c.fetchall(), columns = [x[0] for x in c.description])

## turn the genres table into a dataframe
c.execute("SELECT * FROM genres")

genres_df = pd.DataFrame(c.fetchall(), columns = [x[0] for x in c.description])

## replace the genreid in the tracks table with the genre string
for id in tracks_df.GenreId.value_counts().index:
    genre = genres_df.loc[genres_df.GenreId == id,'Name'].values[0]
    tracks_df.loc[tracks_df.GenreId == id,'GenreId'] = genre
    
## This function gets the album_ids for our artist
def get_album_ids(artist_id):
    c.execute("SELECT AlbumId FROM albums WHERE ArtistId = " + str(artist_id))
    return [album_id[0] for album_id in c.fetchall()]

## This function gets the list of tracks
def get_track_list(artist_id):
    album_ids = get_album_ids(artist_id)
    
    c.execute("SELECT * FROM tracks")
    tracks_df = pd.DataFrame(c.fetchall(), columns = [x[0] for x in c.description])
    
    return list(tracks_df.loc[tracks_df.AlbumId.isin(album_ids),'Name'].values)

In [26]:
# value_counts
tracks_df.GenreId.value_counts()

Rock                  1297
Latin                  579
Metal                  374
Alternative & Punk     332
Jazz                   130
TV Shows                93
Blues                   81
Classical               74
Drama                   64
R&B/Soul                61
Reggae                  58
Pop                     48
Soundtrack              43
Alternative             40
Hip Hop/Rap             35
Electronica/Dance       30
Heavy Metal             28
World                   28
Sci Fi & Fantasy        26
Easy Listening          24
Comedy                  17
Bossa Nova              15
Science Fiction         13
Rock And Roll           12
Opera                    1
Name: GenreId, dtype: int64

In [27]:
get_track_list(10)

['Quadrant',
 "Snoopy's search-Red baron",
 'Spanish moss-"A sound portrait"-Spanish moss',
 'Moon germs',
 'Stratus',
 'The pleasant pheasant',
 'Solo-Panhandler',
 'Do what cha wanna']

In [28]:
get_track_list(100)

['Are You Gonna Go My Way',
 'Fly Away',
 'Rock And Roll Is Dead',
 'Again',
 "It Ain't Over 'Til It's Over",
 "Can't Get You Off My Mind",
 'Mr. Cab Driver',
 'American Woman',
 'Stand By My Woman',
 'Always On The Run',
 'Heaven Help',
 'I Belong To You',
 'Believe',
 'Let Love Rule',
 'Black Velveteen',
 'Johnny B. Goode',
 "Don't Look Back",
 'Jah Seh No',
 "I'm The Toughest",
 'Nothing But Love',
 'Buk-In-Hamm Palace',
 'Bush Doctor',
 'Wanted Dread And Alive',
 'Mystic Man',
 'Coming In Hot',
 'Pick Myself Up',
 'Crystal Ball',
 'Equal Rights Downpresser Man',
 'Holding Back The Years',
 "Money's Too Tight To Mention",
 'The Right Thing',
 "It's Only Love",
 'A New Flame',
 "You've Got It",
 "If You Don't Know Me By Now",
 'Stars',
 'Something Got Me Started',
 'Thrill Me',
 'Your Mirror',
 'For Your Babies',
 'So Beautiful',
 'Angel',
 'Fairground',
 'Still Of The Night',
 'Here I Go Again',
 'Is This Love',
 "Love Ain't No Stranger",
 'Looking For Love',
 "Now You're Gone",
 'S

In [29]:
conn.close()

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2021.

Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)