#HW4 - SQL

This homework has you working with a new database of information on ticket sales for various types of events.  Your job will be to do some initial exploring and then demonstrate your ability to do all the different types of SQL queries we learned over the last week.  You'll also need to make one function that'll make looking at the tables easier.

These questions are written in the way someone would ask them to you.  In other words, I'm using 'plain english' questions vs. ones where I'm very explicit in terms of what columns and tables to use.  Your exploring of the database and functions to ease that process will come in handy here!  

The database has been created using a set of data from Amazon. You can read more about what each table contains here: https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html.  

**Submission Instruction**

1- Replace the blank with your name (e.g. DE_HW4-Sara_Riazi)

2- Run your notebook (all the outputs must be visible).

3- Download .ipynb  

4- Submit on Gradescope

## Libraries and import functions

In [None]:
!pip install mysql-connector-python

Collecting mysql-connector-python
  Downloading mysql_connector_python-9.4.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (7.5 kB)
Downloading mysql_connector_python-9.4.0-cp312-cp312-manylinux_2_28_x86_64.whl (33.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.9/33.9 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mysql-connector-python
Successfully installed mysql-connector-python-9.4.0


First bring the libraries we'll need!

In [2]:
import mysql.connector
import pandas as pd

Now bring get_conn_cur and run_query funtions as well as connection information as we used in the practice notebook for SQL (Lab 3)!  

In [3]:
#connection
mysql_address  = '131.193.32.85'
mysql_username='de_student'
mysql_password='DE_Student_PaSS'


#We are going to use a single database for all databases in this course.
#To avoid confusion, we use databasename_tablename naming convention.
mysql_database = 'my_dataengineering_dbs'
def get_conn_cur():
    cnx = mysql.connector.connect(user=mysql_username, password=mysql_password,
          host=mysql_address,
          database=mysql_database, port='3306');
    return (cnx, cnx.cursor())

In [4]:
#run_query
def run_query(query_string):

  conn, cur = get_conn_cur() # get connection and cursor

  cur.execute(query_string) # executing string as before

  my_data = cur.fetchall() # fetch query data as before

  result_df = pd.DataFrame(my_data, columns=cur.column_names)


  cur.close() # close
  conn.close() # close

  return result_df

## Make a SQL head function - 5 point

Make function to get the pandas equivalent of `.head()`

This function should be called `sql_head` and take a single argument of `table_name` where you specify the table name you want the head information from.  It should return the column names along with the first five rows of the table along.  

**For full points, return a pandas dataframe with this information so it displays nicely :)**

In [5]:
# make sql_head function
def sql_head(table_name: str):
    return run_query(f"SELECT * FROM {table_name} LIMIT 5")


In [6]:
# Check that it works!
sql_head(table_name = 'ticketsdb_sales')

Unnamed: 0,salesid,listid,sellerid,buyerid,eventid,dateid,qtysold,pricepaid,commission,saletime
0,1,1,36861,21191,7872,1875,4,728.0,109.2,2008-02-17 20:36:48
1,2,4,8117,11498,4337,1983,2,76.0,11.4,2008-06-06 00:00:16
2,3,5,1616,17433,8647,1983,2,350.0,52.5,2008-06-06 03:26:17
3,4,5,1616,19715,8647,1986,1,175.0,26.25,2008-06-09 03:38:52
4,5,6,47402,14115,8240,2069,2,154.0,23.1,2008-08-31 04:17:02


## Explore and SELECT - 5 point

Let's start this homework with some basic queries to get a look at what's in the various tables. Remember that we are using one Database for all schemas in this course. So running "show tables" will list all tables from previous schemas too.
* use run_query first to run "show tables"
* look at the column name, we only wants the tables that starts with 'ticketsdb' which is the schema of this notebook.
* run "show tables where Tables_in_ista322dbs like 'ticketsdb_%' " query using run_query to see all tables for ticketsdb schema.
* Now use the `sql_head()` function you created to get the first five rows of all tables in the ticketsdb schema

In [None]:
tables = run_query("show tables WHERE Tables_in_my_dataengineering_dbs like 'ticketsdb_%'")
for table in tables['Tables_in_my_dataengineering_dbs']:
  head = sql_head(table_name = table)
  print(head)

   catid catgroup catname                            catdesc
0      1   Sports     MLB            Major League Baseball\n
1      2   Sports     NHL           National Hockey League\n
2      3   Sports     NFL         National Football League\n
3      4   Sports     NBA  National Basketball Association\n
4      5   Sports     MLS              Major League Soccer\n
   dateid     caldate day  week month qtr  year  holiday
0    1827  2008-01-01  WE     1   JAN   1  2008        0
1    1828  2008-01-02  TH     1   JAN   1  2008        0
2    1829  2008-01-03  FR     1   JAN   1  2008        0
3    1830  2008-01-04  SA     2   JAN   1  2008        0
4    1831  2008-01-05  SU     2   JAN   1  2008        0
   eventid  venueid  catid  dateid                    eventname  \
0        1      305      8    1851              Gotterdammerung   
1        2      306      8    2114                Boris Godunov   
2        3      302      8    1935                       Salome   
3        4      309     

## WHERE - 5 points

Now let's do a bit of filtering with WHERE.  Write and run queries to get the following results.  
**LIMIT all returns to first five rows.**

* Get venues with >= 10000 seats from the venues table
* Get venues in Arizona
* Get users who have a first name that starts with H
* Get **just email addresses** of users who gave a .edu email address




In [9]:
# Get big venues... so those with >= than 10000 seats
query = run_query("SELECT * FROM ticketsdb_venue WHERE venueseats >= 10000")
query

Unnamed: 0,venueid,venuename,venuecity,venuestate,venueseats
0,5,Gillette Stadium,Foxborough,MA,68756
1,6,New York Giants Stadium,East Rutherford,NJ,80242
2,15,McAfee Coliseum,Oakland,CA,63026
3,18,Madison Square Garden,New York City,NY,20000
4,67,Ralph Wilson Stadium,Orchard Park,NY,73967
5,68,Rogers Centre,Toronto,ON,50516
6,69,Dolphin Stadium,Miami Gardens,FL,74916
7,70,M&T Bank Stadium,Baltimore,MD,70107
8,71,Paul Brown Stadium,Cincinnati,OH,65535
9,72,Cleveland Browns Stadium,Cleveland,OH,73200


In [12]:
# Get venues in AZ
query = run_query("SELECT * FROM ticketsdb_venue WHERE venuestate = 'AZ'")
query

Unnamed: 0,venueid,venuename,venuecity,venuestate,venueseats
0,38,US Airways Center,Phoenix,AZ,0
1,65,Jobing.com Arena,Glendale,AZ,0
2,92,University of Phoenix Stadium,Glendale,AZ,0
3,117,Chase Field,Phoenix,AZ,0


In [14]:
#Get users who have a first name that starts with H
query = run_query("SELECT * FROM ticketsdb_users WHERE firstname LIKE 'H%'")
query

Unnamed: 0,userid,username,firstname,lastname,city,state,email,phone,likesports,liketheatre,likeconcerts,likejazz,likeclassical,likeopera,likerock,likevegas,likebroadway,likemusicals
0,13,QTF33MCG,Henry,Cochran,Bossier City,QC,Aliquam.vulputate.ullamcorper@amalesuada.org,(783) 105-0989,0,0,0,0,0,0,0,0,0,0
1,22,RHT62AGI,Hermione,Trevino,Walnut,WI,non.justo.Proin@ametconsectetuer.edu,(245) 110-6540,0,0,0,0,0,0,0,0,0,0
2,29,HUH27PKK,Helen,Avery,Garland,PE,in.faucibus.orci@ultrices.edu,(385) 925-3875,0,0,0,0,0,0,0,0,0,0
3,56,MHU11LZP,Howard,Wiley,Oklahoma City,NU,accumsan@vulputateullamcorper.ca,(277) 315-5682,0,0,0,0,0,0,0,0,0,0
4,67,TWU10MZT,Herman,Myers,Basin,PE,Mauris@neque.com,(471) 895-6189,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2331,49901,CUC69PVF,Hermione,Mcclain,Del Rio,MB,id@Curabitur.org,(242) 264-7006,0,0,0,0,0,0,0,0,0,0
2332,49936,ZKV32TGE,Hop,Mcclain,Saint Cloud,WV,adipiscing@at.org,(930) 289-0793,0,0,0,0,0,0,0,0,0,0
2333,49966,XXT27FBP,Hayden,Wilkinson,Portland,ON,ullamcorper.Duis@pharetra.com,(945) 884-6008,0,0,0,0,0,0,0,0,0,0
2334,49973,KVB52LOX,Harlan,Murphy,Hot Springs,NM,Sed.nulla@nec.ca,(844) 671-5836,0,0,0,0,0,0,0,0,0,0


In [18]:
# Get all .edu email addresses... just the email addresses
query = run_query("SELECT email FROM ticketsdb_users WHERE email LIKE '%.edu'")
query

Unnamed: 0,email
0,Etiam.laoreet.libero@sodalesMaurisblandit.edu
1,Suspendisse.tristique@nonnisiAenean.edu
2,ullamcorper.nisl@Cras.edu
3,vel.est@velitegestas.edu
4,justo.nec.ante@quismassa.edu
...,...
12472,nec.orci@adipiscing.edu
12473,Aliquam@sed.edu
12474,velit.Aliquam.nisl@ac.edu
12475,Proin@Class.edu


## GROUP BY and HAVING - 5 points

Time to practice some GROUP BY and HAVING operations. Please write and run queries that do the following:

GROUP BY application
* Find the top five venues that hosted the most events: Alias the count of events as 'events_hosted'. Also return the venue ID
* Get the number of events hosted in each month. You'll need to use `date_part()` in your select to select just the months. Alias this as 'month' and then the count of the number of events hosted as 'events_hosted'.
* Get the top five sellers who made the most commission. Alias their total commission made as 'total_com'. Also get their average commission made and alias as 'avg_com'.  Be sure to also display the seller_id.  

HAVING application
* Using the same query as the last one, instead of getting the top five sellers get all sellers who have made a total commission greater than $4000.
* Using the same query as the first groupby, instead of returning the top five venues, return just the ID's of venues that have had greater than 60 events.

In [23]:
### GROUP BY application
# Find the top five venues that hosted the most events: Alias the count of events as 'events_hosted'. Also return the venue ID
query = run_query("SELECT venueid, COUNT(venueid) as events_hosted from ticketsdb_event GROUP by venueid")
query

Unnamed: 0,venueid,events_hosted
0,1,49
1,2,39
2,3,35
3,4,28
4,5,32
...,...,...
199,305,54
200,306,56
201,307,49
202,308,53


In [26]:
# Get the number of events hosted in each month. You'll need to use `month()` in your select to select just the months.
# Alias this as 'month' and then the count of the number of events hosted as 'events_hosted'
query = run_query("SELECT month(starttime) as month, COUNT(month(starttime)) as events_hosted from ticketsdb_event GROUP by month(starttime)")
query

Unnamed: 0,month,events_hosted
0,1,778
1,2,711
2,3,753
3,4,725
4,5,727
5,6,709
6,7,729
7,8,737
8,9,746
9,10,735


In [30]:
# Get the top five sellers who made the most commission. Alias their total commission made as 'total_com'.
# Also get their average commission made and alias as 'avg_com'. Be sure to also display the seller_id
sql_head(table_name = "ticketsdb_sales")
query = run_query("SELECT sellerid, AVG(commission) as avg_com, SUM(commission) as total_com from ticketsdb_sales GROUP BY sellerid")
query

Unnamed: 0,sellerid,avg_com,total_com
0,1,51.225000,102.45
1,2,34.668750,277.35
2,3,89.250000,267.75
3,4,223.700000,671.10
4,5,14.650000,43.95
...,...,...,...
43430,49985,55.900000,167.70
43431,49986,60.950000,548.55
43432,49987,110.633333,995.70
43433,49988,159.250000,477.75


In [31]:
### HAVING application
# Using the same query as the last groupby, instead of getting the top five sellers get all sellers who have made a total commission greater than $4000
query = run_query("SELECT sellerid, AVG(commission) as avg_com, SUM(commission) as total_com from ticketsdb_sales GROUP BY sellerid HAVING total_com > 4000")
query

Unnamed: 0,sellerid,avg_com,total_com
0,1140,347.132143,4859.85
1,2372,678.975,4073.85
2,13385,388.568182,4274.25
3,25433,518.49375,4147.95
4,43551,470.475,4704.75


In [33]:
# Using the same query as the first groupby, instead of returning the top five venues, return just the ID's of venues that have had greater than 60 events
query = run_query("SELECT venueid, COUNT(venueid) as events_hosted from ticketsdb_event GROUP by venueid HAVING events_hosted > 60")
query

Unnamed: 0,venueid,events_hosted
0,201,62
1,203,80
2,205,70
3,207,67
4,208,69
5,209,66
6,215,62
7,216,72
8,217,81
9,218,70


## JOIN - 5 points

Time for some joins. You've probably noticed by now that there is at least one relational key in each table, but some have more.  For example, sales has a unique sale id, listing id, seller id, buyer id, date id.  This allows you to link each sale to relevant information in other tables.  

Please write queries to do the following items:

* Join information of users to each sale made (using seller id).  
* Join information about each venue to each event.

In [None]:
# Join users information to each sale using seller id (correct solution has 172456 rows)
query = run_query("""
    SELECT *
    FROM ticketsdb_sales s
    LEFT JOIN ticketsdb_users u ON s.sellerid = u.userid
""")
query
# sql_head(table_name = "ticketsdb_users")


Unnamed: 0,salesid,listid,sellerid,buyerid,eventid,dateid,qtysold,pricepaid,commission,saletime,...,likesports,liketheatre,likeconcerts,likejazz,likeclassical,likeopera,likerock,likevegas,likebroadway,likemusicals
0,5657,6111,35,80,2041,2167,4,984.00,147.60,2008-12-06 21:08:55,...,0,0,0,0,0,0,0,0,0,0
1,2439,2614,38,14643,4999,1894,4,836.00,125.40,2008-03-09 00:17:14,...,0,0,0,0,0,0,0,0,0,0
2,2440,2614,38,38705,4999,1923,3,627.00,94.05,2008-04-07 01:18:55,...,0,0,0,0,0,0,0,0,0,0
3,2441,2614,38,34014,4999,1917,2,418.00,62.70,2008-04-01 01:22:03,...,0,0,0,0,0,0,0,0,0,0
4,2442,2614,38,21305,4999,1902,1,209.00,31.35,2008-03-17 01:22:39,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
172451,171017,232760,49950,8832,1132,1913,4,8756.00,1313.40,2008-03-27 20:33:35,...,0,0,0,0,0,0,0,0,0,0
172452,171018,232760,49950,9703,1132,1914,2,4378.00,656.70,2008-03-28 20:33:38,...,0,0,0,0,0,0,0,0,0,0
172453,168027,227012,49961,14319,8247,2052,4,9764.00,1464.60,2008-08-14 01:14:19,...,0,0,0,0,0,0,0,0,0,0
172454,169108,229107,49966,16280,3296,1933,1,1482.00,222.30,2008-04-16 23:54:04,...,0,0,0,0,0,0,0,0,0,0


In [59]:
# For each event attach the venue information (correct solution has 8659 rows)
query = run_query("""
    SELECT *
    FROM ticketsdb_event e
    INNER JOIN ticketsdb_venue v ON e.venueid = v.venueid
""")
query

Unnamed: 0,eventid,venueid,catid,dateid,eventname,starttime,venueid.1,venuename,venuecity,venuestate,venueseats
0,1,305,8,1851,Gotterdammerung,2008-01-25 08:30:00,305,Lyric Opera House,Chicago,IL,0
1,2,306,8,2114,Boris Godunov,2008-10-15 15:00:00,306,Lyric Opera House,Baltimore,MD,0
2,3,302,8,1935,Salome,2008-04-19 09:30:00,302,Detroit Opera House,Detroit,MI,0
3,4,309,8,2090,La Cenerentola (Cinderella),2008-09-21 09:30:00,309,Los Angeles Opera,Los Angeles,CA,0
4,5,302,8,1982,Il Trovatore,2008-06-05 14:00:00,302,Detroit Opera House,Detroit,MI,0
...,...,...,...,...,...,...,...,...,...,...,...
8654,8793,79,9,1861,The Rowdy Frynds Tour,2008-02-04 13:00:00,79,Arrowhead Stadium,Kansas City,MO,79451
8655,8794,37,9,1938,Greg Kihn Band,2008-04-22 09:00:00,37,Staples Center,Los Angeles,CA,0
8656,8796,28,9,1947,John Mayer,2008-05-01 09:00:00,28,American Airlines Arena,Miami,FL,0
8657,8797,96,9,2082,Keith Urban,2008-09-13 09:00:00,96,Oriole Park at Camden Yards,Baltimore,MD,48876


## Subqueries - 5 points

To wrap up let's do several subqueries. Please do the following:

* Get all purchases made by users of live in Arizona
* Get event information for all events that took place in a venue where the venue name ends with 'Stadium'.
* Get event information for all events where the total ticket sales were greater than $50,000.  

In [61]:
# Get all purchases from users who live in Arizona (correct solution has 1855 rows)
query = run_query("SELECT * FROM ticketsdb_sales s WHERE s.buyerid IN ( select userid FROM ticketsdb_users u WHERE u.state = 'AZ' )")
query

Unnamed: 0,salesid,listid,sellerid,buyerid,eventid,dateid,qtysold,pricepaid,commission,saletime
0,43,47,49346,33489,8577,2141,2,378.00,56.70,2008-11-11 03:51:06
1,79,101,37592,7079,3340,1878,1,36.00,5.40,2008-02-21 04:32:10
2,81,103,26314,7079,15,2033,1,181.00,27.15,2008-07-26 06:04:13
3,83,106,12538,7079,250,1884,1,109.00,16.35,2008-02-27 05:58:35
4,154,162,27703,46451,2906,1907,3,426.00,63.90,2008-03-22 00:21:40
...,...,...,...,...,...,...,...,...,...,...
1850,171925,234507,9246,28626,6249,1902,2,3242.00,486.30,2008-03-16 21:07:33
1851,171966,234574,21923,24479,5355,1870,1,868.00,130.20,2008-02-12 20:09:14
1852,172101,234786,19290,22441,409,2046,1,2275.00,341.25,2008-08-07 21:14:11
1853,172272,235145,10079,31571,8271,2111,1,529.00,79.35,2008-10-11 21:19:47


In [63]:
# Get event information for all events that took place in a venue where the name ended in 'Stadium' (correct solution has 1029 rows)
query = run_query("SELECT * FROM ticketsdb_event e WHERE e.venueid IN ( select venueid FROM ticketsdb_venue v WHERE v.venuename LIKE '%Stadium')")
query

Unnamed: 0,eventid,venueid,catid,dateid,eventname,starttime
0,3803,2,9,2181,Dropkick Murphys,2008-12-21 08:00:00
1,3816,11,9,2139,Keb Mo,2008-11-09 13:00:00
2,3821,79,9,1885,Charlie Daniels Band,2008-02-28 13:30:00
3,3824,98,9,1885,Govt Mule,2008-02-28 08:00:00
4,3835,74,9,2073,LeAnn Rimes,2008-09-04 09:30:00
...,...,...,...,...,...,...
1024,8764,119,9,1933,Billy Idol,2008-04-17 10:00:00
1025,8773,92,9,1995,Motorhead,2008-06-18 14:30:00
1026,8780,129,9,1950,Eddie Money,2008-05-01 09:00:00
1027,8785,14,9,1895,Jimmy Buffett,2008-03-10 09:00:00


In [67]:
# Get event name where the total sales for that event were greater than $50000 (correct solution has three rows for Adriana Lecouvreur,Phantom of the Opera, and Janet Jackson )
# Note that we are looking for  event name!
query = run_query("""
    SELECT eventname
    FROM ticketsdb_event e
    WHERE e.eventid IN (
        SELECT eventid
        FROM ticketsdb_sales s
        GROUP BY eventid
        HAVING SUM(pricepaid) > 50000
    )
""")
query


Unnamed: 0,eventname
0,Adriana Lecouvreur
1,Phantom of the Opera
2,Janet Jackson
