# SQL Joins

---

## ✨ Joining Tables

Today, we will review basic SQL joins.

▶️ First, run the code cell below to import modules used for **🧭 Check Your Work** sections and the autograder.

In [1]:
import unittest
import base64
tc = unittest.TestCase()

---

### 🎯 Pre-exercise: Import Packages

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.
    3. `sqlite3`: No alias

In [2]:
# YOUR CODE BEGINS
import pandas as pd
import numpy as np
import sqlite3
# YOUR CODE ENDS

#### 🧭 Check your work

In [3]:
import sys
tc.assertTrue('pd' in globals(), 'Check whether you have correctly imported Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly imported NumPy with an alias.')
tc.assertTrue('sqlite3' in globals(), 'Check whether you have correctly imported the sqlite3 package.')

---

### 📌 Transcripts

Throughout this exercise, you'll work with the Starbucks Customer Rewards Program dataset. ☕ The dataset was originally released as a part of a capstone project of a Udacity course.

For the purpose of this course, the original dataset in JSON format has been transformed to CSV files for convenience. You can download the original JSON files [here](https://www.kaggle.com/blacktile/starbucks-app-customer-reward-program-data?select=portfolio.json).

The table below describes the columns in the `transactions` table.

| Field | Description |
|---|---|
| member_id | Member ID |
| event | Type of event ('offer received',   'offer completed', 'transaction') |
| time | Time as continuous value |
| offer_id | Offer ID |
| amount | Amount of purchase or NaN if   non-purchase |

---

### 📌 Customer profiles

We want to go further with the purchase data analysis by looking at the customers. Unfortunately, the information we currently have about the customers in each purchase is limited. We only have their unique IDs (e.g., `02c083884c7d45b39cc68e1314fec56c`, `9fa9ae8f57894cc9a3b8a9bbe0fc1b2f`).

In the next few steps, you'll read the customer profiles data and link them to `df_transactions` using `member_id`.

The table below describes the columns in the `profiles` table.

| Field | Description |
|---|---|
| member_id | Unique identifier for each Starbucks app member |
| gender | Gender (NaN if unknown) |
| age | Age (NaN if unknown) |
| became_member_on | Sign-up date |
| income | Annual income (NaN if unknown) |

---
### 📌 Populate database tables from a CSV file

▶️ Run the code below to populate the `profiles` and `transcripts` tables.

In [4]:
df_transactions = pd.read_csv('https://github.com/bdi475/datasets/raw/main/starbucks-rewards/transcript.v2.csv.gz')
df_profiles = pd.read_csv('https://github.com/bdi475/datasets/raw/main/starbucks-rewards/profile.csv')

conn = sqlite3.connect('starbucks.db')
c = conn.cursor()

tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn)['tbl_name'])
tables_to_populate = ['transactions', 'profiles']

for t in tables_to_populate:
    if t in tables:
        c.execute(f'DELETE FROM {t}')
        conn.commit()

df_transactions.to_sql(name='transactions', index=False, con=conn, if_exists='append')
df_profiles.to_sql(name='profiles', index=False, con=conn, if_exists='append')

conn.close()

#### 🧭 Check your work

In [5]:
# DO NOT CHANGE THE CODE IN THIS CELL
conn_checker = sqlite3.connect('starbucks.db')
tables_to_check = ['transactions', 'profiles']

# Check if table exists
for t in tables_to_check:
    user_tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn_checker)['tbl_name'])
    tc.assertTrue(t in user_tables, f'{t} does not exist in your starbucks.db file!')

conn_checker.close()

---
### 📌 Read Sqlite Database File

▶️ Run the code below to select the first 5 rows from the `transactions` table

In [6]:
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('starbucks.db')
display(pd.read_sql_query('SELECT * FROM transactions LIMIT 5;', con=conn))
conn.close()

Unnamed: 0,member_id,event,time,offer_id,amount
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,9b98b8c7a33c4b65b9aebfe6a799e6d9,
1,a03223e636434f42ac4c3df47e8bac43,offer received,0,0b1e1539f2cc45b7b9fa7c272da2e1d7,
2,e2127556f4f64592b11af22de27a7932,offer received,0,2906b810c7d4411798c6938adc9daaa5,
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,0,fafdcd668e3743c1bb461111dcafc2a4,
4,68617ca6246f4fbc85e91a2a49552598,offer received,0,4d5c57ea9a6940dd891ad53e9dbe8da0,


▶️ Run the code below to select the first 5 rows from the `profiles` table

In [7]:
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('starbucks.db')
display(pd.read_sql_query('SELECT * FROM profiles LIMIT 5;', con=conn))
conn.close()

Unnamed: 0,member_id,gender,age,became_member_on,income
0,68be06ca386d4c31939f3a4f0e3dd783,,,2017-02-12,
1,0610b486422d4921ae7d2bf64640c50b,F,55.0,2017-07-15,112000.0
2,38fe809add3b4fcf9315a9694bb96ff5,,,2018-07-12,
3,78afa995795e4d85b5d9ceeca43f5fef,F,75.0,2017-05-09,100000.0
4,a03223e636434f42ac4c3df47e8bac43,,,2017-08-04,


#### 🧭 Check your work

In [8]:
# DO NOT CHANGE THE CODE IN THIS CELL
conn_checker = sqlite3.connect('starbucks.db')
tables_to_check = ['transactions', 'profiles']

# Check if table exists
user_tables = list(pd.read_sql_query('SELECT * FROM sqlite_master WHERE type="table";', con=conn_checker)['tbl_name'])

for table_to_check in tables_to_check:
    tc.assertTrue(table_to_check in user_tables, f'{table_to_check} does not exist in your NWT.db file!')

conn_checker.close()

---

### 🎯 Exercise 1: Join `profiles` into `transactions` table

#### 👇 Tasks

- ✔️ Write a query that joins the `profiles` table into `transactions`.
- ✔️ Use an inner join.
- ✔️ Select all columns.
- ✔️ Store your query to a new variable named `query_joined1`.

In [9]:
# YOUR CODE BEGINS
query_joined1 = '''
SELECT *
FROM transactions
INNER JOIN profiles
ON transactions.member_id == profiles.member_id;
'''
# YOUR CODE ENDS

conn = sqlite3.connect('starbucks.db')
df_result = pd.read_sql_query(query_joined1, con=conn)
display(df_result)
conn.close()

Unnamed: 0,member_id,event,time,offer_id,amount,member_id.1,gender,age,became_member_on,income
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,9b98b8c7a33c4b65b9aebfe6a799e6d9,,78afa995795e4d85b5d9ceeca43f5fef,F,75.0,2017-05-09,100000.0
1,a03223e636434f42ac4c3df47e8bac43,offer received,0,0b1e1539f2cc45b7b9fa7c272da2e1d7,,a03223e636434f42ac4c3df47e8bac43,,,2017-08-04,
2,e2127556f4f64592b11af22de27a7932,offer received,0,2906b810c7d4411798c6938adc9daaa5,,e2127556f4f64592b11af22de27a7932,M,68.0,2018-04-26,70000.0
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,0,fafdcd668e3743c1bb461111dcafc2a4,,8ec6ce2a7e7949b1bf142def7d0e0586,,,2017-09-25,
4,68617ca6246f4fbc85e91a2a49552598,offer received,0,4d5c57ea9a6940dd891ad53e9dbe8da0,,68617ca6246f4fbc85e91a2a49552598,,,2017-10-02,
...,...,...,...,...,...,...,...,...,...,...
306529,b3a1272bc9904337b331bf348c3e8c17,transaction,714,,1.59,b3a1272bc9904337b331bf348c3e8c17,M,66.0,2018-01-01,47000.0
306530,68213b08d99a4ae1b0dcb72aebd9aa35,transaction,714,,9.53,68213b08d99a4ae1b0dcb72aebd9aa35,M,52.0,2018-04-08,62000.0
306531,a00058cf10334a308c68e7631c529907,transaction,714,,3.61,a00058cf10334a308c68e7631c529907,F,63.0,2013-09-22,52000.0
306532,76ddbd6576844afe811f1a3c0fbb5bec,transaction,714,,3.53,76ddbd6576844afe811f1a3c0fbb5bec,M,57.0,2016-07-09,40000.0


#### 🧭 Check your work

In [10]:
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('starbucks.db')
df_check = pd.read_sql_query(query_joined1, con=conn)
tc.assertEqual(df_result.shape, (306534, 10), 'Incorrect number of rows and/or columns')
conn.close()

---

### 🎯 Exercise 2: Join `profiles` into `transactions` table where gender is not missing

#### 👇 Tasks

- ✔️ Write a query that joins the `profiles` table into `transactions`.
- ✔️ Only keep rows where the `gender` column's value in the `profiles` table is NOT NULL.
- ✔️ Use an inner join.
- ✔️ Select all columns.
- ✔️ Store your query to a new variable named `query_joined2`.

In [11]:
# YOUR CODE BEGINS
query_joined2 = '''
SELECT *
FROM transactions
INNER JOIN profiles
ON transactions.member_id == profiles.member_id
WHERE profiles.gender IS NOT NULL;
'''
# YOUR CODE ENDS

conn = sqlite3.connect('starbucks.db')
df_result = pd.read_sql_query(query_joined2, con=conn)
display(df_result)
conn.close()

Unnamed: 0,member_id,event,time,offer_id,amount,member_id.1,gender,age,became_member_on,income
0,0610b486422d4921ae7d2bf64640c50b,offer completed,528,9b98b8c7a33c4b65b9aebfe6a799e6d9,,0610b486422d4921ae7d2bf64640c50b,F,55.0,2017-07-15,112000.0
1,0610b486422d4921ae7d2bf64640c50b,offer received,408,9b98b8c7a33c4b65b9aebfe6a799e6d9,,0610b486422d4921ae7d2bf64640c50b,F,55.0,2017-07-15,112000.0
2,0610b486422d4921ae7d2bf64640c50b,offer received,504,3f207df678b143eea3cee63160fa8bed,,0610b486422d4921ae7d2bf64640c50b,F,55.0,2017-07-15,112000.0
3,0610b486422d4921ae7d2bf64640c50b,transaction,18,,21.51,0610b486422d4921ae7d2bf64640c50b,F,55.0,2017-07-15,112000.0
4,0610b486422d4921ae7d2bf64640c50b,transaction,144,,32.28,0610b486422d4921ae7d2bf64640c50b,F,55.0,2017-07-15,112000.0
...,...,...,...,...,...,...,...,...,...,...
272757,e4052622e5ba45a8b96b59aba68cf068,transaction,84,,25.19,e4052622e5ba45a8b96b59aba68cf068,F,62.0,2017-07-22,82000.0
272758,e4052622e5ba45a8b96b59aba68cf068,transaction,96,,21.53,e4052622e5ba45a8b96b59aba68cf068,F,62.0,2017-07-22,82000.0
272759,e4052622e5ba45a8b96b59aba68cf068,transaction,480,,30.57,e4052622e5ba45a8b96b59aba68cf068,F,62.0,2017-07-22,82000.0
272760,e4052622e5ba45a8b96b59aba68cf068,transaction,486,,19.47,e4052622e5ba45a8b96b59aba68cf068,F,62.0,2017-07-22,82000.0


#### 🧭 Check your work

In [12]:
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('starbucks.db')
df_check = pd.read_sql_query(query_joined2, con=conn)
tc.assertEqual(df_result.shape[0], 272762, 'Incorrect number of rows and/or columns')
conn.close()

---

### 🎯 Exercise 3: Average transaction amount by gender

#### 👇 Tasks

- ✔️ Write a query that finds the average transaction amount by gender.
- ✔️ Only display rows with a non-missing `gender` value.
- ✔️ Only display the following two columns:
    - `gender`
    - `average_amount`
- ✔️ Store your query to a new variable named `query_joined3`.
- ✔️ Sort the result by `average_amount` in descending order.


#### 🧭 Expected Output

|    | gender   |   average_amount |
|---:|:---------|-----------------:|
|  0 | F        |          17.4901 |
|  1 | O        |          14.7987 |
|  2 | M        |          11.6066 |

In [13]:
# YOUR CODE BEGINS
query_joined3 = '''
SELECT gender, AVG(amount) AS average_amount
FROM transactions
INNER JOIN profiles
ON transactions.member_id == profiles.member_id
GROUP BY profiles.gender
HAVING gender IS NOT NULL
ORDER BY average_amount DESC;
'''
# YOUR CODE ENDS

conn = sqlite3.connect('starbucks.db')
df_result = pd.read_sql_query(query_joined3, con=conn)
display(df_result)
conn.close()

Unnamed: 0,gender,average_amount
0,F,17.490077
1,O,14.798731
2,M,11.6066


#### 🧭 Check your work

In [14]:
# DO NOT CHANGE THE CODE IN THIS CELL
conn = sqlite3.connect('starbucks.db')
decoded_query = base64.b64decode(b'ClNFTEVDVCBnZW5kZXIsIEFWRyhhbW91\
bnQpIEFTIGF2ZXJhZ2VfYW1vdW50CkZST00gdHJhbnNhY3Rpb25zCklOTkVSIEpPSU4\
gcHJvZmlsZXMKT04gdHJhbnNhY3Rpb25zLm1lbWJlcl9pZCA9PSBwcm9maWxlcy5tZW\
1iZXJfaWQKR1JPVVAgQlkgcHJvZmlsZXMuZ2VuZGVyCkhBVklORyBnZW5kZXIgSVMgT\
k9UIE5VTEwKT1JERVIgQlkgYXZlcmFnZV9hbW91bnQgREVTQzsK').decode()
df_check = pd.read_sql_query(decoded_query, con=conn)
pd.testing.assert_frame_equal(df_result.reset_index(drop=True),
                              df_check.reset_index(drop=True))