# Exercise 2
Answer three of the following questions with at least one question coming from the closed-ended and one from the open-ended question set. Each question should be answered using one query.

Closed-ended questions:
- **What are the top 5 brands by receipts scanned among users 21 and over?**
- **What are the top 5 brands by sales among users that have had their account for at least six months?**
- What is the percentage of sales in the Health & Wellness category by generation?

Open-ended questions: for these, make assumptions and clearly state them when answering the question.
- Who are Fetch’s power users?
- Which is the leading brand in the Dips & Salsa category?
- **At what percent has Fetch grown year over year?**

##  1. What are the top 5 brands by receipts scanned among users 21 and over?

One major caveat: because the User data is incomplete and the majority of `user_id` values in Transactions do not have a matching value in Users, the number of receipts is very small and I do not think this is an accurate depiction of brand popularity. 

In [22]:
# What are the top 5 brands by receipts scanned among users 21 and over?

import sqlite3
import tabulate
conn = sqlite3.connect('fetch.sqlite')
cur = conn.cursor()

cur.execute('''
select p.brand
, count(distinct receipt_id) as n_receipts
from transactions t
inner join products_clean p on t.barcode = p.barcode
inner join users u on t.user_id = u.user_id
where p.brand is not null
    and date(u.birth_date) <= date(current_date, '-21 year')
group by p.brand 
order by n_receipts desc
limit 5
''')

cur.close 
conn.close

tabulate.tabulate(cur.fetchall(), tablefmt='html')

0,1
NERDS CANDY,3
DOVE,3
TRIDENT,2
SOUR PATCH KIDS,2
MEIJER,2


## 2. What are the top 5 brands by sales among users that have had their account for at least six months?

Same caveat as on previous question: because the User data is incomplete and the majority of `user_id` values in Transactions do not have a matching value in Users, the number of receipts is very small and I do not think this is an accurate depiction of brand popularity. In addition, not all receipts have a value for `final_sale`, so that will also reduce the accuracy of the ranking. 

In [26]:
import sqlite3
import tabulate
conn = sqlite3.connect('fetch.sqlite')
cur = conn.cursor()

cur.execute('''
select p.brand
, sum(cast(final_sale as decimal)) as total_sales
from transactions t
inner join products_clean p on t.barcode = p.barcode
inner join users u on t.user_id = u.user_id
where p.brand is not null
    and u.created_at <= datetime(current_timestamp, '-6 month')
group by p.brand 
order by total_sales desc
limit 5
''')

cur.close 
conn.close

tabulate.tabulate(cur.fetchall(), tablefmt='html')

0,1
CVS,72.0
TRIDENT,46.72
DOVE,42.88
COORS LIGHT,34.96
QUAKER,16.6


## 3. At what percent has Fetch grown year over year?
Since the Transactions data is only for 2024-06-12 through 2024-09-08, I am looking at user growth. However, due to the previously discussed missing user data, while these numbers are accurate for the dataset given they are likely not accurate for Fetch's overall business. 

In addition, the User data only goes through September 11th of 2024, so I will compare monthly user growth for the first 8 months of the year. 

My analysis shows that while Fetch's user growth was down substantially in the first two months of 2024, growth began to catch up in March through June and then in July and August, growth dramatically increased. 

In [62]:
import sqlite3
import tabulate
conn = sqlite3.connect('fetch.sqlite')
conn.row_factory = sqlite3.Row
cur = conn.cursor()

cur.execute('''
with monthly_users as
(
    select 
        strftime('%Y-%m', created_at) as created_year_month   
        , count(*) as n_users_created
    from users
        where created_at >= '2023-01-01 00:00:00' 
        and created_at < '2024-09-01 00:00:00'
    group by created_year_month
    order by created_year_month
)
, lagged_users as
(
    select created_year_month
    , n_users_created 
    , lag(n_users_created, 12) over (order by created_year_month) as py_users_created
    from monthly_users
)
select created_year_month 
    , n_users_created
    , py_users_created
    , round(100.0*(1.0*n_users_created/py_users_created - 1),1) as yoy_change
from lagged_users
where created_year_month >= '2024-01'
order by created_year_month
;
''')

cur.close 
conn.close

tabulate.tabulate(cur.fetchall(), tablefmt='html', headers=["This Year", "Previous Year", "Percent Change"])

Unnamed: 0,This Year,Previous Year,Percent Change
2024-01,1044,2207,-52.7
2024-02,958,1760,-45.6
2024-03,1200,1356,-11.5
2024-04,1200,1114,7.7
2024-05,1146,1209,-5.2
2024-06,1260,1198,5.2
2024-07,2037,1276,59.6
2024-08,1807,1215,48.7
