## Andrew Byrnes: Fetch Rewards Coding Exercise - Data Analyst
### 2.1-EDA_first_pass.ipynb

This notebook represents a first draft / draft as I explore the exercise questions. It does not neccessiarly contain my final answers, but is for the most part reflective of how I would start to approach these problems. My final presentation of answers can be found in 2.2-FINAL-ANSWERS.ipynb.

### Data Sources
- fetch.db: SQLite file from notebook 1-Data_Prep
- fetch_db_erd.jpg: ERD image, documenting fetch.db schema designed and created in notebook 1-Data_Prep

### Changes
- 09-21-2022 : Set up notebook
- 09-22-2022 : 

In [1]:
import pandas as pd
from pathlib import Path
import os
from datetime import datetime
import sqlite3

### File Locations

In [5]:
today = datetime.today()
print(today)
db_path = Path.cwd() / "data" / "processed" / "fetch.db"
erd_fetch_db = Path.cwd() / "data" / "interim" / "fetch_db_erd.jpg" 

2022-09-22 11:25:26.195513


### Formatting and options

In [3]:
pd.set_option('display.max_colwidth', None)
# pd.set_option('display.max_rows', None)
pd.reset_option('display.max_rows')
pd.set_option('display.max_columns', None)
# pd.reset_option('display.max_columns')

## First: Review Existing Unstructured Data and Diagram a New Structured Relational Data Model

1-Data_Prep.ipynb capatures my process from start to finsih of reviewing the data, performing some transfomrations, and loading 4 tables of data into a SQLite database. 

I've structured the data as presented in this ERD:
![fetch_db_erd.jpg](attachment:fetch_db_erd.jpg)  
*Note: brands.barcode <-> receipt_items is not implemented as a 0 or 1 to zero or many relationship in fetch.db as depicted above. There are duplicate values present in brands.barcode, those values are flagged in brands.dupe_barcode for easy filtering. Without filtering this join may cause unintended duplicates and in a production environment would need to be addressed.*

## Second: Write a query that directly answers a predetermined question from a business stakeholder

In [None]:
query = f"""
        select 
            
        from
            
        where
            
    """
#execute the above query and save results to dataframe 
conn = sqlite3.connect(db_path)
df = pd.read_sql_query(query, conn)
conn.close()
df

###  What are the top 5 brands by receipts scanned for most recent month?
I don't want to over count receipts for a brand if multiple different items of the same brand were purchased on the same receipt. Is that a possiblity with this data and something I need to account for?

Question: are there duplicate brand names in the brands table? / is barcode 1to1 with product? / can a brand have multiple barcodes(products) on the same receipt? if so, does that receipt count once or twice?
- only if you include the known duplicate barcodes, barcodes and brands are 1 to 1 when excluding the known duplicates

Assumptions: If the duplicate barcodes are completely ignored, we can assume any given receipt will not contain seperate kinds of items from the same brand. A receipt may count for multipe brands when multiple kinds of items are purchased on the same receipt. 

In [17]:
# Does a brand name appear multiple times in the brands table? No, if you exclude the dupe barcodes identifed earlier.

query = f"""
        select 
            b.name,
            count(b.barcode) as barcodes
        from
            brands b
        where
            -- dupe_barcode is a boolean, but stored as 1.0 or NaN, exclude 
            b.dupe_barcode <> 1
        group by
            b.name
        having
            - only return results where the count of barcode is greater than 1
            barcodes > 1
        order by
            barcodes desc
            
    """
#execute the above query and save results to dataframe 
conn = sqlite3.connect(db_path)
df = pd.read_sql_query(query, conn)
conn.close()
df

Unnamed: 0,name,barcodes


In [None]:
query = f"""
        select 
            b.name,
            count(ri.receipt_id) as receipts_scanned
        from
            brands b
            join receipt_items ri on ri.barcode = b.barcode
            join receipts r on r.id = ri.receipt_id
        where
            b.dupe_barcode <> 1
            and r.dateScanned >= (the first of the month of the max date month)
        order by
            receipts_scanned desc
        limit 5            
        """
#execute the above query and save results to dataframe 
conn = sqlite3.connect(db_path)
df = pd.read_sql_query(query, conn)
conn.close()
df

### How does the ranking of the top 5 brands by receipts scanned for the recent month compare to the ranking for the previous month?

### When considering average spend from receipts with 'rewardsReceiptStatus’ of ‘Accepted’ or ‘Rejected’, which is greater?

### When considering total number of items purchased from receipts with 'rewardsReceiptStatus’ of ‘Accepted’ or ‘Rejected’, which is greater?

### Which brand has the most spend among users who were created within the past 6 months?

### Which brand has the most transactions among users who were created within the past 6 months?

## Third: Evaluate Data Quality Issues in the Data Provided

Using the programming language of your choice (SQL, Python, R, Bash, etc...) identify at least one data quality issue. We are not expecting a full blown review of all the data provided, but instead want to know how you explore and evaluate data of questionable provenance.

## Fourth: Communicate with Stakeholders

Construct an email or slack message that is understandable to a product or business leader who isn’t familiar with your day to day work. This part of the exercise should show off how you communicate and reason about data with others. Commit your answers to the git repository along with the rest of your exercise.

- What questions do you have about the data?
- How did you discover the data quality issues?
- What do you need to know to resolve the data quality issues?
- What other information would you need to help you optimize the data assets you're trying to create?
- What performance and scaling concerns do you anticipate in production and how do you plan to address them?