# Task 2: A Sample of Owners

It would be highly efficient to have a local sample of owners to do work. Here, we will generate a file that contains ever record for each owner. In order to accomplish this we will set up a python script below that will carry out the following tasks:

1. Connects to my Google Big Query instance.

2. Builds a list of owners. 

3. Takes a sample of the owners. 

4. Extracts all records associated with the sample of owners and writes them to a local text file. 

In [1]:
import os
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import polars as pl
import zipfile
from datetime import datetime
from pandas_gbq import *
from pandas_gbq import to_gbq
from pandas_gbq import read_gbq

In [2]:
from google.cloud import bigquery
from google.oauth2 import service_account

In [18]:
# Set Up Google Big Query Instance
service_path = "C:/Users/breni/Documents/"
service_file = 'niekampbreannawedge-8bbebeea1dda.json'
project_id = 'niekampbreannawedge'
data_id = 'wedge24'
table_id = 'wedge_transactions'

beans_key = service_path + service_file

credentials = service_account.Credentials.from_service_account_file(beans_key)

client = bigquery.Client(credentials= credentials, project= project_id)


In [19]:
# Build a List of Owners
query_owners = """
    SELECT DISTINCT card_no
    FROM `umt-msba.wedge_transactions.transArchive*`
    WHERE card_no != 3
"""

df_owners = read_gbq(query_owners, project_id= project_id)

Downloading: 100%|[32m██████████[0m|


In [20]:
print(f"Total number of owners: {len(df_owners)}")

Total number of owners: 27207


In [21]:
# Take Sample of the Owners
sample_size = 740  # Adjust this value to get around 250 MB
df_sampled_owners = df_owners.sample(n=sample_size)
print(f"Sampled owners:\n{df_sampled_owners}")

Sampled owners:
       card_no
10879  39311.0
12472  16229.0
9882   51278.0
5631   36848.0
3608   12541.0
...        ...
6181   13116.0
15533  22935.0
18532  52923.0
2057   16108.0
3032   25292.0

[740 rows x 1 columns]


In [22]:
# Extract All Records Associated with the Sample of Owners
sampled_owner_list = df_sampled_owners['card_no'].tolist()


In [23]:

query_transactions = f"""
    SELECT *
    FROM `umt-msba.wedge_transactions.transArchive*`
    WHERE card_no IN ({','.join(map(str, sampled_owner_list))})
"""

In [24]:

df_transactions = read_gbq(query_transactions)


Downloading: 100%|[32m██████████[0m|


In [25]:
print(f"Number of records for sampled owners: {len(df_transactions)}")

Number of records for sampled owners: 1537436


In [26]:
## Write records created to a local file
output_file = 'sampled_owner_transactions.csv'


In [27]:
df_transactions.to_csv(output_file, index=False)