### Take-home Assessment for Gemini     
Andrew Caide      
28-Jan-2025

----

## Task:

**Part 1.** Coding        
Write a script (python/similar) to export data from https://www.gemini.com/ for the trading pair BTC-USD.       
Use 1 minute interval candlestick data and store it in the assumed data lake.      
Refer Gemini API clients for this task.      
* https://docs.gemini.com/rest-api/#introduction      
* https://docs.gemini.com/rest-api/#candles       


**Requirements**     
Required columns in the output are:      
● Trading Pair, Open Price, Close Price,High Price, Low Price, BTC Volume, USD Volume, Number of Trades, Candle Open Time, Candle Close Time
    
Expected output:      
● Python/similar code which can be executed to collect data in required format from Gemini exchange and store into the Data Lake (assume a file storage system).

----

**Notes and Assumptions:**   

1. The format is not specified, but let's assume this code may be used in production and the data will grow very, very large as we are a popular trading platform with many users. For this reason, we shall **save files as parquet**.       
2. Additionally, we will store our data in a data lake/warehouse (tbd), so let's choose to **store files in s3**.
3. It was communicated that "**Number of Trades** can be weekly/monthly", however the date pulled from candle only spans **2 days** (see below). To make the dataset interesting, I'll compute this at the *hourly* level.
4. The **candle open-close** time will be recorded at the *hourly level* as well.
5. I'm unable to find a resource on the APIs which *tracks price changes* at the granularity we're interested in. I'll (wrongly) assume the price is fixed, for this assignment.


--------


In [1]:
# load important libraries
import requests
import json
import pandas as pd
import boto3

# essential variables
ALL_PAIRS = [
    'btcusd', 'ethbtc', 'ethusd', 'bchusd', 'bchbtc', 'bcheth', 'ltcusd', 'ltcbtc', 'ltceth', 'ltcbch', 
    'batusd', 'daiusd', 'linkusd', 'oxtusd', 'linkbtc', 'linketh', 'ampusd', 'compusd', 'paxgusd', 'mkrusd', 
    'zrxusd', 'manausd', 'storjusd', 'crvusd', 'uniusd', 'renusd', 'umausd', 'yfiusd', 'aaveusd', 'filusd', 
    'btceur', 'btcgbp', 'etheur', 'ethgbp', 'btcsgd', 'ethsgd', 'sklusd', 'grtusd', 'lrcusd', 'sandusd', 
    'cubeusd', 'lptusd', 'maticusd', 'injusd', 'sushiusd', 'dogeusd', 'ftmusd', 'ankrusd', 'btcgusd', 'ethgusd', 
    'ctxusd', 'xtzusd', 'axsusd', 'dogebtc', 'dogeeth', 'rareusd', 'qntusd', 'maskusd', 'fetusd', 'api3usd', 
    'usdcusd', 'shibusd', 'rndrusd', 'galausd', 'ensusd', 'elonusd', 'ldousd', 'solusd', 'apeusd', 'gusdsgd', 
    'chzusd', 'jamusd', 'gmtusd', 'aliusd', 'gusdgbp', 'dotusd', 'ernusd', 'galusd', 'samousd', 'imxusd', 'iotxusd', 
    'avaxusd', 'atomusd', 'usdtusd', 'btcusdt', 'ethusdt', 'pepeusd', 'xrpusd', 'hntusd', 'wifusd', 'bonkusd', 
    'popcatusd', 'opusd', 'moodengusd', 'pnutusd', 'goatusd', 'mewusd', 'bomeusd', 'flokiusd', 'pythusd', 'chillguyusd']
ALL_TIMEFRAMES = ['1m','5m','15m','30m','1hr','6hr','1day']

In [2]:
# collect data

# candle 
symbol = ALL_PAIRS[0]
time_frame = ALL_TIMEFRAMES[0]
data_request = requests.get(f"https://api.gemini.com/v2/candles/{symbol}/{time_frame}")

# price of currency (btc)
price_request = requests.get("https://api.gemini.com/v1/pricefeed")
prices = price_request.json()
btc_price = float([pairs for pairs in prices if pairs['pair'].lower()==symbol][0]['price'])
print(f'Estimating the price of BTC at ${btc_price} USD.')

Estimating the price of BTC at $104515.53 USD.


In [3]:
# organize the dataset

candle_data = pd.DataFrame(data_request.json())
candle_data.columns = ['time_ms', 'open', 'high', 'low', 'close','volume']
candle_data['date-time'] = pd.to_datetime(candle_data.time_ms, unit='ms')
candle_data['time'] = candle_data['date-time'].dt.time
candle_data['trading_pair'] = symbol
candle_data['btc_volume'] = candle_data['volume']
candle_data['usd_volume'] = btc_price * candle_data['volume']
candle_data.sort_values('time_ms',ascending=True,inplace=True)
candle_data.head(3)

Unnamed: 0,time_ms,open,high,low,close,volume,date-time,time,trading_pair,btc_volume,usd_volume
1439,1738203660000,104075.33,104176.56,104075.33,104134.3,0.002141,2025-01-30 02:21:00,02:21:00,btcusd,0.002141,223.777156
1438,1738203720000,104134.3,104213.4,104134.3,104213.4,0.007438,2025-01-30 02:22:00,02:22:00,btcusd,0.007438,777.415776
1437,1738203780000,104213.4,104213.4,104145.17,104145.17,0.001453,2025-01-30 02:23:00,02:23:00,btcusd,0.001453,151.812988


In [4]:
candle_summary = candle_data\
    .groupby([candle_data['date-time'].dt.day,candle_data['date-time'].dt.hour])\
    .agg(
        high_price=('high','max'),
        low_price=('low','min'),
        btc_volume=('btc_volume','sum'),
        usd_volume=('usd_volume','sum'),
        candel_open_time=('time_ms','min'),
        candel_close_time=('time_ms','max')
    )
candle_summary.index.names = ['day','hour']
candle_summary.reset_index(inplace=True)
candle_summary.head(5)

Unnamed: 0,day,hour,high_price,low_price,btc_volume,usd_volume,candel_open_time,candel_close_time
0,30,2,104673.26,104075.33,8.935182,933865.3,1738203660000,1738205940000
1,30,3,105150.0,104520.52,82.254785,8596902.0,1738206000000,1738209540000
2,30,4,105250.0,104746.69,115.226542,12042960.0,1738209600000,1738213140000
3,30,5,105400.0,105031.04,46.143723,4822736.0,1738213200000,1738216740000
4,30,6,105587.12,105142.13,48.751117,5095249.0,1738216800000,1738220340000


---

# Observations

Due to API limitiations of the /candle, using the candle data limits us only what is readily available. This means we're unable to gather:

1. Data beyond the default time range (ie a week or a month).
2. The **number of trades** made in the candle period. This is **required**, so this method is a no-go.
3. Lastly, we're estimating the **price of `bct`** - surely we can do better! 

------

## New Strategy

We'll use data from the `trades` API. This API allows us to query up to a week of data, query **all** trades made, and it provides us with the **exact price** of `btc`! This solves all of our problems. Additionally, we can use the candle summary to **QA** our work.

For this project, let's consider the time range in the candles daya, and five days prior.

In [5]:
import datetime

# URL
base_url = "https://api.gemini.com/v1"

In [6]:
# Initializing book-ends of the requests.

# Start time = first_trade - 5 days (in milliseconds)
first_trade_ms = candle_data['time_ms'].min()

# Remove 5 days in ms from first_trade_ms
days = 5 
backtrack_ms = days * 24 * 60 * 60 * 1000
first_trade_minus_5_days_ms = first_trade_ms - backtrack_ms

# Double-check our work
print(f'Dataset start-time: {datetime.datetime.utcfromtimestamp(first_trade_ms / 1000.0)}\n'\
     f'Dataset start-time plus 5 days: {datetime.datetime.utcfromtimestamp(first_trade_minus_5_days_ms / 1000.0)}')

# Looks great!

Dataset start-time: 2025-01-30 02:21:00
Dataset start-time plus 5 days: 2025-01-25 02:21:00


  print(f'Dataset start-time: {datetime.datetime.utcfromtimestamp(first_trade_ms / 1000.0)}\n'\
  f'Dataset start-time plus 5 days: {datetime.datetime.utcfromtimestamp(first_trade_minus_5_days_ms / 1000.0)}')


In [7]:
# Book-ends for our loop.
current_trade = first_trade_minus_5_days_ms
last_trade = candle_data['time_ms'].max()

# Keep track of time in EST
est_offset = datetime.timedelta(hours=-5)
timestamp_utc = datetime.datetime.utcfromtimestamp(last_trade / 1000.0)
timestamp_ending = timestamp_utc.replace(tzinfo=datetime.timezone(est_offset))

# Collecting results
all_trades_made = []
n_trade_requests = 0

# Loop until we collect all of the data, ending at the last date in our candle DF.
while last_trade > current_trade:
    params = {
        'timestamp': current_trade,
        'limit_trades': '100'
    }
    response = requests.get(base_url + "/trades/btcusd", params=params)
    btcusd_trades = response.json()
    all_trades_made += btcusd_trades

    # Update loop condition
    current_trade = max([x['timestampms'] for x in btcusd_trades])

    # Good ol' command-line logging. 
    n_trade_requests +=1
    if n_trade_requests%200==0:
        timestamp_utc = datetime.datetime.utcfromtimestamp(max([x['timestampms'] for x in btcusd_trades]) / 1000.0)
        timestamp_est = timestamp_utc.replace(tzinfo=datetime.timezone(est_offset))
        log =\
        f"""Retrieved data for {n_trade_requests} requests, {len(all_trades_made)} records collected.
        Current date-time: {timestamp_est}.
        Terminating date-time: {timestamp_ending}.
        """
        #print(log)

  timestamp_utc = datetime.datetime.utcfromtimestamp(last_trade / 1000.0)
  timestamp_utc = datetime.datetime.utcfromtimestamp(max([x['timestampms'] for x in btcusd_trades]) / 1000.0)


In [8]:
print(f'Total number of records collected: {len(all_trades_made)}')

Total number of records collected: 192900


In [9]:
# Tidy-up our results
all_trades = pd.DataFrame(all_trades_made)
all_trades.sort_values('timestamp',ascending=True,inplace=True)
all_trades['real_time'] = pd.to_datetime(all_trades.timestampms, unit='ms')
all_trades['amount'] = all_trades['amount'].astype(float)
all_trades['price']  = all_trades['price'].astype(float)
all_trades['volume'] = all_trades['amount']*all_trades['price']

all_trades.head(2)

Unnamed: 0,timestamp,timestampms,tid,price,amount,exchange,type,real_time,volume
99,1737771661,1737771661019,2840140890597762,104684.85,3.8e-05,gemini,buy,2025-01-25 02:21:01.019,3.98954
98,1737771672,1737771672050,2840140890597774,104684.85,0.0005,gemini,buy,2025-01-25 02:21:12.050,52.342425


In [10]:
## Summarize

trades_summary = all_trades\
    .groupby([all_trades['real_time'].dt.day,all_trades['real_time'].dt.hour])\
    .agg(
        n_trades=('tid', 'count'),
        high_price=('price','max'),
        low_price=('price','min'),
        btc_volume=('amount','sum'),
        usd_volume=('volume','sum'),
        candle_open_time=('timestampms','min'),
        candle_close_time=('timestampms','max')
    )
trades_summary.index.names = ['day','hour']
trades_summary.reset_index(inplace=True)
trades_summary.head(5)

Unnamed: 0,day,hour,n_trades,high_price,low_price,btc_volume,usd_volume,candle_open_time,candle_close_time
0,25,2,151,104765.06,104449.93,1.076861,112651.586683,1737771661019,1737773900461
1,25,3,413,105242.61,104445.11,5.644841,592489.677508,1737774002903,1737777599311
2,25,4,224,105152.33,104607.39,1.356886,142293.770562,1737777612831,1737781196171
3,25,5,387,104654.88,104253.24,3.15725,329946.803732,1737781269891,1737784742364
4,25,6,307,104435.59,104200.0,2.798284,291956.054932,1737784803455,1737788119143


----

### QA: Let's compare to our original results 

We performed calculations from raw data; let's see how it compares to the aggregated candle data. We'll assume the candle data as our source of truth.

**NOTE** Change the `qa_day` variable to allign with the current day. The candle data only provides 1-2 days.

In [16]:
# Let's QA against the following parameters.
qa_day = 30
qa_hours = [4,5]
qa_columns = ['day','hour','high_price','low_price','btc_volume','usd_volume']


candle_summary.loc[((candle_summary.day == qa_day)&(candle_summary.hour.isin(qa_hours))),][qa_columns]

Unnamed: 0,day,hour,high_price,low_price,btc_volume,usd_volume
2,30,4,105250.0,104746.69,115.226542,12042960.0
3,30,5,105400.0,105031.04,46.143723,4822736.0


In [17]:
trades_summary.loc[((trades_summary.day == qa_day)&(trades_summary.hour.isin(qa_hours))),][qa_columns]

Unnamed: 0,day,hour,high_price,low_price,btc_volume,usd_volume
122,30,4,105250.0,104746.69,115.926436,12173110.0
123,30,5,105400.0,105031.04,46.965677,4943322.0


## QA Summary

The high and low prices are the same! This is great. It apprears the volumes are off a little, by less than 1% however (computation below, assuming the candle volume is "correct"). 

Moving forward, we'll use the trade data as our de-facto submission as this contains all the relevent information!

In [21]:
candle_value = candle_summary.loc[((candle_summary.day == qa_day)&(candle_summary.hour==qa_hours[0])),'btc_volume'].values[0]
trades_value = trades_summary.loc[((trades_summary.day == qa_day)&(trades_summary.hour==qa_hours[0])),'btc_volume'].values[0]
percent_difference = round(float(abs(candle_value-trades_value)/candle_value*100),3)

print(f'We observe a {percent_difference} percent difference between the candle and trade btc_volume.')

We observe a 0.607 percent difference between the candle and trade btc_volume.


## Datalake

We'll store our dataset in S3. 

Storage architecture:      
* [ticker_name]/[year]/[month]/[file_name]_[time_collected].filetype

In [22]:
from datetime import datetime

s3_bucket = 'gemini-takehome'
# Get current date and time
now = datetime.now()
current_year = now.year
current_month = now.month
current_time = now.strftime("%H-%M-%S")  # Format time as hour:minute:second
file_type = '.parquet'

s3_path = f's3://{s3_bucket}/{current_year}/{current_month}/trade_data_{symbol}_{current_time}.parquet'
print(f'Saving data to s3_path: {s3_path}')

Saving data to s3_path: s3://gemini-takehome/2025/1/trade_data_btcusd_21-30-39.parquet


In [23]:
# Save the DataFrame as a Parquet file to S3
trades_summary.to_parquet(s3_path, engine='pyarrow')

print(f"File successfully saved to {s3_path}")

File successfully saved to s3://gemini-takehome/2025/1/trade_data_btcusd_21-30-39.parquet


In [26]:
# Save our datasets locally, just in case. 
trades_summary.to_csv('data/trade_summary.csv',index=False)
candle_summary.to_csv('data/candle_summary.csv',index=False)

----
Great, we've finally made it to the end of our journey. We've successfully :
1. Collected accurate trade data
2. Analyzed the data and QA'd against Gemini's inhouse candle dataset
3. Uploaded the results to S3!

This was fun. Thanks for your consideration.

## Thank you!

-Andrew