# Inital Project Exploration


## Project Steps
- Step 1: Scope the Project and Gather Data
- Step 2: Explore and Assess the Data
- Step 3: Define the Data Model
- Step 4: Run ETL to Model the Data
- Step 5: Complete Project Write Up

## Project Requirements

### Write up
- Project scope outlines project steps and defines purpose of final data model
- Other scenarios are addressed
    - data increased by 100x
    - pipelines run daily by 7 am
    - database needs access by 100+ people
- Choice of tools and tech defended well

### Execution
- Code is clean and modular
- Includes **2 data quality checks**
- ETL results in planned for data model
- Data model data dicitonary is included
- **2 data sources** (and data formats)
- more than **1 million rows**

In [1]:
import pandas as pd
import requests
import datetime
import time 
import copy
import coingecko

## Initial Projet Idea
- my initial project idea is to create a datalake/warehouse that could be used to assist in decision making for crypto currency trades and could be used to develop machine learning models for price predictions
- to keep the universe relatively small I've decided to primarly collect data about currencies listed on the Osmosis exchange (a DEX with low trading fees) 
- I've also included top ranked coins as these can be used as overall market proxies

## Coingecko api for metadata and prices
After exploring a variety of options I've decided to use CoinGecko's free tier API to collect price and metadata
- Swagger url can be found <a href='https://www.coingecko.com/en/api/documentation?'>here</a>
- Has relatively high call limits
- Provides very granular price and volume data
- Best option for free tier usage 

In [6]:
#below we create and display out metadata dataframe, which contains high level info about all the coins we are collecting data on
osmosis_coins = ['osmosis','cosmos','terrausd','terra-luna','juno-network','stargaze','secret','comdex','crypto-com-chain','akash-network','ion','sentinel','chihuahua-token','e-money-eur','regen','persistence','lum-network','e-money','bitcanna','iris-network','desmos','ki','bitsong','likecoin','cheqd-network','ixo','starname','vidulum','microtick']
top_coins = ['bitcoin','ethereum', 'binancecoin', 'cardano', 'solana', 'ripple','polkadot','dogecoin','avalanche-2','shiba-inu','matic-network','crypto-com-chain']
all_coins = osmosis_coins+top_coins

all_meta_data = []

for coin in all_coins:
    all_meta_data.append(coingecko.get_coin_metadata(coin))
    
print('done')

meta_df = pd.DataFrame(all_meta_data)

osmosis : 200
cosmos : 200
terrausd : 200
terra-luna : 200
juno-network : 200
stargaze : 200
secret : 200
comdex : 200
crypto-com-chain : 200
akash-network : 200
ion : 200
sentinel : 200
chihuahua-token : 200
e-money-eur : 200
regen : 200
persistence : 200
lum-network : 200
e-money : 200
bitcanna : 200
iris-network : 200
desmos : 200
ki : 200
bitsong : 200
likecoin : 200
cheqd-network : 200
ixo : 200
starname : 200
vidulum : 200
microtick : 200
bitcoin : 200
ethereum : 200
binancecoin : 200
cardano : 200
solana : 200
ripple : 200
polkadot : 200
dogecoin : 200
avalanche-2 : 200
shiba-inu : 200
matic-network : 200
crypto-com-chain : 200
done


In [7]:
meta_df.head()

Unnamed: 0,id,symbol,name,block_time_in_minutes,hashing_algorithm,genesis_date,twitter_screen_name,subreddit_url,description,github_url
0,osmosis,osmo,Osmosis,0,,,osmosiszone,,"Token of the Osmosis Hub, first DEX for IBC co...",https://github.com/osmosis-labs/osmosis
1,cosmos,atom,Cosmos,0,,,cosmos,https://www.reddit.com/r/cosmosnetwork,The Cosmos network consists of many independen...,https://github.com/cosmos/cosmos
2,terrausd,ust,TerraUSD,0,SHA-256,,terra_money,,Terra USD (UST) is an algorithmic stablecoin t...,https://github.com/terra-project
3,terra-luna,luna,Terra,0,,,terra_money,https://www.reddit.com/r/terraluna/,Terra is a decentralized financial payment net...,https://github.com/terra-project/core
4,juno-network,juno,JUNO,0,,,JunoNetwork,https://www.reddit.com,Use & create cross-chain applications on Juno....,https://github.com/CosmosContracts


In [10]:
coin_ids = list(meta_df['id'])


start_date = datetime.datetime(2000, 1,1) - datetime.timedelta(hours=8) #to fix strange timezone error
end_date = datetime.datetime.now()

r_count = len(coint_ids)

all_coins_dfs = []

for coin_id in coin_ids:
    response = coingecko.get_hourly_prices(coin_id, 'usd', start_date, end_date, r_count)
    r_count = response['request_count']
    all_coins_dfs.append(response['df'])
    
    
price_df = pd.concat(all_coins_dfs)

['osmosis',
 'cosmos',
 'terrausd',
 'terra-luna',
 'juno-network',
 'stargaze',
 'secret',
 'comdex',
 'crypto-com-chain',
 'akash-network',
 'ion',
 'sentinel',
 'chihuahua-token',
 'e-money-eur',
 'regen',
 'persistence',
 'lum-network',
 'e-money',
 'bitcanna',
 'iris-network',
 'desmos',
 'ki',
 'bitsong',
 'likecoin',
 'cheqd-network',
 'ixo',
 'starname',
 'vidulum',
 'microtick',
 'bitcoin',
 'ethereum',
 'binancecoin',
 'cardano',
 'solana',
 'ripple',
 'polkadot',
 'dogecoin',
 'avalanche-2',
 'shiba-inu',
 'matic-network',
 'crypto-com-chain']

In [2]:
coin_id = 'cosmos'

start_date = datetime.datetime(2000, 1,1) - datetime.timedelta(hours=8) #to fix strange timezone error
end_date = datetime.datetime.now()



response = coingecko.get_hourly_prices(coin_id, 'usd', start_date, end_date)

date_range: 2019-02-24 00:00:00-2022-01-30 00:00:00; frequency: daily ; cosmos: 200
date_range: 2019-02-24 09:00:08.444000-2019-05-25 06:01:41.312000; frequency: hourly ; cosmos: 200
date_range: 2019-05-25 14:00:03.615000-2019-08-13 12:01:43.267000; frequency: hourly ; cosmos: 200
date_range: 2019-08-13 19:03:40.871000-2019-11-01 18:05:18.871000; frequency: hourly ; cosmos: 200
date_range: 2019-11-02 02:04:57.861000-2020-01-21 01:01:45.183000; frequency: hourly ; cosmos: 200
date_range: 2020-01-21 09:07:55.831000-2020-04-10 07:06:46.200000; frequency: hourly ; cosmos: 200
date_range: 2020-04-10 15:04:57.962000-2020-06-29 14:02:11.647000; frequency: hourly ; cosmos: 200
date_range: 2020-06-29 21:03:18.280000-2020-09-17 20:04:34.196000; frequency: hourly ; cosmos: 200
date_range: 2020-09-18 03:07:16.451000-2020-12-07 03:06:35.364000; frequency: hourly ; cosmos: 200
date_range: 2020-12-07 11:15:02.423000-2021-02-25 11:03:48.383000; frequency: hourly ; cosmos: 200
date_range: 2021-02-25 19

In [3]:
merged_df = response['df']

In [6]:
len(merged_df)

25369