# Collecting Data
When I first started looking into how I'd like to use machine learning for trading, I thought about what data was out there for me to look at that could *potentially* help me with informed trading. It's possible that data I collect could be informative or uninformative, but I needed to collect the data first and perform some analysis before I could arrive at any conclusion. There were three sources of data that I wanted to take a look at when I started:  
1. Traditional price data (open, close, volume, etc.)
2. Wikipedia pageviews
3. Reddit metrics

This notebook goes over how I gathered that data and how I stored them as simple .csv's. (Note that as data grows in volume and variety, using .csv's is non-ideal)

# Price Data
At the end of the day, my expectations of future prices is what is going to drive my trading bot. It's pretty obvious that price data will be needed for any informed trading I wish to do. The ccxt library offers a convenient way to gather OHLCV (open, high, low, close, volume) data for trading cryptocurrency.

In [1]:
import pandas as pd

from datetime import datetime
import ccxt
import plotly.graph_objects as go

In [2]:
binance = ccxt.binance()
pairs = ['BTC/USDT', 'ETH/USDT', 'XRP/USDT']

In [3]:
dfs = []

for p in pairs:
    candles = binance.fetch_ohlcv(p, '1d')
    
    pair = []
    date = []
    open_rate = []
    high_rate = []
    low_rate = []
    close_rate = []
    volume = []
    
    for candle in candles:
        pair.append(p)
        date.append(datetime.fromtimestamp(candle[0] / 1000.0).strftime('%Y-%m-%d'))
        open_rate.append(candle[1])
        high_rate.append(candle[2])
        low_rate.append(candle[3])
        close_rate.append(candle[4])
        volume.append(candle[5])
    
    df = pd.DataFrame({
        'pair' : pair,
        'date' : date,
        'open' : open_rate,
        'high' : high_rate,
        'low' : low_rate,
        'close' : close_rate,
        'volume' : volume
    })
    
    dfs.append(df)
    
crypto_prices = pd.concat(dfs).reset_index(drop=True)

In [4]:
crypto_prices.head()

Unnamed: 0,pair,date,open,high,low,close,volume
0,BTC/USDT,2018-10-25,6528.13,6600.0,6515.01,6538.63,9443.504156
1,BTC/USDT,2018-10-26,6538.63,6558.35,6463.04,6505.6,7776.907864
2,BTC/USDT,2018-10-27,6505.6,6514.17,6453.0,6489.93,5743.933728
3,BTC/USDT,2018-10-28,6489.93,6505.01,6315.0,6344.5,13827.224067
4,BTC/USDT,2018-10-29,6344.5,6395.0,6317.01,6330.87,8877.175385


In [18]:
crypto_prices.to_csv('../data/crypto_prices.csv')

# Wikipedia Pageviews
Another variable that interested me is seeing how many people are looking up topics such as cryptocurrency, Bitcoin, etc. on Wikipedia. The Wikimedia provides the mwviews library to collect this information. My intuition tells me that higher page views would correlate with higher interest in a certain cryptocurrency, but it could also be because they were mentioned a lot in the news recently (whether good or bad). 

In [5]:
from mwviews.api import PageviewsClient

In [6]:
start_date = crypto_prices.date.min().replace('-','')
end_date = crypto_prices.date.max().replace('-','')

In [7]:
pv = PageviewsClient(user_agent='Gathering cryptocurrency pageview information')

In [8]:
views = pv.article_views('en.wikipedia', ['Bitcoin', 'Ripple_(payment_protocol)', 'Ethereum', 'Cryptocurrency'],
                granularity='daily', start=start_date, end=end_date)

In [21]:
wiki_views = pd.DataFrame.from_dict(views, orient='index')
wiki_views = wiki_pv.rename(str.lower, axis='columns')
wiki_views = wiki_pv.reset_index().rename(columns={'index':'date', 'ripple_(payment_protocol)':'ripple'})

In [22]:
wiki_views.head()

Unnamed: 0,date,date.1,bitcoin,ripple,ethereum,cryptocurrency
0,0,2018-10-25,10190,1079,1899,5334
1,1,2018-10-26,9548,952,1788,5563
2,2,2018-10-27,9565,717,1543,3802
3,3,2018-10-28,8465,802,1842,4097
4,4,2018-10-29,11526,973,2391,5107


In [23]:
wiki_views.to_csv('../data/wiki_views.csv')

# Reddit Metrics
There are countless subreddits on Reddit relating to cryptocurrency, whether they are to discuss specific coins (/r/bitcoin, /r/ethereum, /r/ripple, and more), trading (/r/cryptocurrencytrading), general cryptocurrency (/r/cryptocurrency), or more. They provide countless variables that can be looked at for trading insights: comment and submission count, subreddit growth, or text data in the form of comments and submissions.  

Unlike the data for exchange rates or Wikipedia pageviews, retrieving data from Reddit requires signing up for an account and using your own API key. Reddit provides the conditions and instructions to use their API [here](https://www.reddit.com/wiki/api).

# References
[1] https://medium.com/coinmonks/python-scripts-for-ccxt-crypto-candlestick-ohlcv-charting-data-83926fa16a13  
[2] https://blog.wikimedia.org/2015/12/14/pageview-data-easily-accessible/