# Overview
When I first started looking into how I'd like to use machine learning for trading, I thought about what data was out there for me to look at that could *potentially* help me with informed trading. It's possible that data I collect could be informative or uninformative, but I needed to collect the data first and perform some analysis before I could arrive at any conclusion. There were three sources of data that I wanted to take a look at when I started:  
1. Traditional price data (open, close, volume, etc.)
2. Wikipedia pageviews
3. Reddit submissions

This notebook goes over how I gathered that data and how I stored them as simple .csv's. (Note that as data grows in volume and variety, using .csv's is non-ideal)

# Price Data
At the end of the day, my expectations of future prices is what is going to drive my trading bot. It's pretty obvious that price data will be needed for any informed trading I wish to do. The ccxt library offers a convenient way to gather OHLCV (open, high, low, close, volume) data for trading cryptocurrency.

In [1]:
import pandas as pd

from datetime import datetime
import ccxt
import plotly.graph_objects as go

In [2]:
binance = ccxt.binance()
pairs = ['BTC/USDT', 'ETH/USDT', 'XRP/USDT']

In [3]:
dfs = []

for p in pairs:
    candles = binance.fetch_ohlcv(p, '1d')
    
    pair = []
    date = []
    open_rate = []
    high_rate = []
    low_rate = []
    close_rate = []
    volume = []
    
    for candle in candles:
        pair.append(p)
        date.append(datetime.fromtimestamp(candle[0] / 1000.0).strftime('%Y-%m-%d'))
        open_rate.append(candle[1])
        high_rate.append(candle[2])
        low_rate.append(candle[3])
        close_rate.append(candle[4])
        volume.append(candle[5])
    
    df = pd.DataFrame({
        'pair' : pair,
        'date' : date,
        'open' : open_rate,
        'high' : high_rate,
        'low' : low_rate,
        'close' : close_rate,
        'volume' : volume
    })
    
    dfs.append(df)
    
crypto_prices = pd.concat(dfs).reset_index(drop=True)

In [4]:
crypto_prices.head()

Unnamed: 0,pair,date,open,high,low,close,volume
0,BTC/USDT,2018-12-21,3840.25,3979.0,3785.0,3948.91,42822.350872
1,BTC/USDT,2018-12-22,3948.91,4021.53,3870.0,3929.71,40117.531529
2,BTC/USDT,2018-12-23,3929.71,4198.0,3924.83,4008.01,64647.809129
3,BTC/USDT,2018-12-24,4010.11,4020.0,3646.41,3745.79,62725.629432
4,BTC/USDT,2018-12-25,3745.56,3837.15,3656.74,3777.74,42629.375817


In [5]:
crypto_prices.to_csv('../data/crypto_prices.csv')

# Wikipedia Pageviews
Another variable that interested me is seeing how many people are looking up topics such as cryptocurrency, Bitcoin, etc. on Wikipedia. The Wikimedia Foundation provides the mwviews library to collect this information. My intuition tells me that higher page views could correlate with higher interest in a certain cryptocurrency, which could make it an indicator for price movements.

In [6]:
from mwviews.api import PageviewsClient

In [7]:
start_date = crypto_prices.date.min().replace('-','')
end_date = crypto_prices.date.max().replace('-','')

In [8]:
pv = PageviewsClient(user_agent='Gathering cryptocurrency pageview information')

In [9]:
views = pv.article_views('en.wikipedia', ['Bitcoin', 'Ripple_(payment_protocol)', 'Ethereum', 'Cryptocurrency'],
                granularity='daily', start=start_date, end=end_date)

In [10]:
wiki_views = pd.DataFrame.from_dict(views, orient='index')
wiki_views = wiki_views.rename(str.lower, axis='columns')
wiki_views = wiki_views.reset_index().rename(columns={'index':'date', 'ripple_(payment_protocol)':'ripple'})

In [11]:
wiki_views.head()

Unnamed: 0,date,bitcoin,ripple,ethereum,cryptocurrency
0,2018-12-21,12817,863,1445,3482
1,2018-12-22,10507,691,1216,3627
2,2018-12-23,9330,792,1301,3158
3,2018-12-24,9452,854,1299,3398
4,2018-12-25,9181,824,1134,3117


In [12]:
wiki_views.to_csv('../data/wiki_views.csv')

# Reddit Submissions
There are countless subreddits on Reddit relating to cryptocurrency, whether they are to discuss specific coins (/r/bitcoin, /r/ethereum, /r/ripple, and more), trading (/r/cryptocurrencytrading), general cryptocurrency (/r/cryptocurrency), or more. They provide countless variables that can be looked at for trading insights: comment and submission count, subreddit growth, or text data in the form of comments and submissions. For this example specifically, I'm only going to be saving the titles of popular submissions that I can later perform sentiment analysis on.

Unlike the data for exchange rates or Wikipedia pageviews, retrieving data from Reddit requires signing up for an account and using your own API key. Reddit provides the conditions and instructions to use their API [here](https://www.reddit.com/wiki/api).

I'm not going to be sharing my Reddit keys here, but once you have your own CLIENT_ID and CLIENT_SECRET for Reddit, you can look at [here](https://towardsdatascience.com/how-to-hide-your-api-keys-in-python-fb2e1a61b0a0) to see how to set them as environment variables like I did.

In [13]:
import praw
import os

In [14]:
CLIENT_ID = os.environ.get('CLIENT_ID')
CLIENT_SECRET = os.environ.get('CLIENT_SECRET')

In [15]:
reddit = praw.Reddit(client_id=CLIENT_ID, client_secret=CLIENT_SECRET,
                    user_agent='gathering cryptocurrency community data - /u/crafting_vh')

In [16]:
subreddits = ['bitcoin', 'ethereum', 'ripple', 'cryptocurrency']

dfs = []
for sub in subreddits:
    subreddit = []
    title = []
    score = []
    created_utc = []
    
    for s in reddit.subreddit(sub).top('all', limit=100):
        subreddit.append(sub)
        title.append(s.title)
        score.append(s.score)
        created_utc.append(s.created_utc)
        
    df = pd.DataFrame({
        'subreddit' : subreddit,
        'title' : title,
        'score' : score,
        'created_utc' : created_utc
    })
    
    dfs.append(df)
        
reddit_submissions = pd.concat(dfs).reset_index(drop=True)
reddit_submissions.created_utc = pd.to_datetime(reddit_submissions.created_utc, unit='s').dt.strftime('%Y-%m-%d')

In [17]:
reddit_submissions.head()

Unnamed: 0,subreddit,title,score,created_utc
0,bitcoin,"It's official! 1 Bitcoin = $10,000 USD",48506,2017-11-29
1,bitcoin,The last 3 months in 47 seconds.,48471,2018-02-04
2,bitcoin,It's over 9000!!!,42435,2017-11-26
3,bitcoin,Everyone who's trading BTC right now,42042,2018-01-06
4,bitcoin,"Quick, upvote this to confuse everyone into th...",36853,2019-07-24


In [18]:
reddit_submissions.to_csv('../data/reddit_submissions.csv')

# References
[1] https://medium.com/coinmonks/python-scripts-for-ccxt-crypto-candlestick-ohlcv-charting-data-83926fa16a13  
[2] https://blog.wikimedia.org/2015/12/14/pageview-data-easily-accessible/  
[3] https://towardsdatascience.com/how-to-hide-your-api-keys-in-python-fb2e1a61b0a0