# COGS 108 - Data Checkpoint

# Names

- Andrew Hernandez
- Austin Nguyen
- Christian Kim
- Kevin De Silva Jayasinghe

<a id='research_question'></a>
# Research Question

*Does an increase in tweets with the hashtag description '#Bitcoin' correlate with an increase in Bitcoin’s price on a weekly basis?*

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: Bitcoin USD Price Data
- Link to the dataset: https://www.marketwatch.com/investing/cryptocurrency/btcusd/download-data
- Number of observations: 106

This data set contains price history between 2019-2020 for Bitcoin on a weekly basis showing open, high, low and close price. 


- Dataset Name: Bitcoin Tweets Historical chart
- Link to the dataset: https://bitinfocharts.com/comparison/tweets-btc-ltc-eth.html#3y
- Number of observations: 730

This dataset is from a website that tracks the number of tweets with '#Bitcoin' per day from 4-09-2014 to present day, however we will only be using data from 01/01/2019 - 01/01/2021. By analyzing the bitcoin dataset we observed that this time frame was one of the most volatile periods for Bitcoin. We have narrowed it down to this time period in order to decrease our sample size.


# Setup

In [17]:
import pandas as pd
import numpy as np
import requests 
from bs4 import BeautifulSoup
import re

# Data Cleaning

Describe your data cleaning steps here.

In [27]:

# weekly price data from 01/01/2019 - 01/01/2021
df = pd.read_csv('btc_2019.csv')

#remove commas from values in order to typcast into int
df = df.replace(',','', regex=True)
df = df.astype({'Open': 'int32', 'High': 'int32', 'Low': 'int32', 'Close': 'int32'})

# create a new column containg the percent change for the week
df = df.assign(percent_change = (df['Close'] - df['Open']) / df['Open'])

#weekly price data from 01/01/2020 - 01/01/2021
df_2020 = pd.read_csv('btc_2020.csv')
df_2020 = df.replace(',','', regex=True)
df_2020 = df.astype({'Open': 'int32', 'High': 'int32', 'Low': 'int32', 'Close': 'int32'})
df_2020 = df_2020.assign(percent_change = (df['Close'] - df['Open']) / df['Open'])

# concatenate both years' data
df.append(df_2020)

# Our data was already mostly clean we only had to typecast the strings into integers and remove commas
# We then added a column which contains percent change which is what we will use to determine
# if bitcoins price correlates to the amount of tweets containing #Bitcoin
# The data sets were split up by year so we had to concatenate them as well


Unnamed: 0,Date,Open,High,Low,Close,percent_change
0,1/1/2021,26666,29650,25824,29259,0.097240
1,12/26/2020,23988,26753,22080,26648,0.110889
2,12/19/2020,18800,24128,18734,24004,0.276809
3,12/12/2020,19019,19429,17604,18794,-0.011830
4,12/5/2020,17765,19928,17556,19019,0.070588
...,...,...,...,...,...,...
48,2/1/2020,8377,9559,8294,9371,0.118658
49,1/25/2020,8910,9176,8230,8376,-0.059933
50,1/18/2020,8102,8992,7964,8910,0.099728
51,1/11/2020,7349,8454,7327,8106,0.103007


In [29]:
# We found this web scraping script at https://stackoverflow.com/questions/47730259/installing-urllib-in-python3-6 
# and made some minor modifications in order to fit to our needs in terms of retrieving the amount of tweets on a daily basis.
# We will have to further clean the data by extracting all days between 01/01/2019 - 01/01/2021 and then 
# grouping the data into a weekly format (sum amount of tweets  for all days in a given week) 
# in order to match our Bitcoin price dataset

def parse_strlist(sl):
    clean = re.sub("[\[\],\s]","",sl)
    splitted = re.split("[\'\"]",clean)
    values_only = [s for s in splitted if s != '']
    return values_only


url = 'https://bitinfocharts.com/comparison/tweets-btc.html#1y'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script', text=True)
for script in scripts:
    if 'd = new Dygraph(document.getElementById("container")' in str(script):
        StrList = str(script)
        StrList = '[[' + StrList.split('[[')[-1]
        StrList = StrList.split(']]')[0] +']]'
        StrList = StrList.replace("new Date(", '').replace(')','')
        dataList = parse_strlist(StrList)

        date = []
        tweet = []
        for each in dataList:
            if (dataList.index(each) % 2) == 0:
                date.append(each)
            else:
                tweet.append(each)
        df = pd.DataFrame(list(zip(date, tweet)), columns=["Date","BTC- Tweets"])
df

Unnamed: 0,Date,BTC- Tweets
0,2014/04/09,8193
1,2014/04/10,15039
2,2014/04/11,14907
3,2014/04/12,7582
4,2014/04/13,10674
...,...,...
2580,2021/05/02,78756
2581,2021/05/03,89710
2582,2021/05/04,87287
2583,2021/05/05,93866
