# COGS 108 - Data Checkpoint

# Names

- Andrew Hernandez
- Austin Nguyen
- Christian Kim
- Kevin De Silva Jayasinghe

<a id='research_question'></a>
# Research Question

*Does an increase in tweets with the hashtag description '#Bitcoin' correlate with an increase in Bitcoin’s price on a weekly basis?*

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: Bitcoin USD Price Data
- Link to the dataset: https://www.marketwatch.com/investing/cryptocurrency/btcusd/download-data
- Number of observations: 108

This data set contains price history between 2019-2020 for Bitcoin on a weekly basis showing open, high, low and close price. 


- Dataset Name: Bitcoin Tweets Historical chart
- Link to the dataset: https://bitinfocharts.com/comparison/tweets-btc-ltc-eth.html#3y
- Number of observations: 730

This dataset is from a website that tracks the number of tweets with '#Bitcoin' per day from 4-09-2014 to present day, however we will only be using data from 01/05/2019 - 01/12/2021. By analyzing the bitcoin dataset we observed that this time frame was one of the most volatile periods for Bitcoin. We have narrowed it down to this time period in order to decrease our sample size.


# Setup

In [2]:
import pandas as pd
import numpy as np
import requests 
from bs4 import BeautifulSoup
import re

# Data Cleaning

Describe your data cleaning steps here.

In [6]:

# weekly price data from 01/05/2019 - 01/06/2020
df = pd.read_csv('btc_2019.csv')

#remove commas from values in order to typcast into int
df = df.replace(',','', regex=True)
df = df.astype({'Open': 'int32', 'High': 'int32', 'Low': 'int32', 'Close': 'int32'})

# create a new column containg the percent change for the week
df = df.assign(percent_change = (df['Close'] - df['Open']) / df['Open'])

#weekly price data from 01/11/2020 - 01/12/2021
df_2020 = pd.read_csv('btc_2020.csv')
df_2020 = df.replace(',','', regex=True)
df_2020 = df.astype({'Open': 'int32', 'High': 'int32', 'Low': 'int32', 'Close': 'int32'})
df_2020 = df_2020.assign(percent_change = (df['Close'] - df['Open']) / df['Open'])

# concatenate both years' data
df.append(df_2020)

# Our data was already mostly clean we only had to typecast the strings into integers and remove commas
# We then added a column which contains percent change which is what we will use to determine
# if bitcoins price correlates to the amount of tweets containing #Bitcoin
# The data sets were split up by year so we had to concatenate them as well


Unnamed: 0,Date,Open,High,Low,Close,percent_change
0,1/6/2020,7349,7603,7327,7574,0.030616
1,1/4/2020,7334,7529,6880,7349,0.002045
2,12/28/2019,7145,7696,7087,7334,0.026452
3,12/21/2019,7092,7447,6459,7147,0.007755
4,12/14/2019,7510,7665,7022,7092,-0.055659
...,...,...,...,...,...,...
49,2/2/2019,3583,3591,3371,3462,-0.033771
50,1/26/2019,3728,3731,3447,3583,-0.038895
51,1/19/2019,3656,3784,3516,3727,0.019420
52,1/12/2019,3880,4136,3608,3655,-0.057990


In [25]:
# We found this web scraping script at https://stackoverflow.com/questions/47730259/installing-urllib-in-python3-6 
# and made some minor modifications in order to fit to our needs in terms of retrieving the amount of tweets on a daily basis.
# We have to further clean the data by extracting all days between 01/05/2019 - 01/12/2021 and then 
# grouping the data into a weekly format (sum amount of tweets  for all days in a given week) 
# in order to match our Bitcoin price dataset

def parse_strlist(sl):
    clean = re.sub("[\[\],\s]","",sl)
    splitted = re.split("[\'\"]",clean)
    values_only = [s for s in splitted if s != '']
    return values_only


url = 'https://bitinfocharts.com/comparison/tweets-btc.html#3y'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script', text=True)
for script in scripts:
    if 'd = new Dygraph(document.getElementById("container")' in str(script):
        StrList = str(script)
        StrList = '[[' + StrList.split('[[')[-1]
        StrList = StrList.split(']]')[0] +']]'
        StrList = StrList.replace("new Date(", '').replace(')','')
        dataList = parse_strlist(StrList)

        date = []
        tweet = []
        for each in dataList:
            if (dataList.index(each) % 2) == 0:
                date.append(each)
            else:
                tweet.append(each)
        tweet_df = pd.DataFrame(list(zip(date, tweet)), columns=["Date","BTC- Tweets"])
tweet_df

Unnamed: 0,Date,BTC- Tweets
0,2014/04/09,8193
1,2014/04/10,15039
2,2014/04/11,14907
3,2014/04/12,7582
4,2014/04/13,10674
...,...,...
2581,2021/05/03,89710
2582,2021/05/04,87287
2583,2021/05/05,93866
2584,2021/05/06,84935


In [38]:
# select all records from 2019
tweets_2019 = tweet_df[tweet_df.Date.str.contains("2019", na=False)]
# select all records from 2020
tweets_2020 = tweet_df[tweet_df.Date.str.contains("2020", na=False)]
# select all records from 2021
tweets_2021 = tweet_df[tweet_df.Date.str.contains("2021", na=False)]

#combine into one df 
tweets = tweets_2019.append(tweets_2020)
tweets = tweets.append(tweets_2021)
tweets

Unnamed: 0,Date,BTC- Tweets
1728,2019/01/01,17069
1729,2019/01/02,18830
1730,2019/01/03,26754
1731,2019/01/04,21139
1732,2019/01/05,20096
...,...,...
2581,2021/05/03,89710
2582,2021/05/04,87287
2583,2021/05/05,93866
2584,2021/05/06,84935


In [39]:
# drop any dates in 2021 which are not in januaray since our sample is from 01/05/2019 - 01/12/2021
exclude = tweets[tweets.Date.str.contains('2021/02')]
tweets = tweets[~tweets.isin(exclude)].dropna()
exclude = tweets[tweets.Date.str.contains('2021/03')]
tweets = tweets[~tweets.isin(exclude)].dropna()
exclude = tweets[tweets.Date.str.contains('2021/04')]
tweets = tweets[~tweets.isin(exclude)].dropna()
exclude = tweets[tweets.Date.str.contains('2021/05')]
tweets = tweets[~tweets.isin(exclude)].dropna()
tweets





Unnamed: 0,Date,BTC- Tweets
1728,2019/01/01,17069
1729,2019/01/02,18830
1730,2019/01/03,26754
1731,2019/01/04,21139
1732,2019/01/05,20096
...,...,...
2485,2021/01/27,60652
2486,2021/01/28,84933
2487,2021/01/29,200783
2488,2021/01/30,93578
