# COGS 108 - Data Checkpoint

# Names

- Marco Paredes (A15951023)
- Alan Hang (A16409702)
- Zehong Li (A15852954)
- Danyal Iqbal (A16687740)

<a id='research_question'></a>
# Research Question

*Are covid cases a good predictor of the price of cryptocurrency?*

# Dataset(s)

- Dataset Name: Daily Ethereum(ETH) data
- Link to the dataset: https://raw.githubusercontent.com/alhang/csv-files/master/gemini_ETHUSD_day.csv
- Number of observations: ~700
- Description: Daily Ethereum(ETH) data will be used to see daily Bitcoin trends between 2019-2021. The data will indicate changes of the respective stock prices through the year


- Dataset Name: Daily Bitcoin(BTC) data
- Link to the dataset: https://raw.githubusercontent.com/alhang/csv-files/master/gemini_BTCUSD_day.csv
- Number of observations: ~700
- Description: Daily Bitcoin data will be used to see crypto trends between 2019-2021. The data will indicate changes of the respective stock prices through the year


- Dataset Name: Covid data
- Link to the dataset:https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv
- Number of observations: 1
- Description: The Covid data will provide data about covid cases in the US between January 2020 until now.


- Dataset Name: NASDAQ Historical data
- Link to the dataset: https://raw.githubusercontent.com/alhang/csv-files/master/NASDAQ_Historical.csv
- Number of observations: 521
- Description: Stock market data from March 2019 to December 2021 will indicate stock prices through the year and their major changes. This dataset will be used to compare with the cryptocurrency prices, as stock prices are a good indicator of the economy and how it is affected during covid. 



# Setup

In [1]:
import pandas as pd
eth_df_raw = pd.read_csv('https://raw.githubusercontent.com/alhang/csv-files/master/gemini_ETHUSD_day.csv', skiprows=1)
btc_df_raw = pd.read_csv('https://raw.githubusercontent.com/alhang/csv-files/master/gemini_BTCUSD_day.csv', skiprows=1)
covid_df_raw = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv')
nasdaq_df_raw = pd.read_csv('https://raw.githubusercontent.com/alhang/csv-files/master/NASDAQ_Historical.csv')

# Data Cleaning

We clean the Ethereum data to only have the date and the closing price from January 21, 2021 to now. We also need to reverse the index order.

In [2]:
eth_df = eth_df_raw.iloc[:749]
eth_df = eth_df.sort_values(by=['Date'], ascending=True)
eth_df = eth_df.reset_index()
eth_df = eth_df.drop(columns=['Unix Timestamp','Symbol','Open','High','Low','Volume','index'])
eth_df

Unnamed: 0,Date,Close
0,2020-01-21 04:00:00,168.40
1,2020-01-22 04:00:00,164.86
2,2020-01-23 04:00:00,159.13
3,2020-01-24 04:00:00,158.89
4,2020-01-25 04:00:00,161.18
...,...,...
744,2022-02-03 04:00:00,2685.66
745,2022-02-04 04:00:00,3000.46
746,2022-02-05 04:00:00,3011.36
747,2022-02-06 04:00:00,3072.90


We clean the Bitcoin data to only have the date and the closing price from January 21, 2021 to now. We also need to reverse the index order.

In [3]:
btc_df = btc_df_raw.iloc[:749]
btc_df = btc_df.sort_values(by=['Date'], ascending=True)
btc_df = btc_df.reset_index()
btc_df = btc_df.drop(columns=['Unix Timestamp','Symbol','Open','High','Low','Volume','index'])
btc_df

Unnamed: 0,Date,Close
0,2020-01-21 04:00:00,8697.93
1,2020-01-22 04:00:00,8559.71
2,2020-01-23 04:00:00,8310.20
3,2020-01-24 04:00:00,8286.66
4,2020-01-25 04:00:00,8346.01
...,...,...
744,2022-02-03 04:00:00,37330.90
745,2022-02-04 04:00:00,41487.51
746,2022-02-05 04:00:00,41485.00
747,2022-02-06 04:00:00,42863.79


We clean the covid dataset to only have the date and cases and reverse the order of the indices.

In [4]:
covid_df = covid_df_raw.drop(columns=['deaths'])
covid_df

Unnamed: 0,date,cases
0,2020-01-21,1
1,2020-01-22,1
2,2020-01-23,1
3,2020-01-24,2
4,2020-01-25,3
...,...,...
747,2022-02-06,76419014
748,2022-02-07,76767122
749,2022-02-08,76961614
750,2022-02-09,77188002


We clean the Nasdaq data to only include the date and the closing price from January 21, 2021 to now. We also need to reverse the index order and change the date from MM/DD/YY format to YY/MM/DD format.

In [5]:
nasdaq_df_raw['Date'] = pd.to_datetime(nasdaq_df_raw.Date)
nasdaq_df = nasdaq_df_raw.iloc[:521]
nasdaq_df = nasdaq_df.sort_values(by=['Date'], ascending=True)
nasdaq_df = nasdaq_df.reset_index()
nasdaq_df = nasdaq_df.drop(columns=['Volume','Open','High','Low','index'])
nasdaq_df

Unnamed: 0,Date,Close/Last
0,2020-01-21,9370.81
1,2020-01-22,9383.77
2,2020-01-23,9402.48
3,2020-01-24,9314.91
4,2020-01-27,9139.31
...,...,...
516,2022-02-04,14098.01
517,2022-02-07,14015.67
518,2022-02-08,14194.45
519,2022-02-09,14490.37


To check for missing values in all datasets, we run:

In [6]:
eth_df.isnull().values.any()

False

In [7]:
btc_df.isnull().values.any()

False

In [8]:
covid_df.isnull().values.any()

False

In [9]:
nasdaq_df.isnull().values.any()

False

Now all our datasets are clean and correctly formatted for analysis.