# COGS 108 - Data Checkpoint

# Names

- Nathaniel Wong
- Ethan Tan
- Judy Liu
- Clara Pozuelos
- Aidan Twul

<a id='research_question'></a>
# Research Question

Was the performance of technology related sectors as represented by the performance of the QQQ as well as physical entertainment companies as represented by the Las Vegas Sands (LVS) corporation in the stock market directly influenced by the rise and fall of COVID-19 infection rates in the 2020 year?

# Dataset(s)

- Dataset Name: QQQ Historical Data
- Link to the dataset: https://finance.yahoo.com/quote/QQQ/history?period1=1568592000&period2=1644537600&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true
- Number of observations: 604
QQQ: This data is information about the open and close of the Invesco QQQ Trust Series 1 Fund which tracks majoor technology stocks. We will be using the information from this stock to track the general sentiment of technology companies in relation to corona virus infections.

- Dataset Name: LVS Historical Data
- Link to the dataset: https://finance.yahoo.com/quote/LVS/history?period1=1568592000&period2=1644537600&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true
- Number of observations: 609
LVS: This data is information about the open and close of Las Vegas Sands company, a major casino entertainment conglomerate. We will be using the information from this data to track the general sentiment about physical entertainment companies in relation to corona virus infections.

- Dataset Name: WHO Covid-19 Global Data
- Link to the dataset: https://data.humdata.org/dataset/coronavirus-covid-19-cases-and-deaths
- Number of observations: 182727
WHO: This data is information about the Coronavirus COVID-19 daily new and cumulative cases and deaths globally. We will be using the information from this data to track the number of cases of Covid-19 in the United States during the 2020 year.

Because we are tracking the relationship between covid-19 infections and the performance of stocks in technology and physical enteretainment sectors, we will ideally combining the data of covid infection rates along with the QQQ and LVS performance data over time to see if there is a general correlation between rising infection rates with rising technology stock performance, and a general correlation between rising infection rates with decreasing phsyical entertainment stock perforamcne.

# Setup

In [None]:
from datetime import datetime
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest
import matplotlib.pyplot as plt

In [None]:
QQQ_df = pd.read_csv('./datasets/QQQ.csv')
LVS_df = pd.read_csv('./datasets/LVS.csv')
WHO_df = pd.read_csv('./datasets/WHO-COVID-19-global-data.csv')

In [None]:
QQQ_df.head()

In [None]:
LVS_df.head()

In [None]:
WHO_df.head()

# Data Cleaning

In regards to cleaning up the QQQ and LVS stock data, there was minimal steps we had to undertake 
because the data we collected from Yahoo Finance provided everything we needed in a concise manner 
without any extraneous data that had to be cleaned up.

In [None]:
QQQ_df.isna().sum()

In [None]:
LVS_df.isna().sum()

In regards to cleaning up the WHO Covid-19 data, the data was already cleaned because they were provided by a service that directly affiliates with World Health Organization. Due to our need, we have filtered the dataset so we only look at cases from the US and cases that happened in 2020.

In [None]:
# Check for country by unique name
WHO_df['Country'].value_counts

In [None]:
# Create a new Dataframe with just cases reagrding the US
us_covid_df = WHO_df[WHO_df['Country'] == 'United States of America']

# Checking to see if any information is missing in the Dataframe
us_covid_df.isna().sum()

In [None]:
# Change the 'Date_reported' column to actual datetime type data 
# The below is wrong, keep getting this error: A value is trying to be set on a 
# copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
us_covid_df['Date_reported'] = pd.to_datetime(us_covid_df['Date_reported'])

# Rechecking that the column types are correct
us_covid_df.dtypes

In [None]:
# Checking the column names of the dataframe
us_covid_df.columns

In [None]:
# Selecting only the necessary columns needed to do our analysis
us_covid_df = us_covid_df[['Date_reported', 'Country', 'New_cases', 'Cumulative_cases', 
                          'New_deaths', 'Cumulative_deaths']]

# Reconfiguring dataframe to include dates only in the year 2020
us_covid_df.loc[us_covid_df['Date_reported'].dt.year == 2020]

In [None]:
# extract a subset of us_covid_df with data from the year of 2020
us_covid_2020_df = us_covid_df.loc[us_covid_df['Date_reported'].dt.year == 2020]

In [None]:
# extract a subset of QQQ_df with data from the year of 2020
QQQ_df_2020 = QQQ_df[QQQ_df['Date'].str[:4] == '2020']
QQQ_df_2020

In [None]:
# extract a subset of LVS_df with data from the year of 2020
LVS_df_2020 = LVS_df[LVS_df['Date'].str[:4] == '2020']
LVS_df_2020

# Data Analysis

In [None]:
# line plot that observes new cases change with time (date reported)
sns.lineplot(x="Date_reported", y="New_cases", data=us_covid_2020_df)

From the lineplot above, we can observe that new cases comes in waves. There is a downward trend from April to June 2020, and another one from August to October 2020. There is an upward trend in late March and from June to August 2020. There is also an upward spike from October to end of the year.

In [None]:
# line plot that observes cumulative cases with time (date reported)
sns.lineplot(x="Date_reported", y="Cumulative_cases", data=us_covid_2020_df)

From the lineplot above, we can see a steep increase in cumulative cases.

In [None]:
# add a new column that stores the day-to-day change in new cases
us_covid_2020_df["New_cases_change"] = us_covid_2020_df['New_cases'] - us_covid_2020_df['New_cases'].shift(-1)

In [None]:
# an enhanced boxplot that observes the pattern for change in new cases
sns.boxenplot(x="New_cases_change", data=us_covid_2020_df)

In [None]:
# boxplot for the change in new cases to observe outliers
sns.boxplot(x="New_cases_change", data=us_covid_2020_df)

In [None]:
# set our plot size
sns.set(rc= {'figure.figsize':(15,8)})

In [None]:
ax = sns.lineplot(x='Date', y='Close', data=QQQ_df_2020)
x_ticks = ax.set_xticks([i*30 for i in range(10)])

In [None]:
ax = sns.lineplot(x='Date', y='Close', data=LVS_df_2020)
x_ticks = ax.set_xticks([i*30 for i in range(10)])

## Combined Stock Data

In [None]:
fig, ax = plt.subplots()
sns.lineplot(x='Date', y='Close', data=QQQ_df_2020)
plt.ylabel('Close QQQ', size=5)
ax2 = ax.twinx()
sns.lineplot(x='Date', y='Close', data=LVS_df_2020, color='r')
plt.ylabel('Close LVS', size=5)
plt.xlabel('Date', size=5)
x_ticks = ax.set_xticks([i*30 for i in range(10)])

As we can see in the initial data, with the onset of COVID, there is a harsh drop in stock price of both LVS and QQQ data. This is to be expected as with wordlwide fears over the pandemic, there is no company which is left untouched as the public pulls their money out of companies.

After the _ date period however, we see theat there is a marked change in sentiment in which after the large initla drop, investors start pulling their money back into the market at a slow rate.

We see that technology has a remarkably stronger performance in that area in comparison to LVS, the Las Vegas Sands stock we are using as a marker for "travel" stocks.

We reason that this is because technology requires little social contact in order to be useful, whilest physical travel companies require indivudals to be in acutal locations which was restricted due to government lockdowns and quarntine states around the globe.

### Revised graph

In [None]:
# generate QQQ data in the form of date, change since previous day
QQQ_data = []
for i in range(1,len(QQQ_df_2020)):
    QQQ_data.append([QQQ_df_2020['Date'].iloc[i],QQQ_df_2020['Close'].iloc[i-1]-QQQ_df_2020['Close'].iloc[i]])
QQQ_data = pd.DataFrame(QQQ_data, columns = ['Date', 'Closing Change'])

In [None]:
ax = sns.lineplot(x='Date', y='Closing Change', data=QQQ_data)
x_ticks = ax.set_xticks([i*30 for i in range(10)])

In [None]:
# generate data in the form of date, change since previous day
LVS_data = []
for i in range(1,len(LVS_df_2020)):
    LVS_data.append([LVS_df_2020['Date'].iloc[i],LVS_df_2020['Close'].iloc[i-1]-LVS_df_2020['Close'].iloc[i]])
LVS_data = pd.DataFrame(LVS_data, columns = ['Date', 'Closing Change'])

In [None]:
fig, ax = plt.subplots()
sns.lineplot(x='Date', y='Closing Change', data=QQQ_data)
plt.ylabel('Closing Change QQQ', size=5)
ax2 = ax.twinx()
sns.lineplot(x='Date', y='Closing Change', data=LVS_data, color='r')
plt.ylabel('Closing Change LVS', size=5)
plt.xlabel('Date', size=5)
x_ticks = ax.set_xticks([i*30 for i in range(10)])

When we look at the revised grpahs which plot the slope at which the resepctive stocks increase, we see that there is a larger positive acceleration in which the price of the QQQ increases in comparison to that of LVS. This is in line with our analysis on how technology stock performance greatly outweighed that of LVS by demosntrating that the rate at whicht he stock price typicaly increased was much larger than that of the travel stock. Furthermore, we see that the the value of this "slope" of closing prices, it is typically the case that the change in price value from the current day - the past day is positive, and much more than that of LVS stock relative perforamnce.

In other words, when we count the number of "positive change" days as well as the value of the "positive change", that of the QQQ results vastly outweight LVS in terms of positive value as well as number of positive dayis.