<a href="https://colab.research.google.com/github/du-hr/covid-atals/blob/main/MiniProject_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COVID Atalas: Analyzing COVID-19 Search Trends and Hospitalization with ML


> MiniProject 1 of COMP 551 (Fall 2020) at McGill University


> Authors (G68): Haoran Du, Cong Zhu, Matthew Kourlas





In [173]:
import numpy as np
import scipy as sci
import pandas as pd

## Task 1: Acquire, preprocess, and analyze the data.

### 1.1 Download the datasets. Load the datasets into Pandas dataframes or NumPy objects (i.e., arrays or matrices) in Python.

In [174]:

url_search = "https://raw.githubusercontent.com/google-research/open-covid-19-data/master/data/exports/search_trends_symptoms_dataset/United%20States%20of%20America/2020_US_weekly_symptoms_dataset.csv"
url_hospital = "https://raw.githubusercontent.com/google-research/open-covid-19-data/master/data/exports/cc_by/aggregated_cc_by.csv"

df_search = pd.read_csv(url_search, low_memory=False)
df_hospital = pd.read_csv(url_hospital, low_memory=False)

pd.set_option('display.max_columns', None)


###1.2 Clean the data. Remove regions and features that have too many missing or invalid data entries.

In [175]:
print("Length of search dataset before cleaning = " + str(len(df_search)))
df_search.dropna(thresh=8, inplace=True)
df_search.dropna(axis=1, how="all", inplace=True)
print("Length of search dataset after cleaning = " + str(len(df_search)))

print("\nLength of hospital dataset before cleaning = " + str(len(df_hospital)))

df_hospital = df_hospital[ (df_hospital["open_covid_region_code"] >= "US-A") & (df_hospital["open_covid_region_code"] <= "US-Z") ]
df_hospital.dropna(axis=1, how="all", inplace=True)
print("Length of hospital dataset after cleaning = " + str(len(df_hospital)))

Length of search dataset before cleaning = 608
Length of search dataset after cleaning = 569

Length of hospital dataset before cleaning = 98434
Length of hospital dataset after cleaning = 12194


###1.3 Merge the two datasets. Bring both the datasets at the weekly resolution and thereafter merge them into one array (Numpy or Pandas).

In [176]:



df_search["date"] = pd.to_datetime(df_search["date"])

df_hospital["date"] = pd.to_datetime(df_hospital["date"])

df_search = df_search.set_index(["date"])
df_search = df_search.shift(periods=6, freq="D")
df_hospital = df_hospital.set_index(["date"])


grouper = df_hospital.groupby([pd.Grouper(freq='1W'), 'open_covid_region_code'])
df_hospital = grouper['hospitalized_new', 'hospitalized_cumulative'].count()

df_search = df_search.groupby(by=["date", "open_covid_region_code"]).sum()

df_hospital = df_hospital.drop(["2020-10-11", "2020-10-04"])
df_search = df_search.drop(["2020-01-12", "2020-01-19"])

df = pd.concat([df_hospital, df_search], axis=1, sort=False)

df.dropna(thresh=122, inplace=True)

print(df)
# Hi guys, this is the final combined dataset, I set columns "date" and "open_covid_region_code" as the indexes.
# Columns "hospitalized_new" and "hospitalized_cumulative" come from the hospitalization dataset and all other columns come from the search trend dataset.
# Feel free to let me know if you have any questions.




                                   hospitalized_new  hospitalized_cumulative  \
date       open_covid_region_code                                              
2020-03-01 US-RI                                1.0                      1.0   
2020-03-08 US-AK                                3.0                      3.0   
           US-DC                                4.0                      4.0   
           US-DE                                3.0                      3.0   
           US-HI                                2.0                      2.0   
...                                             ...                      ...   
2020-09-27 US-RI                                7.0                      7.0   
           US-SD                                7.0                      7.0   
           US-VT                                7.0                      7.0   
           US-WV                                7.0                      7.0   
           US-WY                        

  


## Task 2: Visualize and cluster the data

### 2.1 Visualize how the distribution of search frequency of each symptom aggregated across different regions changes over time.

###2.2 Use Principal Component Analysis (PCA) to reduce the data dimensionality.

###2.3 Explore using a clustering method (k-means) to evaluate possible groups in the search trends dataset. Do the clusters remain consistent for raw as well as PCA-reduced data?

## Task 3: Supervised Learning

###3.1 Split the data (region): keep all data from some regions in the validation set and train on the rest (keep 80% regions in training set and 20% in validation set, doing this multiple times to estimate cross-validation results).

###3.2 Split the data (time): keep data for the last couple of timepoints (keep data after `2020-08-10') from all regions in the validation set and train on the rest of the data.

### 3.3 Supervised Learning: KNN (region) (5-fold cross-validation)

###3.4 Supervised Learning: KNN (time)

###3.5 Supervised Learning: Decision Tree (region) (5-fold cross-validation)

###3.6 Supervised Learning: Decision Tree (time)

### 3.7 **(Optional)** Explore other prediction strategies. For example, one strategy could be to learn separate models for predicting hospitalization in each region or cluster from Task 2.