HackTheValley x Distributed Compute Labs Challenge
You can find the full challenge details here
Users can use the function built in the prediction model.
model_prediction(data_csv, number_of_days)
data_csv: location of a csv file
number_of_days: number of days desired to be predicted
Users need to download their own COVID count database and input the file location with the number of days they want predicted. Function will return a csv file, prediction.csv, and the mean-squared-eror.
I'm provided with 3 csv files which can be downloaded from the challenge document or can be found in the repo here. The csv files contain the following columns: infected_unvaccinated, infected_vaccinated and total_vaccinated. I manually added total_pop because it is given in the full challenge details. It is important to note that each csv file corresponds to a different population size. I begin with feature engineering and data exploration. I created multiple new features - daily infected, total infected and date. I graphed the new features to see how they all corresponded to each other.
After data exploration and feature engineering, I chose to use fbprophet to predict the number of those daily infected since this is a time series problem. I chose to use the date and daily infected as my ds and y variables respectively. I knew that the relation between ds and y was stationary and seasonal from my data exploration, so I set a seasonality modifier, high fourier order and a seasonality mode. Afterwards, I then fit my model and predicted for 100 days. I would then add the yhat and y to a new dataframe and used sklearn's mean-squared-error to get the MSE. I removed the excess rows and saved the predicted model. I did this for all 3 csv files.
The challenge requests that I make a function that takes in two parameters, data_csv and number_of_days. Data_csv is the location of a csv file and number_of_days is the number of days to predict. The function saves a dataframe containing the predicted case count and returns the mean-squared-error and the predicted case count.
I use fbprophet to make a prediction based on the previous 300 days of infection and vaccination data for 3 different csv files for varying population sizes. At the end there's a function that saves a dataframe containing the predicted case count and returns the mean-squared-error and the predicted case count.
10/19/2021: Added Base Cases (base model with no hyperparameters) and modified the algorithm slightly for seasonality
Figure out how to add a floor to prevent predictions from being less than zero, try out an ARIMA + ARMA model, normalize everything to prevent overfitting in predictions #2 and #3