Course project for UCLA CS145, Introduction to Data Mining
The main driver script is run.py
. It takes in a single argument, the ML model type: [NN, PR, AR, ARIMA, ARMA, MA, SARIMA]
PR: Polynomial Regression
NN: Neural Network
AR: Auto Regression
MA: Moving Average
ARIMA
ARMA
SARIMA
py run.py NN
This will generate a result csv file, matching the Kaggle submission format. To change any configurations, refer to the constant variables declared in run.py, polynomial_regression.py, neural_network.py, or prediction_model.py (superclass of all prediction models).
To transform input data, run:
python transform_input.py
It will then create a csv file for each states, each containing its state's daily report. Miscellaneous states from the input data set are ignored
NOTE Each time this script is ran, all the <state>.csv
files are truncated an refilled from the daily report files.
Data format (copied from https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data)
USA daily state reports (csse_covid_19_daily_reports_us)
This table contains an aggregation of each USA State level data.
To create the test.csv file, run:
python create_test_csv.py
To get MAPE of the prediction vs truth data, run:
python mape.py
MM-DD-YYYY.csv in UTC.
- Province_State - The name of the State within the USA.
- Country_Region - The name of the Country (US).
- Last_Update - The most recent date the file was pushed.
- Lat - Latitude.
- Long_ - Longitude.
- Confirmed - Aggregated case count for the state.
- Deaths - Aggregated death toll for the state.
- Recovered - Aggregated Recovered case count for the state.
- Active - Aggregated confirmed cases that have not been resolved (Active cases = total cases - total recovered - total deaths).
- FIPS - Federal Information Processing Standards code that uniquely identifies counties within the USA.
- Incident_Rate - cases per 100,000 persons.
- People_Tested - Total number of people who have been tested.
- People_Hospitalized - Total number of people hospitalized. (Nullified on Aug 31, see Issue #3083)
- Mortality_Rate - Number recorded deaths * 100/ Number confirmed cases.
- UID - Unique Identifier for each row entry.
- ISO3 - Officialy assigned country code identifiers.
- Testing_Rate - Total test results per 100,000 persons. The "total test results" are equal to "Total test results (Positive + Negative)" from COVID Tracking Project.
- Hospitalization_Rate - US Hospitalization Rate (%): = Total number hospitalized / Number cases. The "Total number hospitalized" is the "Hospitalized – Cumulative" count from COVID Tracking Project. The "hospitalization rate" and "Total number hospitalized" is only presented for those states which provide cumulative hospital data. (Nullified on Aug 31, see Issue #3083)
For more details of Neural Network Model please refer to neural_network.py
.
In this class we train based on Neural Network and we use GridSearch to find the best parameters
You can add/remove parameters and their values to see how to find the optimal NN settings. Please only modify the following in neural_network.py
self.parameters = {
'hidden_layer_sizes': [(80, 80), (70, 70), (60, 60)],
'activation': ['relu'],
'solver': ['adam'],
'learning_rate': ['adaptive'],
'learning_rate_init': [0.0001, 0.001, 0.005, 0.0005]
}