## Deadline + Late Penalty

**Note :** It will take you quite some time to complete this project, therefore, we earnestly recommend that you start working as early as possible.


* Submission deadline for the Project is **20:59:59 on 23rd Apr, 2021 (Sydney Time)**.
* **LATE PENALTY: Late Penalty: 10-% on day-1 and 20% on each subsequent day.**

## Instructions
1. This note book contains instructions for **COMP9318-Project**.

2. You are required to complete your implementation in a file `submission.py` provided along with this notebook.

3. You are not allowed to print out unnecessary stuff. We will not consider any output printed out on the screen. All results should be returned in appropriate data structures via corresponding functions.

4. You can submit your implementation for the **Project** via following link: http://kg.cse.unsw.edu.au/submit/

5. For each part, we have provided you with detailed instructions. In case of any problem, you can post your query @ Ed.

6. You are allowed to add other functions (you may have to for this project), but you are not allowed to define global variables. **Only functions are allowed** in `submission.py`.

7. You are allowed to import other modules, but only from the following modules/libraries.
 * **Scikit-Learn 0.24.1**
 * **Numpy 1.19.5**
 * **Pandas 1.1.5**
 * **Python 3.6.5**

 Importing other modules will lead to errors.

8. For some parts of the project, i.e., **Project-Part1**, we will provide immediate feedback on your submission **based on the dataset provided with the specs (1st April 2021 onwards)**. You can view the feedback using the online submission portal on the same day.

9. You are allowed a limited number of Feedback Attempts **(15 Attempts for each Student)**, we will use your **LAST** submission for Final Evaluation. Please **DO NOT** forget to submit **Report.pdf** along with your last submission.

## Project-Part1: Predict COVID-19 Confirmed Cases (45 Points)

Given the fact that the world is exposed to COVID-19, in this project, we aim to analyze the time series of COVID-19 cases as a function of past COVID-19 cases and the weather conditions.

In this question, you are required to formulate a model that can predict confirmed COVID-19 cases for a state $X$ by analyzing the time series data of the COVID-19 cases along with the weather conditions. Specifically, you are required to complete the function `predict_COVID_part1()` in the file `submission.py`. The inputs and the outputs of the function are explained as follows:

## Input and Output formats

### Inputs:

1. `svm_model`, Scikit-learn's Support Vector Regression model with hyper-parameters initialized. **Note** that for part1 of the project, you are not required to change the model and its hyper-parameters, we recommend using the hyper-parameters settings provided as model input. 

* `train_df`, pandas dataframe corresponding the csv file: `COVID_train_data.csv`. The format of all fields of the data set is explained below. This dataset is intended for model training.

* `train_labels_df`, pandas dataframe corresponding to the csv file: `COVID_train_labels.csv`. It comprises the number of COVID-19 confirmed cases for each single day. This dataset is intended for model training.

* `past_cases_interval`, an integer value representing number of past days of COVID-19 cases to consider for the model training.

* `past_weather_interval`, an integer value representing the number of past days of weather conditions to consider for the model training.

* `test_feature`, A feature vector encompassing a subset of features from the file `test_features.csv` used for predicting the COVID-19 cases of the future. We provide details of the file: `test_features.csv` in the following section.

### Outputs:

Based on the feature space used for the model training (i.e., features constructed from the `train_data`), you are required to select the corresponding subset of features from the test features `test_features.csv` and predict the probable cases of COVID-19 cases for each single day.

**NOTE: You should use math.floor(x) to convert the prediction result to an integer.**

### Data Format Explained

#### 1. `train_df:`

The `train_df` dataframe encompasses time series data of weather conditions and COVID-19 cases for the state $X$ in the increasing order of time. The contents of the dataframe are explained below:

1. **day**: The day number of each observation in the file `train_df`. The day numbers are increasing in order of time.

* **temp**: Temperature of state $X$ in $^{\circ}F$. We provide $maximum$, $average$ and $minimum$ temperature of state $X$ in the fields `max_temp`, `avg_temp` and `min_temp` respectively.

* **dew**: Dew point of state $X$ in $^{\circ}F$. We provide $maximum$, $average$ and $minimum$ dew point of state $X$ in the fields `max_dew`, `avg_dew` and `min_dew` respectively.

* **humid**: % Humidity of state $X$. We provide $maximum$, $average$ and $minimum$ humidity of state $X$ in the fields `max_humid`, `avg_humid` and `min_humid` respectively.

* **wind_speed**: Wind speed in state $X$ measured in mph. We provide $maximum$, $average$ and $minimum$ wind speed  of state $X$ in the fields `max_wind_speed`, `avg_wind_speed` and `min_wind_speed` respectively.

* **pressure**: Sea level pressure of state $X$ measured in $Hg$. We provide $maximum$, $average$ and $minimum$ sea level pressure of state $X$ in the fields `max_pressure`, `avg_pressure` and `min_pressure` respectively.

* **precipitation**: Total dailly Precipitation of state $X$ measured in inches, represented as variable `precipitation`.

* **dailly_cases**: Total number of confirmed COVID-19 cases for state $X$ reported each single day.


#### 2. `train_labels_df:`

The `train_labels_df` encompasses the number of COVID-19 confirmed cases of state $X$ being reported every single day. The contents of this file are shown below.

1. **day**: The day number of each observation in the file `train_labels_df`. The day numbers are increasing in order of time.

* **dailly_cases**: Total number of confirmed COVID-19 cases for state $X$ reported each single day.


#### 3. `test_features.csv:`

The `test_features.csv` encompasses all possible feature values of the test features, encompassing the weather conditions and the past COVID-19 cases for the past `N-days (N=30)`. For a given day, the corresponding day-id is added along with the feature name. For this project, you are allowed to use all the features and/or a subset of it as per your requirement.


**Note:** 

* We restrict the maximum allowed values of the features from the past (weather and COVID-19 cases) to `N=30` days.
* The features in the file `test_features.csv` will follow the same sequential order unless specified otherwise.

## How to construct feature matrix
Given the fact that we restrict the value of the past instances of weather and past cases to be: `N=30`. If we construct a feature matrix of all the parameters involved in the training data, our initial feature matrix will be of the shape: `162 x 510`. In this project we require you to formulate the COVID-19 cases prediction as a regression problem. 
Specifically, you will be using the information for the `days: 1,...,t-1` to predict the cases for the `day = t`, as shown in a simple linear regression model below, for illustration purposes only:


<center>$\sum_{i=1}^{t-1}a_{i}*max\_temp_{i} +...+ \sum_{i=1}^{t-1}p_{i}*precipitation_{i} + \sum_{i=1}^{t-1}q_{i}*Cases_{i}= Cases_{day = t}$</center>

### Training Feature Matrix:

For the project-part1, we require you to form a feature matrix (`x_train`) encompassing the maximum values of the following subset of weather information and the past cases, in the same order as mentioned below: 

`[max_temp, max_dew, max_humid, past_cases]`

For the values of the input parameters: `past_weather_interval=10` and `past_cases_interval=10`, the resulting training matrix formed using the training data file: `train_df`, and the features mentioned above (i.e., `[max_temp, max_dew, max_humid, past_cases]`), the training matrix will be of the shape: `162x40`

## How to run your implementation Project-Part1 (Example)

In [1]:
import submission
import numpy as np
import pandas as pd
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error
import math

## Parameters settings
past_cases_interval = 10
past_weather_interval = 10


## Read training data
train_file = './data/COVID_train_data.csv'
train_df = pd.read_csv(train_file)

## Read Training labels
train_label_file = './data/COVID_train_labels.csv'
train_labels_df = pd.read_csv(train_label_file)


## Read testing Features
test_fea_file = './data/test_features.csv'
test_features = pd.read_csv(test_fea_file)


## Set hyper-parameters for the SVM Model
svm_model = SVR()
svm_model.set_params(**{'kernel': 'rbf', 'degree': 1, 'C': 5000,
                        'gamma': 'scale', 'coef0': 0.0, 'tol': 0.001, 'epsilon': 10})


SVR(C=5000, cache_size=200, coef0=0.0, degree=1, epsilon=10, gamma='scale',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [2]:
def predict_COVID_part1(svm_model, train_df, train_labels_df, past_cases_interval, past_weather_interval, test_feature):

    x_train = pd.DataFrame(columns = ['day'], data= [i for i in range(31,len(train_df) + 1)])  #day 31 - 162
    y_train = train_labels_df.iloc[30:]                                        #day 31 - 162
    consider_features = ["max_temp","max_dew","max_humid"]
    
    ######processing train data
    
    for feature in (consider_features):
        for i in range(past_weather_interval,0, -1):
            n_col = feature+"-"+str(i)      #name of col, ***-10 to ***-1
            x_train[n_col] = -1             #init with value -1
            for idx in x_train.index:
                x_train.loc[idx,n_col] = train_df.iloc[idx + 30 - i][feature]  #assign the value
    
    for i in range(past_cases_interval,0, -1):
        n_col = "dailly_cases-"+str(i)
        x_train[n_col] = -1                 #init with -1
        for idx in x_train.index:
            x_train.loc[idx,n_col] = train_df.iloc[idx + 30 - i]['dailly_cases']
    train_features = x_train.columns.tolist()
            
    #drop the col 'day'
    x_train = x_train.drop(["day"], axis=1)
    y_train = y_train["dailly_cases"]
    #convert to np.array
    x_train = np.array(x_train)
    y_train = np.array(y_train)
    #fit model
    svm_model.fit(x_train, y_train)
    
    #####processing test data
    test_fts = test_feature.index.tolist()
    for ft in test_fts:
        if ft not in train_features:
            test_feature = test_feature.drop([ft])
    test_feature = test_feature.drop(['day'])
    x_test = [np.array(test_feature)]       ##the shape should be [ [] ]
   
    
    return (math.floor(svm_model.predict(x_test)))


In [3]:
## Generate Prediction Results
predicted_cases_part1 = []
for idx in range(len(test_features)):
    test_feature = test_features.loc[idx]
    prediction = submission.predict_COVID_part1(svm_model, train_df, train_labels_df, 
                                                past_cases_interval, past_weather_interval, test_feature)
    predicted_cases_part1.append(prediction)


print(predicted_cases_part1)

TypeError: must be real number, not str

# Project-Part2: (45 Points)

In this part, you are required to formulate a model that can improve the performance of the model proposed in the Project-Part1 by a significant margin in terms of `Mean Absolute Error(MAE)`, explained below:

<br>

<center> $MAE = (\frac{1}{test\_interval})\sum_{i=1}^{test\_interval}\left | prediction_{i} - ground\_truth_{i} \right |$ </center>

<br>

For part2, in order to boost the model performance, unlike part1, you are allowed to design any approach and/or propose new model.

Specifically, you are required to complete the method `predict_COVID_part2()` in the file `submission.py`. The inputs and outputs alongwith their formats are defined below: 


**Note:** For part-2, you are only allowed to use the Python libraries already explained above.

## Input and Output formats

### Inputs:

1. `train_df`, pandas dataframe corresponding the csv file `COVID_train_data.csv`. The format of all fields of the data set is explained above. This dataset is intended for model training.

* `train_labels_df`, pandas dataframe corresponding the csv file `COVID_train_labels.csv`. It comprises the number of COVID-19 confirmed cases for each single day.

* `test_feature`, A feature vector constructed using a subset of features from the file `test_features.csv` used for predicting the COVID-19 cases of the future.



### Outputs:

Based on the feature space of the model (i.e., features constructed from the `train_data`), you are required to select the corresponding subset of features from the file `test_features.csv` and predict the probable cases of COVID-19 cases for each single day.

**NOTE: You should use math.floor(x) to convert the prediction result to nearest integer.**

## How to run your implementation Project-Part2 (Example)

In [6]:
import submission
import numpy as np
import pandas as pd
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error



## Read training data
train_file = './data/COVID_train_data.csv'
train_df = pd.read_csv(train_file)

## Read Training labels
train_label_file = './data/COVID_train_labels.csv'
train_labels_df = pd.read_csv(train_label_file)


## Read testing Features
test_fea_file = './data/test_features.csv'
test_features = pd.read_csv(test_fea_file)


## Generate Prediction Results
predicted_cases_part2 = []
for idx in range(len(test_features)):
    test_feature = test_features.loc[idx]
    prediction = submission.predict_COVID_part2(train_df, train_labels_df, test_feature)
    predicted_cases_part2.append(prediction)

## Error Computation
We compare the prediction results for each day against the ground truth values to compute the absolute error of each day. Later, we compute the mean over all the absolute error terms corresponding to the `test_interval` to compute the `MAE`.


In [5]:
## MeanAbsoluteError Computation...!

test_label_file ='./data/COVID_test_labels.csv'
test_labels_df = pd.read_csv(test_label_file)
ground_truth = test_labels_df['dailly_cases'].to_list()


MeanAbsError = mean_absolute_error(predicted_cases_part1, ground_truth)
print('MeanAbsError = ', MeanAbsError)

MeanAbsError =  95.1


### Evaluation


Your implementation will be tested using multiple different training and test data sets. 

1. For the `project-part1`, we test the correctness of implementation. For a given set of input parameters, you are required to correctly compute the feature vectors and generate the results for the predicted COVID-19 cases.


<center>$
score_{part1} =  \begin{cases}
    \sum_{i=1}^{3}15; & \text{if}\;\; \text{Correctly implemented}\\
    0,              & \text{otherwise}
\end{cases} $</center>

* For the `project-part2`, we will test your implementation in terms of performance improvement compared to the project-part1. We will be using the following linear function to assign scores:


<center>$
score_{part2} = \begin{cases}
    \text{math.floor}(-1.32 * MAE_{avg} + 125.52) &; \text{if} \;\; 61.0 \leq MAE_{avg} \leq 95.0\\
    45 &; \text{if} \;\; MAE_{avg} < 61.0
\end{cases} $</center>

where $MAE_{avg}$ is the average of the mean-absolute values over $N$ different test data sets, as shown below.

<center>$MAE_{avg} = \frac{1}{N}\sum_{i=1}^{N}MAE_{i}$</center>

**NOTE: For the project part-2, we will be using the same training data as provided along with the specs (i.e., `train_df`)**.

## BONUS Points 

We will be awarding BONUS scores to top-20 students with best scores for the part-2. The bonus scores will be awarded in decreasing order of the performance.

* The best performing student will be awarded `10 points`.
* Second best performing student will be awarded `9.5 points`.
* Third best performing student will be awarded `9 points` and so on.

## Project Submission and Feedback

For project submission and feedback, you are required to submit the following files:

1. Your implementation in a python file `submission.py`.

2. A report `Project.pdf` (**10 points**). You need to write a concise and simple report illustrating:
    - Implementation details of part 1.
    - Implementation details of part 2. Especially, it should include:
        * Comprehensive feature analysis, i.e., which features were used to boost the performance of part 2 compared with part 1, why?
        * What additional techniques were used to augment the performance of the model compared to part1?


**Note:** 
1. Every student will be entitled to **15 Feedback Attempts** (use them wisely), we will use the last submission for final evaluation.
* We will not provide any feedback for the project-part2.
* It is mandatory for the students to submit the report along with the last submission. 
* **The students failing to submit the report will be penalized by 10 points.**