# Advanced Data Science

## Time Series

### Table of contents
0. [Submission instructions](#si)
1. [Understanding the problem](#1)
2. [Data splitting](#2)
3. [EDA](#3)
4. [Feature engineering](#4)
5. [Preprocessing and transformations](#5)
6. [Baseline model](#6)
7. [Different models](#7)
8. [Interpretation and feature importances](#8)
9. [Results on the test set](#9)
10. [Summary of the results](#10)

For this project we will use a dataset from [IPMA](https://www.ipma.pt/en/oclima/series.longas/list.jsp)
* Region: Lisboa (Temperatura and Preciptation)


## Goals
The goal of this project is to develop a complete project using Time Series Data. For this, you are expected to implement two final models:
1. Temperature Forecast
2. Predict whether it's going to rain or not

You can perform the forecast on the preciptation as well. You should present your predictions for 10 days, a month and a year.

### Submission instructions <a name="si"></a>


- **You may work on this assignment in a group (group size <= 3) and submit your assignment as a group.**
- Below are some instructions on working as a group.
    - The maximum group size is 3.
    - You can choose your own group members.
    - Use group work as an opportunity to collaborate and learn new things from each other.
    - Be respectful to each other and make sure you understand all the concepts in the assignment well.
    - It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline.
- Submit your work using [github assignement](https://classroom.github.com/a/zBt-9qVV)
- Make sure that your plots/output are rendered properly in github.

## Imports

Add your imports here. Make sure to have an environment with the needed libraries.

In [None]:
# import ...

## Introduction <a name="in"></a>

In this homework you will be working on an open-ended mini-project, where you will put all the different things you have learned about time series so far together to solve an interesting problem.

A few notes and tips when you work on this mini-project:

#### Tips
1. This mini-project is open-ended, and while working on it, there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary.
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code.
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions.


#### A final note
It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "a few hours" (15-20 hours???) is a good guideline for this project .

## 1. Pick your problem and explain the prediction problem <a name="1"></a>
<hr>
rubric={points:10}

You can choose a problem and use your own dataset to develop this project. If that is the case, share with the professor the problem and dataset.

Or, you can work with the proposed problem: A weather dataset from IPMA.
You can download the dataset [IPMA Long Series](https://www.ipma.pt/en/oclima/series.longas/list.jsp). Choose the Lisbon option. In this data set, there are 3 files (`Temperature`, `Preciptation`, `Pressure` and `Variables List`), and the goal is to forecast the temperature for the next 10 days, one month and one year, and predicting whether it is going to rain or not. `Temperature` is a dataset with almost 60000 examples and 5 columns. `Preciptation` is a dataset with around 56000 examples and 4 columns. It shows the daily preciptation in `mm`. `Pressure` has around 52000 and 8 columns. You can use the information of all the datasets together.
Beware of the notes in each file, where IPMA informs different transformations they have performed in the data.

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. You can find this information in the documentation on [IPMA Long Series](https://www.ipma.pt/en/oclima/series.longas/list.jsp).
2. Write a few sentences on your initial thoughts on the problem and the dataset.
3. Download the dataset and read it as a pandas dataframe.
4. Work the data to have an working dataset with all the columns that you think that might be important.

## 2. Data splitting <a name="2"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Split the data into train and test considering the forecast window (10 days, one month, one year).

> If your computer cannot handle the dataset, consider using a smaller amount of historical data

## 3. EDA <a name="3"></a>
<hr>
rubric={points:15}

**Your tasks:**

1. Perform exploratory data analysis on the train set.
2. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
3. Summarize your initial observations about the data.
4. Pick appropriate metric/metrics for assessment.

## 4. Feature engineering <a name="4"></a>
rubric={points:5}

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing.

## 5. Preprocessing and transformations <a name="5"></a>
<hr>
rubric={points:5}

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type.
2. Define a column transformer, if necessary.

## 6. Baseline model <a name="6"></a>
<hr>
rubric={points:5}

We have seen a lot of different types of forecast models. You should pick a simple one to be the baseline.
**Your tasks:**
1. Try a simple forecast model to predict the temperature and report results.
2. Try `scikit-learn`'s baseline model to predict the rainy days and report results.

## 7. Different models <a name="7"></a>
<hr>
rubric={points:20}

**Your tasks:**
1. Try at least 3 other models aside from the baseline for the forecast and classification tasks.
2. Summarize your results in terms of overfitting/underfitting and fit and score times.

## 8. Interpretation and feature importances <a name="8"></a>
<hr>
rubric={points:10}

**Your tasks:**

1. Use the methods we saw in class (or any other methods of your choice) to examine the most important features of one of the non-linear models.
2. Summarize your observations.

## 9. Results on the test set <a name="9"></a>
<hr>

rubric={points:10}

**Your tasks:**

1. Try your best performing model on the test data and report test scores.
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias?
3. Take one or two test predictions and explain these individual predictions.

## 10. Summary of results <a name="10"></a>
<hr>
rubric={points:10}

Imagine that you want to present the summary of these results to your boss and co-workers.

**Your tasks:**

1. Create a table summarizing important results.
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability .
3. Report your final test score along with the metric you used at the top of this notebook in the [Submission instructions section](#si).

**PLEASE READ BEFORE YOU SUBMIT:**

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`.
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all.
3. Export the notebook as a python script (.py) for feedback purposes on github.
4. Upload the assignment using github, opening a pull request.
5. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on github, also upload a pdf or html in addition to the .ipynb.