# Precipitaion Forecasting
In this project, we have trained different types of machine learning models on some data about weather to predict precipitation.

## Introduction
_Weather forecasting_ is using data about the current state and predict how the atmosphere will change. Weather warnings are used to protect lives and property, weather forecasting improves transportation safety, and precipitation forecasting is important to agriculture. There are many different ways of weather prediction. We have used machine learning models and compared the predicted results with actual values.
#### Study area
Basel is a city in northwest Switzerland. On average $51\%$ days of the year have precipitation more than $0.1mm$. The total precipitation is around $850 mm$ annually.
#### Methodology
At first, data is collected. Then, some preprocessing techniques are used to prepare data for machine learning models. Finally, different machine learning techniquies are applied and the accuracy for each is reported.\
The following libraries are used: `numpy`, `pandas`, `matplotlib`, `scikit-learn`, `keras`

In [None]:
from getdata import get_data, write_daily_data, read_daily_data
from preprocessing import drop_missing_data, change_resolution_to_daily, split_data, normalization
from visualize import show_histogram

## Dataset
**You can find and download the dataset in [this](https://www.meteoblue.com/en/weather/archive/export) link.**
#### About dataset
This dataset contains some attributes about weather for Basel, from *January, 2014* to *November, 2023* with hourly resolution. The first nine rows are some basic information about location of city and units of measurements which we do not need it.

In [None]:
data = get_data()
print(data.shape)

At the initial stage, there are $86664$ _samples_ and $7$ parameters.
#### Cleaning dataset
There are some rows at the end of dataset which are empty, _missing data_. We simply drop them. A day after the missing data is not complete. For simplicity we remove this day as well.

In [None]:
missing_data = drop_missing_data(data)
print(missing_data)

As we can see, number of missing rows is $191$.
#### Parameters
In our data, each row represents a sample and each column represents a feature. Here is the list of columns
- Temperature (*T*)
- Precipitation Total (*PT*)
- Relative Humidity (*RH*)
- Wind Speed (*WS*)
- Wind Direction (*WD*)
- Cloud Cover Total (*CCT*)
- Mean Sea Level Pressure (*MSLP*)

measured hourly. This list is raw and we will do some operations to get ready for models.

## Preprocessing
#### Make samples daily
Forecasting for a whole day is more general than one hour, so we decided to merge each $24$ sample to convert the resolution to daily by get mean for each feature.
- Specifically for temperature, having _maximum_, _minimum_, and _mean_ is better.
- Precipitation should be the _sum_ instead of _mean_.

In [None]:
daily_data = change_resolution_to_daily(data)
daily_data

Now we have $8$ _features_ and $3603$ _samples_. _PT_ is actually the target value.\
For saving time, we write daily data in a file.

In [None]:
write_daily_data(daily_data)
_, _, X, y = read_daily_data()

#### Splitting data
We split our data into *train* and *test* sets with relative size $70/30$.

In [None]:
X_train, X_test, y_train, y_test = split_data(X, y)
print(X_train, X_test, y_train, y_test)

#### Normalization
In general, many learning algorithms such as linear models benefit from standardization of the data set.\
We find parameters of nomalization only by having *training set* and **do not** recalculate them on the *test set*.

In [None]:
X_train_std, X_test_std = normalization(X_train, X_test)

#### Visualize parameters distribution
A [histogram](https://www.investopedia.com/terms/h/histogram.asp#:~:text=A%20histogram%20is%20a%20graph,how%20often%20that%20variable%20appears) is a graph that shows the frequency of numerical data using rectangles. The height of each rectangle represents the distribution frequency of a variable. The width of the rectangle represents the value of the variable. Here is the histogram for features and target value.

In [None]:
show_histogram(X_test)

After applying normalization we have:

In [None]:
show_histogram(X_test_std)

## Models
For precipitation forecating, we have two options:
- Quantitative Precipitation Forecast (QPF) that foreacst the amount of precipitation.
- Predict whether it will rain or not.
### Regression
We use $R^2$ technique for measuring accuracy.
#### Linear Model
In this model, target value is expected to be a linear combination of the features.

Accuracy for Linear Model is about $40$ percent which is not good.

### Classification

## Conclusion

##### Useful links
These links are used in mentioned parts:
- [Normalization](https://datascience.stackexchange.com/questions/54908/data-normalization-before-or-after-train-test-split)
- [Classification](https://stackoverflow.com/questions/77607029/use-logisticregression-to-predict-precipitation-in-sklearn)