# Precipitaion Forecasting
In this project we have trained different types of machine learning models on some data about weather to predict precipitation.

## Introduction
Weather forecasting is using data about the current state and predict how the atmosphere will change. Weather warnings are used to protect lives and property, weather forecasting improves transportation safety, precipitation forcasting is important to agriculture. There are many different ways for weather prediction. We have used machine learning models and the predicted results are compared with actual values.
#### Study area
Basel is a city in northwest Switzerland. On average 32% days of the year are rainy or snowy. The total precipitation is around 840 mm annually. May recieves the wettest month in Basel with an average of 98 mm of rain.

## Methodology
The following libraries are used.

In [9]:
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
# import sklearn

from preprocessing import get_data, drop_missing_data, change_resolution_to_daily, write_daily_data
# from visualize import plot_temperature

At first, data is collected. Then, some preprocessing techniques are used to prepare data for machine learning models. Finally, different machine learning techniquies are applied and the accuracy for each is reported. 

## Dataset
**You can find and download the dataset in [this](https://www.meteoblue.com/en/weather/archive/export) link.**
#### About dataset
This dataset contains some attributes about weather for Basel, from January, 2014 to November, 2023 with hourly resolution. The first nine rows are some basic information about location of city and units of measurements which we do not need it.

In [3]:
data = get_data()
print(f"Number of samples:      {data.shape[0]}")
print(f"Number of features:     {data.shape[1]}")

Number of samples:      86664
Number of features:     7


### Cleaning dataset
There are some rows at the end of dataset which are empty, missing data. We simply drop them. A day after the missing data is not complete. For simplicity we remove this day as well.

In [4]:
missing_data = drop_missing_data(data)
print(f"Number of missing rows: {missing_data}")
print(f"Number of samples:      {data.shape[0]}")

Number of missing rows: 191
Number of samples:      86472


In [3]:
import pandas as pd

file_path = "data/daily_data.csv"

weather_df = pd.read_csv(file_path)

print(weather_df)

      Year  Month  Day  Temperature  Precipitation Total  Relative Humidity  \
0     2014      1    1     4.494412             0.058333          90.203694   
1     2014      1    2     5.978995             0.283333          92.044333   
2     2014      1    3     6.586079             0.033333          88.778159   
3     2014      1    4     6.358579             0.687500          94.172092   
4     2014      1    5     4.995662             0.229167          86.622807   
...    ...    ...  ...          ...                  ...                ...   
3599  2023     11    9     7.729829             0.175000          82.235617   
3600  2023     11   10     9.101078             0.387500          76.940065   
3601  2023     11   11     7.350245             0.133333          82.186650   
3602  2023     11   12     5.818162             0.583333          93.582846   
3603  2023     11   13    11.380245             0.700000          91.976036   

      Wind Speed  Cloud Cover Total  
0      16.615

In [5]:
weather_df.head()

Unnamed: 0,Year,Month,Day,Temperature,Precipitation Total,Relative Humidity,Wind Speed,Cloud Cover Total
0,2014,1,1,4.494412,0.058333,90.203694,16.615975,59.916667
1,2014,1,2,5.978995,0.283333,92.044333,20.621631,63.875
2,2014,1,3,6.586079,0.033333,88.778159,22.263927,50.108334
3,2014,1,4,6.358579,0.6875,94.172092,15.272616,66.666667
4,2014,1,5,4.995662,0.229167,86.622807,16.897822,59.383333


In [7]:
X = weather_df.drop(["Year", "Month", "Day", "Precipitation Total"], axis=1) 
y = weather_df["Precipitation Total"] 

In [8]:
X.head()

Unnamed: 0,Temperature,Relative Humidity,Wind Speed,Cloud Cover Total
0,4.494412,90.203694,16.615975,59.916667
1,5.978995,92.044333,20.621631,63.875
2,6.586079,88.778159,22.263927,50.108334
3,6.358579,94.172092,15.272616,66.666667
4,4.995662,86.622807,16.897822,59.383333


In [9]:
y.head()

0    0.058333
1    0.283333
2    0.033333
3    0.687500
4    0.229167
Name: Precipitation Total, dtype: float64

## Features
In our data, each row represents a sample and each column represents a feature. Here is the list of columns:
- Temperature (T)
- Precipitation Total (PT)
- Relative Humidity (RH)
- Wind Speed (WS)
- Wind Direction (WD)
- Cloud Cover Total (CCT)
- Mean Sea Level Pressure (MSLP)

### Make samples daily
Forecasting for a whole day is more general than one hour, so we decide to merge each 24 examples to convert the resolution to daily. A good questio is how? One way is to get mean for each feature.
- Specifically for temperature, having maximum, minimum, and mean is better.
- Precipitation should be the sum instead of mean.
- Now we have 9 columns(features) and 3603 rows(samples).

In [6]:
daily_data = change_resolution_to_daily(data)
daily_data

Unnamed: 0_level_0,Unnamed: 1_level_0,Variable,MEANT,PT,RH,WS,WD,CCT,MSLP,MAXT,MINT
Year,Month,Day,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2014,1,1,4.494412,1.4,90.203694,16.615975,166.495278,59.916667,1014.029167,8.720245,0.310245
2014,1,2,5.978995,6.8,92.044333,20.621631,184.099628,63.875000,1008.670833,9.240245,3.530245
2014,1,3,6.586079,0.8,88.778159,22.263927,183.268500,50.108334,1013.070833,10.400246,2.340245
2014,1,4,6.358579,16.5,94.172092,15.272616,158.277724,66.666667,1007.625000,8.480246,4.600245
2014,1,5,4.995662,5.5,86.622807,16.897822,225.090521,59.383333,1010.754167,7.480245,0.440245
...,...,...,...,...,...,...,...,...,...,...,...
2023,11,8,7.268995,0.0,82.063136,17.965620,202.643658,33.016667,1019.379167,12.380245,3.760245
2023,11,9,7.729829,4.2,82.235617,25.143419,196.132513,69.020834,1010.795833,10.380245,3.690246
2023,11,10,9.101078,9.3,76.940065,27.342357,228.518643,61.875000,1005.745833,10.670245,7.960245
2023,11,11,7.350245,3.2,82.186650,22.030794,242.243293,49.875000,1010.391667,8.660245,4.260245


For saving time, we write daily data in a file.

In [None]:
write_daily_data(daily_data)

### Visualize parameters distribution


## Models
Let's have a look at each one in deep

In [None]:
import torch
import torch.nn as nn


class LogReg(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(20000, 1)
        # self.linear2 = nn.Linear(10000,5000)
        # self.linear3 = nn.Linear(5000, 2500)
        # self.linear4 = nn.Linear(2500, 1250)
        # self.linear5 = nn.Linear(1250, 625)
        # self.linear6 = nn.Linear(625, 125)
        # self.linear7 = nn.Linear(125, 1)
        
    def forward(self, xb):
        xb.to(device)
        out = self.linear1(xb.to(device)).to(device)
        # out = self.linear2(out.to(device)).to(device)
        # out = self.linear3(out.to(device)).to(device)
        # out = self.linear4(out.to(device)).to(device)
        # out = self.linear5(out.to(device)).to(device)
        # out = self.linear6(out.to(device)).to(device)
        # out = self.linear7(out.to(device)).to(device)
        return out

### Linear Model
This is like ...
If we apply it, results:
###### SOME FIGUERS AND PLOTS TO SHOW THE RESULTS
Calculate performance, error, advantages and disadvantages

### Other Models
This is like ...
If we apply it, results:
###### SOME FIGUERS AND PLOTS TO SHOW THE RESULTS
Calculate performance, error, advantages and disadvantages

## Conclusion
- Forecasting is good.
- Undestand that these features have a stronger affect in result.
- This Model is better in performance.
- Add references if necessary.