#### 07 Forecasting

During this session, we will engage in time series forecasting using a weather-related time series dataset. This dataset includes information such as temperature, precipitation, and wind speed data recorded at the Barcelona airport.

The dataset we will be utilizing has been sourced from [AEMET's Open Data](https://opendata.aemet.es/centrodedescargas/inicio)  initiative.



In [158]:
import io
import json
import pandas as pd
import datetime
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math

In [159]:
INPUT_FILE = "Data/07_Forecasting/aemet-barcelona-airport-2016-2022.json"
weatherdf = pd.read_json(INPUT_FILE)

weatherdf.head(5)

Unnamed: 0,fecha,indicativo,nombre,provincia,altitud,tmed,prec,tmin,horatmin,tmax,horatmax,dir,velmedia,racha,horaracha,sol,presMax,horaPresMax,presMin,horaPresMin
0,2016-07-01,0201D,BARCELONA,BARCELONA,6,241,0,219,03:40,263,11:50,99.0,31,58,10:50,,,,,
1,2016-07-02,0201D,BARCELONA,BARCELONA,6,246,0,223,04:40,270,11:50,13.0,36,75,12:10,,,,,
2,2016-07-03,0201D,BARCELONA,BARCELONA,6,234,0,219,11:00,249,08:10,20.0,42,89,08:50,,,,,
3,2016-07-04,0201D,BARCELONA,BARCELONA,6,238,0,218,04:40,259,10:30,25.0,33,67,23:20,,,,,
4,2016-07-05,0201D,BARCELONA,BARCELONA,6,253,0,232,00:10,274,14:00,35.0,31,64,02:00,,,,,


We now rename the columns to English using the following dictionary. Two of the methods to do this, is implemented here. In first approcah we use rename() function which requires a dictionary. In the second approach we can use <b>columns</b> property which needs a list of new names, and in this example we get this list from the dictionary. 

In [160]:
COLUMN_NAMES = {
  "fecha" : "date",
  "indicativo" : "station_num",
  "nombre" : "station_name",
  "provincia" : "station_province",
  "altitud" : "station_altitude",
  "tmed" : "temp_avg",
  "tmin" : "temp_min",
  "tmax" : "temp_max",
  "horatmin" : "time_temp_min",
  "horatmax" : "time_temp_max",
  "prec" : "rainfall",
  "dir" : "windspeed_dir",
  "velmedia" : "windspeed_avg",
  "racha" : "windspeed_gusts",
  "horaracha" : "time_windspeed_gusts",
  "sol" : "sun",
  "presMax" : "pressure_max",
  "horaPresMax" : "time_pressure_max",
  "presMin" : "pressure_min",
  "horaPresMin" : "time_pressure_min"
}

weather1 = weatherdf.copy()
cols = list(COLUMN_NAMES.values())
weather1.columns = cols
display(weather1.head(5))

weatherdf.rename(columns= COLUMN_NAMES, inplace=True)
display(weatherdf.head(5))

Unnamed: 0,date,station_num,station_name,station_province,station_altitude,temp_avg,temp_min,temp_max,time_temp_min,time_temp_max,rainfall,windspeed_dir,windspeed_avg,windspeed_gusts,time_windspeed_gusts,sun,pressure_max,time_pressure_max,pressure_min,time_pressure_min
0,2016-07-01,0201D,BARCELONA,BARCELONA,6,241,0,219,03:40,263,11:50,99.0,31,58,10:50,,,,,
1,2016-07-02,0201D,BARCELONA,BARCELONA,6,246,0,223,04:40,270,11:50,13.0,36,75,12:10,,,,,
2,2016-07-03,0201D,BARCELONA,BARCELONA,6,234,0,219,11:00,249,08:10,20.0,42,89,08:50,,,,,
3,2016-07-04,0201D,BARCELONA,BARCELONA,6,238,0,218,04:40,259,10:30,25.0,33,67,23:20,,,,,
4,2016-07-05,0201D,BARCELONA,BARCELONA,6,253,0,232,00:10,274,14:00,35.0,31,64,02:00,,,,,


Unnamed: 0,date,station_num,station_name,station_province,station_altitude,temp_avg,rainfall,temp_min,time_temp_min,temp_max,time_temp_max,windspeed_dir,windspeed_avg,windspeed_gusts,time_windspeed_gusts,sun,pressure_max,time_pressure_max,pressure_min,time_pressure_min
0,2016-07-01,0201D,BARCELONA,BARCELONA,6,241,0,219,03:40,263,11:50,99.0,31,58,10:50,,,,,
1,2016-07-02,0201D,BARCELONA,BARCELONA,6,246,0,223,04:40,270,11:50,13.0,36,75,12:10,,,,,
2,2016-07-03,0201D,BARCELONA,BARCELONA,6,234,0,219,11:00,249,08:10,20.0,42,89,08:50,,,,,
3,2016-07-04,0201D,BARCELONA,BARCELONA,6,238,0,218,04:40,259,10:30,25.0,33,67,23:20,,,,,
4,2016-07-05,0201D,BARCELONA,BARCELONA,6,253,0,232,00:10,274,14:00,35.0,31,64,02:00,,,,,


Now we get a subset of the columns.

In [161]:
column_subset = ["date","temp_avg",	"temp_min",	"temp_max",	"windspeed_dir", "windspeed_avg","windspeed_gusts"]

weather2 = weatherdf[column_subset]

weather2


Unnamed: 0,date,temp_avg,temp_min,temp_max,windspeed_dir,windspeed_avg,windspeed_gusts
0,2016-07-01,241,219,263,99.0,31,58
1,2016-07-02,246,223,270,13.0,36,75
2,2016-07-03,234,219,249,20.0,42,89
3,2016-07-04,238,218,259,25.0,33,67
4,2016-07-05,253,232,274,35.0,31,64
...,...,...,...,...,...,...,...
2275,2022-09-25,200,164,237,1.0,42,89
2276,2022-09-26,203,150,256,25.0,56,133
2277,2022-09-27,200,153,247,35.0,33,89
2278,2022-09-28,219,168,270,30.0,50,125


Let's check the data type of each column. 

In [162]:
weather2.dtypes

date                object
temp_avg            object
temp_min            object
temp_max            object
windspeed_dir      float64
windspeed_avg       object
windspeed_gusts     object
dtype: object

As can be seen all columns except date column, is float and some float columns have <b>comma</b> and their data type is object. So we first, set date column as index and fix the other columns.

The errors='coerce' parameter will replace any parsing errors with NaT (Not a Time) values. Adjust the column name accordingly based on your actual DataFrame.

In [163]:
weather2["date"] = pd.to_datetime(weather2["date"], errors='coerce')
display(weather2.dtypes)
weather2.set_index("date", inplace=True)
display(weather2)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weather2["date"] = pd.to_datetime(weather2["date"], errors='coerce')


date               datetime64[ns]
temp_avg                   object
temp_min                   object
temp_max                   object
windspeed_dir             float64
windspeed_avg              object
windspeed_gusts            object
dtype: object

Unnamed: 0_level_0,temp_avg,temp_min,temp_max,windspeed_dir,windspeed_avg,windspeed_gusts
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-07-01,241,219,263,99.0,31,58
2016-07-02,246,223,270,13.0,36,75
2016-07-03,234,219,249,20.0,42,89
2016-07-04,238,218,259,25.0,33,67
2016-07-05,253,232,274,35.0,31,64
...,...,...,...,...,...,...
2022-09-25,200,164,237,1.0,42,89
2022-09-26,203,150,256,25.0,56,133
2022-09-27,200,153,247,35.0,33,89
2022-09-28,219,168,270,30.0,50,125


In [164]:
def convert_to_float(value):
    try:
        return float(value.replace(',', '.'))
    except:
        return np.nan

# Apply the conversion function to relevant columns
weather2 = weather2.apply(lambda col: col.map(convert_to_float) if col.dtype != 'float64' else col)
weather2.dropna(inplace=True)
weather2.head(5) 

temp_avg           float64
temp_min           float64
temp_max           float64
windspeed_dir      float64
windspeed_avg      float64
windspeed_gusts    float64
dtype: object

Unnamed: 0_level_0,temp_avg,temp_min,temp_max,windspeed_dir,windspeed_avg,windspeed_gusts
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-07-01,24.1,21.9,26.3,99.0,3.1,5.8
2016-07-02,24.6,22.3,27.0,13.0,3.6,7.5
2016-07-03,23.4,21.9,24.9,20.0,4.2,8.9
2016-07-04,23.8,21.8,25.9,25.0,3.3,6.7
2016-07-05,25.3,23.2,27.4,35.0,3.1,6.4



This ia a new dataset called [Weather Data for COVID-19 Data Analysis](https://www.kaggle.com/datasets/davidbnn92/weather-data-for-covid19-data-analysis)


The column names in a weather dataset often represent different meteorological variables. Here's a brief interpretation for each:
- Lat: Latitude
- Long: Longitude
- Date: Date of the observation
- ConfirmedCases: Number of confirmed COVID-19 cases
- Fatalities: Number of fatalities (deaths)
- day_from_jan_first: Number of days elapsed since January 1st
- temp: Temperature
- min: Minimum temperature
- max: Maximum temperature
- stp: Mean station pressure
- slp: Mean sea-level pressure
- dewp: Dew point temperature
- rh: Relative humidity
- ah: Absolute humidity
- wdsp: Wind speed
- prcp: Precipitation
- fog: Presence or absence of fog


