# Brainstorming

(Timeseries) Rainfall in Ireland
- 4,660,423 rows × 18 columns
- [Ideas](https://www.kaggle.com/datasets/dariasvasileva/hourly-weather-data-in-ireland-from-24-stations)
    - EDA
        - What are the most prominent seasonal weather patterns in Ireland?
        - How does the weather conditions affect city life?
            - [Pedestrian footfall](https://data.smartdublin.ie/dataset/dublin-city-centre-footfall-counters)
            - [Bikeshare sevices](https://data.smartdublin.ie/dataset/bleeperbike)
            - Road accidents
            - Taxi 
    - For ML and Neural Networks modeling:
        - Can you predict the probability of rain using weather data obtained from a single station in the previous 24, 36 or 48 hours?
        - How does the addition of data recorded by neighbouring stations affect the accuracy of the model?
        
**To do:** Build a model that can predict the probability of rain using weather data obtained from a single station in the previous 12, 24, 36, or 48 hours.  



**Data**

This dataset contains data from 25 stations across 15 counties in Ireland. Hourly data is available for these weather stations from the start of their record keeping until the end of 2018. All data was sourced from the Irish Meteorological Service - Met Éireann.

Met Éireann is a scientific organization that undertakes research in numerous fields such as Numerical Weather Prediction and Climate Modeling. 

**Variables measured by stations**
* `date`: Date and Time of observation
* `ind`: Encoded Rainfall Indicators (see KeyHourly.txt for details)
* `rain`: Precipitation Amount, mm
* `ind.1`: Encoded Temperature Indicators (see KeyHourly.txt for details)
* `temp`: Air Temperature, °C
* `ind.2`: Encoded Wet Bulb Indicators (see KeyHourly.txt for details)
* `wetb`: Wet Bulb Air Temperature, °C
* `dewpt`: Dew Point Air Temperature, °C
* `vappr`: Vapour Pressure, hPa
* `rhum`: Relative Humidity, %
* `msl`: Mean Sea Level Pressure, hPa
* `ind.3`: Encoded Wind Speed Indicators (see KeyHourly.txt for details)
* `wdsp`: Mean Hourly Wind Speed, knot
* `ind.4`: Encoded Wind Direction Indicators (see KeyHourly.txt for details)
* `wddir`: Predominant Hourly wind Direction, degree
* `ww`: Synop Code Present Weather (see KeyHourly.txt for details)
* `w`: Synop Code Past Weather (see KeyHourly.txt for details)
* `sun`: Sunshine duration, hours
* `vis`: Visibility, m
* `clht`: Cloud Ceiling Height (if none value is 999), 100s of feet
* `clamt`: Cloud Amount, okta

**Wikipedia page for "Wind direction"**
"Wind direction is usually reported in cardinal (or compass) direction, or in degrees. Consequently, a wind blowing from the north has a wind direction referred to as 0° (360°); a wind blowing from the east has a wind direction referred to as 90°, etc."


[Table: Common Cardinal (or compass) direction vs degrees](https://uni.edu/storm/Wind%20Direction%20slide.pdf)

**Information on the stations:**

* `county`: County the station is located in
* `st_id`: Station number
* `st_name`: Station name
* `st_height`: Station Height, m
* `st_lat`: Station Latitude, sexagesimal degrees (degrees, minutes, and seconds - DMS notation)
* `st_long`: Station Longitude, sexagesimal degrees (degrees, minutes, and seconds - DMS notation)

Latitude and longitude are presented in sexagesimal degrees (degrees, minutes, and seconds - DMS notation). To convert them into decimal degrees (DD) which are used in GIS and GPS apply the following formula: DD = D + M/60 + S/3600. More details can be found [here](https://en.wikipedia.org/wiki/Decimal_degrees).

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import statistics

In [2]:
irish_rain = pd.read_csv('Data/hrly_Irish_weather.csv')
irish_rain = irish_rain.sample(n=5000) # choose the number of rows you want to include

  irish_rain = pd.read_csv('Data/hrly_Irish_weather.csv')


In [3]:
irish_rain

Unnamed: 0,county,station,latitude,longitude,date,rain,temp,wetb,dewpt,vappr,rhum,msl,wdsp,wddir,sun,vis,clht,clamt
2407687,Galway,MACE HEAD,53.326,-9.901,11-dec-2016 04:00,0.0,9.8,9.1,8.4,11.0,90,1021.5,14,240,,,,
4370022,Cork,SherkinIsland,51.476,-9.428,14-sep-2017 00:00,0.0,11.9,10.1,8.4,11.0,79,1010.0,13,290,,,,
3284022,Westmeath,MULLINGAR,53.537,-7.362,27-jun-2014 01:00,0.4,12.2,12.0,11.7,13.8,96,1010.1,5,80,,,,
3504523,Carlow,OAK PARK,52.861,-6.915,22-may-2007 14:00,0.0,17.8,14.0,10.8,13.0,63,1021.8,8,280,,,,
1147020,Cork,CORK AIRPORT,51.847,-8.486,03-jun-2005 22:00,0.0,10.5,9.8,9.0,11.5,91,1008.8,10,240,0.0,30000,999,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1969936,Wexford,JOHNSTOWNII,52.298,-6.497,22-aug-2006 07:00,0.0,14.5,12.7,11.1,13.2,79,1018.6,2,250,,,,
3807551,Cork,ROCHES POINT,51.793,-8.244,08-dec-1999 21:00,0.0,6.0,4.8,3.3,,,992.6,20,250,,,,
1621951,Meath,DUNSANY,53.516,-6.660,06-oct-2016 17:00,0.0,12.1,9.5,6.7,9.8,69,1025.2,9,100,,,,
2126565,Mayo,KNOCK AIRPORT,53.906,-8.817,02-sep-2000 18:00,0.0,16.8,14.1,11.9,13.9,73,1016.1,3,70,0.0,40000,50,7


## Data Preprocessing

Change index for Timeseries

In [4]:
# Convert date from string format to datetime format
irish_rain['date'] = pd.to_datetime(irish_rain['date'], format='%d-%b-%Y %H:%M')

# Set date as the index
irish_rain.set_index('date', inplace=True)

### Handeling missing values

In [5]:
irish_rain.replace(' ', np.nan, inplace=True)

In [6]:
# Check for missing values
missing_values = irish_rain.isnull().sum()

# print the number of missing values in each column
print(missing_values)

county          0
station         0
latitude        0
longitude       0
rain          135
temp           40
wetb           48
dewpt          47
vappr         217
rhum          183
msl            67
wdsp          346
wddir         367
sun          3055
vis          3101
clht         3055
clamt        3055
dtype: int64


`rain`

In [7]:
unique_values = irish_rain['rain'].unique()
print(unique_values)

['0.0' '0.4' 0.0 nan '0.2' '0.1' '1.9' 0.1 0.3 2.4 0.8 0.6 0.5 '1.0' '1.6'
 '0.8' 0.9 '1.4' 0.4 '0.5' '5.4' '0.7' '0.3' 0.2 '0.6' '1.3' '0.9' '3.4'
 1.2 5.3 '1.2' 4.7 1.7 1.0 '3.9' '3.8' 2.6 1.3 '1.5' 1.9 '2.2' '1.7' 8.0
 '3.0' '3.7' '2.3' 1.4 '4.8' '2.4' '1.8' '2.7' 1.5 2.8 0.7 2.0 2.3 1.1
 '3.1' 5.4 3.8 1.6 '2.8' 2.2 '4.7' 3.5 '3.2' '2.1' '2.6' 9.3 '1.1' '2.0'
 3.2 3.1 1.8 '4.0' 5.6 5.2 '6.8' 3.7 4.2 2.5]


In [8]:
# Drop the missing values
irish_rain.dropna(subset=['rain'], inplace=True)

`temp`

In [9]:
unique_values = irish_rain['temp'].unique()
print(unique_values)

[9.8 11.9 '12.2' '17.8' 10.5 '1.4' '13.2' 11.4 7.3 4.4 '6.5' 14.8 5.1
 '9.2' 3.1 5.0 13.8 10.7 11.0 '7.8' '15.4' '15.5' '20.3' 12.6 7.1 '16.5'
 '12.5' 5.3 '17.5' 5.7 3.5 16.7 15.9 '12.3' 15.1 6.5 7.9 '10.6' 3.6 5.5
 11.3 '2.8' 4.3 11.7 8.1 14.2 6.6 9.3 -0.3 15.2 '8.2' 4.2 10.6 20.4 6.2
 '11.4' 13.9 6.7 8.2 5.4 2.1 '11.1' -0.1 4.6 12.9 10.0 -1.0 '10.8' 13.7
 14.0 '18.5' 11.1 '8.9' 8.6 '0.5' '19.2' 6.4 13.2 0.0 '11.2' 10.4 '0.8'
 17.3 '10.9' 12.5 8.3 9.4 '14.0' nan '8.0' 15.0 9.5 1.4 '9.8' 4.5 14.6
 '8.3' 3.3 '12.0' 10.2 5.9 10.1 '11.7' 15.4 8.8 '5.2' '6.0' '12.6' '6.1'
 14.9 '9.7' '13.8' 14.5 9.2 '17.2' '5.5' 10.8 3.7 '12.4' '8.8' '13.9'
 '15.8' '10.2' '9.4' 9.0 12.8 9.9 8.4 6.3 '9.9' 16.8 10.3 14.3 '2.5' 3.0
 11.6 '8.7' '5.1' '14.4' '4.8' '2.7' 7.5 7.0 13.4 8.9 '5.9' '10.0' 3.4
 '8.1' 9.6 13.0 '20.0' '14.7' '6.4' 7.7 21.7 '11.5' '3.6' '13.1' 15.5
 '14.9' '12.1' 5.8 '10.5' '-2.7' 9.1 15.6 '17.0' -0.8 16.9 7.2 '4.7' 23.1
 2.3 20.8 '16.1' 17.4 5.2 '11.8' 12.2 '7.7' 12.4 4.9 '5.6' 11.2 20.

In [10]:
irish_rain.dropna(subset=['temp'], inplace=True)

`wetb`

In [11]:
unique_values = irish_rain['wetb'].unique()
print(unique_values)

[9.1 '10.1' '12.0' '14.0' 9.8 '-0.1' '11.5' 10.2 7.1 2.4 '5.5' 13.5 '3.5'
 '8.3' '2.1' 5.0 10.3 8.2 '5.1' '13.5' '14.5' '17.7' 10.7 5.6 '12.7' '9.9'
 '4.8' '13.8' 2.6 14.7 '6.3' 14.5 '11.3' 11.6 4.0 '7.9' '9.4' 3.3 5.1 10.4
 '2.7' 3.5 7.7 13.3 5.3 9.3 -0.7 13.1 '6.6' 12.3 10.1 18.5 5.5 '10.8' 13.6
 10.0 '2.3' 5.2 4.4 1.3 '9.6' -0.4 4.2 12.9 8.4 6.3 '-1.5' '9.0' '11.7'
 '4.3' 9.0 9.7 '15.0' '9.8' '7.8' '8.0' '0.2' '15.8' '6.4' 4.3 -1.2 '10.6'
 9.4 '0.1' '13.2' 8.5 '10.2' 11.9 4.8 '11.2' '11.0' '12.2' 12.5 6.1 7.8
 2.2 '1.1' 11.8 '10.3' '8.7' 13.4 8.0 '7.5' '3.0' '11.4' 8.3 5.8 '9.3'
 13.2 8.1 4.7 '3.8' '11.8' '5.2' '5.7' '8.9' '13.1' 8.7 '7.0' '13.9'
 '15.2' '4.7' '10.0' 6.8 '7.7' '6.8' '9.2' '2.5' '13.6' 12.2 8.6 6.4 4.1
 '5.6' '12.5' 14.8 '8.6' '14.2' 11.0 13.7 '2.2' 2.0 '3.6' 12.7 6.7 11.4
 7.2 5.7 '8.5' 1.6 '5.3' '13.4' '2.6' 6.6 '6.9' '4.4' '18.1' '3.3' '8.1'
 '16.1' '11.1' '3.2' '7.1' '14.7' 13.9 3.6 10.6 2.8 '-3.0' 7.5 12.8 -1.0
 '6.1' 16.5 4.6 '4.2' 7.3 '1.7' 15.1 '12.6' 13.0 4.

In [12]:
irish_rain.dropna(subset=['wetb'], inplace=True)

`vappr`

In [13]:
unique_values = irish_rain['vappr'].unique()
print(unique_values)

[11.0 '11.0' '13.8' '13.0' '11.5' '4.9' '12.2' 9.9 5.7 '8.2' 14.4 '6.5'
 '10.1' '6.3' 8.7 15.2 12.2 8.6 '6.6' '14.0' '15.7' '18.2' 11.3 7.9 '11.6'
 '12.9' 8.2 6.6 15.1 '8.4' 15.4 '12.6' 10.9 6.1 '10.6' '10.8' 7.5 8.5 11.9
 '7.4' 7.2 11.2 10.2 '14.5' '11.7' 5.5 13.5 '8.5' 7.1 14.0 12.0 19.8
 '12.5' 15.3 11.7 '6.8' 6.5 7.6 12.4 '6.1' '10.7' 14.9 9.7 8.3 '5.1' 12.1
 5.4 11.4 '14.2' '11.1' '9.6' '6.0' '15.2' 4.6 '12.3' '5.6' '11.9' 10.5
 13.4 10.0 8.4 '12.8' 12.5 9.2 '5.4' '6.4' 13.0 '10.4' 8.0 '9.7' '12.1'
 7.3 9.4 7.7 9.3 '10.9' '9.8' '11.8' '8.3' '13.3' '7.8' '8.9' 11.5 '14.6'
 10.4 10.3 '15.4' 9.6 '15.1' '9.9' '7.1' 13.8 10.1 7.8 9.8 '14.9' '7.0'
 6.3 '10.0' 9.0 '6.7' '15.6' '11.2' 13.9 '8.6' '9.2' 8.8 10.6 13.1 '13.7'
 6.7 '9.1' 13.7 '19.3' '13.9' nan '5.3' '10.2' '7.3' '13.5' '16.6' 14.3
 10.8 6.8 7.0 '4.7' 9.1 '13.6' 15.8 '13.1' '9.4' 14.5 18.4 6.4 '10.3' 12.7
 '7.9' 14.1 '8.7' 4.7 '8.8' '12.7' 11.6 '13.4' 12.6 '15.5' '13.2' 5.3 6.9
 '14.8' '6.9' 16.9 '18.1' '14.1' '9.0' '11.4' '7.2

In [14]:
# Drop the missing values
irish_rain.dropna(subset=['vappr'], inplace=True)

`rhum`

In [15]:
unique_values = irish_rain['rhum'].unique()
print(unique_values)

[90 '79' '96' '63' '91' '72' '80' '85' 96 68 86 '74' '86' '82' 100 95 66
 '62' '89' 76 78 '61' '69' '92' '64' 89 85 80 '87' 63 '100' '84' 94 '98'
 87 82 '90' 91 '78' 93 83 '93' 97 59 84 79 51 47 '67' 92 '94' 65 69 '60'
 '77' '99' 73 '65' 67 '71' 77 81 '95' '55' '57' '97' '83' '59' 75 '76'
 '88' 99 74 61 88 53 '75' 70 '73' 71 48 57 '68' 55 '50' 98 56 '81' 62 '70'
 64 39 46 54 '56' 60 58 '52' 72 '49' '42' '58' '45' '66' '51' 49 '0' '48'
 '54' '40' '53' 50 44 52 '39' 45 '46' '43' '47']


No more missing values after taking care of the other ones! But in case I bring in more data, I will still put code in to drop missing values.

In [16]:
# Drop the missing values
irish_rain.dropna(subset=['rhum'], inplace=True)

`wdsp`

In [17]:
unique_values = irish_rain['wdsp'].unique()
print(unique_values)

[14 13 '5' '8' 10 '3' nan 9 16 '11' 2 '19' '17' 6 4 19 '18' '10' '14' 12 8
 '4' '13' 7 21.0 5 20 '0' '7' 23 '2' 29 17 3 '1' 11 '9' 1 '24' '15' 18 15
 '16' '12' '6' 30 '21' 26 '32' 24 31 '26' '20' 22 '28' '22' 27 25 '23' 0
 '30' '34' 28 '35' '37' 36 '29' '38' 40 32 '25' '27' 33 '31' 37]


In [18]:
irish_rain['wdsp'] = pd.to_numeric(irish_rain['wdsp'], errors='coerce')

In [19]:
unique_values = irish_rain['wdsp'].unique()
print(unique_values)

[14. 13.  5.  8. 10.  3. nan  9. 16. 11.  2. 19. 17.  6.  4. 18. 12.  7.
 21. 20.  0. 23. 29.  1. 24. 15. 30. 26. 32. 31. 22. 28. 27. 25. 34. 35.
 37. 36. 38. 40. 33.]


In [20]:
# calculate the median value of the 'wdsp' column
median_value = np.median(irish_rain['wdsp'].dropna())

# replace missing values with the median value
irish_rain['wdsp'] = irish_rain['wdsp'].fillna(median_value)

`wwdir`

In [21]:
unique_values = irish_rain['wddir'].unique()
print(unique_values)

[240 290 '80' '280' '50' nan 230 '270' 50 '330' 350 260 170 10 '250' '90'
 190 320 '20' '230' '140' '160' 180 270 280 '200' 100 130 '120' 70 30 '0'
 160 '340' 110 '310' 330 '350' '300' 80 '130' 220 '100' '290' 360 '220'
 '10' '260' 40 140 150 '210' 250 '180' '170' '320' 120 20 '30' '240' 200
 '110' '190' '360' 210 '150' 60 300 90.0 '40' 340 '70' '60' 310 0]


In [22]:
irish_rain['wddir'] = pd.to_numeric(irish_rain['wddir'], errors='coerce')

In [23]:
unique_values = irish_rain['wddir'].unique()
print(unique_values)

[240. 290.  80. 280.  50.  nan 230. 270. 330. 350. 260. 170.  10. 250.
  90. 190. 320.  20. 140. 160. 180. 200. 100. 130. 120.  70.  30.   0.
 340. 110. 310. 300. 220. 360.  40. 150. 210.  60.]


Since `wddir` represents wind direction, which is a circular variable, it may be more appropriate to impute missing values using the circular mean or median. However, in this case, the number of missing values is relatively small compared to the total number of values in the column, so imputing with the mode value should be sufficient.

In [24]:
mode_value = irish_rain['wddir'].mode()[0]
irish_rain['wddir'] = irish_rain['wddir'].fillna(mode_value)

`sun`

In [25]:
unique_values = irish_rain['sun'].unique()
print(unique_values)

[nan 0.0 1.0 0.2 0.7 0.6 '0.0' 0.4 0.1 0.3 0.5 0.9 '0.8' 0.8 '0.3' '0.6'
 '1.0' '0.2' '0.5']


Since the `sun` column represents the duration of sunshine, which is a continuous variable, you may consider imputing the missing values using mean or median imputation. However, note that imputing such a large proportion of the data may introduce bias in your analysis or modeling.

In [26]:
irish_rain['sun'] = pd.to_numeric(irish_rain['sun'], errors='coerce')

In [27]:
median_value = irish_rain['sun'].median()
irish_rain['sun'] = irish_rain['sun'].fillna(median_value)

Note that this approach assumes that the missing values in the `sun` column are missing at random and that the distribution of the 'sun' values is approximately normal. If these assumptions do not hold, other imputation methods such as regression imputation, KNN imputation, or multiple imputation may be more appropriate.

`vis`

In [28]:
unique_values = irish_rain['vis'].unique()
print(unique_values)

[nan 30000 40000 25000 4000 50000 15000 35000 '6000' 3000 '30000' 10000
 2500 24000 20000 3500 45000 '12000' '15000' 60000 12000 65000 16000 8000
 22000 '35000' 200 '25000' '40000' '20000' 7000 '22000' 2600 55000 '400'
 '18000' 800 '3000' '17000' 75000 5000 26000.0 4500 '50000' '60000' 14000
 18000 '75000' 28000 '70000' 400 11000 100 9000 '16000' '10000' '28000'
 6000 '900' '8000' '800' '45000' 1800 70000 21000 17000 '14000' '2500'
 2000 '4000' 700.0 '11000' '7000' '200' '2700' 1200.0 '5000' '1200' '1500'
 '2200' 900 1100 '4500' 19000 300 500 '9000' '24000' 1500 '100' '300' 1600
 27000.0 3200 '4400' 13000 '55000']


Without the NaNs, the `vis` is:
* Mean: 26897.97343722213
* Median: 25000.0
* Mode: 30000.0

In [29]:
# Convert the 'vis' column to numeric type
irish_rain['vis'] = pd.to_numeric(irish_rain['vis'], errors='coerce')

# Calculate the mean of the 'vis' column
mean_value = irish_rain['vis'].mean()

# Fill the NaN values with the mean value
irish_rain['vis'] = irish_rain['vis'].fillna(mean_value)

`clht`

In [30]:
unique_values = irish_rain['clht'].unique()
print(unique_values)

[nan 999 250 200 130 38 5 40 0 75 18 4 33 '999' 15 42 2 26 36 27 120 70 12
 150 22 30 180 45 110 35 34 20 50 80 60 1 8 90 17 3 100 23 '3' 7 48.0 46
 21 37 25.0 13 16 28 190 49.0 '38' 230 220 10 160 '11' '48' 14 24.0 32
 '12' '14' 6 9 '80' 240 47 19 '60' '35' '4' 39 '300' 29 '26' '10' 11 300
 '20' '50' 210 140 '90' '15' 41 '25' '19' 31 280 44 '36' '40' '250' '18'
 43 170 '9' 69 '5' '22' '16' '220']


In [31]:
irish_rain['clht'] = pd.to_numeric(irish_rain['clht'], errors='coerce') # convert to numeric
irish_rain['clht'] = irish_rain['clht'].interpolate(method='linear') # interpolate

Technique used: Interpolated values are estimates of missing data points based on the values of neighboring data points. In other words, when data is missing for certain points, interpolation is a technique used to estimate what those values would have been if they had been measured. Interpolation works by using mathematical algorithms to fill in the missing values based on the patterns observed in the available data. The resulting values are typically considered "interpolated values" and can be used to analyze the data as if no data were missing.

In [32]:
irish_rain['clht'] = pd.to_numeric(irish_rain['clht'], errors='coerce')
irish_rain['clht'] = irish_rain['clht'].interpolate(method='linear')
irish_rain['clht'] = irish_rain['clht'].fillna(irish_rain['clht'].mean())

`clamt`

In [33]:
unique_values = irish_rain['clamt'].unique()
print(unique_values)

[nan 3 7 6 8 2 4 '3' 5 1 0 '8' '7' '2' '6' '0' '1' '4' '5']


In the case of `clamt` cloud amount I am going to use the most common value, or mode, to fill in the missing values. 

In [34]:
mode_value = statistics.mode(irish_rain['clamt'])

In [35]:
irish_rain['clamt'] = irish_rain['clamt'].fillna(mode_value)

`msl`

In [36]:
unique_values = irish_rain['msl'].unique()
print(unique_values)

[1021.5 '1010.0' '1010.1' ... 1034.6 '1027.1' 979.2]


In [37]:
# Drop the missing values
irish_rain.dropna(subset=['msl'], inplace=True)

In [38]:
# Sanity check
# Check for missing values
missing_values = irish_rain.isnull().sum()

# print the number of missing values in each column
print(missing_values)

county       0
station      0
latitude     0
longitude    0
rain         0
temp         0
wetb         0
dewpt        0
vappr        0
rhum         0
msl          0
wdsp         0
wddir        0
sun          0
vis          0
clht         0
clamt        0
dtype: int64


In [39]:
irish_rain

Unnamed: 0_level_0,county,station,latitude,longitude,rain,temp,wetb,dewpt,vappr,rhum,msl,wdsp,wddir,sun,vis,clht,clamt
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2016-12-11 04:00:00,Galway,MACE HEAD,53.326,-9.901,0.0,9.8,9.1,8.4,11.0,90,1021.5,14.0,240.0,0.0,27306.635071,259.742779,7
2017-09-14 00:00:00,Cork,SherkinIsland,51.476,-9.428,0.0,11.9,10.1,8.4,11.0,79,1010.0,13.0,290.0,0.0,27306.635071,259.742779,7
2014-06-27 01:00:00,Westmeath,MULLINGAR,53.537,-7.362,0.4,12.2,12.0,11.7,13.8,96,1010.1,5.0,80.0,0.0,27306.635071,259.742779,7
2007-05-22 14:00:00,Carlow,OAK PARK,52.861,-6.915,0.0,17.8,14.0,10.8,13.0,63,1021.8,8.0,280.0,0.0,27306.635071,259.742779,7
2005-06-03 22:00:00,Cork,CORK AIRPORT,51.847,-8.486,0.0,10.5,9.8,9.0,11.5,91,1008.8,10.0,240.0,0.0,30000.000000,999.000000,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016-06-15 00:00:00,Cavan,BALLYHAISE,54.051,-7.310,0.0,11.0,10.1,9.3,11.7,89,996.5,3.0,310.0,0.0,27306.635071,524.500000,7
2006-08-22 07:00:00,Wexford,JOHNSTOWNII,52.298,-6.497,0.0,14.5,12.7,11.1,13.2,79,1018.6,2.0,250.0,0.0,27306.635071,366.333333,7
2016-10-06 17:00:00,Meath,DUNSANY,53.516,-6.660,0.0,12.1,9.5,6.7,9.8,69,1025.2,9.0,100.0,0.0,27306.635071,208.166667,7
2000-09-02 18:00:00,Mayo,KNOCK AIRPORT,53.906,-8.817,0.0,16.8,14.1,11.9,13.9,73,1016.1,3.0,70.0,0.0,40000.000000,50.000000,7


### Numerical data

In [40]:
# Set selected columns to numerical
numerical_cols = ['latitude', 'longitude', 'rain', 'temp', 'wetb', 'dewpt', 'vappr', 'rhum', 'msl', 'wdsp', 'wddir', 'sun', 'vis', 'clht', 'clamt']
numerical_cols_set = set(numerical_cols)

# Selecting only numeric columns
numeric_irish_rain = irish_rain.select_dtypes(include='number')

# Comparing the selected columns with the expected numerical columns
if set(numeric_irish_rain.columns) == numerical_cols_set:
    print("All columns are numerical.")
else:
    print("Some columns are not numerical.")


Some columns are not numerical.


In [41]:
# Check for numerical columns
numeric_cols = irish_rain.select_dtypes(include='number').columns
print(numeric_cols)

Index(['latitude', 'longitude', 'wdsp', 'wddir', 'sun', 'vis', 'clht'], dtype='object')


In [42]:
# rain, temp, wetb, dewpt, vappr, rhum, msl, and clamt should be numerical

# Convert to numerical data types
irish_rain['rain'] = pd.to_numeric(irish_rain['rain'], errors='coerce')
irish_rain['temp'] = pd.to_numeric(irish_rain['temp'], errors='coerce')
irish_rain['wetb'] = pd.to_numeric(irish_rain['wetb'], errors='coerce')
irish_rain['dewpt'] = pd.to_numeric(irish_rain['dewpt'], errors='coerce')
irish_rain['vappr'] = pd.to_numeric(irish_rain['vappr'], errors='coerce')
irish_rain['rhum'] = pd.to_numeric(irish_rain['rhum'], errors='coerce')
irish_rain['msl'] = pd.to_numeric(irish_rain['msl'], errors='coerce')
irish_rain['clamt'] = pd.to_numeric(irish_rain['clamt'], errors='coerce')

# Print data types of all columns to confirm they are numerical
print(irish_rain.dtypes)

county        object
station       object
latitude     float64
longitude    float64
rain         float64
temp         float64
wetb         float64
dewpt        float64
vappr        float64
rhum           int64
msl          float64
wdsp         float64
wddir        float64
sun          float64
vis          float64
clht         float64
clamt        float64
dtype: object


### Scaling the data

In [43]:
# Select the numerical columns
num_cols = irish_rain.select_dtypes(include=['float64', 'int64']).columns.tolist()

print(num_cols)

['latitude', 'longitude', 'rain', 'temp', 'wetb', 'dewpt', 'vappr', 'rhum', 'msl', 'wdsp', 'wddir', 'sun', 'vis', 'clht', 'clamt']


In [44]:
from sklearn.preprocessing import StandardScaler

# Select the columns to scale
cols_to_scale = ['latitude', 'longitude', 'rain', 'temp', 'wetb', 'dewpt', 'vappr', 'rhum', 'msl', 'wdsp', 'wddir', 'sun', 'vis', 'clht', 'clamt']

# Create a StandardScaler object
scaler = StandardScaler()

# Scale the selected columns
irish_rain[cols_to_scale] = scaler.fit_transform(irish_rain[cols_to_scale])

In [45]:
# Print the summary statistics of the scaled columns
print(irish_rain[cols_to_scale].describe())

           latitude     longitude          rain          temp          wetb   
count  4.765000e+03  4.765000e+03  4.765000e+03  4.765000e+03  4.765000e+03  \
mean   5.055068e-16 -6.165990e-16  4.026160e-17 -4.111903e-16  1.282407e-16   
std    1.000105e+00  1.000105e+00  1.000105e+00  1.000105e+00  1.000105e+00   
min   -1.783717e+00 -1.657198e+00 -2.599409e-01 -3.378197e+00 -3.657390e+00   
25%   -9.438219e-01 -6.562559e-01 -2.599409e-01 -6.894210e-01 -6.882037e-01   
50%    1.453851e-01 -7.157128e-02 -2.599409e-01  3.040871e-02  5.993766e-02   
75%    6.991846e-01  6.935716e-01 -2.599409e-01  7.078955e-01  7.613202e-01   
max    2.197100e+00  1.550949e+00  1.848868e+01  3.693072e+00  2.701812e+00   

              dewpt         vappr          rhum           msl          wdsp   
count  4.765000e+03  4.765000e+03  4.765000e+03  4.765000e+03  4.765000e+03  \
mean   2.967429e-16 -5.099803e-16  2.982341e-18 -1.113904e-15 -2.534990e-17   
std    1.000105e+00  1.000105e+00  1.000105e+00  1.

The `mean` values of all the columns are close to 0, indicating that the columns are centered around 0 after scaling. The `std` values of all the columns are close to 1, indicating that the columns have a similar scale after scaling.

### Categorical data

In [46]:
irish_rain = pd.get_dummies(irish_rain, columns=['county', 'station'], prefix='', prefix_sep='')


### Ensure the data is spaced evenly in time 

In [47]:
# Use the resample method
irish_rain = irish_rain.resample('H').mean()

In [48]:
# Interpolate missing values
irish_rain = irish_rain.interpolate(method='linear')

## Baseline Model

In [53]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

# Define the target variable
target = 'rain'

# Define the lagged features for the different time intervals
lags = [12, 24, 36, 48]

for lag in lags:
    irish_rain[f'{target}_lag{lag}'] = irish_rain[target].shift(lag)

# Remove rows with missing values
irish_rain.dropna(inplace=True)

# Split the data into training and testing sets
X = irish_rain.drop(columns=target)
y = irish_rain[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a linear regression model to the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the mean squared error (MSE) in mm^2
mse = mean_squared_error(y_test, y_pred)

# Calculate the root mean squared error (RMSE) in mm
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.2f} mm")


RMSE: 0.34 mm


### What are the most important features?

In [54]:
from sklearn.tree import DecisionTreeRegressor

# Instantiate a decision tree regressor
model = DecisionTreeRegressor()

# Fit the model to the training data
model.fit(X_train, y_train)

# Compute feature importances
importances = model.feature_importances_

# Print the feature importances
for feature, importance in zip(X_train.columns, importances):
    print(f"{feature}: {importance:.4f}")

latitude: 0.0044
longitude: 0.0040
temp: 0.0026
wetb: 0.0058
dewpt: 0.0019
vappr: 0.0067
rhum: 0.0135
msl: 0.0059
wdsp: 0.0069
wddir: 0.0081
sun: 0.0012
vis: 0.0128
clht: 0.0057
clamt: 0.0016
Carlow: 0.0003
Cavan: 0.0017
Clare: 0.0012
Cork: 0.0027
Donegal: 0.0003
Dublin: 0.0008
Galway: 0.0009
Kerry: 0.0003
Mayo: 0.0014
Meath: 0.0003
Roscommon: 0.0003
Sligo: 0.0002
Tipperary: 0.0003
Westmeath: 0.0015
Wexford: 0.0014
ATHENRY: 0.0007
BALLYHAISE: 0.0026
BELMULLET: 0.0013
CASEMENT: 0.0004
CLAREMORRIS: 0.0013
CORK AIRPORT: 0.0006
DUBLIN AIRPORT: 0.0004
DUNSANY: 0.0001
FINNER: 0.0002
GURTEEN: 0.0006
JOHNSTOWNII: 0.0009
KNOCK AIRPORT: 0.0003
MACE HEAD: 0.0002
MALIN HEAD: 0.0008
MARKREE: 0.0021
MOORE PARK: 0.0012
MT DILLON: 0.0008
MULLINGAR: 0.0012
NEWPORT: 0.0009
OAK PARK: 0.0021
PHOENIX PARK: 0.0004
ROCHES POINT: 0.0009
SHANNON AIRPORT: 0.0002
SherkinIsland: 0.0004
VALENTIA OBSERVATORY: 0.0012
rain_lag12: 0.8523
rain_lag24: 0.0242
rain_lag36: 0.0047
rain_lag48: 0.0022


## 1st iteration

Start with top 7 features: `latitude`, `longitude`, `temp`, `wetb`, `dewpt`, `vappr`, `rhum`

In [138]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Select the top 7 features
features = ['latitude', 'longitude', 'temp', 'wetb', 'dewpt', 'vappr', 'rhum']

# Create the lagged rain features
lags = [12, 24, 36, 48]
for lag in lags:
    irish_rain[f'rain_lag{lag}'] = irish_rain['rain'].shift(lag)

# Drop the rows with missing values
irish_rain.dropna(inplace=True)

# Split the data into training and testing sets
train_size = int(len(irish_rain) * 0.8)
train = irish_rain[:train_size]
test = irish_rain[train_size:]

# Train the linear regression model
lr = LinearRegression()
lr.fit(train[features], train['rain'])

# Make predictions on the test set
y_pred = lr.predict(test[features])

# Calculate the mean squared error for each time interval
mse_12h = mean_squared_error(test['rain_lag12'], y_pred)
mse_24h = mean_squared_error(test['rain_lag24'], y_pred)
mse_36h = mean_squared_error(test['rain_lag36'], y_pred)
mse_48h = mean_squared_error(test['rain_lag48'], y_pred)

# Calculate the root mean squared error for each time interval in mm
rmse_12h = np.sqrt(mse_12h)
rmse_24h = np.sqrt(mse_24h)
rmse_36h = np.sqrt(mse_36h)
rmse_48h = np.sqrt(mse_48h)

# Print the RMSE for each time interval in mm
print(f"RMSE for 12h: {rmse_12h:.2f} mm")
print(f"RMSE for 24h: {rmse_24h:.2f} mm")
print(f"RMSE for 36h: {rmse_36h:.2f} mm")
print(f"RMSE for 48h: {rmse_48h:.2f} mm")

RMSE for 12h: 0.88 mm
RMSE for 24h: 0.89 mm
RMSE for 36h: 0.90 mm
RMSE for 48h: 0.90 mm
