# Rainfall Trends and Patterns in Dublin (Ringsend)

The final project for Programming for Data Analytics '24-'25

Author: Atacan Buyuktalas

## Introduction

- Objective

    This project analyzes rainfall data from the Dublin (Ringsend) weather station from 1941 to August 2024. It aims to uncover trends, seasonal patterns, and significant rainfall events.

- Key Questions

    1.	How has total annual rainfall changed over time?
	2.	Which months experience the highest and lowest rainfall?
	3.	What trends exist in the number of rain (rd) and wet days (wd)?
	4.	How has the greatest daily rainfall (gdf) varied?
	5.	Can we predict future annual rainfall based on historical data?

## Loading and Exploring Dataset

The data set it taken from [Met Eireann](https://www.met.ie/climate/available-data/historical-data). 

In [31]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
import numpy as np

In [32]:
# Read the data
file_path = 'data/dublin_1941_2024.csv'
df = pd.read_csv(file_path, skiprows=13)

# Display the first 5 rows of the data
print(df.head())

# Check for missing values and data types
print(df.info())

   year  month  ind   rain   gdf  rd  wd
0  1941      1    0  112.8  13.0  18  18
1  1941      2    0   69.5  13.0  22  15
2  1941      3    0  111.0  50.0  21  13
3  1941      4    0   68.6  16.5  15  12
4  1941      5    0   66.4  20.1  13  10
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 971 entries, 0 to 970
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    971 non-null    int64 
 1   month   971 non-null    int64 
 2   ind     971 non-null    int64 
 3   rain    971 non-null    object
 4   gdf     971 non-null    object
 5   rd      971 non-null    object
 6   wd      971 non-null    object
dtypes: int64(3), object(4)
memory usage: 53.2+ KB
None


In [33]:
# Convert columns to appropriate numeric types
df['rain'] = pd.to_numeric(df['rain'], errors='coerce')
df['gdf'] = pd.to_numeric(df['gdf'], errors='coerce')
df['rd'] = pd.to_numeric(df['rd'], errors='coerce')
df['wd'] = pd.to_numeric(df['wd'], errors='coerce')

## Data Cleaning and Preparation 


### Convert date columns

- Used [`to_datetime`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) to create a datetime column, also used [`DataFrame.assign(day=1)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html#pandas.DataFrame.assign) to complete the date. 

In [34]:
# Create a datetime column for easy plotting and analysis
df['date'] = pd.to_datetime(df[['year', 'month']].assign(day=1))
df.set_index('date', inplace=True)

# Drop unnecessary columns
df = df.drop(columns=['ind'])

print(df.head())

            year  month   rain   gdf    rd    wd
date                                            
1941-01-01  1941      1  112.8  13.0  18.0  18.0
1941-02-01  1941      2   69.5  13.0  22.0  15.0
1941-03-01  1941      3  111.0  50.0  21.0  13.0
1941-04-01  1941      4   68.6  16.5  15.0  12.0
1941-05-01  1941      5   66.4  20.1  13.0  10.0


### Handling missing values

- Used [`replace()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html#pandas.DataFrame.replace) operator to reveal null values and counting them using [`isnull()`](https://pandas.pydata.org/docs/reference/api/pandas.isnull.html#pandas-isnull) and `sum()`.

- Used []

In [35]:
# Replace missing values with NaN
df.replace(' ', np.nan, inplace=True)

# Check for missing values
print(df.isnull().sum())

print(df.info())

year       0
month      0
rain      50
gdf      104
rd        91
wd        91
dtype: int64
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 971 entries, 1941-01-01 to 2024-08-01
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   year    971 non-null    int64  
 1   month   971 non-null    int64  
 2   rain    921 non-null    float64
 3   gdf     867 non-null    float64
 4   rd      880 non-null    float64
 5   wd      880 non-null    float64
dtypes: float64(4), int64(2)
memory usage: 53.1 KB
None


In [37]:
# Fill missing values using an interpolation method
df['rain'] = df['rain'].interpolate(method='time')
df['gdf'] = df['gdf'].interpolate(method='time')
df['rd'] = df['rd'].interpolate(method='time')
df['wd'] = df['wd'].interpolate(method='time')

#Sanity check
print(df.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 971 entries, 1941-01-01 to 2024-08-01
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   year    971 non-null    int64  
 1   month   971 non-null    int64  
 2   rain    971 non-null    float64
 3   gdf     971 non-null    float64
 4   rd      971 non-null    float64
 5   wd      971 non-null    float64
dtypes: float64(4), int64(2)
memory usage: 53.1 KB
None


### Add derived column

In [39]:
# Add 'season' column for seasonal analysis
df['month'] = df.index.month
df['season'] = df['month'].apply(
    lambda x: 'Winter' if x in [12, 1, 2] 
                else 'Spring' if x in [3, 4, 5] 
                else 'Summer' if x in [6, 7, 8] 
                else 'Autumn')

print(df.head(12))

            year  month   rain   gdf    rd    wd  season
date                                                    
1941-01-01  1941      1  112.8  13.0  18.0  18.0  Winter
1941-02-01  1941      2   69.5  13.0  22.0  15.0  Winter
1941-03-01  1941      3  111.0  50.0  21.0  13.0  Spring
1941-04-01  1941      4   68.6  16.5  15.0  12.0  Spring
1941-05-01  1941      5   66.4  20.1  13.0  10.0  Spring
1941-06-01  1941      6   13.6   3.8   5.0   4.0  Summer
1941-07-01  1941      7   33.4   6.6  14.0  10.0  Summer
1941-08-01  1941      8   58.2   6.6  22.0  17.0  Summer
1941-09-01  1941      9   19.6   5.1  10.0   8.0  Autumn
1941-10-01  1941     10   51.2  12.4  17.0  11.0  Autumn
1941-11-01  1941     11   81.2  10.4  26.0  17.0  Autumn
1941-12-01  1941     12   38.5  16.5  18.0   9.0  Winter
