# I. SINGAPORE'S WEATHER DATA: 1983 - 2019 (JUNE)
This dataset is one of the few detailed, multi-year ones that I've been able to find in the public domain in Singapore. I believe it can be useful for data science students in this part of the world who are looking to test their skills on a local dataset, or build a personal project.

I'll be using this dataset for small projects on data visualisation (see notebook 2.0_visualisation_cch), time series analysis and machine learning. Ping me on LinkedIn or Twitter if you do something interesting with this set of data:

Twitter: @chinhon

LinkedIn: https://www.linkedin.com/in/chuachinhon/

## FILE ORGANISATION:
The original data files, as downloaded from the [Singapore Met Office](http://www.weather.gov.sg/climate-historical-daily/) and Data.gov.sg, are in the raw folder. The files are mostly clean, save for some missing values for mean and max wind speed. I've lightly processed the files and saved the output to the data folder so that I can call them up easily for future data projects.

You can make a different version of the dataset by concating the raw files over a different time frame, or with more elaborate feature engineering.

What you'll find in the raw folder:
- 438 CSV files containing daily weather data for Singapore from 1983 - 2019 (June)

- a "monthly_data" sub-folder containing monthly average data for rainfall, maximum and mean temperatures.

The files in the data folder have been processed by the code below.

In [2]:
import glob
import pandas as pd

# 1. DAILY WEATHER DATA 

In [3]:
# Combining the separate CSV files into one
raw = pd.concat(
    [pd.read_csv(f) for f in glob.glob("../raw/*.csv")], ignore_index=True
)

In [4]:
# Adding a datetime col in the year-month-day format
raw["Date"] = pd.to_datetime(
    raw["Year"].astype(str)
    + "-"
    + raw["Month"].astype(str)
    + "-"
    + raw["Day"].astype(str)
)

In [5]:
raw = raw.sort_values('Date', ascending=False)

In [6]:
# Converting values in the Max/Mean Wind into numeric data type
raw["Max Wind Speed (km/h)"] = pd.to_numeric(
    raw["Max Wind Speed (km/h)"], errors="coerce"
)
raw["Mean Wind Speed (km/h)"] = pd.to_numeric(
    raw["Mean Wind Speed (km/h)"], errors="coerce"
)

In [7]:
raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13330 entries, 13084 to 4324
Data columns (total 14 columns):
Station                          13330 non-null object
Year                             13330 non-null int64
Month                            13330 non-null int64
Day                              13330 non-null int64
Daily Rainfall Total (mm)        13330 non-null float64
Highest 30 Min Rainfall (mm)     13330 non-null object
Highest 60 Min Rainfall (mm)     13330 non-null object
Highest 120 Min Rainfall (mm)    13330 non-null object
Mean Temperature (°C)            13330 non-null float64
Maximum Temperature (°C)         13330 non-null float64
Minimum Temperature (°C)         13330 non-null float64
Mean Wind Speed (km/h)           13320 non-null float64
Max Wind Speed (km/h)            13319 non-null float64
Date                             13330 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(6), int64(3), object(4)
memory usage: 1.5+ MB


#### Fill the missing entries in Mean Wind Speed and Max Wind Speed columns with mean values of both cols

In [8]:
raw["Max Wind Speed (km/h)"] = raw["Max Wind Speed (km/h)"].fillna(
    raw["Max Wind Speed (km/h)"].mean()
)
raw["Mean Wind Speed (km/h)"] = raw["Mean Wind Speed (km/h)"].fillna(
    raw["Mean Wind Speed (km/h)"].mean()
)

In [9]:
# Dropping cols that I won't need for visualisation or modelling
raw = raw.drop(
    columns=[
        "Station",
        "Highest 30 Min Rainfall (mm)",
        "Highest 60 Min Rainfall (mm)",
        "Highest 120 Min Rainfall (mm)",
    ]
)

In [10]:
# Slight rearrangement of cols for clarity
cols = [
    "Date",
    "Year",
    "Month",
    "Day",
    "Daily Rainfall Total (mm)",
    "Mean Temperature (°C)",
    "Maximum Temperature (°C)",
    "Minimum Temperature (°C)",
    "Mean Wind Speed (km/h)",
    "Max Wind Speed (km/h)",
]

In [11]:
weather = raw[cols].copy()

In [12]:
weather = weather.sort_values('Date', ascending=False)

In [13]:
weather.info()
# no null values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13330 entries, 13084 to 4324
Data columns (total 10 columns):
Date                         13330 non-null datetime64[ns]
Year                         13330 non-null int64
Month                        13330 non-null int64
Day                          13330 non-null int64
Daily Rainfall Total (mm)    13330 non-null float64
Mean Temperature (°C)        13330 non-null float64
Maximum Temperature (°C)     13330 non-null float64
Minimum Temperature (°C)     13330 non-null float64
Mean Wind Speed (km/h)       13330 non-null float64
Max Wind Speed (km/h)        13330 non-null float64
dtypes: datetime64[ns](1), float64(6), int64(3)
memory usage: 1.1 MB


In [14]:
weather.columns

Index(['Date', 'Year', 'Month', 'Day', 'Daily Rainfall Total (mm)',
       'Mean Temperature (°C)', 'Maximum Temperature (°C)',
       'Minimum Temperature (°C)', 'Mean Wind Speed (km/h)',
       'Max Wind Speed (km/h)'],
      dtype='object')

In [15]:
weather.describe()
# The Daily Rainfall cols have some obvious outliers. But let's deal with that later, as and when required

Unnamed: 0,Year,Month,Day,Daily Rainfall Total (mm),Mean Temperature (°C),Maximum Temperature (°C),Minimum Temperature (°C),Mean Wind Speed (km/h),Max Wind Speed (km/h)
count,13330.0,13330.0,13330.0,13330.0,13330.0,13330.0,13330.0,13330.0,13330.0
mean,2000.750863,6.481995,15.727907,5.851905,27.657524,31.510833,24.88907,7.402755,34.041722
std,10.537707,3.448823,8.799558,14.455764,1.173196,1.57175,1.260053,3.442466,8.052888
min,1983.0,1.0,1.0,0.0,22.8,23.6,20.2,0.2,4.7
25%,1992.0,3.0,8.0,0.0,26.9,30.8,24.0,4.8,28.8
50%,2001.0,6.0,16.0,0.0,27.7,31.7,24.9,6.8,33.1
75%,2010.0,9.0,23.0,4.4,28.5,32.5,25.7,9.7,38.2
max,2019.0,12.0,31.0,216.2,30.9,36.0,29.1,22.2,90.7


In [16]:
weather.head()

Unnamed: 0,Date,Year,Month,Day,Daily Rainfall Total (mm),Mean Temperature (°C),Maximum Temperature (°C),Minimum Temperature (°C),Mean Wind Speed (km/h),Max Wind Speed (km/h)
13084,2019-06-30,2019,6,30,0.0,28.8,30.9,27.3,10.1,28.8
13083,2019-06-29,2019,6,29,18.4,28.6,32.0,23.9,13.0,41.8
13082,2019-06-28,2019,6,28,0.0,29.6,32.2,27.8,14.4,33.5
13081,2019-06-27,2019,6,27,0.0,29.2,32.2,26.7,9.7,34.2
13080,2019-06-26,2019,6,26,0.0,28.3,31.2,27.1,8.6,43.2


In [17]:
#weather.to_csv('../data/weather.csv', index=False)

## 2. MONTHLY DATA
Here, I'll do some light processing of the monthly average data for rainfall, maximum and mean temperatures. They are in the raw folder's "monthly_data" sub-folder.

### 2.1 MONTHLY RAINFALL RECORDS

In [18]:
monthly_rain = pd.read_csv('../raw/monthly_data/monthly_rain.csv')

In [19]:
monthly_rain["month"] = pd.to_datetime(monthly_rain["month"])
monthly_rain["year"] = monthly_rain["month"].dt.year
monthly_rain["month"] = monthly_rain["month"].dt.month

In [20]:
monthly_rain = monthly_rain.rename(columns = {"year": "Year", 
                                              "month": "Month", 
                                              "total_rainfall": "Total_Monthly_Rainfall (mm)"})

In [21]:
# For consistency with the daily records, I'll start with entries from 1983 for the monthly datasets as well 
cols_rain = ["Total_Monthly_Rainfall (mm)", "Year", "Month"]
monthly_rain = monthly_rain[cols_rain].copy()
monthly_rain = monthly_rain[monthly_rain["Year"] >= 1983]

In [22]:
#monthly_rain.to_csv('../data/rain_monthly.csv', index=False)

In [23]:
monthly_rain.head()

Unnamed: 0,Total_Monthly_Rainfall (mm),Year,Month
12,246.0,1983,1
13,5.6,1983,2
14,18.6,1983,3
15,33.6,1983,4
16,160.8,1983,5


### 2.2 MONTHLY MEAN TEMPERATURES

In [24]:
mean_temp = pd.read_csv('../raw/monthly_data/monthly_temp_mean.csv')

In [25]:
mean_temp["month"] = pd.to_datetime(mean_temp["month"])
mean_temp["year"] = mean_temp["month"].dt.year
mean_temp["month"] = mean_temp["month"].dt.month

In [26]:
mean_temp = mean_temp.rename(
    columns={
        "year": "Year",
        "month": "Month",
        "mean_temp": "Mean_Monthly_Temperature (°C)",
    }
)

In [27]:
cols_temp_mean = ["Mean_Monthly_Temperature (°C)", "Year", "Month"]
mean_temp = mean_temp[cols_temp_mean].copy()
mean_temp = mean_temp[mean_temp["Year"] >= 1983]

In [28]:
#mean_temp.to_csv('../data/mean_temp_monthly.csv', index=False)

### 2.3 MONTHLY MAX TEMPERATURES

In [29]:
max_temp = pd.read_csv('../raw/monthly_data/monthly_temp_max.csv')

In [30]:
max_temp["month"] = pd.to_datetime(max_temp["month"])
max_temp["year"] = max_temp["month"].dt.year
max_temp["month"] = max_temp["month"].dt.month

In [31]:
max_temp = max_temp.rename(
    columns={
        "year": "Year",
        "month": "Month",
        "max_temperature": "Max_Monthly_Temperature (°C)",
    }
)

In [32]:
cols_temp_max = ["Max_Monthly_Temperature (°C)", "Year", "Month"]
max_temp = max_temp[cols_temp_max].copy()
max_temp = max_temp[max_temp["Year"] >= 1983]

In [33]:
#max_temp.to_csv('../data/max_temp_monthly.csv', index=False)