# V. Weather Predictions: 'Classic' Machine Learning Models Vs Keras

There's no shortage of online tutorials on specific data science tasks. What's harder to find are tutorials that connect the dots for newcomers, and help them explore the next phase once they've build up some level of familiarity in an area, say, machine learning basics.

After spending a good part of 2019 learning the basics of machine learning, I was keen to start experimenting with some rudimentary deep learning. But there wasn't an obvious way to start. So I decided to pull together the materials I had found on the subject, and rustled up a series of notebooks that would hopefully help others who are looking to do the same.

In these notebooks, I use a mix of machine learning and deep learning techniques to try to predict the rain pattern in Singapore in December 2019 (validation set). The models will be trained on 37 years of weather data in Singapore, from Jan 01 1983 to the end of November in 2019. 

CAVEAT: While this dataset spans 37 years, it contains just under 13,500 rows of data. It is fair to ask whether you need deep learning for a dataset like this, and whether it necessarily produces better results.

Frankly, these questions don't matter much to me as a newcomer to data science. Massive real-world datasets are hard to come by, especially in Singapore. I much prefer to continue experimenting and learning new techniques, instead of waiting for the perfect dataset to drop on my lap.

In [1]:
import glob
import numpy as np
import pandas as pd

# 1. CREATE NEW WEATHER DATAFRAME

I've been using the Singapore Met Service's [historic daily records](http://www.weather.gov.sg/climate-historical-daily) for a series of [data science projects](https://medium.com/@chinhonchua).

For this new project, let's create a new CSV file containing weather data from Jan 1983 - Nov 2019. The weather data for Dec 2019 will be used to test the models' predictions.

In [2]:
# Combining the separate CSV files into one
raw = pd.concat(
    [pd.read_csv(f) for f in glob.glob("../raw/*.csv")], ignore_index=True
)

In [3]:
# Adding a datetime col in the year-month-day format
raw["Date"] = pd.to_datetime(
    raw["Year"].astype(str)
    + "-"
    + raw["Month"].astype(str)
    + "-"
    + raw["Day"].astype(str)
)

# Converting values in the Max/Mean Wind into numeric data type
raw["Max Wind Speed (km/h)"] = pd.to_numeric(
    raw["Max Wind Speed (km/h)"], errors="coerce"
)
raw["Mean Wind Speed (km/h)"] = pd.to_numeric(
    raw["Mean Wind Speed (km/h)"], errors="coerce"
)

#### Fill the missing entries in Mean Wind Speed and Max Wind Speed columns with mean values of both cols

In [4]:
raw["Max Wind Speed (km/h)"] = raw["Max Wind Speed (km/h)"].fillna(
    raw["Max Wind Speed (km/h)"].mean()
)
raw["Mean Wind Speed (km/h)"] = raw["Mean Wind Speed (km/h)"].fillna(
    raw["Mean Wind Speed (km/h)"].mean()
)

In [5]:
# Dropping cols that I won't need 
raw = raw.drop(
    columns=[
        "Station",
        "Highest 30 Min Rainfall (mm)",
        "Highest 60 Min Rainfall (mm)",
        "Highest 120 Min Rainfall (mm)",
    ]
)

In [6]:
# Slight rearrangement of cols for clarity
cols = [
    "Date",
    "Year",
    "Month",
    "Day",
    "Daily Rainfall Total (mm)",
    "Mean Temperature (°C)",
    "Maximum Temperature (°C)",
    "Minimum Temperature (°C)",
    "Mean Wind Speed (km/h)",
    "Max Wind Speed (km/h)",
]

In [7]:
weather = raw[cols].copy()

In [8]:
weather = weather.sort_values('Date', ascending=False)

In [9]:
#weather.to_csv('../data/weather_model.csv', index=False)