This project creates an end-to-end data pipeline to fetch data from various reports, store it in a Google Cloud Platform (GCP) database, build a dashboard, and develop a machine learning model for price forecasting.
We use Python, along with libraries such as pandas and BeautifulSoup, to scrape data from various report links. The scraped data is stored in dataframes and then loaded into Google Cloud Storage buckets. This data is then transferred to BigQuery tables for efficient processing. The data extraction process is automated with a Cronjob/Google Cloud Scheduler.
We build and run various machine learning models in GCP’s BigQuery to predict future fuel/energy prices. We tested LSTM univariate/multivariate, GRU for time series problems, and ANN Regressor, Random Forests regression for regression problems. The ANN regression model provided the best results for our use case.
After modeling, we generate a data visualization report on Google Data Studio for further insights. The report includes a pie chart about the distribution of fuel generated by each fuel type, a stacked column chart about the distribution of fuel generated each month, and a time series visualization of fuel generation during each quarter of the year.
- Mean Average Error (MAE): The ANN regression model achieved a MAE in the range of 7.51 - 12.
- Look Back: The LSTM/GRU models used a look back of 3, meaning they trained on 3 hours of past data.
- n_steps_in and n_steps_out: The LSTM/GRU models used n_steps_in of 3 and n_steps_out of 1, meaning they looked at 3 past hours and predicted 1 future hour.
- nb_epochs: The LSTM/GRU models completed 10 passes of the entire training dataset.
- Pie Chart: Nuclear fuel generated 60% of the fuel.
- Stacked Column Chart: Fuel generation was highest in January and August.
Instructions for accessing and configuring Google BigQuery, Google Cloud Storage, Google Cloud Functions, and Google Cloud Scheduler are provided in the following sections.