<a href="https://colab.research.google.com/github/alfredofosu/python.projects/blob/main/Climate_Data_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Extracting Historical Climate Data from Environment and Cimate Change Canada (ECCC)
The ECCC allows you to extract historical climate data in a form of CSV or XML files; however, their portal allows for just a day's search so with their bulk extracting URL I decided to build a python script to extract bulk data.

In [2]:
import requests
import pandas as pd
from pathlib import Path

# Since I have more than three args to insert into the URL, build a class object
# seem to be an efficient way of organizing the syntax of this script.

class ClimateDataGetter(): # <- the class object

  timeframe = 1 # can only be set to 1, 2 or 3 for hourly, daily and monthly intervals respectively

  def __init__(self, url:str, format:str, stationID:int, from_year:int, to_year:int, from_month:int, to_month:int, 
               timeframe={"hourly":1,"daily":2,"monthly":3}):
    # Instance variables
    self.url = url
    self.format = format
    self.stationID = stationID
    self.from_year = from_year
    self.to_year = to_year
    self.from_month = from_month
    self.to_month = to_month
    #self.timeframe = timeframe #{1=hourly, 2=daily, 3=monthly}

  def download_csv_files(self):
    # iterate through each year from Jan to Dec and retrieve all corresponding data
    all_csv = []
    for year in range(self.from_year,self.to_year + 1):
      for month in range (self.from_month,self.to_month + 1):
        response_csv = pd.read_csv(f"{self.url}?format={self.format}&stationID={self.stationID}&Year={year}&Month={month}&timeframe={self.timeframe}&submit=Download+Data")
        all_csv.append(response_csv) # each time an CSV is extracted its is saved to a list
    data = pd.concat(all_csv) # this combines all the CSVs into one 

    # save the extraced files to a designated path/folder.
    filepath = Path(f'/content/drive/MyDrive/data/{self.stationID}_{self.from_year}_{self.to_year}_{[x for x,y in {"hourly":1,"daily":2,"monthly":3}.items() if self.timeframe == y][0]}.{self.format}')
    filepath.parent.mkdir(parents=True, exist_ok=True)
    data.to_csv(filepath, index=False)

    # save the pandas information to a designated path/folder.
    with open(f'/content/drive/MyDrive/data/{self.stationID}_{self.from_year}_{self.to_year}_{[x for x,y in {"hourly":1,"daily":2,"monthly":3}.items() if self.timeframe == y][0]}.txt', 'w') as file_out:
      data.info(verbose=True, buf=file_out)

    # print to screen
    print("Extracted {} {} climate data from station {} from {} to {}." .format(len(data), 
                                                                  [x for x,y in {"hourly":1,"daily":2,"monthly":3}.items() if self.timeframe == y][0],
                                                                  self.stationID, self.from_year, self.to_year))
  @classmethod
  # the timeframe can be set to 1, 2 or 3  
  def set_timeframe(cls, timeframe):
    cls.timeframe = timeframe



List of weather stations near the lake of the woods and rainy river

1. FORT FRANCES RCS ONTARIO Weather StationID 46507 (Hourly from 2007)
2. MINE CENTRE SOUTHWEST ONTARIO Weather StationID 44343 (Daily from 2005)
3. ATIKOKAN (AUT) ONTARIO Weather StationID 10220 (Hourly from 1994)




In [4]:
# https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=46507&Year=2022&Month=12&timeframe=1&submit=Download+Data

def main():

  CDG = ClimateDataGetter("https://climate.weather.gc.ca/climate_data/bulk_data_e.html",
                          "csv", 10220, 1994, 2021, 1, 12) # an instance of the class object
  CDG.set_timeframe(1)
  CDG.download_csv_files()
  # ClimateDataGetter.download_csv_files(CDG) # same as in line 7

if __name__=="__main__":
  main()

Extracted 245448 hourly climate data from station 10220 from 1994 to 2021.


In [3]:
df = pd.read_csv('/content/drive/MyDrive/data/10220_1994_2021_hourly.csv')
df.info()

  exec(code_obj, self.user_global_ns, self.user_ns)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245448 entries, 0 to 245447
Data columns (total 30 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Longitude (x)        245448 non-null  float64
 1   Latitude (y)         245448 non-null  float64
 2   Station Name         245448 non-null  object 
 3   Climate ID           245448 non-null  object 
 4   Date/Time (LST)      245448 non-null  object 
 5   Year                 245448 non-null  int64  
 6   Month                245448 non-null  int64  
 7   Day                  245448 non-null  int64  
 8   Time (LST)           245448 non-null  object 
 9   Temp (°C)            238354 non-null  float64
 10  Temp Flag            584 non-null     object 
 11  Dew Point Temp (°C)  235233 non-null  float64
 12  Dew Point Temp Flag  3852 non-null    object 
 13  Rel Hum (%)          235233 non-null  float64
 14  Rel Hum Flag         3855 non-null    object 
 15  Wind Dir (10s deg