# Jupyter Notebook Purpose

- Run all code cells and you will have the most current data from each resource (through a series of HTTP requests)
    - as long as the links remain active in the future
- The data will be saved in the data format from where they came
- Also the data will be processed in later Jupyter Notebooks
    - they are numbered in the order that they should be run (ex. 01, 02, etc.)

## Group 2 Members

- 1. Melissa Hartwick - [Email](mailto:mhartwic@uwaterloo.ca)
- 2. McKinleigh Needham - [Email](mailto:mjneedha@uwaterloo.ca)
- 3. Daniel Adam Cebula  - [Email](mailto:dacebula@uwaterloo.ca)
- 4. Athithian Selvadurai - [Email](mailto:a6selvad@uwaterloo.ca)
- 5. Aravind Kakarala - [Email](mailto:akakaral@uwaterloo.ca)
- 6. Allan Sales - [Email](mailto:asales@uwaterloo.ca)

# Python Dependencies

In [7]:
import pandas as pd
import numpy as np
import os
import requests  # simple HTTP library for Python
import io        # Tool for working with streams (Input/Ouput data)
import matplotlib.pyplot as plt
import glob

%matplotlib inline

In [2]:
# function to create folders
def create_folder(cwd="DEFAULT", folder_name="DEFAULT"):
    """
    If folder does not exist and it is not an empty string then create the folder and return FilePath.
    If folder does exist and it is not an empty string then return FilePath.
    Otherwise return None.
    """
    import os
    # get file_path for folder_name
    file_path = os.path.join(cwd, folder_name)
    
    # check to see if folder exists and folder name is not empty string ""
    if os.path.exists(file_path) == False and folder_name != "":
        try:
            os.makedirs(file_path) # auto create folder if it does not exist
            # return filepath
            return file_path
        except:
            print("\\n(꒪Д꒪)ノ\tPATH ERROR -- cannot create folder:  ", folder_name)
            return None
    
    # if folder exists then just return filepath
    elif os.path.exists(file_path) == True and folder_name != "":
        return file_path
    
    # for all other conditions just return None
    else:
        return file_path

# Government of Canada - Historical Data - [link](https://climate.weather.gc.ca/historical_data/search_historic_data_e.html)

- Instructions for Bulk Data Download can be found on this [Google Drive](https://drive.google.com/drive/folders/1WJCDEU34c60IfOnG4rv5EPZ4IhhW9vZH)
    - [Weather Station Inventory](https://drive.google.com/file/d/1egfzGgzUb0RFu_EE5AYFZtsyXPfZ11y2/view?usp=sharing)
    - [URLs for bulk data download](https://drive.google.com/file/d/14Ifvj2oxO0IMqz1Kh3YVntwBOmXY300_/view?usp=sharing)
- Get Weather data for 2014 - 2019
    - neglect any affect that COVID-19 might have on the data

In [3]:
# Create Folder for Toronto Historical Weather
cwd = os.getcwd()

Toronto_Weather_Directory = create_folder(cwd=cwd, folder_name="TORONTO_WEATHER")

In [4]:
# Canadian Historical Weather Station List
# need to download the file and store it
station_inventory = os.path.join(Toronto_Weather_Directory, "Station Inventory EN.csv")

# open the .csv file with pandas
station_inventory_df = pd.read_csv(station_inventory, skiprows=2)
station_inventory_df.head()

Unnamed: 0,Name,Province,Climate ID,Station ID,WMO ID,TC ID,Latitude (Decimal Degrees),Longitude (Decimal Degrees),Latitude,Longitude,Elevation (m),First Year,Last Year,HLY First Year,HLY Last Year,DLY First Year,DLY Last Year,MLY First Year,MLY Last Year
0,ACTIVE PASS,BRITISH COLUMBIA,1010066,14,,,48.87,-123.28,485200000,-1231700000,4.0,1984,1996,,,1984.0,1996.0,1984.0,1996.0
1,ALBERT HEAD,BRITISH COLUMBIA,1010235,15,,,48.4,-123.48,482400000,-1232900000,17.0,1971,1995,,,1971.0,1995.0,1971.0,1995.0
2,BAMBERTON OCEAN CEMENT,BRITISH COLUMBIA,1010595,16,,,48.58,-123.52,483500000,-1233100000,85.3,1961,1980,,,1961.0,1980.0,1961.0,1980.0
3,BEAR CREEK,BRITISH COLUMBIA,1010720,17,,,48.5,-124.0,483000000,-1240000000,350.5,1910,1971,,,1910.0,1971.0,1910.0,1971.0
4,BEAVER LAKE,BRITISH COLUMBIA,1010774,18,,,48.5,-123.35,483000000,-1232100000,61.0,1894,1952,,,1894.0,1952.0,1894.0,1952.0


In [5]:
# Need to grab the Weather Stations that are in
# Ontario, Toronto and were active from the years 2014 - 2019
station_inventory_df["Name"] = station_inventory_df["Name"].str.upper()
station_inventory_df["Province"] = station_inventory_df["Province"].str.upper()

# Filters for only valid Historical Toronto Weather Stations within a geographic range
filter1 = station_inventory_df["Name"].str.contains("TORONTO")
filter2 = station_inventory_df["Province"].str.contains("ONTARIO")
filter3 = station_inventory_df["HLY First Year"] <= 2014
filter4 = station_inventory_df["HLY Last Year"] >= 2019
# Latitude and Longitude for Union Station
latitude = 43.645840
longitude = -79.379861
filter5 = station_inventory_df["Latitude (Decimal Degrees)"] >= latitude - 0.3
filter6 = station_inventory_df["Latitude (Decimal Degrees)"] <= latitude + 0.3
filter7 = station_inventory_df["Longitude (Decimal Degrees)"] >= longitude - 0.3
filter8 = station_inventory_df["Longitude (Decimal Degrees)"] <= longitude + 0.3

# take a subset of the data for consideration
station_inventory_df = station_inventory_df.loc[(((filter1&filter2)|
                                                  (filter5&filter6&filter7&filter8))
                                                 &filter3&filter4)].reset_index(drop=True)
station_inventory_df

Unnamed: 0,Name,Province,Climate ID,Station ID,WMO ID,TC ID,Latitude (Decimal Degrees),Longitude (Decimal Degrees),Latitude,Longitude,Elevation (m),First Year,Last Year,HLY First Year,HLY Last Year,DLY First Year,DLY Last Year,MLY First Year,MLY Last Year
0,TORONTO CITY,ONTARIO,6158355,31688,71508.0,XTO,43.67,-79.4,434000000,-792400000,112.5,2002,2021,2002.0,2021.0,2002.0,2021.0,2003.0,2006.0
1,TORONTO CITY CENTRE,ONTARIO,6158359,48549,71265.0,YTZ,43.63,-79.4,433739000,-792346000,76.8,2009,2021,2009.0,2021.0,2010.0,2021.0,,
2,TORONTO INTL A,ONTARIO,6158731,51459,71624.0,YYZ,43.68,-79.63,434036000,-793750000,173.4,2013,2021,2013.0,2021.0,2013.0,2021.0,,


In [6]:
# with the DataFrame download the hourly data from 2014 - 2019 timeframe for
# the identified 3 Weather Stations
Station_ID = station_inventory_df["Station ID"]
Station_Name = station_inventory_df["Name"]
Years = list(range(2014,2020))
Months = list(range(1,13))

# iterate through the 3 stations
for index, station in enumerate(Station_ID):
    
    # create a folder and get filepath for each 
    folderpath = create_folder(cwd=Toronto_Weather_Directory, folder_name=Station_Name[index])
    
    # iterate through the 6 years
    for year in Years:
        
        # iterate through each month
        for month in Months:
            
            URL = (f"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&" +
                    f"stationID={station}&Year={year}&Month={month}&Day=14" +
                    f"&timeframe=1&submit= Download+Data")

            # this might take a while to download the zipped file (~50 MB in size)
            with requests.get(URL, stream=True) as response:
                # location where it will be saved
                filepath = os.path.join(folderpath, f"{year}-{month}-{Station_Name[index]}.csv")
                # save the .csv
                with open(filepath, "wb") as file:
                    for chunk in response.iter_content(chunk_size=128):
                        file.write(chunk)

In [9]:
os.path.join(Toronto_Weather_Directory, Station_Name[0])

'C:\\Users\\danie\\OneDrive\\Documents\\Data_Science_ML-Group_2\\TORONTO_WEATHER\\TORONTO CITY'

In [18]:
# concatenate all dataframes together for each Weather Station
# Toronto City Weather Station
Toronto_City_files = glob.glob(os.path.join(os.path.join(Toronto_Weather_Directory, Station_Name[0]),
                                            "*.csv"))

df_from_each_file = (pd.read_csv(f) for f in Toronto_City_files)
Toronto_City_df   = pd.concat(df_from_each_file, ignore_index=True)
Toronto_City_df = Toronto_City_df.loc[:, ['Station Name', 'Date/Time (LST)', 'Temp (°C)',
       'Precip. Amount (mm)', 'Wind Dir (10s deg)', 'Wind Spd (km/h)',
       'Stn Press (kPa)', 'Hmdx', 'Wind Chill']]
del df_from_each_file

# Toronto City Centre Weather Station
Toronto_City_Centre_files = glob.glob(os.path.join(os.path.join(Toronto_Weather_Directory, Station_Name[1]),
                                            "*.csv"))

df_from_each_file = (pd.read_csv(f) for f in Toronto_City_Centre_files)
Toronto_City_Centre_df = pd.concat(df_from_each_file, ignore_index=True)
Toronto_City_Centre_df = Toronto_City_Centre_df.loc[:, ['Station Name', 'Date/Time (LST)', 'Temp (°C)',
       'Precip. Amount (mm)', 'Wind Dir (10s deg)', 'Wind Spd (km/h)',
       'Stn Press (kPa)', 'Hmdx', 'Wind Chill']]
del df_from_each_file

# Toronto INTL A Weather Station
Toronto_INTL_A_files = glob.glob(os.path.join(os.path.join(Toronto_Weather_Directory, Station_Name[2]),
                                            "*.csv"))

df_from_each_file = (pd.read_csv(f) for f in Toronto_INTL_A_files)
Toronto_INTL_A_df   = pd.concat(df_from_each_file, ignore_index=True)
Toronto_INTL_A_df = Toronto_INTL_A_df.loc[:, ['Station Name', 'Date/Time (LST)', 'Temp (°C)',
       'Precip. Amount (mm)', 'Wind Dir (10s deg)', 'Wind Spd (km/h)',
       'Stn Press (kPa)', 'Hmdx', 'Wind Chill']]
del df_from_each_file

# Combine all the datasets
Combined_df = pd.concat((Toronto_City_df, Toronto_City_Centre_df, Toronto_INTL_A_df), ignore_index=True)
Combined_df["Date/Time (LST)"] = pd.to_datetime(Combined_df["Date/Time (LST)"])
Combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157752 entries, 0 to 157751
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   Station Name         157752 non-null  object        
 1   Date/Time (LST)      157752 non-null  datetime64[ns]
 2   Temp (°C)            157254 non-null  float64       
 3   Precip. Amount (mm)  100729 non-null  float64       
 4   Wind Dir (10s deg)   101944 non-null  float64       
 5   Wind Spd (km/h)      105014 non-null  float64       
 6   Stn Press (kPa)      157213 non-null  float64       
 7   Hmdx                 24484 non-null   float64       
 8   Wind Chill           22744 non-null   float64       
dtypes: datetime64[ns](1), float64(7), object(1)
memory usage: 10.8+ MB


In [26]:
# Aggregate to get the average of all numerical fields
Groupby_df = Combined_df.groupby(["Date/Time (LST)"])[
    'Temp (°C)', 'Precip. Amount (mm)', 'Wind Dir (10s deg)',
    'Wind Spd (km/h)', 'Stn Press (kPa)', 'Hmdx', 'Wind Chill'
].mean().round(decimals=2).reset_index(drop=False)
Groupby_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52584 entries, 0 to 52583
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Date/Time (LST)      52584 non-null  datetime64[ns]
 1   Temp (°C)            52582 non-null  float64       
 2   Precip. Amount (mm)  52573 non-null  float64       
 3   Wind Dir (10s deg)   52567 non-null  float64       
 4   Wind Spd (km/h)      52575 non-null  float64       
 5   Stn Press (kPa)      52582 non-null  float64       
 6   Hmdx                 9535 non-null   float64       
 7   Wind Chill           12403 non-null  float64       
dtypes: datetime64[ns](1), float64(7)
memory usage: 3.2 MB


  Groupby_df = Combined_df.groupby(["Date/Time (LST)"])[


In [29]:
# Write The file to TORONTO_WEATHER folder
Groupby_df.to_csv(os.path.join(Toronto_Weather_Directory, "Toronto_Weather.csv"),index=False)

Groupby_df.head()

Unnamed: 0,Date/Time (LST),Temp (°C),Precip. Amount (mm),Wind Dir (10s deg),Wind Spd (km/h),Stn Press (kPa),Hmdx,Wind Chill
0,2014-01-01 00:00:00,-9.07,0.0,27.0,21.5,100.93,,-17.0
1,2014-01-01 01:00:00,-9.0,0.0,26.5,32.0,101.0,,-18.5
2,2014-01-01 02:00:00,-9.4,0.0,26.5,26.0,101.06,,-18.5
3,2014-01-01 03:00:00,-9.9,0.0,24.5,24.5,101.1,,-18.5
4,2014-01-01 04:00:00,-9.97,0.0,25.0,25.0,101.13,,-18.5


# Toronto Open Data Catalogue - TTC Delay Data - [link](https://open.toronto.ca/)

- Data can be downloaded here
    - [TTC Subway Delay](https://open.toronto.ca/dataset/ttc-subway-delay-data/)
    - [TTC Bus Delay](https://open.toronto.ca/dataset/ttc-bus-delay-data/)
    - [TTC Streetcar Delay](https://open.toronto.ca/dataset/ttc-streetcar-delay-data/)
- We will be looking at data available for January 2014 - December 2019
    - neglect any affect of COVID-19 affecting delay data

In [30]:
# Create Folder for TTC Delay data
cwd = os.getcwd()

TTC_Directory = create_folder(cwd=cwd, folder_name="TTC")
TTC_Subway = create_folder(cwd=TTC_Directory, folder_name="SUBWAY")
TTC_Bus = create_folder(cwd=TTC_Directory, folder_name="BUS")
TTC_Streetcar = create_folder(cwd=TTC_Directory, folder_name="STREETCAR")

## Delay Metadata

In [43]:
# Get Subway Delay Codes
URL = "https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/fece136b-224a-412a-b191-8d31eb00491e"

with requests.get(URL, stream=True) as response:
    # location where it will be saved
    filepath = os.path.join(TTC_Directory, f"Subway_Codes.xlsx")
    # save the .xlsx
    with open(filepath, "wb") as file:
        for chunk in response.iter_content(chunk_size=128):
            file.write(chunk)
            
# place in dataframe
codes_df = pd.read_excel(filepath, skiprows=1)
codes_df = codes_df.loc[:, ["SUB RMENU CODE", "CODE DESCRIPTION",
                                          "SRT RMENU CODE", "CODE DESCRIPTION.1"]]
subway_codes_df = codes_df.loc[:, ["SUB RMENU CODE", "CODE DESCRIPTION"]].reset_index(drop=True)
srt_codes_df = codes_df.loc[~codes_df["SRT RMENU CODE"].isnull(),
                            ["SRT RMENU CODE", "CODE DESCRIPTION.1"]].rename(
                            columns={'CODE DESCRIPTION.1':'CODE DESCRIPTION'},
                            inplace=False).reset_index(drop=True)

# save to .csv file
subway_codes_df.to_csv(os.path.join(TTC_Directory, "Subway_Codes.csv"), index=False)
srt_codes_df.to_csv(os.path.join(TTC_Directory, "SRT_Codes.csv"), index=False)

In [45]:
# Get Subway Delay Metadata

URL = "https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/54247e39-5a7d-40db-a137-82b2a9ab0708"

with requests.get(URL, stream=True) as response:
    # location where it will be saved
    filepath = os.path.join(TTC_Directory, f"Subway_Metadata.xlsx")
    # save the .xlsx
    with open(filepath, "wb") as file:
        for chunk in response.iter_content(chunk_size=128):
            file.write(chunk)

In [46]:
# Get Bus Delay Metadata

URL = "https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/fed6d76e-9167-4268-9eb1-139fcde3f58a"

with requests.get(URL, stream=True) as response:
    # location where it will be saved
    filepath = os.path.join(TTC_Directory, f"Bus_Metadata.xlsx")
    # save the .xlsx
    with open(filepath, "wb") as file:
        for chunk in response.iter_content(chunk_size=128):
            file.write(chunk)

In [47]:
# Get Streetcar Delay Metadata

URL = "https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/146bfbda-8146-4ff8-b3dc-1eec3a5170fe"

with requests.get(URL, stream=True) as response:
    # location where it will be saved
    filepath = os.path.join(TTC_Directory, f"Streetcar_Metadata.xlsx")
    # save the .xlsx
    with open(filepath, "wb") as file:
        for chunk in response.iter_content(chunk_size=128):
            file.write(chunk)

## Subway Delay Data

- [2014-01 - 2017-04](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/6664420f-316f-4f94-9ba4-d4b4677aeea9)
- [2017-05](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/d1159888-0035-45a0-b238-86b546555ac0)
- [2017-06](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/240d8e8c-d300-4f91-b94f-cbb3d136d25e)
- [2017-07](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/98d4ac77-aa9f-40a3-97ee-6fc9070da252)
- [2017-08](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/92e7649a-cf2f-4ac7-9802-b7f92f00b384)
- [2017-09](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/61412f10-656b-4992-9a1a-a156dcf2f6c7)
- [2017-10](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/69a6db37-7982-49c7-8dbc-56919a92afca)
- [2017-11](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/10080217-8022-41c0-a8ba-2a455b8d9d6e)
- [2017-12](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/a731c4bb-630a-4530-b590-b325f9b4ef9b)
- [2018-01](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/5dd04139-6dd0-46c4-99a8-45e7b7607e5c)
- [2018-02](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/84d2cc1f-296d-4593-a6eb-504a7a5c769a)
- [2018-03](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/83e7a4e7-fa62-4932-afc2-8c8ca2235835)
- [2018-04](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/d9bf907b-413c-4028-b9bc-0b72a66bf11e)
- [2018-05](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/6767e18c-cbb3-461a-b4ae-ba202c9b73e4)
- [2018-06](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/e91ccb8b-9b0a-4479-adcd-447bfb298ab5)
- [2018-07](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/718061ee-8351-4cbb-818a-691f03e92041)
- [2018-08](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/b96a8119-9241-4392-92c7-f2fb6f4eff5f)
- [2018-09](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/5e22688c-e841-4249-8a18-243dc70307c1)
- [2018-10](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/9d4cae40-1a5e-4dea-b3c2-95818ae0b521)
- [2018-11](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/5e3067a3-ff0a-4ecf-a25b-5cf59e7adc25)
- [2018-12](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/b4e5c8d0-36b3-4e24-8f24-30bf59fded2d)
- [2019-01](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/9a824dba-20cc-40b1-8f26-778a34a0f3a8)
- [2019-02](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/e6bac74e-2da2-4429-a76f-202eba3d9193)
- [2019-03](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/0511879f-3233-4a42-8c28-93b432132c8b)
- [2019-04](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/447b4a5a-f696-4f05-86c0-9602f56922e5)
- [2019-05](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/a302fcab-81a1-4142-b0ec-031b0666c1df)
- [2019-06](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/655a138c-d381-4fe7-b3b3-a6620825161f)
- [2019-07](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/34d9619f-0239-4dad-a598-b6bc71ce1071)
- [2019-08](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/b6557580-a0f4-4c96-9ce2-82657b62e88a)
- [2019-09](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/e2a5e386-ddf7-4416-8e84-c3508c4f9a4f)
- [2019-10](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/fd837bd2-85ed-485e-ba02-46e29af52024)
- [2019-11](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ac734fde-145d-4313-9090-3d8137d39852)
- [2019-12](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/967ea5f7-de10-4ca8-a2fd-e92a5ffd0e16)

In [None]:
Subway_Year_Month = [
    "2014-01 - 2017-04-Subway_Delay"
    , "2017-05"
    , "2017-06"
    , 
]