# Jupyter Notebook Purpose
- read in the .csv.gz compressed files into pandas DataFrames and merge them together into 1 DataFrame for data analysis / manipulation and machine learning
    - not a good idea for memory storage but makes it easier to work with the data
    - final dataframe is also written as a .csv.bz2 compressed file

## Group 10 Members

- ### A. Nidhi Punja - [Email](mailto:npunja@uwaterloo.ca)
- ### B. Judith Roth - [Email](mailto:j5roth@uwaterloo.ca)
- ### C. Iman Dordizadeh Basirabad - [Email](mailto:idordiza@uwaterloo.ca)
- ### D. Daniel Adam Cebula - [Email](mailto:dacebula@uwaterloo.ca)
- ### E. Cynthia Fung - [Email](mailto:c27fung@uwaterloo.ca)
- ### F. Ben Klassen - [Email](mailto:b6klasse@uwaterloo.ca)

In [1]:
# Group 10 Collaborators
COLLABORATORS = ["Nidhi Punja",
                 "Judith Roth",
                 "Iman Dordizadeh Basirabad",
                 "Daniel Adam Cebula",
                 "Cynthia Fung",
                 "Ben Klassen"]

# Group 10 Members
for _ in COLLABORATORS:
    print(f"Group 10 Member: {_:->30}")

Group 10 Member: -------------------Nidhi Punja
Group 10 Member: -------------------Judith Roth
Group 10 Member: -----Iman Dordizadeh Basirabad
Group 10 Member: ------------Daniel Adam Cebula
Group 10 Member: ------------------Cynthia Fung
Group 10 Member: -------------------Ben Klassen


# Table of Contents

## 1. [Python Dependecies](#1.-Python-Libraries-and-Dependencies[1,2,3,4,5])
___
## 2. [Folder Creation for Data Storage](#2.-Folders-for-Data-Storage)
___
## 3. [Read in the Data](#3.-Pandas-to-read-.csv.bz2-compressed-files-into-DataFrames)
### 3a. [DataFrame Merging](#3a.-Merge-compressed-file-to-generate-final-pandas-DataFrame)
### 3b. [DataFrame Metadata](#3b.-Metadata-of-the-Final-DataFrame[6,7,8,9,10])
___
## 4. [Read in the Merged DataFrame](#4.-TFS-Fire-Incidents,-Toronto-Historical-Weather-and-TFS-Fire-Station-Locations-DataFrame)
- Steps 1 - 3 is the merging of the datasets and creating the metadata
- This step is the code needed to read in the Merged DataFrame
___
## 5. [References](#5.-Jupyter-Notebook-References)
___

# 1. Python Libraries and Dependencies<sup>[1,2,3,4,5]</sup>

In [2]:
# Python Modules for Miscellaneous reasons
from zipfile import ZipFile  # to read and write to zipped folders
import requests  # simple HTTP library for Python
import os        # portable way to use operating system functionalities
import io        # Tool for working with streams (Input/Ouput data)
import datetime  # python classes for manipulating dates and times
import dateutil  # powerful extensions to standard datetime Python module
import time      # used for time.sleep() to delay the HTTP requests ever so slightly
import re        # used for Python regex library
import math      # radians, cos, sin, asin and sqrt are used for haversine formula
from IPython.display import display # use this to see the entire DataFrame in the right format
from create_folder import create_folder # create folder function that I have defined and placed in create_folder.py file

In [3]:
# DATA ANALYSIS / VISUALIZATION Python Dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 2. Folders for Data Storage

In [4]:
# Here are the major directory names that will hold the data / metadata
RAW_ZIPPED_DIRECTORY = create_folder(folder_name="RAW_ZIPPED")
RAW_UNZIPPED_DIRECTORY = create_folder(folder_name="RAW_UNZIPPED")
PROCESSED_ZIPPED_DIRECTORY = create_folder(folder_name="PROCESSED_ZIPPED")
PROCESSED_UNZIPPED_DIRECTORY = create_folder(folder_name="PROCESSED_UNZIPPED")

In [5]:
# Create folders for fire_incidents
FIRE_RAW_ZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(RAW_ZIPPED_DIRECTORY, "FIRE_INCIDENTS"))
FIRE_RAW_UNZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(RAW_UNZIPPED_DIRECTORY, "FIRE_INCIDENTS"))
FIRE_PROCESSED_ZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(PROCESSED_ZIPPED_DIRECTORY, "FIRE_INCIDENTS"))
FIRE_PROCESSED_UNZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(PROCESSED_UNZIPPED_DIRECTORY, "FIRE_INCIDENTS"))

In [6]:
# Create folders for toronto_weather
WEATHER_RAW_UNZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(RAW_UNZIPPED_DIRECTORY, "TORONTO_WEATHER"))
WEATHER_PROCESSED_UNZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(PROCESSED_UNZIPPED_DIRECTORY, "TORONTO_WEATHER"))
WEATHER_PROCESSED_ZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(PROCESSED_ZIPPED_DIRECTORY, "TORONTO_WEATHER"))
WEATHER_PROCESSED_UNZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(PROCESSED_UNZIPPED_DIRECTORY, "TORONTO_WEATHER"))

In [7]:
# Create folders for fire_stations
STATIONS_RAW_ZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(RAW_ZIPPED_DIRECTORY, "FIRE_STATIONS"))
STATIONS_RAW_UNZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(RAW_UNZIPPED_DIRECTORY, "FIRE_STATIONS"))
STATIONS_PROCESSED_ZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(PROCESSED_ZIPPED_DIRECTORY, "FIRE_STATIONS"))
STATIONS_PROCESSED_UNZIPPED_DIRECTORY = create_folder(folder_name=os.path.join(PROCESSED_UNZIPPED_DIRECTORY, "FIRE_STATIONS"))

# 3. Pandas to read .csv.bz2 compressed files into DataFrames

# 3a. Merge compressed file to generate final pandas DataFrame
- Read in the 3 .csv.bz2 files and merge them into 1 .csv.bz2 file using pandas DataFrames

In [8]:
# fire incident data
FIRE_INCIDENT_PATH = os.path.join(FIRE_PROCESSED_ZIPPED_DIRECTORY, "2011-2018_Toronto_Fire_Incidents_PROCESSED.csv.bz2")
df_fire = pd.read_csv(FIRE_INCIDENT_PATH,
                      compression="bz2",
                      index_col="INCIDENT_NUM",
                      parse_dates=["DATETIME"])

# for faster queries I will turn the following columns to categorical data type
df_fire["CAD_TYPE"] = pd.Categorical(df_fire["CAD_TYPE"])
df_fire["CAD_CALL_TYPE"] = pd.Categorical(df_fire["CAD_CALL_TYPE"])
df_fire["FINAL_TYPE"] = pd.Categorical(df_fire["FINAL_TYPE"])
df_fire["CALL_SOURCE"] = pd.Categorical(df_fire["CALL_SOURCE"])

# Create the "DATE" column for merging - will be dropped later
df_fire["DATE"] = df_fire["DATETIME"].dt.floor("d")

df_fire.head()

Unnamed: 0_level_0,DATETIME,MINUTES_ARRIVAL,MINUTES_LEAVE,FIRE_STATION,FIRE_STATION_CLOSEST,LATITUDE,LONGITUDE,CAD_TYPE,CAD_CALL_TYPE,FINAL_TYPE,ALARM_LEVEL,CALL_SOURCE,PERSONS_RESCUED,DATE
INCIDENT_NUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
F11000010,2011-01-01 00:03:43,6.317,21.267,342.0,342.0,43.679099,-79.461761,Medical,Medical,89 - Other Medical,1,03 - From Ambulance,0.0,2011-01-01
F11000011,2011-01-01 00:03:55,5.117,6.183,131.0,131.0,43.726342,-79.396401,Medical,Carbon Monoxide,89 - Other Medical,1,01 - 911,0.0,2011-01-01
F11000012,2011-01-01 00:05:03,4.517,17.617,324.0,324.0,43.668548,-79.335324,Medical,Medical,89 - Other Medical,1,03 - From Ambulance,0.0,2011-01-01
F11000013,2011-01-01 00:04:46,6.0,9.883,345.0,345.0,43.657123,-79.434313,FIG - Fire - Grass/Rubbish,Emergency Fire,"03 - NO LOSS OUTDOOR fire (exc: Sus.arson,vand...",1,01 - 911,0.0,2011-01-01
F11000014,2011-01-01 00:06:07,4.933,10.133,142.0,142.0,43.75984,-79.516182,FAHR - Alarm Highrise Residential,Emergency Fire,"33 - Human - Malicious intent, prank",1,05 - Telephone from Monitoring Agency,0.0,2011-01-01


In [9]:
# fire stations locations
FIRE_STATION_LOCATIONS_PATH = os.path.join(STATIONS_PROCESSED_ZIPPED_DIRECTORY, "Toronto_Fire_Station_Locations.csv.bz2")
df_locations = pd.read_csv(FIRE_STATION_LOCATIONS_PATH,
                      compression="bz2", index_col="INDEX")

# for faster queries I will turn the following columns to categorical data type
df_locations["NAME"] = pd.Categorical(df_locations["NAME"])
df_locations["ADDRESS"] = pd.Categorical(df_locations["ADDRESS"])
df_locations["WARD_NAME"] = pd.Categorical(df_locations["WARD_NAME"])
df_locations["MUN_NAME"] = pd.Categorical(df_locations["MUN_NAME"])

df_locations.head()

Unnamed: 0_level_0,NAME,ADDRESS,LATITUDE,LONGITUDE,WARD_NAME,MUN_NAME
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
214,FIRE STATION 214,745 MEADOWVALE RD,43.794219,-79.163605,Scarborough East (44),Scarborough
215,FIRE STATION 215,5318 LAWRENCE AVE E,43.777401,-79.148069,Scarborough East (44),Scarborough
221,FIRE STATION 221,2575 EGLINTON AVE E,43.734799,-79.255066,Scarborough Southwest (35),Scarborough
222,FIRE STATION 222,755 WARDEN AVE,43.720408,-79.284094,Scarborough Southwest (35),Scarborough
223,FIRE STATION 223,116 DORSET RD,43.723965,-79.233264,Scarborough Southwest (36),Scarborough


In [10]:
# toronto weather
TORONTO_WEATHER_PATH = os.path.join(WEATHER_PROCESSED_ZIPPED_DIRECTORY, "2010-2020_Toronto_Weather.csv.bz2")
df_weather = pd.read_csv(TORONTO_WEATHER_PATH,
                      compression="bz2", parse_dates=["DATE"],
                      index_col="DATE")
df_weather

Unnamed: 0_level_0,MAX_TEMP,MIN_TEMP,MEAN_TEMP,HDD,CDD,RAIN_MM,PRECIP_MM,SNOW_CM
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-01,1.9,-9.9,-3.00,21.00,0.0,0.0,0.63,0.0
2010-01-02,-9.7,-18.5,-14.05,32.05,0.0,0.0,0.33,1.0
2010-01-03,-9.3,-17.0,-12.90,30.90,0.0,0.0,1.90,1.0
2010-01-04,-6.7,-13.5,-9.85,27.85,0.0,0.0,0.27,3.5
2010-01-05,-3.6,-12.5,-7.65,25.65,0.0,0.0,1.30,4.5
...,...,...,...,...,...,...,...,...
2020-12-27,,,,,,0.0,0.00,0.0
2020-12-28,,,,,,0.0,0.00,0.0
2020-12-29,,,,,,0.0,0.00,0.0
2020-12-30,,,,,,0.0,0.00,0.0


In [11]:
# Merge the fire incidents and toronto weather dataframes together
df_merge = df_fire.merge(df_weather,
                         how="left",
                         right_index=True,
                         left_on="DATE",
                         suffixes=("", "_WEATHER"))

# drop the "DATE" column as it is no longer needed
df_merge = df_merge.drop(columns=["DATE"])

df_merge.head()

Unnamed: 0_level_0,DATETIME,MINUTES_ARRIVAL,MINUTES_LEAVE,FIRE_STATION,FIRE_STATION_CLOSEST,LATITUDE,LONGITUDE,CAD_TYPE,CAD_CALL_TYPE,FINAL_TYPE,...,CALL_SOURCE,PERSONS_RESCUED,MAX_TEMP,MIN_TEMP,MEAN_TEMP,HDD,CDD,RAIN_MM,PRECIP_MM,SNOW_CM
INCIDENT_NUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
F11000010,2011-01-01 00:03:43,6.317,21.267,342.0,342.0,43.679099,-79.461761,Medical,Medical,89 - Other Medical,...,03 - From Ambulance,0.0,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000011,2011-01-01 00:03:55,5.117,6.183,131.0,131.0,43.726342,-79.396401,Medical,Carbon Monoxide,89 - Other Medical,...,01 - 911,0.0,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000012,2011-01-01 00:05:03,4.517,17.617,324.0,324.0,43.668548,-79.335324,Medical,Medical,89 - Other Medical,...,03 - From Ambulance,0.0,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000013,2011-01-01 00:04:46,6.0,9.883,345.0,345.0,43.657123,-79.434313,FIG - Fire - Grass/Rubbish,Emergency Fire,"03 - NO LOSS OUTDOOR fire (exc: Sus.arson,vand...",...,01 - 911,0.0,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000014,2011-01-01 00:06:07,4.933,10.133,142.0,142.0,43.75984,-79.516182,FAHR - Alarm Highrise Residential,Emergency Fire,"33 - Human - Malicious intent, prank",...,05 - Telephone from Monitoring Agency,0.0,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0


In [12]:
# Merge the above dataframe with Fire Station Locations
df_merge = df_merge.merge(df_locations,
                          how="left",
                          right_index=True,
                          left_on="FIRE_STATION",
                          suffixes=("", "_STATION"))
df_merge.head()

Unnamed: 0_level_0,DATETIME,MINUTES_ARRIVAL,MINUTES_LEAVE,FIRE_STATION,FIRE_STATION_CLOSEST,LATITUDE,LONGITUDE,CAD_TYPE,CAD_CALL_TYPE,FINAL_TYPE,...,CDD,RAIN_MM,PRECIP_MM,SNOW_CM,NAME,ADDRESS,LATITUDE_STATION,LONGITUDE_STATION,WARD_NAME,MUN_NAME
INCIDENT_NUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
F11000010,2011-01-01 00:03:43,6.317,21.267,342.0,342.0,43.679099,-79.461761,Medical,Medical,89 - Other Medical,...,0.0,3.7,8.7,0.0,FIRE STATION 342,106 ASCOT AVE,43.679375,-79.44863,Davenport (17),former Toronto
F11000011,2011-01-01 00:03:55,5.117,6.183,131.0,131.0,43.726342,-79.396401,Medical,Carbon Monoxide,89 - Other Medical,...,0.0,3.7,8.7,0.0,FIRE STATION 131,3135 YONGE ST,43.726226,-79.402161,Don Valley West (25),former Toronto
F11000012,2011-01-01 00:05:03,4.517,17.617,324.0,324.0,43.668548,-79.335324,Medical,Medical,89 - Other Medical,...,0.0,3.7,8.7,0.0,FIRE STATION 324,840 GERRARD ST E,43.667767,-79.343518,Toronto-Danforth (30),former Toronto
F11000013,2011-01-01 00:04:46,6.0,9.883,345.0,345.0,43.657123,-79.434313,FIG - Fire - Grass/Rubbish,Emergency Fire,"03 - NO LOSS OUTDOOR fire (exc: Sus.arson,vand...",...,0.0,3.7,8.7,0.0,FIRE STATION 345,1287 DUFFERIN ST,43.667401,-79.438153,Davenport (18),former Toronto
F11000014,2011-01-01 00:06:07,4.933,10.133,142.0,142.0,43.75984,-79.516182,FAHR - Alarm Highrise Residential,Emergency Fire,"33 - Human - Malicious intent, prank",...,0.0,3.7,8.7,0.0,FIRE STATION 142,2753 JANE ST,43.745991,-79.514374,York Centre (9),North York


In [13]:
# Reorder the Columns in a more favourable order
df_merge = df_merge.loc[:, ['DATETIME', 'MINUTES_ARRIVAL',
                            'MINUTES_LEAVE', 'FIRE_STATION',
                            'FIRE_STATION_CLOSEST', 'NAME',
                            'ADDRESS', 'LATITUDE_STATION',
                            'LONGITUDE_STATION', 'WARD_NAME',
                            'MUN_NAME', 'CAD_TYPE',
                            'CAD_CALL_TYPE', 'FINAL_TYPE',
                            'ALARM_LEVEL', 'CALL_SOURCE',
                            'PERSONS_RESCUED', 'LATITUDE',
                            'LONGITUDE', 'MAX_TEMP',
                            'MIN_TEMP', 'MEAN_TEMP',
                            'HDD', 'CDD', 'RAIN_MM',
                            'PRECIP_MM', 'SNOW_CM']]
df_merge.head()

Unnamed: 0_level_0,DATETIME,MINUTES_ARRIVAL,MINUTES_LEAVE,FIRE_STATION,FIRE_STATION_CLOSEST,NAME,ADDRESS,LATITUDE_STATION,LONGITUDE_STATION,WARD_NAME,...,LATITUDE,LONGITUDE,MAX_TEMP,MIN_TEMP,MEAN_TEMP,HDD,CDD,RAIN_MM,PRECIP_MM,SNOW_CM
INCIDENT_NUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
F11000010,2011-01-01 00:03:43,6.317,21.267,342.0,342.0,FIRE STATION 342,106 ASCOT AVE,43.679375,-79.44863,Davenport (17),...,43.679099,-79.461761,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000011,2011-01-01 00:03:55,5.117,6.183,131.0,131.0,FIRE STATION 131,3135 YONGE ST,43.726226,-79.402161,Don Valley West (25),...,43.726342,-79.396401,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000012,2011-01-01 00:05:03,4.517,17.617,324.0,324.0,FIRE STATION 324,840 GERRARD ST E,43.667767,-79.343518,Toronto-Danforth (30),...,43.668548,-79.335324,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000013,2011-01-01 00:04:46,6.0,9.883,345.0,345.0,FIRE STATION 345,1287 DUFFERIN ST,43.667401,-79.438153,Davenport (18),...,43.657123,-79.434313,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000014,2011-01-01 00:06:07,4.933,10.133,142.0,142.0,FIRE STATION 142,2753 JANE ST,43.745991,-79.514374,York Centre (9),...,43.75984,-79.516182,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0


In [14]:
# get some info from the DataFrame
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
Index: 975161 entries, F11000010 to F18139242
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   DATETIME              975161 non-null  datetime64[ns]
 1   MINUTES_ARRIVAL       951681 non-null  float64       
 2   MINUTES_LEAVE         951677 non-null  float64       
 3   FIRE_STATION          975160 non-null  float64       
 4   FIRE_STATION_CLOSEST  903971 non-null  float64       
 5   NAME                  975160 non-null  category      
 6   ADDRESS               975160 non-null  category      
 7   LATITUDE_STATION      975160 non-null  float64       
 8   LONGITUDE_STATION     975160 non-null  float64       
 9   WARD_NAME             975160 non-null  category      
 10  MUN_NAME              975160 non-null  category      
 11  CAD_TYPE              975161 non-null  category      
 12  CAD_CALL_TYPE         975161 non-null  category     

In [15]:
# write this merged DataFrame to a csv.bz2 file
PATH_MERGED_CSV_BZ2 = os.path.join(FIRE_PROCESSED_ZIPPED_DIRECTORY, "FINAL_DATASET.csv.bz2")
df_merge.to_csv(PATH_MERGED_CSV_BZ2, compression='bz2')

## 3b. Metadata of the Final DataFrame<sup>[6,7,8,9,10]</sup>
- columns and their descriptions are included and written to a .csv file

In [16]:
# here is the metadata for the columns
metadata_dict = {
    "INCIDENT_NUM" : "Toronto Fire Services (TFS) incident number.  Used as index for the DataFrame because it is unique for each call.",
    "DATETIME" : "Year, Month, Day, Hour, Minute, Second of when TFS was notified of the incident (alarm).",
    "MINUTES_ARRIVAL" : "Minutes it took for the first unit to arrive (after alarm).",
    "MINUTES_LEAVE" : "Minutes it took for the first unit to leave (after arrival).",
    "FIRE_STATION" : "Number of TFS Station where incident occurred.",
    "FIRE_STATION_CLOSEST" : "Number of closest (by smallest Haversine formula distance calculation) TFS Station where incident occurred.",
    "NAME" : "Name of column FIRE_STATION TFS Fire Station.",
    "ADDRESS" : "Address of column FIRE_STATION TFS Fire Station.",
    "LATITUDE_STATION" : "Latitude (Decimal Degrees) of column FIRE_STATION TFS Fire Station.",
    "LONGITUDE_STATION" : "Longitude (Decimal Degrees) of column FIRE_STATION TFS Fire Station.",
    "WARD_NAME" : "Municipality Ward Name of column FIRE_STATION TFS Fire Station.",
    "MUN_NAME" : "Name of Toronto / GTA Municipality of column FIRE_STATION TFS Fire Station.",
    "CAD_TYPE" : "First event type in CAD system of this incident.",
    "CAD_CALL_TYPE" : "First call type in CAD system of this incident. Call type is a group of event types.",
    "FINAL_TYPE" : "Final incident type.",
    "ALARM_LEVEL" : "Alarm level of the event.",
    "CALL_SOURCE" : "Source of the call to TFS.",
    "PERSONS_RESCUED" : "Number of persons rescued, if any.",
    "LATITUDE" : "Latitude (Decimal Degrees) of nearest major / minor intersection of incident.",
    "LONGITUDE" : "Longitude (Decimal Degrees) of nearest major / minor intersection of incident.",
    "MAX_TEMP" : "Maximum Temperature (Celsius) recorded across 3 Toronto Weather Stations for a given day.",
    "MIN_TEMP" : "Minimum Temperature (Celsius) recorded across 3 Toronto Weather Stations for a given day.",
    "MEAN_TEMP" : "Average Temperature (Celsius) recorded across 3 Toronto Weather Stations for a given day.",
    "HDD" : "Heating Degree Day (HDD, Celsius) recorded across 3 Toronto Weather Stations for a given day.",
    "CDD" : "Cooling Degree Day (CDD, Celsius) recorded across 3 Toronto Weather Stations for a given day.",
    "RAIN_MM" : "Measured Rain (mm / day) recorded across 3 Toronto Weather Stations for a given day.",
    "PRECIP_MM" : "Measured Precipiation (mm / day) recorded across 3 Toronto Weather Stations for a given day.",
    "SNOW_CM" : "Snow on Ground (cm) recorded across 3 Toronto Weather Stations for a given day."
}

# create a metadata DataFrame
df_metadata = pd.DataFrame(metadata_dict.items(),
                           columns=["COLUMN_NAME", "COLUMN_DESCRIPTION"]
                          ).set_index("COLUMN_NAME")

# display it
with pd.option_context('display.max_colwidth', 300):
    display(df_metadata)

Unnamed: 0_level_0,COLUMN_DESCRIPTION
COLUMN_NAME,Unnamed: 1_level_1
INCIDENT_NUM,Toronto Fire Services (TFS) incident number. Used as index for the DataFrame because it is unique for each call.
DATETIME,"Year, Month, Day, Hour, Minute, Second of when TFS was notified of the incident (alarm)."
MINUTES_ARRIVAL,Minutes it took for the first unit to arrive (after alarm).
MINUTES_LEAVE,Minutes it took for the first unit to leave (after arrival).
FIRE_STATION,Number of TFS Station where incident occurred.
FIRE_STATION_CLOSEST,Number of closest (by smallest Haversine formula distance calculation) TFS Station where incident occurred.
NAME,Name of column FIRE_STATION TFS Fire Station.
ADDRESS,Address of column FIRE_STATION TFS Fire Station.
LATITUDE_STATION,Latitude (Decimal Degrees) of column FIRE_STATION TFS Fire Station.
LONGITUDE_STATION,Longitude (Decimal Degrees) of column FIRE_STATION TFS Fire Station.


In [17]:
# write this metadata to the folders to explain what this data represents
df_metadata.to_csv(
    os.path.join(FIRE_PROCESSED_UNZIPPED_DIRECTORY, "FINAL_DATASET_METADATA.csv")
)
df_metadata.to_csv(
    os.path.join(FIRE_PROCESSED_ZIPPED_DIRECTORY, "FINAL_DATASET_METADATA.csv")
)

# 4. TFS Fire Incidents, Toronto Historical Weather and TFS Fire Station Locations DataFrame

In [18]:
# read in the metadata from .csv into memory
# use this metatdata to explain the columns
df_metadata = pd.read_csv(
    os.path.join(FIRE_PROCESSED_ZIPPED_DIRECTORY, "FINAL_DATASET_METADATA.csv"),
    index_col="COLUMN_NAME")

# display it
with pd.option_context('display.max_colwidth', 300):
    display(df_metadata)

Unnamed: 0_level_0,COLUMN_DESCRIPTION
COLUMN_NAME,Unnamed: 1_level_1
INCIDENT_NUM,Toronto Fire Services (TFS) incident number. Used as index for the DataFrame because it is unique for each call.
DATETIME,"Year, Month, Day, Hour, Minute, Second of when TFS was notified of the incident (alarm)."
MINUTES_ARRIVAL,Minutes it took for the first unit to arrive (after alarm).
MINUTES_LEAVE,Minutes it took for the first unit to leave (after arrival).
FIRE_STATION,Number of TFS Station where incident occurred.
FIRE_STATION_CLOSEST,Number of closest (by smallest Haversine formula distance calculation) TFS Station where incident occurred.
NAME,Name of column FIRE_STATION TFS Fire Station.
ADDRESS,Address of column FIRE_STATION TFS Fire Station.
LATITUDE_STATION,Latitude (Decimal Degrees) of column FIRE_STATION TFS Fire Station.
LONGITUDE_STATION,Longitude (Decimal Degrees) of column FIRE_STATION TFS Fire Station.


In [19]:
# read the merged DataFrame from .csv.bz2 file into DataFrame
PATH_MERGED_CSV_BZ2 = os.path.join(FIRE_PROCESSED_ZIPPED_DIRECTORY, "FINAL_DATASET.csv.bz2")
df_total = pd.read_csv(PATH_MERGED_CSV_BZ2,
                       compression='bz2', index_col="INCIDENT_NUM", parse_dates=["DATETIME"])

# make the columns categorical (for faster queries)
df_total["CAD_TYPE"] = pd.Categorical(df_total["CAD_TYPE"])
df_total["CAD_CALL_TYPE"] = pd.Categorical(df_total["CAD_CALL_TYPE"])
df_total["FINAL_TYPE"] = pd.Categorical(df_total["FINAL_TYPE"])
df_total["CALL_SOURCE"] = pd.Categorical(df_total["CALL_SOURCE"])
df_total["NAME"] = pd.Categorical(df_total["NAME"])
df_total["ADDRESS"] = pd.Categorical(df_total["ADDRESS"])
df_total["WARD_NAME"] = pd.Categorical(df_total["WARD_NAME"])
df_total["MUN_NAME"] = pd.Categorical(df_total["MUN_NAME"])

# display it
with pd.option_context('display.max_columns', None):
    display(df_total.head())

Unnamed: 0_level_0,DATETIME,MINUTES_ARRIVAL,MINUTES_LEAVE,FIRE_STATION,FIRE_STATION_CLOSEST,NAME,ADDRESS,LATITUDE_STATION,LONGITUDE_STATION,WARD_NAME,MUN_NAME,CAD_TYPE,CAD_CALL_TYPE,FINAL_TYPE,ALARM_LEVEL,CALL_SOURCE,PERSONS_RESCUED,LATITUDE,LONGITUDE,MAX_TEMP,MIN_TEMP,MEAN_TEMP,HDD,CDD,RAIN_MM,PRECIP_MM,SNOW_CM
INCIDENT_NUM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
F11000010,2011-01-01 00:03:43,6.317,21.267,342.0,342.0,FIRE STATION 342,106 ASCOT AVE,43.679375,-79.44863,Davenport (17),former Toronto,Medical,Medical,89 - Other Medical,1,03 - From Ambulance,0.0,43.679099,-79.461761,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000011,2011-01-01 00:03:55,5.117,6.183,131.0,131.0,FIRE STATION 131,3135 YONGE ST,43.726226,-79.402161,Don Valley West (25),former Toronto,Medical,Carbon Monoxide,89 - Other Medical,1,01 - 911,0.0,43.726342,-79.396401,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000012,2011-01-01 00:05:03,4.517,17.617,324.0,324.0,FIRE STATION 324,840 GERRARD ST E,43.667767,-79.343518,Toronto-Danforth (30),former Toronto,Medical,Medical,89 - Other Medical,1,03 - From Ambulance,0.0,43.668548,-79.335324,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000013,2011-01-01 00:04:46,6.0,9.883,345.0,345.0,FIRE STATION 345,1287 DUFFERIN ST,43.667401,-79.438153,Davenport (18),former Toronto,FIG - Fire - Grass/Rubbish,Emergency Fire,"03 - NO LOSS OUTDOOR fire (exc: Sus.arson,vand...",1,01 - 911,0.0,43.657123,-79.434313,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0
F11000014,2011-01-01 00:06:07,4.933,10.133,142.0,142.0,FIRE STATION 142,2753 JANE ST,43.745991,-79.514374,York Centre (9),North York,FAHR - Alarm Highrise Residential,Emergency Fire,"33 - Human - Malicious intent, prank",1,05 - Telephone from Monitoring Agency,0.0,43.75984,-79.516182,11.5,0.9,6.4,11.6,0.0,3.7,8.7,0.0


In [20]:
# percentage of nulls in each column
# only around ~2.5% of nulls or less
(df_total.isnull().sum() / len(df_total)) * 100

DATETIME                0.000000
MINUTES_ARRIVAL         2.407808
MINUTES_LEAVE           2.408218
FIRE_STATION            0.000103
FIRE_STATION_CLOSEST    7.300333
NAME                    0.000103
ADDRESS                 0.000103
LATITUDE_STATION        0.000103
LONGITUDE_STATION       0.000103
WARD_NAME               0.000103
MUN_NAME                0.000103
CAD_TYPE                0.000000
CAD_CALL_TYPE           0.000000
FINAL_TYPE              0.004922
ALARM_LEVEL             0.000000
CALL_SOURCE             0.007178
PERSONS_RESCUED         0.006973
LATITUDE                7.300333
LONGITUDE               7.300333
MAX_TEMP                0.000000
MIN_TEMP                0.000000
MEAN_TEMP               0.000000
HDD                     0.000000
CDD                     0.000000
RAIN_MM                 0.000000
PRECIP_MM               0.000000
SNOW_CM                 0.000000
dtype: float64

In [21]:
# get information about the completed DataSet
df_total.info()

<class 'pandas.core.frame.DataFrame'>
Index: 975161 entries, F11000010 to F18139242
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   DATETIME              975161 non-null  datetime64[ns]
 1   MINUTES_ARRIVAL       951681 non-null  float64       
 2   MINUTES_LEAVE         951677 non-null  float64       
 3   FIRE_STATION          975160 non-null  float64       
 4   FIRE_STATION_CLOSEST  903971 non-null  float64       
 5   NAME                  975160 non-null  category      
 6   ADDRESS               975160 non-null  category      
 7   LATITUDE_STATION      975160 non-null  float64       
 8   LONGITUDE_STATION     975160 non-null  float64       
 9   WARD_NAME             975160 non-null  category      
 10  MUN_NAME              975160 non-null  category      
 11  CAD_TYPE              975161 non-null  category      
 12  CAD_CALL_TYPE         975161 non-null  category     

# 5. Jupyter Notebook References

[1] "Python Documentation."  *Python Software Foundation*.  [Online](https://docs.python.org/).  [Accessed August 04, 2020]

[2] G. Niemeyer.  "dateutil - powerful extensions to datetime."  *dateutil*.  [Online](https://github.com/dateutil/dateutil).  [Accessed August 04, 2020]

[3] "pandas."  *PyData*.  [Online](https://pandas.pydata.org/).  [Accessed August 04, 2020]

[4] "NumPy - The fundamental package for scientific computing with Python."  *NumPy*.  [Online](https://numpy.org/).  [Accessed August 04, 2020]

[5] "Matplotlib:  Visualization with Python."  *The Matplotlib Development team*.  [Online](https://matplotlib.org/).  [Accessed August 04, 2020]

[6] "City of Toronto Open Data Portal."  *City of Toronto*.  [Online](https://open.toronto.ca/).  [Accessed August 04, 2020]

[7] "Fire Services Basic Incident Details."  *City of Toronto*.  [Online](https://open.toronto.ca/dataset/fire-services-basic-incident-details/).  [Accessed August 04, 2020]

[8] "Historical Climate Data."  *Government of Canada*.  [Online](https://climate.weather.gc.ca/).  [Accessed August 04, 2020]

[9] "URL based procedure to automatically download data in bulk from Climate Website"  *Government of Canada*.  [Online].  *ftp://client_climate@ftp.tor.ec.gc.ca/Pub/Get_More_Data_Plus_de_donnees/Readme.txt*.  [Accessed August 04, 2020]

[10] "Fire Station Locations."  *City of Toronto*.  [Online](https://open.toronto.ca/dataset/fire-station-locations/).  [Accessed August 04, 2020]