# Get Missing Latitudes and Longitudes

In [1]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [2]:
import os
from datetime import datetime
from glob import glob
from typing import Union

import pandas as pd
import snowflake.connector
from dotenv import find_dotenv, load_dotenv

Importing custom functions for geocoding (this function will be explained in detail later in this notebook)

In [3]:
%aimport src.geopy_helpers_v2
from src.geopy_helpers_v2 import geocode_missing_lat_lon

## About

This notebook will retrieve missing `latitude` and `longitude` co-ordinates for establishments that were inspected and have missing values in these two columns.

These co-ordinates are needed in order to aggregated statistics (such as crimes committed, population, etc.) the neighbourhood containing each establishment that was inspected (see `4_get_stats_by_neighbourhood.ipynb`). These aggregated counts could then be used as features by an ML model.

## User Inputs

In [4]:
geocoded_fname_prefix = "filtered_transformed_filledmissing_data"

ci_run = "no"

In [5]:
if ci_run == "no":
    load_dotenv(find_dotenv())

In [6]:
connector_dict = dict(
    account=os.getenv("SNOWFLAKE_ACCOUNT"),
    user=os.getenv("SNOWFLAKE_USER"),
    password=os.getenv("SNOWFLAKE_PASS"),
    database="dinesafe",
    schema="public",
    warehouse=os.getenv("SNOWFLAKE_WAREHOUSE"),
    role="sysadmin",
)

In [7]:
def show_sql_df(
    query: str,
    cursor,
    cnx=None,
    table_output: bool = False,
    use_manual_approach: bool = False,
    show_df: bool = True,
) -> Union[None, pd.DataFrame]:
    cursor.execute(query)
    if cnx:
        cnx.commit()
    if table_output:
        if use_manual_approach:
            colnames = [cdesc[0].lower() for cdesc in cursor.description]
            cur_fetched = cursor.fetchall()
            if cur_fetched:
                df_query_output = pd.DataFrame.from_records(
                    cur_fetched, columns=colnames, index=range(len(cur_fetched))
                )
                if show_df:
                    display(df_query_output)
                return df_query_output
        else:
            df_query_output = cursor.fetch_pandas_all().reset_index(drop=True)
            if show_df:
                display(df_query_output)
            return df_query_output
    return pd.DataFrame()

## Connect to SQL Database

In [8]:
conn = snowflake.connector.connect(**connector_dict)
cur = conn.cursor()

## Geocoding Latitude and Longitude

### Add latitude and longitude to Filtered and Aggregated Data

We'll load the transformed (filtered and aggregated) data (which contains one inspection per row)

In [9]:
%%time
df = pd.read_csv(glob("data/processed/filtered_transformed_data__*.csv")[-1])
df

CPU times: user 421 ms, sys: 19.4 ms, total: 440 ms
Wall time: 440 ms


Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,establishment_status,infractions_summary,num_significant,num_crucial,num_minor,...,num_charges_withdrawn,num_pending,num_cancelled,num_conviction_suspended_sentence,num_conviction_fined,num_conviction_fined_order_to_close_by_court,num_charges_dismissed,num_null.1,num_conviction_probationary_order,is_infraction
0,1222579,Food Take Out,870 MARKHAM RD,102810896,2012-08-21,Pass,,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,1222579,Food Take Out,870 MARKHAM RD,103015259,2013-06-27,Pass,,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,1222579,Food Take Out,870 MARKHAM RD,103133558,2013-12-20,Pass,Food handler fail to wear headgear. Operator f...,0,0,6,...,0,0,0,0,0,0,0,1,0,0
3,1222579,Food Take Out,870 MARKHAM RD,103329697,2014-09-09,Pass,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,3,0,12,...,0,0,0,0,0,0,0,2,0,1
4,1222579,Food Take Out,870 MARKHAM RD,103420091,2015-01-08,Pass,Operator fail to properly wash equipment. Oper...,3,0,6,...,0,0,0,0,0,0,0,2,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205748,10690616,Food Take Out,4698 YONGE ST,104594530,2019-10-23,Pass,,0,0,0,...,0,0,0,0,0,0,0,1,0,0
205749,10690642,Bake Shop,20 ST PATRICK ST,104594681,2019-10-23,Pass,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,1,0,0,...,0,0,0,0,0,0,0,1,0,1
205750,10690660,Restaurant,549 BLOOR ST W,104594800,2019-10-23,Pass,Use pallets not of readily cleanable design - ...,1,0,1,...,0,0,0,0,0,0,0,2,0,1
205751,10690679,Food Take Out,1175 ST CLAIR AVE W,104594954,2019-10-23,Pass,SANITIZE UTENSILS IN WATER FOR LESS THAN 45 SE...,1,0,0,...,0,0,0,0,0,0,0,1,0,1


**Notes**
1. For each grouping of `establishment_id`, `establishmenttype` and `establishment_address`, we have the aggregated number of each type of infraction and the combined text of the details of all infractions. This dataset was created using a SQL `GROUP BY` over these three columns in `2_sql_filter_transform.ipynb`.
2. This dataset does not contain the `latitude` and `longitude` column for each inspection. We'll query the SQL database to get the latitude and longitude for each grouping of `establishment_id`, `establishmenttype` and `establishment_address`. We can then merge that query output with this loaded dataset, on these three columns, and get the corresponding latitude and longitude for each row (inspection) in this data.

Write a SQL query to get the latitude and longitude for every establishment address in the database

In [10]:
%%time
query = """
        SELECT establishment_id,
               establishmenttype,
               establishment_address,
               MAX(latitude) AS latitude,
               MAX(longitude) AS longitude
        FROM inspections
        GROUP BY establishment_id, establishmenttype, establishment_address
        """
df_query = show_sql_df(query, cur, None, True, False)
df_query.columns = df_query.columns.str.lower()
df_query

Unnamed: 0,ESTABLISHMENT_ID,ESTABLISHMENTTYPE,ESTABLISHMENT_ADDRESS,LATITUDE,LONGITUDE
0,9390938,Restaurant,1910 YONGE ST,43.698466,-79.396994
1,9392260,Restaurant,1346 ST CLAIR AVE W,43.676875,-79.449036
2,9392365,Restaurant,202 DAVENPORT RD,43.675116,-79.396114
3,9393127,Restaurant,1448 LAWRENCE AVE E,43.741813,-79.313108
4,9393880,Meat Processing Plant,44 WELLESWORTH DR,43.666829,-79.578502
...,...,...,...,...,...
30285,10690477,Food Take Out,1071 KING ST W,43.640760,-79.417007
30286,10690535,Food Caterer,1841 LAWRENCE AVE E,43.743333,-79.303036
30287,10690660,Restaurant,549 BLOOR ST W,43.665207,-79.410220
30288,10690679,Food Take Out,1175 ST CLAIR AVE W,43.677663,-79.443365


CPU times: user 869 ms, sys: 102 ms, total: 971 ms
Wall time: 1.65 s


Unnamed: 0,establishment_id,establishmenttype,establishment_address,latitude,longitude
0,9390938,Restaurant,1910 YONGE ST,43.698466,-79.396994
1,9392260,Restaurant,1346 ST CLAIR AVE W,43.676875,-79.449036
2,9392365,Restaurant,202 DAVENPORT RD,43.675116,-79.396114
3,9393127,Restaurant,1448 LAWRENCE AVE E,43.741813,-79.313108
4,9393880,Meat Processing Plant,44 WELLESWORTH DR,43.666829,-79.578502
...,...,...,...,...,...
30285,10690477,Food Take Out,1071 KING ST W,43.640760,-79.417007
30286,10690535,Food Caterer,1841 LAWRENCE AVE E,43.743333,-79.303036
30287,10690660,Restaurant,549 BLOOR ST W,43.665207,-79.410220
30288,10690679,Food Take Out,1175 ST CLAIR AVE W,43.677663,-79.443365


**Notes**
1. This output has a single combination of `latitude` and `longitude` for each grouping of `establishment_id`, `establishmenttype` and `establishment_address`.

We will now merge these two datasets (`df_query` and `df`) using the `establishment_id`, `establishmenttype` and `establishment_address` columns in order to get the `latitude` and `longitude` for each row of the data (`df`) that was filtered and transformed in the previous notebook (`2_sql_filter_transform.ipynb`)

In [11]:
%%time
df_with_lat_lon = df.merge(
    df_query,
    on=["establishment_id", "establishmenttype", "establishment_address"],
    how="left",
)
df_with_lat_lon

CPU times: user 79.3 ms, sys: 19.4 ms, total: 98.7 ms
Wall time: 98 ms


Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,establishment_status,infractions_summary,num_significant,num_crucial,num_minor,...,num_cancelled,num_conviction_suspended_sentence,num_conviction_fined,num_conviction_fined_order_to_close_by_court,num_charges_dismissed,num_null.1,num_conviction_probationary_order,is_infraction,latitude,longitude
0,1222579,Food Take Out,870 MARKHAM RD,102810896,2012-08-21,Pass,,0,0,0,...,0,0,0,0,0,1,0,0,43.767980,-79.229029
1,1222579,Food Take Out,870 MARKHAM RD,103015259,2013-06-27,Pass,,0,0,0,...,0,0,0,0,0,1,0,0,43.767980,-79.229029
2,1222579,Food Take Out,870 MARKHAM RD,103133558,2013-12-20,Pass,Food handler fail to wear headgear. Operator f...,0,0,6,...,0,0,0,0,0,1,0,0,43.767980,-79.229029
3,1222579,Food Take Out,870 MARKHAM RD,103329697,2014-09-09,Pass,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,3,0,12,...,0,0,0,0,0,2,0,1,43.767980,-79.229029
4,1222579,Food Take Out,870 MARKHAM RD,103420091,2015-01-08,Pass,Operator fail to properly wash equipment. Oper...,3,0,6,...,0,0,0,0,0,2,0,1,43.767980,-79.229029
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205748,10690616,Food Take Out,4698 YONGE ST,104594530,2019-10-23,Pass,,0,0,0,...,0,0,0,0,0,1,0,0,43.759200,-79.410700
205749,10690642,Bake Shop,20 ST PATRICK ST,104594681,2019-10-23,Pass,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,1,0,0,...,0,0,0,0,0,1,0,1,43.650944,-79.389040
205750,10690660,Restaurant,549 BLOOR ST W,104594800,2019-10-23,Pass,Use pallets not of readily cleanable design - ...,1,0,1,...,0,0,0,0,0,2,0,1,43.665207,-79.410220
205751,10690679,Food Take Out,1175 ST CLAIR AVE W,104594954,2019-10-23,Pass,SANITIZE UTENSILS IN WATER FOR LESS THAN 45 SE...,1,0,0,...,0,0,0,0,0,1,0,1,43.677663,-79.443365


### Get Addresses with a Missing Latitude or Longitude

Some but not all inspections are missing information in these two columns. Instead of geocoding all the addresses in the above data (which will involve unnecessary calls to an API), we will now get the unique addresses for which the `latitude` and `longitude` are missing

In [12]:
df_addr_lat_lon = (
    df_with_lat_lon.query("latitude.isnull() | longitude.isnull()")
    .groupby("establishment_address", as_index=False)[["latitude", "longitude"]]
    .max()
)
df_addr_lat_lon

Unnamed: 0,establishment_address,latitude,longitude
0,1 AVONDALE AVE,,
1,1 BALDWIN ST,,
2,1 BALMORAL AVE,,
3,1 BAXTER ST,,
4,1 BLUE JAYS WAY,,
...,...,...,...
6338,997 BAY ST,,
6339,997 EGLINTON AVE W,,
6340,998 ST CLAIR AVE W,,
6341,999 ALBION RD,,


**Notes**
1. We only need to geocode these addresses and then join back with `df_with_lat_lon` in order to fill in the missing `latitude`s and `longitude`s there.

The addresses above are missing the name of the city, province and country, which are needed to allow for accurate geocoding. We'll now append a suffix to the `establishment_address` column with this information

In [13]:
unique_addresses_missing_lat_lon = (
    df_addr_lat_lon["establishment_address"].str.title() + ", Toronto, ON, Canada"
)
unique_addresses_missing_lat_lon.rename("address").to_frame()

Unnamed: 0,address
0,"1 Avondale Ave, Toronto, ON, Canada"
1,"1 Baldwin St, Toronto, ON, Canada"
2,"1 Balmoral Ave, Toronto, ON, Canada"
3,"1 Baxter St, Toronto, ON, Canada"
4,"1 Blue Jays Way, Toronto, ON, Canada"
...,...
6338,"997 Bay St, Toronto, ON, Canada"
6339,"997 Eglinton Ave W, Toronto, ON, Canada"
6340,"998 St Clair Ave W, Toronto, ON, Canada"
6341,"999 Albion Rd, Toronto, ON, Canada"


### Prepare Database Table to Append Geocoded Data

The geocoded data will be stored locally in a database. We'll now create the `addressinfo` table in the `dinesafe` database

In [None]:
_ = cur.execute("DROP TABLE IF EXISTS addressinfo")

In [None]:
create_table_query = """
                     CREATE TABLE IF NOT EXISTS addressinfo (
                         address TEXT,
                         neighbourhood TEXT,
                         locality TEXT,
                         formattedAddress TEXT,
                         postalCode TEXT,
                         latitude FLOAT,
                         longitude FLOAT
                     )
                     """
_ = cur.execute(create_table_query)

Disconnect from the database

In [14]:
cur.close()
conn.close()

**Note**
1. Geocoding is done with the Bing Maps API. Per [Bing Maps FAQ](https://www.microsoft.com/en-us/maps/faq/) (see *What is the policy on caching data?*), the geocoded attributes will **only** be stored locally in this database so they can be used in this analysis. After completion of this analysis, the entire database with the geocoded data will be deleted. Geocoded data will not be stored elsewhere.

### Geocode Addresses

Next, the addresses with a missing `latitude` or `longitude` will be geocoded using the `geopy` Python library with the [Bing Geocoder](https://geopy.readthedocs.io/en/stable/#bing). This is done using a helper function ` geocode_missing_lat_lon()` from `src.geopy_helpers.py` - the full contents of the code in this helper function are shown below

```python
import os
from random import randint
from time import sleep
from typing import Dict

import pandas as pd
import snowflake.connector
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Bing
from snowflake.connector.pandas_tools import write_pandas


def run_bing_geocoder(
    row_number, street_address, verbose: bool = False
) -> Dict[str, Union[str, None]]:
    """Geocode a single street addresses."""
    # Set up the Bing Geocoder
    geolocator = Bing(os.getenv("BING_MAPS_KEY"))

    # Perform geocoding
    try:
        # Geocode a single street address
        location = geolocator.geocode(
            street_address, include_neighborhood=True, exactly_one=True
        )
        # Get the street address key from the .raw attribute of the geocoded
        # output
        address_components = location.raw["address"]
        # Get the neighbourhood (if available)
        neighbourhood = (
            address_components["neighborhood"]
            if "neighborhood" in list(address_components)
            else None
        )
        # Get the locality (if available)
        locality = address_components["locality"]
        # Get the latitude and longitude coordinates
        lat, lon = location.raw["point"]["coordinates"]
        # Store geocoded output in a dictionary
        record = {
            "address": street_address,
            "neighbourhood": neighbourhood,
            "locality": locality,
            "formattedAddress": address_components["formattedAddress"]
            if "formattedAddress" in address_components
            else None,
            "postalCode": address_components["postalCode"]
            if "postalCode" in address_components
            else None,
            "latitude": lat,
            "longitude": lon,
        }
        if verbose:
            print(
                f"{row_number}: Geocode completed for {street_address}", end=""
            )
    except GeocoderTimedOut as e:
        # If geocoding did not work, create dictionary with None for each key
        # in the dictionary where geocoding was successful
        if verbose:
            print(
                "{} - Error: geocode failed on input {} with msg: {}".format(
                    row_number, street_address, e.message
                )
            )
        record = {
            "address": street_address,
            "neighbourhood": None,
            "locality": None,
            "formattedAddress": None,
            "postalCode": None,
            "latitude": None,
            "longitude": None,
        }
    return record


def geocode_missing_lat_lon(
    connector_dict: Dict[str, str],
    unique_addresses_missing_lat_lon,
    db_table_name=None,
    min_delay_seconds=5,
    max_delay_seconds=10,
    verbose: bool = False,
) -> None:
    """Geocode a column with one or more street addresses."""
    conn = snowflake.connector.connect(**connector_dict)
    cur = conn.cursor()
    # Iterate over all street addresses to be geocoded
    for row_num, street_address in unique_addresses_missing_lat_lon.items():
        # Clean the street address
        street_address_clean = street_address.replace("'", "\\'")
        # Query local database for existing record with street address
        df_query = pd.read_sql(
            f"""
            SELECT COUNT(*) AS num_matching_street_addresses
            FROM {db_table_name}
            WHERE address = '{street_address_clean}'
            """,
            con=conn,
        )
        # print(df_query)
        # If geocoded output is not available in local database, then perform
        # geocoding for street address
        if df_query["NUM_MATCHING_STREET_ADDRESSES"].iloc[0] == 0:
            # Geocode
            geocoded_output = run_bing_geocoder(
                row_num, street_address, verbose
            )
            # Pause
            if verbose:
                print("...Pausing...", end="")
            sleep(randint(min_delay_seconds, max_delay_seconds))
            if verbose:
                print("Done.")
            # Convert dictionary of geocoded outputs to DataFrame
            df_geocoded = pd.DataFrame.from_dict(
                geocoded_output, orient="index"
            ).T.astype({"latitude": float, "longitude": float})
            df_geocoded.columns = df_geocoded.columns.str.upper()
            # print(df_geocoded.dtypes)
            # Append DataFrame of geocoded outputs to database
            success, _, nrows, _ = write_pandas(
                conn, df_geocoded, db_table_name
            )
            assert success
            assert nrows == 1
        else:
            # If geocoded output is available in local database, then do not
            # geocode the same street address
            if verbose:
                print(
                    f"{row_num}: Found existing record for {street_address}. "
                    "Did nothing."
                )
    cur.close()
    conn.close()
```

**Notes**
1. `geocode_missing_lat_lon()` iterates over every unique address to be geocoded and `run_bing_geocoder()` performs the geocoding returing a dictionary of location attributes including the latitude and longitude. `geocode_missing_lat_lon()` accumulates each returned dictionary (one per address that was geocoded) into a list, creates a `DataFrame` from this list of dicts and appends the `DataFrame` to a table in the local `dinesafe` database. If an address has been previously geocoded, then the `run_bing_geocoder()` helper function will skip the re-geocoding of this address in order to prevent unnecessary calls to the Bing Maps API.

In [16]:
%%time
geocode_missing_lat_lon(connector_dict, unique_addresses_missing_lat_lon[:50], "ADDRESSINFO", 1, 3, True)

0: Found existing record for 1 Avondale Ave, Toronto, ON, Canada. Did nothing.
1: Found existing record for 1 Baldwin St, Toronto, ON, Canada. Did nothing.
2: Found existing record for 1 Balmoral Ave, Toronto, ON, Canada. Did nothing.
3: Found existing record for 1 Baxter St, Toronto, ON, Canada. Did nothing.
4: Found existing record for 1 Blue Jays Way, Toronto, ON, Canada. Did nothing.
5: Found existing record for 1 Byng Ave, Toronto, ON, Canada. Did nothing.
6: Found existing record for 1 Carlingview Dr, Toronto, ON, Canada. Did nothing.
7: Found existing record for 1 Centre Island Pk, Toronto, ON, Canada. Did nothing.
8: Found existing record for 1 Chelwood Rd, Toronto, ON, Canada. Did nothing.
9: Found existing record for 1 Concorde Gt, Toronto, ON, Canada. Did nothing.
10: Found existing record for 1 De Boers Dr, Toronto, ON, Canada. Did nothing.
11: Found existing record for 1 Dundas St W, Toronto, ON, Canada. Did nothing.
12: Found existing record for 1 Eastdale Ave, Toronto, O

Connect to the SQL database

In [17]:
conn = snowflake.connector.connect(**connector_dict)
cur = conn.cursor()

Show all records where the geocode did not retrieve any of the requested attributes

In [18]:
%%time
query = """
        SELECT *
        FROM addressinfo
        WHERE postalCode IS NULL
        OR locality IS NULL
        OR formattedAddress IS NULL
        OR latitude IS NULL
        OR longitude IS NULL
        """
_ = show_sql_df(query, cur, None, True, False)

Unnamed: 0,ADDRESS,NEIGHBOURHOOD,LOCALITY,FORMATTEDADDRESS,POSTALCODE,LATITUDE,LONGITUDE


CPU times: user 20.4 ms, sys: 0 ns, total: 20.4 ms
Wall time: 135 ms


**Observations**
1. There are no records in this table with a missing value in any of the specified geocoding attribute columns.

**Notes**
1. During the initial run of geocoding, a few addresses could not be geocoded completely and so had to be retried using the steps below (see the Python comments for explanatory details)
   ```python
   # 0. Create list of incompletely geocoded addresses
   incomplete_addresses = [
       '1922 Queen St E, Toronto, ON, Canada',
       "1081 Weston Rd, Toronto, ON, Canada",
       "1105 Bay St, Toronto, ON, Canada",
   ]

   # 1. Delete rows from database table with incompletely geocoded addresses
   for incomplete_address in incomplete_addresses:
       _ = conn.execute(
           f"DELETE FROM addressinfo WHERE address = '{incomplete_address}'"
       )

   # 2. Create a new data structure with addresses for which geocoding will be retried
   unique_addresses_missing_lat_lon = pd.Series(incomplete_addresses)
   print(unique_addresses_missing_lat_lon)
   > 0          1922 Queen St E, Toronto, ON, Canada
     1          1081 Weston Rd, Toronto, ON, Canada
     2          1105 Bay St, Toronto, ON, Canada
     dtype: object

   # 3. Re-run geocoding
   geocode_missing_lat_lon(unique_addresses_missing_lat_lon, "addressinfo", URI, 1, 3, verbose=True)
   ```

## Replace Missing Latitude and Longitude with Geocoded Values

Finally, we can replace the missing `latitude` and `longitude` with the values retrieved from geocoding.

First, we'll query the database table with the geocoding records to get the unique geocoded addresses and their latitude and longitude

In [19]:
%%time
query = """
        SELECT REPLACE(address, ', Toronto, ON, Canada', '') AS establishment_address,
               latitude AS latitude_geo,
               longitude AS longitude_geo
        FROM addressinfo
        """
df_query = show_sql_df(query, cur, None, True, True, False)
df_query.columns = df_query.columns.str.lower()
df_query.head()

CPU times: user 7.31 ms, sys: 558 µs, total: 7.87 ms
Wall time: 214 ms


Unnamed: 0,establishment_address,latitude_geo,longitude_geo
0,1 Avondale Ave,43.758012,-79.409618
1,1 Baldwin St,43.656091,-79.392349
2,1 Balmoral Ave,43.685451,-79.393331
3,1 Baxter St,43.675269,-79.388565
4,1 Blue Jays Way,43.641739,-79.389247


**Notes**
1. The `latitude` and `longitude` columns contain the suffix `_geo` to indicate they came from geocoding.

Now, we'll merge this with the transformed data containing the latitude and longitude columns (`df_with_lat_lon`)

In [20]:
df_with_lat_lon_filled = df_with_lat_lon.merge(
    df_query, on=["establishment_address"], how="left"
)

We'll now replace missing values in the `latitude` and `longitude` columns with the geocoded values (respective columns ending with the suffix `_geo`)

In [21]:
df_with_lat_lon_filled["latitude"] = df_with_lat_lon_filled["latitude"].fillna(
    df_with_lat_lon_filled["latitude_geo"]
)
df_with_lat_lon_filled["longitude"] = df_with_lat_lon_filled["longitude"].fillna(
    df_with_lat_lon_filled["longitude_geo"]
)

Now, we will drop the unwanted geocoded `latitude` and `longitude` columns (ending with the suffix `_geo`)

In [22]:
df_with_lat_lon_filled = df_with_lat_lon_filled.drop(
    columns=["latitude_geo", "longitude_geo"]
)

As we can see, there are now no missing values in the `latitude` and `longitude` columns

In [23]:
display(df_with_lat_lon_filled)
display(df_with_lat_lon_filled.isna().sum().rename("missing_values").to_frame())

Unnamed: 0,establishment_id,establishmenttype,establishment_address,inspection_id,inspection_date,establishment_status,infractions_summary,num_significant,num_crucial,num_minor,...,num_cancelled,num_conviction_suspended_sentence,num_conviction_fined,num_conviction_fined_order_to_close_by_court,num_charges_dismissed,num_null.1,num_conviction_probationary_order,is_infraction,latitude,longitude
0,1222579,Food Take Out,870 MARKHAM RD,102810896,2012-08-21,Pass,,0,0,0,...,0,0,0,0,0,1,0,0,43.767980,-79.229029
1,1222579,Food Take Out,870 MARKHAM RD,103015259,2013-06-27,Pass,,0,0,0,...,0,0,0,0,0,1,0,0,43.767980,-79.229029
2,1222579,Food Take Out,870 MARKHAM RD,103133558,2013-12-20,Pass,Food handler fail to wear headgear. Operator f...,0,0,6,...,0,0,0,0,0,1,0,0,43.767980,-79.229029
3,1222579,Food Take Out,870 MARKHAM RD,103329697,2014-09-09,Pass,FAIL TO PROVIDE TOWELS IN FOOD PREPARATION ARE...,3,0,12,...,0,0,0,0,0,2,0,1,43.767980,-79.229029
4,1222579,Food Take Out,870 MARKHAM RD,103420091,2015-01-08,Pass,Operator fail to properly wash equipment. Oper...,3,0,6,...,0,0,0,0,0,2,0,1,43.767980,-79.229029
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
205748,10690616,Food Take Out,4698 YONGE ST,104594530,2019-10-23,Pass,,0,0,0,...,0,0,0,0,0,1,0,0,43.759200,-79.410700
205749,10690642,Bake Shop,20 ST PATRICK ST,104594681,2019-10-23,Pass,FAIL TO PROVIDE THERMOMETER IN REFRIGERATION E...,1,0,0,...,0,0,0,0,0,1,0,1,43.650944,-79.389040
205750,10690660,Restaurant,549 BLOOR ST W,104594800,2019-10-23,Pass,Use pallets not of readily cleanable design - ...,1,0,1,...,0,0,0,0,0,2,0,1,43.665207,-79.410220
205751,10690679,Food Take Out,1175 ST CLAIR AVE W,104594954,2019-10-23,Pass,SANITIZE UTENSILS IN WATER FOR LESS THAN 45 SE...,1,0,0,...,0,0,0,0,0,1,0,1,43.677663,-79.443365


Unnamed: 0,missing_values
establishment_id,0
establishmenttype,0
establishment_address,0
inspection_id,0
inspection_date,0
establishment_status,0
infractions_summary,121406
num_significant,0
num_crucial,0
num_minor,0


## Considerations for Geocoding in a ML Training Experiment

The workflow being used here involves finding the latitude and longitude for establishment locations that are missing values in these two columns. These two columns will be used to retrieve geodata in the next notebook, that could be used later as features in a machine learning training experiment.

In other words, for every ML training run that requires geodata-based features, we will need to query the SQL database we have built up of geocoded locations and, for those that are missing values in these two columns, call the Bing Maps API to perform the geocoding. The ML model trained with such features can then be used to make predictions on unseen data.

The above workflow should also be used during cross-validation.i.e. for each training-validation fold pair, we will need to geocode any missing values in these two columns. This is an expensive computation. A better approach is to retrieve a dataset with every location of licensed establishments, geocode any missing lat-long pairs and store the locations in the SQL database. As new establishments are licensed, their locations should be added to this database. In this way, the database of establishment locations can be queried during ML experiments with no need for geocoding. This is not as compute-intensive as the currently used approach.

The city of Toronto provides a [Business Licenses and Permits dataset](https://open.toronto.ca/dataset/municipal-licensing-and-standards-business-licences-and-permits/) that might be useful for such an approach, if geodata features prove to be strong predictive features. Future iterations of this project should explore the feasibility of using that dataset, specifically looking for the number of establishments from the inspections data that are missing from the licenses and permits dataset since inspections at these locations will have to be dropped before ML experiments. That represents the disadvantage of this approach - there may be establishments that have been inspected but that do not require a license and would be dropped before ML training experiments. A better understanding of the licensing requirements for inspections would be helpful before excluding such observations from the inspections dataset that is used for ML training.

## Export Transformed Data, with Geocoded locations, to CSV

We'll now export this to a CSV file so that we have access to the transformed data, with latitude and longitude columns that don't contain missing values, for further analysis

In [24]:
%%time
time_now  = datetime.now().strftime('%Y%m%d_%H%M%S')
df_with_lat_lon_filled.to_csv(f"data/processed/{geocoded_fname_prefix}__{time_now}.csv", index=False)

CPU times: user 1.37 s, sys: 44.7 ms, total: 1.42 s
Wall time: 1.42 s


In the next notebook (`4_get_stats_by_neighbourhood.ipynb`), we will
- use the `geopandas` library to determine the name of the neighbourhood containing each establishment in the above exported inspections data
- aggregate population, crimes and land area by neighbourhood and append these columns of aggregated counts to each inspection

## Disconnect from SQL Database

In [27]:
cur.close()
conn.close()