<a href="https://colab.research.google.com/github/elammertsma/COVID-19/blob/master/CovidStats_from_JH.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, let's install the required libraries, like FuzzyWuzzy, since a few that we need are not included in Google Colab.

In [None]:
# Prevents all the installation output from cluttering Colab
%%capture

!pip install fuzzywuzzy
!pip install python-Levenshtein
!pip install census
!pip install countryinfo

Collecting fuzzywuzzy
  Downloading https://files.pythonhosted.org/packages/43/ff/74f23998ad2f93b945c0309f825be92e04e0348e062026998b5eefef4c33/fuzzywuzzy-0.18.0-py2.py3-none-any.whl
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0
Collecting python-Levenshtein
[?25l  Downloading https://files.pythonhosted.org/packages/42/a9/d1785c85ebf9b7dfacd08938dd028209c34a0ea3b1bcdb895208bd40a67d/python-Levenshtein-0.12.0.tar.gz (48kB)
[K     |████████████████████████████████| 51kB 1.6MB/s 
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25l[?25hdone
  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.0-cp36-cp36m-linux_x86_64.whl size=144796 sha256=84ff0bb54a267fef2d30f00ac859c8cbd279264c37ab7b6fbee51bae359d1362
  Stored in directory: /root/.cache/pip/wheels/de/c2/93/660fd5f7559049268ad2dc6d81c4e39e9e36518766eaf7e342
Successfully built python-Levenshtein
Installing collect

# Project outline

## Intended use
This project is meant to run as a webhook on GCP Cloud Functions for a Dialogflow integration, but for development purposes I also wanted it to run as a Colab or locally.

To do this, we need to run a few checks and execute/install things depending on the environment.

We first check if this code is running in Colab or otherwise (e.g. a Google Cloud Function or locally) and we ask for a location. When it's running as a webhook, the webhook function is called directly, including a JSON-formatted location, skipping the "main" function. When it's running in Colab or locally, we execute the main funtion to correctly format the requested location in a way that is compatible with the webhook method, artificially creating a JSON request from the query.

## Getting the location
Once we have the JSON request with a location, we determine what the location means by running it by the Google Maps API, which is very robust and can generally generate an address for even the most malformed and unintuitive location requests.

## Getting the Covid statistics
With an accurate location in hand, we determine which data to retrieve from Johns Hopkins CSSE on Github, since there are options for US States or US counties and international data. This is updated daily, so we always have recent data to work with. We filter out unneeded data and, since the data includes latitudes and longitudes, we find the distance from the requested location and sort the data by that distance. Sometimes, the JH locations don't match up with Google's locations, so we then check that the address components match, so we're getting the most relevant location instead of just the nearest location. We now have the most relevant data!

## Returning results
With the best data now defined, we simply format the output as desired. Primarily, this means returning the data to Dialogflow so that the Google Assistant knows how to use it, but for other API calls and running the code directly, we output data in a more readable format.

In [None]:
# First, check if this is running in Colab
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
  print('Running in Colab!')
  get_ipython().system('pip install fuzzywuzzy')
  get_ipython().system('pip install python-Levenshtein')
  get_ipython().system('pip install census')
  get_ipython().system('pip install countryinfo')
  get_ipython().system('clear')

import os
import logging
from flask import Flask, request, make_response, jsonify, escape
import pandas as pd
import numpy as np
import requests
import re
from geopy.geocoders import GoogleV3
from math import pi,sqrt,sin,cos,atan2
from fuzzywuzzy import fuzz
from census import Census
from countryinfo import CountryInfo

 
def fixData(df,country):
  orig_size = df.size

  df.drop(df[pd.isnull(df.Lat) | pd.isnull(df.Long_) | pd.isnull(df.Confirmed) | pd.isnull(df.Active)].index, inplace=True, errors='ignore')

  # Exceptions due to differences in country notation between the JH data and Google
  if country == 'United States':
    df[['Country_Region']] = df[['Country_Region']].replace('^US$', 'United States', regex=True) # this makes comparisons easier later
  if country == 'Côte d\'Ivoire':
    df[['Country_Region']] = df[['Country_Region']].replace('^Cote d\'Ivoire$', 'Côte d\'Ivoire', regex=True) # this makes comparisons easier later
  if country == 'Democratic Republic of the Congo':
    df[['Country_Region']] = df[['Country_Region']].replace('^Congo \(Kinshasa\)$', 'Democratic Republic of the Congo', regex=True) # this makes comparisons easier later
  if country == 'Republic of the Congo':
    df[['Country_Region']] = df[['Country_Region']].replace('^Congo \(Brazzaville\)$', 'Republic of the Congo', regex=True) # this makes comparisons easier later
  if country == 'South Korea':
    df[['Country_Region']] = df[['Country_Region']].replace('^"Korea, South"$', 'South Korea', regex=True) # this makes comparisons easier later

  print('Removing anything that isn\'t ' + country)
  print(df[df.Country_Region == country].head(3))
  df.drop(df[df.Country_Region != country].index, inplace=True, errors='ignore')

  df[['Province_State']] = df[['Province_State']].replace(np.nan, '', regex=True) # some countries have territories as states with an empty state entry as the actual country. ugh.
  df[['Recovered']] = df[['Recovered']].replace(np.nan, 0, regex=True) # states have empty recovered numbers instead of zero, contrary to countries and counties. ugh.

  logging.debug('Removed ' + str(orig_size - df.size) + ' unusable locations from the data. (' + str(100 - df.size/orig_size * 100) + '%)')
  return df

def haversine(row, location):
  """This function finds the distance between the
  queried location and each row in the data set."""

  lat1 = row['Lat']
  long1 = row['Long_']
  lat2 = location['lat']
  long2 = location['long']

  degree_to_rad = float(pi / 180.0)

  d_lat = (lat2 - lat1) * degree_to_rad
  d_long = (long2 - long1) * degree_to_rad

  a = pow(sin(d_lat / 2), 2) + cos(lat1 * degree_to_rad) * cos(lat2 * degree_to_rad) * pow(sin(d_long / 2), 2)
  c = 2 * atan2(sqrt(a), sqrt(1 - a))
  km = 6367 * c
    
  return km

def fuzzymatch(row, location):
  Admin2 = str(row[['Admin2']][0])
  Province = str(row[['Province_State']][0])
  Country = str(row[['Country_Region']][0])
  l_Admin2 = location['l_Admin2']
  l_Province = location['l_Province']
  l_Country = location['l_Country']

  if Admin2 == 'nan':
    Admin2 == ''
  if Province == 'nan':
    Province = ''
  if Country == 'nan':
    Contry = ''

  if Admin2:
    admin2_match = fuzz.ratio(Admin2.lower(),l_Admin2.lower())
  else:
    admin2_match = 0
  if Province:
    province_match = fuzz.ratio(Province.lower(),l_Province.lower())
  else:
    province_match = 0
  if Country:
    country_match = fuzz.ratio(Country.lower(),l_Country.lower())
  else:
    country_match = 0

  logging.debug('Comparing ' + Admin2.lower() + ' with ' + l_Admin2.lower() + ': ' + str(admin2_match))
  logging.debug('Comparing ' + Province.lower() + ' with ' + l_Province.lower() + ': ' + str(province_match))
  logging.debug('Comparing ' + Country.lower() + ' with ' + l_Country.lower() + ': ' + str(country_match))

  # Choose how closely results should match.Too low and it will
  # match things that shouldn't match and too high means slight 
  # variations won't match, like New York City to New York. 70 
  # seems to work well.
  match_threshhold = 70

  if admin2_match > match_threshhold:
    return 3
  elif province_match > match_threshhold:
    return 2
  elif country_match > match_threshhold:
    return 1
  else:
    return 0

def addcounty(row):
  if str(row[['Admin2']][0])[-7:].lower() == ' county':
    return str(row[['Admin2']][0])
  else:
    return str(row[['Admin2']][0]) + ' County'

def roundbig(number, retain = 3):
  l = len(str(number))
  r = int(round(number / 10 ** (l - retain)) * 10 ** (l - retain))
  number_with_commas = "{:,}".format(r)
  return number_with_commas

def getpopulation(fips):
  c = Census(os.getenv('US_CENSUS_KEY'))
  logging.debug('Getting population for FIPS ' + str(fips))
  if fips > 999:
    fips = f'{fips:05}' # pad with zeroes, since the census API needs them
    
    state_fips = fips[:-3]
    county_fips = fips[-3:]
    census_info = c.acs5.get(('NAME', 'B01003_001E'), {'for': 'county:' + county_fips, 'in': 'state:' + state_fips})
  else:
    fips = f'{fips:02}' # pad with zeroes, since the census API needs them
    census_info = c.acs5.get(('NAME', 'B01003_001E'), {'for': 'state:' + fips})
  logging.debug('census_info: ' + str(census_info))

  try:
    pop = int(census_info[0]['B01003_001E'])
  except:
    pop = 0

  logging.debug('Population is ' + str(pop))
  return pop

def getlocation(query):
  # query components are {'admin-area', 'business-name', 'city', 'country', 'island', 'shortcut', 'street-address', 'subadmin-area', 'zip-code'}

  
  # Get the full location from an incomplete query via Google Maps
  g = GoogleV3(api_key=os.getenv('GOOGLE_MAPS_KEY'))
  # l = g.geocode('Houston St',components={"administrative_area": "CA"}) # shows limiting bounds by component

  l = g.geocode(query)
  l_Locality = ''
  l_Admin2 = ''
  l_Province = ''
  l_Country = ''
  county = False

  logging.info(l.address)

  for i in l.raw['address_components']:
    if i['types'][0] == 'locality':
      l_Locality = i['long_name']
    elif i['types'][0] == 'administrative_area_level_2':
      l_Admin2 = i['long_name']
      if l_Admin2[-7:] == ' County':
        l_Admin2 = l_Admin2.replace(' County', '')
        county = True
    elif i['types'][0] == 'administrative_area_level_1':
      l_Province = i['long_name']
    elif i['types'][0] == 'country':
      l_Country = i['long_name']
  if l_Admin2 == '':
    l_Admin2 = l_Locality

  location = {'l_Locality': l_Locality, 'l_Admin2': l_Admin2, 'l_Province': l_Province, 'l_Country': l_Country, 'lat': l.latitude, 'long': l.longitude, 'county': county}

  return location

def getstats(location):

  # Here, we download latest report from Johns Hopkins git.
  # Which report we download depends on the location requested.
  # 
  # There are three files:
  # 1. A file with only US states (and territories)
  # 2. A file with all countries, for some countries also
  #    states and/or territories and (inexplicably) all US
  #    counties.
  # 3. A few files with timelines of daily cases, deaths and 
  #    recoveries for the same locales as 1 and 2. We won't be
  #    using these for now.
  #
  # None of the files contain an entry for the US, so we use
  # the state data and total it if the user requests the US.

  if location['l_Country'] == 'United States' and not (location['l_Locality'] or location['l_Admin2']):
    # Download US state data, since the request is for a US state or the US.
    logging.debug('US or US state requested')
    US = True
    report_dir = '/csse_covid_19_daily_reports_us'
  else:
    # Download US county and international data (which is inexplicably in one file)
    logging.debug('US county or international location requested')
    US = False
    report_dir = '/csse_covid_19_daily_reports'

  file_list_url = 'https://api.github.com/repos/CSSEGISandData/COVID-19/contents/csse_covid_19_data' + report_dir
  git_raw_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data'
  git_raw_url = git_raw_url + report_dir

  github_request = requests.get(file_list_url)
  if not github_request:
    logging.error('Error: Could not get folder content from Github.')

  file_list = github_request.json()

  # Get the name of the last csv in the directory (with a name
  # like 31-03-2020.csv). This gets the most recent file because
  # the file name is in US date notation and all files are from
  # 2020. This will break when entering 2021, since 12-31-2020.csv
  # would be ordered lower in the list than 01-01-2021.csv.
  is_dated_csv = re.compile(r'^\d{2}-\d{2}-\d{4}\.csv$')
  for item in reversed(file_list):
    if bool(is_dated_csv.match(item['name'])):
      last_report_name = item['name']
      break

  logging.info('Getting data from ' + git_raw_url + '/' + last_report_name)
  last_report = pd.read_csv(git_raw_url + '/' + last_report_name) # returns a dataframe built from the giant CSV of COVID-19 stats
  logging.debug(last_report.head(3))

  # Fix data that causes problems.
  last_report = fixData(last_report, location['l_Country'])
  if last_report.empty:
    return False

  if not US:

    # The user is simply not looking for the US or a US state.

    # NOTE! They might still be looking for a location in the US and
    # therefore recieve A US COUNTY.

    if location['l_Province'] or location['l_Admin2'] or location['l_Locality']:
      # A state/province or more detailed location is present in the request,
      # meaning we will try to find something more specific.

      # Sort locations based on distance from the query result.
      last_report['Distance'] = last_report.apply(haversine, args=(location,), axis=1)
      last_report = last_report.sort_values('Distance')
      nearest = last_report[['FIPS','Admin2','Province_State','Country_Region','Distance','Confirmed','Deaths','Recovered','Active']].head(5)

      # Of the nearest locations with data, find the one that most closely 
      # resembles the address the user was looking for. This isn't always the 
      # nearest option. For example, Paris is closer to the center of Belgium 
      # than the center of France, so asking for Paris would give an illogical 
      # result if the nearest info was returned.

      nearest['Match'] = nearest.apply(fuzzymatch, args=(location,), axis=1)

      # The data doesn't include "county" at the end of county names, so add it.
      if location['county']:
        nearest['Admin2'] = nearest.apply(addcounty, axis=1)
      sorted_df = nearest.sort_values('Match', ascending=False)

      bestmatch_specificity = int(sorted_df[['Match']].iloc[0])
      bestmatch_name = str(sorted_df.iloc[0,4-bestmatch_specificity]) # returns the county/city, state or country, depending on what matched
      if bestmatch_specificity == 1:
        # The best match is a country afterall, despite the user wanting a more specific location
        is_country = True
      else:
        is_country = False
    else:
      is_country = True

      # Here, we know the user is looking for a non-US country, so we get
      # all entries for that non-US country. Some countries have a single
      # entry, but some are split into states/provinces and some have both
      # their own entry and entries for territories (which are entered in
      # the state/province field, to make it more confusing). Entries
      # specific to a country have a blank state/province, so we look for
      # this first. If that exists, we can use that and ignore the rest,
      # since those appear to always be territories. If a blank entry does
      # not exist, the country is split into states/provinces, so we total
      # those entries to get the values for the country.

      #sorted_df = last_report.loc[last_report['Country_Region'] == location['l_Country']]
      sorted_df = last_report.sort_values('Province_State', ascending=True)
      if sorted_df[['Province_State']].iloc[0][0]:
        # The top entry has a province, so the country doesn't have its own
        # entry. Therefore we have to add all provinces together.
        totalsseries = sorted_df.sum(numeric_only=True) # creates a vertical series of totals
        totalsdf = pd.DataFrame(totalsseries).transpose() # transpose to create a row and make it into a new dataframe
        sorted_df = pd.concat([totalsdf, sorted_df], ignore_index = True) # insert the totals at the top, so they get used

        bestmatch_specificity = 3
        bestmatch_name = location['l_Country']
      else:
        # the top entry has no province meaning it's the main entry for the
        # country, so we're done. (The other entries are usually territories,
        # like Curacao which lists Netherlands as the Country.)
        bestmatch_specificity = 3
        bestmatch_name = sorted_df[['Country_Region']].iloc[0][0]
  elif location['l_Province']:
    # we're looking for a state
    sorted_df = last_report.loc[last_report['Province_State'] == location['l_Province']]
    is_country = False

    bestmatch_specificity = 3
    bestmatch_name = sorted_df[['Province_State']].iloc[0][0]
  else:
    # we need to add up data for the US
    totals_series = last_report.sum(numeric_only=True) # creates a vertical series of totals
    totals_df = pd.DataFrame(totals_series).transpose() # transpose to create a row and make it into a new dataframe
    sorted_df = pd.concat([totals_df, last_report], ignore_index = True) # insert the totals at the top, so they get used
    is_country = True

    bestmatch_specificity = 3
    bestmatch_name = 'the United States'

  logging.debug(sorted_df.head(3))
  logging.debug('Location is a country? ' + str(is_country))
  logging.debug('Best matching location name: ' + bestmatch_name)
  logging.debug('Match level 1 - 3: ' + str(bestmatch_specificity))

  if not is_country and location['l_Country'] == 'United States':
    logging.debug('Getting population for ' + str(sorted_df[['FIPS']].iloc[0]))
    if bestmatch_name == 'New York City':
      # The FIPS for NYC is incorrect as it's for Manhattan, while the
      # infection counts are for all buros of NY. This would give a much
      # lower population of 1.6M, while it should be over 8M, making the
      # infection rate seem much worse than it is.
      pop = 8336817
    else:
      # Use the US Census API to get populations of US areas based on FIPS#.
      pop = getpopulation(int(sorted_df[['FIPS']].iloc[0]))
    if pop:
      per = int(pop / int(sorted_df[['Active']].iloc[0]))
    else:
      per = 0
  elif is_country:
    # Use the CountryInfo library to get country populations.
    pop = CountryInfo(location['l_Country']).population()
    if pop:
      per = int(pop / int(sorted_df[['Active']].iloc[0]))
    else:
      per = 0
  else:
    # We don't know the populations of states/provinces/territories outside
    # the US, so return 0.
    pop = 0
    per = 0
  logging.debug('population = ' + str(pop))
  
  stats = {
      'name': bestmatch_name, 
      'exact_match': bestmatch_specificity, 
      'confirmed': int(sorted_df[['Confirmed']].iloc[0]), 
      'recovered': int(sorted_df[['Recovered']].iloc[0]), 
      'deaths': int(sorted_df[['Deaths']].iloc[0]), 
      'active': int(sorted_df[['Active']].iloc[0]),
      'one_in_every': per,
      'population': pop
      }

  if (stats['exact_match'] < 3 and (location['l_Admin2'] or location['l_Locality'])) or (location['l_Locality'] and location['l_Locality'] != location['l_Admin2']):
    stats['exact_match'] = False
  else:
    stats['exact_match'] = True

  return stats

def webhook(req):
  # GC Functions uses Flask and calls this function with a Flask request
  logging.info('Request: ' + str(req))

  request_json = req.get_json(silent=True)
  request_args = req.args

  if request_json:
    req = request_json
    logging.debug('using request_json')
  elif request_args and 'location' in request_args:
    req = request_args
    logging.debug('using request_args')
    if request_args['location'] == 'DF':
      # req = json.loads(df_test)
      logging.debug('TODO: run a test DF request')

  # Test if the request is from DialogFlow
  if 'queryResult' in req:
    isDF = True
    logging.debug('Request originated from Dialogflow.')
  else:
    isDF = False
    logging.debug('Request was direct.')

  ## extract the name of the intent which is detected.
  if isDF:
    try:
      detect_intent = req["queryResult"]["intent"]["displayName"]
    except:
      detect_intent = 'bad_request'
  elif 'location' in req:
    detect_intent = 'direct'
  else:
    detect_intent = 'bad_request'

  logging.info('Intent Detected: '+ str(detect_intent))

  ## This is how we are going to map the the intent with a function to perform that functionality for that intent
  if detect_intent == 'Default Welcome Intent':
    res = welcome(req)
  elif detect_intent == 'Get Coronavirus Statistics':
    location = getlocation(req["queryResult"]["outputContexts"][0]["parameters"]["location.original"])
  elif detect_intent == 'test':
    res = 'Testing success!'
  elif detect_intent == 'bad_request':
    res = 'The request did not include a location parameter.'
  elif detect_intent == 'direct':
    location = getlocation(req["location"])
  elif detect_intent == 'raw':
    location = getlocation(req)
  else:
    res = 'Unkown intent! ' + str(detect_intent)

  if location:
    stats = getstats(location)
  else:
    res = 'I didn\'t understand that location. You can ask for any location and I will find the nearest data available.'

  if 'rawdata' in req:
    response = stats
  elif stats == False:
    response = 'Unfortunately, no location or data was found for your request.'
  else:
    if not stats['exact_match']:
      inaccuracy = 'The nearest location with data available is ' + stats['name'] + '. '
    else:
      inaccuracy = ''
    if stats['one_in_every']:
      perpop =  'In ' + stats['name'] + ', 1 in ' + str(roundbig(stats['one_in_every'], 2)) + ' people potentially have the virus. T'
    else:
      perpop = 'In ' + stats['name'] + ', t'
    if stats['recovered'] > 0:
      recoveries = str(roundbig(stats['recovered'])) + ' confirmed recoveries and '
    else:
      recoveries = ''
    response = inaccuracy + perpop + 'here are ' + str(roundbig(stats['confirmed'])) + ' confirmed cases, of which ' + str(roundbig(stats['active'])) + ' are active. There have been ' + recoveries + str(roundbig(stats['deaths'])) + ' deaths.'
  if isDF:
    followup = 'Would you like the coronavirus statistics about another place?'
    response = {'fulfillmentText':response + followup}

  ## returning the res from the function for which intent is matched.
  logging.info("Response: " + str(response))
  return str(response)

def main():
  logging.basicConfig(level=logging.DEBUG)

# Create a file called CovidStats_Keys.txt
# Add your Google Maps Key as:
# GOOGLE_MAPS_KEY:xxxxxxxxxxxxxxxxxxxxxxxxxxxx
# and your US Census Key as:
# US_CENSUS_KEY:xxxxxxxxxxxxxxxxxxxxxxxxxxxx

  if IN_COLAB:
    google.colab.drive.mount('/content/drive')
    keys_file_path = 'drive/My Drive/Colab Notebooks/CovidStats_Keys.txt'
  else:
    keys_file_path = 'CovidStats_Keys.txt'

  keys = []
  with open(keys_file_path) as keysfile:
    keys = keysfile.readlines()
  for key in keys:
    key = key.strip()
    keypair = key.split(':')
    os.environ[keypair[0]] = keypair[1]
    print('Successfully added environment variable ' + keypair[0])

  query = str(input('Which location would you like data for?\n'))

  app = Flask(__name__)
  with app.test_request_context('?location=' + query): # creates test Flask request
    webhook(request)

if IN_COLAB:
  main()
else:
  # assumes this file is named webhook.py
  if __name__ == "__main__":
      import CovidStats
      CovidStats.main()


Running in Colab!
/bin/bash: cls: command not found
[H[2JDrive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


OSError: ignored

In [None]:
drive.mount("/content/drive", force_remount=True)
!ls 'drive/My Drive/Colab Notebooks/'
!cat 'drive/My Drive/Colab Notebooks/CovidStats_Keys.txt'

KeyboardInterrupt: ignored