Author: Andy Sollish

Date: November 8 2017

Platform: Python 3 (Windows 10)

Class: OA3802

Assignment: GeoHash Lab 5

In [44]:
import gdelt 
import geopandas as pd
import pandas as p
import pygeohash as pgh
from haversine import haversine
from collections import Counter
import folium
import json

In [16]:
gd = gdelt.gdelt(version=2)

# Task 0

write a function that, when given a date (i.e. the current date), pulls down the last 30 days of the GDELT Event data table into an appropriate data structure (dataframe or geodataframe recommended but not required). Apply your function to dowload all events in the 30 days prior to 31 October (i.e. 1 to 30 October) and use that file to demonstrate your subsequent methods.*

The 'events' function is an implimentation to pull down information from the GDELT api and store the results in a Pandas dataframe. 

In [17]:
def events(begin_date, end_date):
    ''' Input: two dates, a start date and an end date (inclusive). 
        Output: a dataframe of event info for the date range. 
        * dates must be in string form, '2017 December 25'.'''
    
    df = gd.Search(date=[begin_date, end_date], 
                   table='events', normcols=True, coverage=False)
        
    return df

In [18]:
df1 = events('2017 October 1', '2017 October 30')



# Task 1

Write a function that takes as input a location x in Lat/Long, a distance y in kilometers, and a list of event codes z and returns a data structure containing only the events within y (great-circle) distance of point x of type z.

Provide a nicely formatted list of the 10 closest incidents when x - (15.05, 1.82), y = 300, and z = [13, 14, 18, 19, 20]. This should just be a very short summary of the events to demonstrate that the approach works.

'close_events' takes in a dataframe and set of parameters and finds all events in the dataframe that occurred inside the provided distance threshold.  The function makes use of the haversine formula to determine if the event in the dataframe was inside the distance threshold or the location of interest.

In [19]:
def close_events(DF, location, distance, event_type):
    ''' Input: latitude, longitude in form (15.05, 1.82), distance from lat / long in km, and a list of event codes
               in form [13, 14, 18, 19, 20]. 
        Output: a dataframe of events that satisfy the criteria. '''
    
    new_df = DF[DF['eventcode'].isin(event_type)]                           # create a smaller DF first by extracting
                                                                            # only events of interest
    loc2 = location
    lats = set()                                                            # use sets to filter duplicate locations
    longs = set()
    for lat, long in zip(new_df.actiongeolat, new_df.actiongeolong):        # iterate through our smaller DF
        loc1 = (lat, long)
        if haversine(loc1, loc2) < distance:                                # check if dist within given threshold
            lats.add(lat)                                                   # saved locations used below
            longs.add(long)
            
    final_df = new_df[(new_df['actiongeolat'].isin(list(lats))) &           # create our final DF will all filter
                     (new_df['actiongeolong'].isin(list(longs)))]           # criteria applied
    
    return final_df

Call the 'close_events' function that makes use of the haversine formula and time the results.

In [20]:
%time DF2 = close_events(df1, (15.05, 1.82), 300, ['013', '014', '018', '019', '020'])
#print(DF2['sourceurl'])

CPU times: user 36 ms, sys: 16 ms, total: 52 ms
Wall time: 57.8 ms


# Task 2

Implement a geohash approach to the task above. Write a function that takes as input a location x in Lat/Long, a distance y in kilometers, and a list of event codes z and returns a data structure containing all events of type z within the minimum geohash bounding box that includes distance y from point x.

Provide a nicely formatted (i.e. easy for the instructor to read) list of the 10 closest incidents when x = (15.05, 1.82), y = 300, and z = [13, 14, 18, 19, 20]. This should just be a very short summary of the events to demonstrate that the approach works.

'geohash_column' is used to create a new column added to our dataframe that includes the geohash for each of the locations in the dataframe.

In [21]:
def geohash_column(DF):
    ''' add a new column to a dataframe that is made up of geohashes.
    Input: a dataframe that includes latitudes and longitudes.
    Ouput: a dataframe with a new column that includes geohashes.'''
    
    temp_list = []
    for lat, long in zip(DF.actiongeolat, DF.actiongeolong):        # iterate through our input DF
        gh = pgh.encode(lat, long)                                  # creates a geohash string
        temp_list.append(gh)
    new_ser = p.Series(temp_list)                                   # converts list to series
    DF['geohash'] = new_ser                                         # add new column (series) to dataframe
    
    return DF

'closeGeoEvents' is the geohash implimentation of finding events that are close to our location of interest.

In [26]:
def closeGeoEvents(DF, location, distance, eventCodes):
    ''' works the same as 'close_events' function but does so comparing geohashes rather than lat / longs. 
    Input: latitude, longitude in form (15.05, 1.82), distance from lat / long in km, and a list of event codes
           in form [13, 14, 18, 19, 20]. 
    Output: a dataframe of events that satisfy the criteria. '''
    
    new_df = DF[DF['eventcode'].isin(eventCodes)]                       # create a smaller DF first by creating new
                                                                        # dataframe with only event types of interest
    distance = distance * 3280.84                                       # convert kilometers to feet
    geoSet = set()
    lat = location[0]
    long = location[1]
    loc1 = pgh.encode(lat, long)                                        # convert area of interest to a geohash
    
    for loc2 in new_df['geohash']:                                      # loc2 is a geohash string
        diff = pgh.geohash_approximate_distance(loc1, loc2)
        if  diff < distance:                                            # find distance between two geohashes
            geoSet.add(loc2)        
    
    final_df = new_df[(new_df['geohash'].isin(list(geoSet)))]           # create new dataframe with results                  
    
    return final_df

The function below takes a dataframe previously computed via 'close_events' or 'closeGeoEvents' and runs the results through a haversine distance check to make sure results fall within the distance threshold.  It also sorts the results in order of closest events and prints only the top 10 closest.

In [23]:
def extractorSorter(DF, location, distance):
    ''' pull out only the desired info from a dataframe, then sort it.
    Input: a dataframe, location of interest, and a distance threshold.
    Ouput: a sorted list with select attributes.'''

    attributes = {}
    for lat, long, code, url, ID in zip(DF.actiongeolat, DF.actiongeolong, 
                                    DF.eventcode, DF.sourceurl, DF.globaleventid):
        loc2 = (lat, long)
        dist = haversine(location, loc2)
        if dist < distance:
            attributes[(ID, lat, long, code, url)] = dist
            
    order_list = [key for (key, value) in sorted(attributes.items(), reverse=False)]
            
    for item in order_list[:10]:
        ID, lat, long, code, url = item[:]
        print('Event ID: {} Lat: {} Long: {} Event Code: {} URL: {}'.format(ID, lat, long, code, url))

Using the function 'geohash_column', we create a new dataframe with the new column composed of geohashes for each
lat / long. 

In [24]:
DF3 = geohash_column(df1)                                          # call 'geohash_column' to generate a new dataframe
#print(DF3['geohash'])                                             # with the added geohash column

Next, we call the function 'closeGeoEvents' to find the events that are close to our location of interest. Events are checked using the geohash we previously created rather than the lat / long as we did in Task 1 using 'close_events'.

It's important to note that 'closeGeoEvents' will return more than just events that are inside the distance threshold.  It will also return events that are relatively close.  If you then want to only see events within the distance threshold, you must run the results of 'closeGeoEvents' through the function 'extractorSorter.'

In [27]:
%time DF4 = closeGeoEvents(DF3, (15.05, 1.82), 300, ['013', '014', '018', '019', '020'])

CPU times: user 8 ms, sys: 4 ms, total: 12 ms
Wall time: 14.3 ms


## Displayed Results

We call the 'extractorSorter' function **twice**, once on our results using the haversine method and the second time using our geohash method.  The results are the same. The only thing that is different in the function call is the dataframe we pass.

The 'extractorSorter' makes use of haversine, which is redundant in the case of passing our first dataframe, because we already calculated the haversine distances.  The function was created to be an abstraction, so that regardless of which method you used previously, you could pass the same parameters and get the same result displayed.

In [98]:
extractorSorter(DF2, (15.05, 1.82), 300)

Event ID: 700438321 Lat: 15.0522 Long: 1.8318 Event Code: 013 URL: http://www.tbo.com/ap/national/us-general-lays-out-niger-attack-details-questions-remain-ap_nationalb298bf48455b4f38ba029bf165060897
Event ID: 700438365 Lat: 13.5167 Long: 2.11667 Event Code: 020 URL: http://dailycaller.com/2017/10/23/dunford-reveals-new-details-on-fatal-niger-ambush/
Event ID: 700438370 Lat: 13.5167 Long: 2.11667 Event Code: 020 URL: http://dailycaller.com/2017/10/23/dunford-reveals-new-details-on-fatal-niger-ambush/


In [97]:
extractorSorter(DF4, (15.05, 1.82), 300)

Event ID: 700438321 Lat: 15.0522 Long: 1.8318 Event Code: 013 URL: http://www.tbo.com/ap/national/us-general-lays-out-niger-attack-details-questions-remain-ap_nationalb298bf48455b4f38ba029bf165060897
Event ID: 700438365 Lat: 13.5167 Long: 2.11667 Event Code: 020 URL: http://dailycaller.com/2017/10/23/dunford-reveals-new-details-on-fatal-niger-ambush/
Event ID: 700438370 Lat: 13.5167 Long: 2.11667 Event Code: 020 URL: http://dailycaller.com/2017/10/23/dunford-reveals-new-details-on-fatal-niger-ambush/


# Task 3

Benchmark the performance (speed/results) of your functions for Great Circle / Euclidean Distance and Geohash when x= (15.05, 1.82), y = 300, and z = [13, 14, 18, 19, 20].

Note that building the data structure(s) that stores your data should NOT be included in your speed benchmarking. What matters is the speed of the lookup on query and the results returned. What is good and bad about each approach (compare/contrast)? How can you leverage the strengths of both approaches? Implement an improved approach for the final project and provide a demonstration of its use in Task 4.

## Discussion on Timing

Based on the way the functions were created, comparing the timing of the haversine method and geohash method is not an apples to apples comparison.  In the haversine function we compare lat / longs and throw them into a data structure if they're under our distance threshold, then we have to query for **two** items, latitude and longitude.  In the geohash function, we throw geohashes that are close into a data structure, but then only query for **one** item, the geohash.  This is probably a negligible detail.  What isn't negligible though is that we first created a new column of geohashes.  We had to convert the lat / longs of the dataframe to geohashes and append them to the dataframe.  This is a process that takes some time that the haversine method didn't have to do.  With that being said, the geohash method seems to be quite a bit faster, relatively speaking.  The timeing results appear under the function calls for 'close_events' and 'closeGeoEvents.' At scale, the gap is likely to increase.  

One of the bigger time savers though was first making a filterd copy of the original dataframe.  We know we're only interested in certain events, so filtering for those first reduces our searchable dataframe down significantly.  We make many fewer distance comparisons when working with a smaller dataframe.  This task is done in each of the functions, 'close_events,' and 'closeGeoEvents.'

The difference in timeing may be neglible and unimportant depending on the task.  But overall, the best solution might be some combination of both as is done in Task 4.

# Task 4

Write a function that returns an interactive map and an interactive report given the inputs specified above. Demonstrate the use of this on a location of your choice (make it interesting), for a distance of 250 KM for event codes [13, 14, 18, 19, 20].

**The combination of the Interactive Map and the Interactive Report makes use of both geohash and haversine** by way of calling both 'closeGeoEvents' and 'extractorSorter.'

First, the data must be pulled from Gdelt using the 'events' function as we did in Task 0.  The code below is what follows after this initial step.

In the below code, we can easily adjust the parameters by changing: loi, dist, and events.  When parameters are set, we call the 'geohash_column' function to add the geohashes to the dataframe.  Once we have the geohash column added, we call 'closeGeoEvents' which finds the events that are near our location of interest by way of geohash.  Once complete we generate our interactive map.

## Interactive Map

The interactive map displays a marker over our location of interest and a circle around it with it's radius set to our distance threshold.  The other events returned by 'closeGeoEvents' are also depicted with markers.  When you click on a marker it displays the URL for the story on the given event. 

In [71]:
loi = [14.58, 120.98]                                                      # set location of interest
dist = 250                                                                 # set distance threshold in kilometers
events = ['013', '014', '018', '019', '020']                               # set event codes of interest

DF5 = geohash_column(df1)                                                  # create geohash column
DF5 = closeGeoEvents(DF5, loi, dist, events)                               # filter for events of interest

locationlist = zip(list(DF5['actiongeolat']), list(DF5['actiongeolong']))
locationlist = list(locationlist)

m = folium.Map(location=loi,zoom_start=7, control_scale=True)             # initialize map 

# the line below draws a circle around the location of interest with a radius of distance threshold
folium.Circle(radius=(dist*1000), location=loi, popup='Starting Location', color='crimson', fill=False).add_to(m)

folium.Marker(loi, popup='Start').add_to(m)                               # display location of interest

for point in range(0, len(locationlist)):                                 # display each location returned
    folium.Marker(locationlist[point], popup=list(DF5['sourceurl'])[point]).add_to(m)

m                                                                          # display the map

## Interactive Report

We call the 'extractorSorter' function to generate a rudimentary report that allows the user to see which of the returned events fall inside the distance threshold, the event ID, location, event code, and an interactive URL that the user can click on to go to the story.

As mentioned previously, 'extractorSorter' makes use of the haversine distance formula to further reduce the list of events down to only ones that are within the distance threshold. 

In [66]:
extractorSorter(DF5, loi, 250)

Event ID: 694582027 Lat: 15.8334 Long: 120.37799999999999 Event Code: 020 URL: http://newsinfo.inquirer.net/935103/cops-in-drug-slays-seek-church-protection
Event ID: 694583543 Lat: 15.8334 Long: 120.37799999999999 Event Code: 020 URL: http://newsinfo.inquirer.net/935103/cops-in-drug-slays-seek-church-protection
Event ID: 696468129 Lat: 13.0 Long: 122.0 Event Code: 013 URL: http://newsinfo.inquirer.net/936798/sereno-pins-hopes-on-support-of-people-in-impeachment-case
Event ID: 696468159 Lat: 13.0 Long: 122.0 Event Code: 013 URL: http://newsinfo.inquirer.net/936798/sereno-pins-hopes-on-support-of-people-in-impeachment-case
Event ID: 697423975 Lat: 13.0 Long: 122.0 Event Code: 020 URL: http://www.tbo.com/news/things-to-know-in-the-world-for-oct-13/2340924
Event ID: 697741139 Lat: 15.8334 Long: 120.37799999999999 Event Code: 013 URL: http://newsinfo.inquirer.net/937810/philippine-news-updates-president-duterte-war-on-drugs-philippine-drug-enforcement-agency
Event ID: 697741146 Lat: 15.833