<a href="https://colab.research.google.com/github/gisalgs/notebooks/blob/main/spatial-big-data.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Spatial Big Data

Big data has been the buzz word for quite a few years now. The main characteristic of big data is its size or volume. For example, think about how many tweets that have ever been twitted and are still coming at this moment. There is a LOT. Also, think about the velocity of these data sets -- they are happening at any moment. There are also a lot of different kinds, or variety, of data sets happening at high volume and high velocity. For example, in addition to tweets, think about how many transactions are made at any moment at Wal-Mart or online shopping sites. These three V's -- volume, velocity, and variety -- are the main characteristics of [big data](https://en.wikipedia.org/wiki/Big_data#Characteristics), along with many other terms, many of them starting with v (veracity, value, and variability). So in short, there is more than enough data out there!

In this tutorial, we are going to explore a few ways of collecting big data. We focus on those data sets that have spatial components and we call them spatial big data. 

## Data feeds

To some extent, data feeds are the simplest form of online data. The idea is to have a fixed online file and the data provider will simply keep overwriting the file with new content. A lot of government agencies use this approach to make their data available to the general public with no strings attached. The example we will use here is the data feeds for global earth quakes provided by the USGS as detailed in the page linked below:

https://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php

More specifically, we will use the all quakes in the past day at the following GeoJSON online file:

https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson

As the name suggest, this is a GeoJSON file and we can use it as we did with the coastlines data. We can actually plot the earth quakes along with the graticule and coastlines, which will provide a quick way of explore the data.

We will first import some necessary Python packages that allow us to request the GeoJSON file at a URL:

In [None]:
import urllib.request as request
import json 

Then we follow the examples we did before by important something from our own modules:

In [None]:
!git clone https://github.com/gisalgs/geom.git 

from geom.worldmap import *
from geom.plot_worldmap import *

This is how we get the quakes (again, similar to the coastlines):

In [None]:
url_1day_quakes = 'https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.geojson'
with request.urlopen(url_1day_quakes) as response:
    quakes1days = json.loads(response.read())

Typically we want to know about what's inside the GeoJSON:

In [None]:
quakes1days.keys()

Let's see how may earth quakes in the past day around the world:

In [None]:
len(quakes1days['features'])

In [None]:
quakes1days['features'][0].keys()

In [None]:
quakes1days['features'][0]['properties'].keys()

To make a quick map, let get the coastlines and graticule:

In [None]:
url = 'https://raw.githubusercontent.com/gisalgs/data/master/ne_110m_coastline.geojson'
raw_points, numgraticule, numline = prep_projection_data(url, _use_lib='URL')

We also pull the features for convenience:

In [None]:
quakes_features = quakes1days['features']

Now we are going to make a map. We will show two the quakes below magnitude 4.5 and above in two different colors.

In [None]:
_, ax = plt.subplots(1, 1)

plot_world(ax, raw_points, numgraticule, numline, 'lightgrey')


quakes_45 = [q for q in quakes_features if q['properties']['mag'] > 4.5 ]
points = [ [q['geometry']['coordinates'][0], q['geometry']['coordinates'][1]] for q in quakes_45 ]
ax.scatter([p[0] for p in points], [p[1] for p in points], color='red', marker='o', alpha=0.8, zorder=2)

# TODO
#    Draw small (<=4.5) quakes in blue
#




plt.show()

## Air Quality Using AirNow API

Some web services provide more data and the air quality data provided by [AirNow](airnowapi.org), a web service by the EPA. Any body who wants to access their data must log in. So it is important for each of us to have a valid account at this website. Please sign up using the following link:

https://docs.airnowapi.org/account/request/

After signing up, AirNow will provide an API Key that is required to access their data sets. The key will come in an email and is also shown on their web site. 

**TODO: Sign up an airnow account and obtain their API Key**

For example, the following is a URL that will return a JSON file for ozone, PM 2.5, and PM 10 air quality indexes for a geographic area around Columbus (please replace YOUR_API_KEY with a real key):

```
http://www.airnowapi.org/aq/data/?startDate=2016-04-21T12&endDate=2016-04-21T13&parameters=O3,PM25,PM10&BBOX=-83.368244,39.586371,-82.269611,40.344184&dataType=A&format=application/json&verbose=1&API_KEY=YOUR_API_KEY
```

API stands for application programming interface. What it means is it provides a device in between us and the data that allows us to control in a programming environment. The "programming" part of the AirNow API is not directly relevant to us since they do not provide coding specifically. Most of their services are through the URLs like above. They will return a data file and we just need to get the file and its content. Everything here is online and we do not need to download anything.

Here, our strategy is simple: we will write code to construct a string that contains the correct URL. What we see above a [**query string**](https://en.wikipedia.org/wiki/Query_string) that includes the web site followed by a question mark. After the question mark we have the parameters, each is formed by something like parameterName=value. Multiple parameters are separated by the ampersand (&).

It should be noted that AirNow uses UTC (Coordinated Universal Time), which is 4 hours ahead of us in the Eastern Timezone. So 1 AM Eastern will be 5 AM UTC. THere are modules in Python that are convenient in processing date and time, but here we will simply set time manually. A time of `2020-10-26T05` will be October 26, 2020 at 1 AM Eastern. The following code constructs the correct URL that is ready to use. (Again, please make sure to replace **YOUR_API_KEY** with a real key.

In [None]:
options = {
    'url': 'https://airnowapi.org/aq/data/',
    'startDate': '2020-10-26T05',
    'endDate': '2020-10-26T06',
    'parameters': 'o3,pm25',
    'bbox': '-84.815,38.4,-80.5,42',  # OHIO!
    'data_type': 'b',
    'format': 'application/json',
    'verbose': '1',
    'api_key': 'YOUR_API_KEY'
}

# API request URL
request_url = options['url'] \
              + '&startDate=' + options['startDate'] \
              + '&endDate=' + options['endDate'] \
              + '?parameters=' + options['parameters'] \
              + '&bbox=' + options['bbox'] \
              + '&datatype=' + options['data_type'] \
              + '&format=' + options['format'] \
              + '&verbose=' + options['verbose']\
              + '&api_key=' + options['api_key']

print(request_url)

We set the bounding box that contains Ohio. The following code uses the URL to get the stations in Ohio:

In [None]:
try:
    # Perform the AirNow API data request
    with request.urlopen(request_url) as response:
        responses = json.loads(response.read())
except Exception as e:
    print('Unable perform AirNowAPI request. %s' % e)
    responses = None

Different stations will have different parameters (ozone or PM 2.5), here is an example of the first station:

In [None]:
responses[0]

In [None]:
len(responses)

Now our goal is to map all ozone air quality indexes across Ohio. We first get all the stations that have ozone measures:

In [None]:
# TODO
#    Get all the items in responses that have the Parameters value of OZONE
#    and put them in a list called ozone. (Note Parameters is one of the keys.)
#    This can be done in a list comprehension, but it will be fine to use a loop.

ozone = []


In [None]:
o = ozone[0]

In [None]:
o['Longitude'], o['Latitude'], o['SiteName'], o['AQI']

We need a base map for Ohio. Here is an example and other files can be used too:

In [None]:
oh_url = 'https://raw.githubusercontent.com/gisalgs/data/master/OH_geog.geojson'
with request.urlopen(oh_url) as response:
    oh_counties = json.loads(response.read())

The Ohio GeoJSON was converted from the [US Census cartographic boundary files](https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html) using QGIS. It appears that QGIS will save all polygons as multipolygons. This turns out to be convenient. TO draw the Ohio map along with EPA stations, we first write a function that draws a simply polygon (meaning there is no holes or multiple parts):

In [None]:
def plot_simple_polygon(ax, points, color='lightgrey'):
    '''
    coordinates = [ [x,y], [x,y], ...]
    '''
    pts = [[p[0], p[1]] for p in points]
    l = plt.Polygon(pts, color=color, fill=False, closed=True) # need to import matplotlib.pyplot as plt
    ax.add_line(l)


In [None]:
_, ax = plt.subplots(1,1)

for f in oh_counties['features']:
    geom = f['geometry']['coordinates']
    for part in geom:
        points = part[0]
        plot_simple_polygon(ax, points)

for s in ozone:
    ax.plot(s['Longitude'], s['Latitude'], color='lightgreen', marker='o')
    ax.text(s['Longitude'], s['Latitude'], s['AQI'], color='grey')

        
ax.axis('equal')                        # x and y one the same scale
ax.axes.get_xaxis().set_visible(False)  # don't show axis
ax.axes.get_yaxis().set_visible(False)  # don't show axis
ax.set_frame_on(False)                  # no frame either

plt.show()

In [None]:
# TODO
#    Copy the above code here and then make some changes
#    so that the labels of the top 5 stations with the hight indexes will in red





## Twitter data and API

Twitter data is huge and we can get a piece of it using their API. This API does provide a some coding function and their Python module is called tweepy and can be installed like this:

In [None]:
!pip install tweepy

In [None]:
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler, Stream
import time

Getting Twitter data requires a lot of authentication, which is handled through a lot of tokens. To be able to get those tokens, one must sign up a developer's account at https://developer.twitter.com. The purpose here is to provide a quick introduction to the API and some basic ways to explore the data. The following tokens are already generated by Twitter and are provided here for teaching purposes. These tokens will be deactivated at some point. This is from a free license. 

In [None]:
tokens = {"consumer_key": 'ZYD8f3hGFY3tdclRWEXoUhxpM',
    "consumer_secret": 'V4cjXWVstatZMaoMf8hiE10JCCykNRU82PRTLnN34IU9rXPupT',
    "access_token": '20824990-5LQAVwk666yeGol0i0ou2vEs2sqSbusPTZOpdk5KD',
    "access_token_secret": 'x6K30ramXKr4Awu7e8S0nosy5K2R0ZsWix6mU6LOkFTGf'
}

Though a free license, Twitter allows some limited data streaming and Twitter functions. We will stream some geo-tagged twittes from the world. Geo-tagged tweets come in different shapes. Some tweets have specific geolocation provided as coordinates. Some don't, but the provide some information about the place where the tweets are sent. Nevertheless, we will stream tweets by providing a bounding box so that all the tweets sampled here must fall into this region. Our region in this specific case is the globe (note the commented numbers are for Central Ohio):

In [None]:
region_bbox = [-180, -90, 180, 90] # [-84.50, 39.50, -81.70, 41]

The way tweepy works is to define a listener that constantly listens to the twitter world and harvest tweets that are allowed to the stream by the license. Below we extend the listener into our own class called `xGeoStdListener`. 

Twitter will provide tweets as JSON objects. We will see its detailed information later, but right now we need to be able to save all the JSON objects into a file as a list.

In the `__init__` method, the user can give us the duration of this streaming process. By default, we will just stream 10 seconds. Also in this method we get the current time (so we can check duration down the road), and open a file called sometweets.json. We will save all the tweets sampled into this file. This file will be a list of JSON objects and we first write the left bracket into the file. Things will pile up in the file once we get more data from Twitter. We set a counter to zero at the beginning. It is not critical to actually count the number of tweets, but we need to know when the tweets come in. We will add a comma before each tweet (again, a JSON obj). We do this for all the tweets except for the first one (the one immediately after the left bracket).

Every time a new tweet is sampled, the API automatically calls the `on_data` method and pass the data. 

In [None]:
class xGeoStdListener(StreamListener):
    def __init__(self, duration=10):
        self.duration = duration
        self.time0 = time.time()
        self.file = open('sometweets.json', 'w')
        self.file.write('[')
        self.count = 0
    def on_data(self, data):
        decoded = json.loads(data) # decode the data so it is all text
        if time.time()-self.time0 < self.duration:  # less than 10 seconds
            if self.count > 0:                      # if it is the first tweet sampled
                self.file.write(',\n')              # add a comma and a new line
            json.dump(decoded, self.file)           # save the decoded text into sometweets.json
            self.count += 1
            return True
        else:
            self.file.write(']')                    # add the right bracket
            self.file.close()                       # make sure the file is closed
            return False
    def on_error(self, status):
        print(status)

The following first create the listener object and the pass a bunch of codes for authentication. Once the line of `stream = Stream(auth, listener)` is called, the API will start an infinite loop to keep listening. The next line makes sure only geo-tagged tweets are sampled here. Running the following code needs some patience because it takes 10 seconds to finish. 

It is also important to guess the best time or worst time to collect tweets. Normally most tweets are sent during the day (North America has most of the tweets). During the night things are quiet and we should not expect to many tweets. This will be important when the bounding box is very small.

Also, it is necessary to know that the number of geo-tagged tweets is [relatively small](https://gwu-libraries.github.io/sfm-ui/posts/2017-04-12-geographic-collecting), only about 2 percent of all the tweets. So we should also not expect we will get a lot of tweets here.

In [None]:
try:
    listener = xGeoStdListener()
    auth = OAuthHandler(tokens['consumer_key'], tokens['consumer_secret'])
    auth.set_access_token(tokens['access_token'], tokens['access_token_secret'])
    stream = Stream(auth, listener) # this line supports continuous harvesting
    stream.filter(locations=region_bbox)
except Exception as err:
    print(err)

We need to open the file to retrieve the saved data. Please note the file will be saved into the folder where the program is run, so typically there is no need to change the folder.

In [None]:
with open('sometweets.json') as tweets_file:
    tweets = json.load(tweets_file)

In [None]:
len(tweets)

In [None]:
tweet = tweets[0]
tweet.keys()

There are three keys useful in our case. The `text` key is where the actual tweet is stored. The `coordinates` indicate when the user shares the point location of the tweet. If the user decides to be protect more privacy, he/she can choose to show the cit instead of the coordinates. This is less ideal for us because we won't get the actual coordinates. But this is the case and we will find a way to make it up a little. We can use the `place` key where the bounding box of the region (can be a city or anything) can be found and be used to compute a point.

First, let's see how many tweets do not have coordinates:

In [None]:
tt = [t for t in tweets if t['coordinates']!=None]
len(tt)

The following code get the tweets with `coordinates` being None, and see what's in the `place` key:

In [None]:
tt = [t for t in tweets if t['coordinates']==None]
tt[0]

Now we first get all the geo-tagged tweets by making sure we have the `place` key. Those tweets with `coordinates` not being None still have the `place` key. Sometimes things may go wrong and we will have items in the list without any of those. The following code makes sure only the geo-tagged tweets are used for further analysis. 

In [None]:
geotagged = [t for t in tweets if 'place' in t.keys()]
len(geotagged)

Now in the following loop, we make sure every tweet has a valid `coordinates` key. When the original tweet has a value of None, we use the average coordinates from the bounding box to assign to the `coordinates` key. Note the value of this key is actually a JSON with keys of `type` and `coordinates`.

In [None]:
for gt in geotagged:
    if not 'coordinates' in gt.keys():
        continue
    if gt['coordinates'] == None: # use bbox
        if gt['place'] != None:
            coords = gt['place']['bounding_box']['coordinates'][0]
            x = sum([p[0] for p in coords])/4
            y = sum([p[1] for p in coords])/4
            gt['coordinates'] = {'type': 'Point', 
                  'coordinates': [x, y]}


We draw the map. To make sure all the dots are on top of the figure, we set the `zorder` to 2 for the does. The default value of 1 will make the dots at the same level of other things in map, making them hard to be seen.

In [None]:
_, ax = plt.subplots(1, 1)

plot_world(ax, raw_points, numgraticule, numline, 'lightgrey')


points = [ [t['coordinates']['coordinates'][0], t['coordinates']['coordinates'][1]] for t in geotagged ]
ax.scatter([p[0] for p in points], [p[1] for p in points], color='green', marker='o', s=3, alpha=0.8, zorder=2)

plt.show()

What can we do with these tweets? A lot! But how about find some happy tweets? We first will get a list of words that indicate happiness and then check if any of the words appear in the `text`. We must note this is not conclusive at all: we first don't have a complete list of happy words, and then there are tweets in different languages other than English.

In [None]:
happy = ['happy', 'good', 'wonderful', 'nice', 'eat', 'joyous', 'joy', 'pleased', 'cheer']

def has_keywords(keywords, text):
    for k in keywords:
        if k in text:
            return True
    return False

In [None]:
# TODO
#    Use the has_keywords function to get all the happy tweets





In [None]:
_, ax = plt.subplots(1, 1)

plot_world(ax, raw_points, numgraticule, numline, 'lightgrey')


points = [ [t['coordinates']['coordinates'][0], t['coordinates']['coordinates'][1]] for t in geotagged ]
ax.scatter([p[0] for p in points], [p[1] for p in points], color='grey', marker='o', s=3, alpha=0.8, zorder=2)

# TODO
#    Draw the happy tweets in red
#    make sure these are on top of everything else
#



plt.show()