In [1]:
%load_ext lab_black

In [2]:
import io
import requests
import re
import zipfile
import pathlib
import time
import datetime

import numpy as np

from IPython.display import clear_output
from itertools import chain
from psycopg2 import connect
from sqlalchemy import create_engine

# Why use NLP for time series forecasting?

Time series forecasting is the science of predicting the future, usually for chaotic, less-anticipatable quantities. Weather, stock prices, commodity availability, etc. are all quantities that people wish they could know in advance, the problem being that all of these quantities are often considered to have a component of randomness to them, thus making predicting them a rather futile effort. That's where I disagree. There are few things in the world which are *truly* random, and most of those are limited to fundamental physical phenomena. Everything else is deterministic. Weather is definitively chaotic, but with enough information about the initial state of the system, very accurate predictions can be made. Commodity prices depend on supply and demand, which are simple, but the supply/demand of the components to produce it may be hard to track. Additionally, local problems where the item is produced can have an impact, plus supply-chain issues, changes to foreign import policy, product competitors, and so on on can all impact how much something will cost, but it can be nearly impossible to automate the tracking of such things. However, if one *did* have all of that information, then knowing the change in an item's price in the immediate future should be very achievable. How a stock price will change in the future is usually considered anyone's guess, but I think it would be better to say that it's *everyone's* guess. If more people think an asset will increase in price, that leads to a surplus of buyers and then (usually) the price goes up. The problem is that you have to know what people are thinking *before* they act to make use of that logic. Luckily (or, perhaps, unfortunately), thanks to social media, a large portion of the population can't help but share what they're thinking all the time. By scraping Reddit, Twitter, Facebook, and other social media sites, one could form a solid opinion on how to invest just by using the wisdom of the masses.

The point being that these examples (except weather, unless you consider climate change) all have a *human* component. Gas prices are no different. Taxes, emissions policies, what wars are being fought where, who just discovered more oil - all of these things are distinctly human, or at the very least aren't constantly tracked numerically. Federal gasoline tax, for instance, hasn't changed since 1993 [*source*](https://www.pgpf.org/blog/2021/03/its-been-28-years-since-we-last-raised-the-gas-tax-and-its-purchasing-power-has-eroded), so it would be foolish to include its value in a model. But if it were raised, so would the tracked gasoline price. State taxes have changed in that time, but I couldn't find a good source on what the tax rate was for each state going back in time. Not to mention the changes in attitudes on gasoline usage, oil spills, fracking developments, etc. which have occured in the past few years. In all of these cases, *someone* is tracking that information, and they *are* recording it. They just aren't tracking it in some public CSV file that I can load. Instead, they're tracking it in text and reporting it to the world. And *that* is why natural language processing should be implemented, as it is a way to extract as much information about the human component of a problem, and to quantify it.

# Where to get the data?

The Global Database of Events, Language, and Tone [(GDELT)](https://www.gdeltproject.org/) tracks, well, everything. The project attempts to do two things. First, it tries to create a database of every event that has occured in the world (the past few decades, not all of history), and to try to assign who was involved, when it occured and where, how impactful the event was, how many sources reported on it, and many, many more aspects of the event. Additionally, the store the source URL of the first report on the incident. This is the "Events" data. Secondly, are far more impressively, they connect everything together in their "Gloabl Knowledge Graph" (GKG). The GKG is massive, and attempts to piece together how different events and people are linked, and it looks for themes present in the event to connect things. However, for the purposes of this project, the GKG data is largely useless because it only expresses connections, and not facts or opinions. The events data doesn't store this either, but it *does* store the source article which, ideally, contains the facts of the situation. Therefore, for each article in the database, I will determine if it is relevant, and, if so, download it.

GDELT offers three ways to acquire their data. The first is to download tens of thousands of ZIP files and piece the events database together yourself. This is a ton of data, making this a pretty terrible idea. The second is to use GDELT's own [analysis service](https://www.gdeltproject.org/data.html#gdeltanalysisservice), which allows you to specify some conditions, after which it will fetch the data and email it to you several hours later. Unfortunately, when I attempted to use it, it redirected me to a page stating that the analysis service isn't finished yet, but should be done sometime in 2019. *Good, good*. Luckily, there is a third option, which is to use Google's BigQuery, which is kind enough to store GDELT's tables for public use. Unluckily, some SQL inspection of the tables showed that only the past week of data seemed to be stored, which was entirely useless. I'm not sure if this was a bug, or if Google may have been moving the storage location of the tables, or what. Therefore, we do what any sane person would do at this point and ~curse GDELT~ happily implement option one.

In [3]:
# gdelt maintains a list of zip files of three types
# one type for events, one for gkg, and another for a supplemtary mentions table
# step one is to download this lists and parse it
# extracting the download locations of the events files
def get_master_file_list(url):
    r = requests.get(url)
    if not r.ok:
        # should raise if the list is not reachable
        raise BaseException(f"Invalid HTML status {r.status_code}")
    raw_list = r.text.split("\n")
    event_list = []
    for raw in raw_list:
        try:
            # looks for a url
            file_location = re.search(r"https?://[^\s]+", raw).group()
        except AttributeError:
            # if for some reason the line is poorly formatted/null, we skip the file
            pass
        
        # events files have a .export file type
        if ".export" in file_location:
            event_list.append(file_location)

    return event_list


# next, these csv files are zipped to compress them
# because I don't want thousands of zip files on my machine
# I will extract and parse them without ever storing the file
def fetch_and_parse_zip(url):
    r = requests.get(url)
    if not r.ok:
        # if the file can't be reached, an empty dummy csv is used
        csv = open("/tmp/gdelt_csv_empty_file.csv", "w+")
        return csv
    file = r.content
    # the zip content is loaded in 
    zip_object = zipfile.ZipFile(io.BytesIO(file), "r")
    # and the first item in the list (of one) files is opened
    csv = zip_object.open(zip_object.namelist()[0])

    # this function returns an open file
    return csv

# next, we have to get the lines which are actually relevant
def get_relevant_events_lines(content, words, aggregate_file):
    for line in content.readlines():
        try:
            # the lines are in binary, so we decode them to a string
            line = line.decode("utf-8")
            # the source url is extracted from the line
            source_url = line.split("\t")[-1].strip("\n")
            # the url is split on common separators
            terms_in_url = re.split(r"[-._/]+", source_url)
            # the terms are then checked against a list of "interesting" words
            # if any are present, the line is written to an aggregator file
            if any([word in terms_in_url for word in words]):
                aggregate_file.write(line)
        except UnicodeDecodeError:
            pass

    return


# this function puts everything together
def get_gdelt_events_data(words, progress_file, file_list):
    # first, so this code can start and stop, we will maintain a list \
    # in storage of all files which have already been processed
    processed_files = [pf.strip() for pf in progress_file.readlines()]
    N = len(file_list)
    p = pathlib.Path(f"../data")
    p.mkdir(parents=True, exist_ok=True)
    q = p / "gdelt_events.csv"
    with q.open("a+") as agg:
        # keep track of time for a nice printout
        start_time = time.time()
        m = 0
        # iterate through the file list
        for n, file_url in enumerate(file_list):
            if file_url in processed_files:
                # if the file has been processed, skip
                m += 1
                continue
            
            # get the zip and parse the lines
            csv = fetch_and_parse_zip(file_url)
            get_relevant_events_lines(csv, words, agg)
            
            # everything below is just a printout formatter
            clear_output(wait=True)

            elapsed_time = time.time() - start_time

            elapsed_time_tuple = str(datetime.timedelta(seconds=elapsed_time)).split(
                ":"
            )

            elapsed_time_string = f"{elapsed_time_tuple[0]}:{elapsed_time_tuple[1]}:{round(float(elapsed_time_tuple[2])):02}"

            estimated_time_remaining = (
                elapsed_time * (N - m) / (n - m + 1)
            ) - elapsed_time

            estimated_remaining_time_tuple = str(
                datetime.timedelta(seconds=estimated_time_remaining)
            ).split(":")

            estimated_remaining_time_string = f"{estimated_remaining_time_tuple[0]}:{estimated_remaining_time_tuple[1]}:{round(float(estimated_remaining_time_tuple[2])):02}"

            print(f"{n + 1}/{N} files parsed")
            print(f"Elapsed time: {elapsed_time_string}")
            print(f"Estimated remaining time: {estimated_remaining_time_string}")

            progress_file.write(file_url + "\n")

    return

Those functions will do the heavy lifting for processing the data. Now all we have to do is call them.

In [4]:
events_list = get_master_file_list(
    "http://data.gdeltproject.org/gdeltv2/masterfilelist.txt"
)

And to define a list of gasoline-related words. It's a very exhaustive list, I know. 

In [5]:
OIL_WORDS = [
    "oil", # NOTE: as oil does not strictly refer to petroleum, we will also get some articles on olive oil, etc
    "gas", # same with gas and the state of matter
    "gasoline",
    "petrol",
    "fuel",
    "petroleum",
    "diesel",
]

Now we run the main function. If the function has never been run, the progress file must be created. Otherwise, it is read in from memory. To restart data aggregation, all one has to do is delete the existing `progress.txt` file.

(NOTE: The elapsed time output of this cell is indicative of how long it took the cell to run that time, and not a cumulative counter. It currently reads ten minutes, but if it were run from the beginning this cell would take roughly twenty hours to run on my network.)

(SECOND NOTE: I opted to not use multiprocessing to speed up this code. This is because I believe the functions to be IO limited and not CPU limited, so multiprocessing is likely to get very messy and offer little improvement.)

In [6]:
try:
    with open("../data/events_progress.txt", "r+") as progress_file:
        get_gdelt_events_data(OIL_WORDS, progress_file, events_list)
except FileNotFoundError:
    with open("../data/events_progress.txt", "w+") as progress_file:
        get_gdelt_events_data(OIL_WORDS, progress_file, events_list)

272839/272839 files parsed
Elapsed time: 0:10:20
Estimated remaining time: 0:00:00


# SQL Hosting

Once all the data is gathered, we will create a database to hold all the information. The cell below will execute terminal commands to create a fresh database. To execute it, one must alter the files in `etc/` to contain your sudoers password in `user.password`, as well as your postgres password in `postgres.password`. (Good thing this isn't a cybersecurity project)

In [7]:
%%capture
! sudo -S -i -u postgres dropdb -f gdelt < ../etc/user.password
! sudo -S -i -u postgres createdb gdelt < ../etc/user.password

In [8]:
# create a string of column names and types to save room in the sql command
# id is not a data column found in the gdelt data, but we add it to avoid trouble with duplicates

event_columns = """id serial PRIMARY KEY,
                   GlobalEventID integer, 
                   Day integer,
                   MonthYear integer,
                   Year integer,
                   FractionDate numeric,
                   Actor1Code text,
                   Actor1Name text,
                   Actor1CountryCode text,
                   Actor1KnownGroupCode text,
                   Actor1EthnicCode text,
                   Actor1Religion1Code text,
                   Actor1Religion2Code text,
                   Actor1Type1Code text,
                   Actor1Type2Code text,
                   Actor1Type3Code text,
                   Actor2Code text,
                   Actor2Name text,
                   Actor2CountryCode text,
                   Actor2KnownGroupCode text,
                   Actor2EthnicCode text,
                   Actor2Religion1Code text,
                   Actor2Religion2Code text,
                   Actor2Type1Code text,
                   Actor2Type2Code text,
                   Actor2Type3Code text,
                   IsRootEvent integer,
                   EventCode text,
                   EventBaseCode text,
                   EventRootCode text,
                   QuadClass integer,
                   GoldsteinScale text,
                   NumMentions integer,
                   NumSources integer,
                   NumArticles integer,
                   AvgTone numeric,
                   Actor1Geo_Type integer,
                   Actor1Geo_Fullname text,
                   Actor1Geo_CountryCode text,
                   Actor1Geo_ADM1Code text,
                   Actor1Geo_ADM2Code text,
                   Actor1Geo_Lat text,
                   Actor1Geo_Long text,
                   Actor1Geo_FeatureID text,
                   Actor2Geo_Type integer,
                   Actor2Geo_Fullname text,
                   Actor2Geo_CountryCode text,
                   Actor2Geo_ADM1Code text,
                   Actor2Geo_ADM2Code text,
                   Actor2Geo_Lat text,
                   Actor2Geo_Long text,
                   Actor2Geo_FeatureID text,
                   ActionGeo_Type integer,
                   ActionGeo_Fullname text,
                   ActionGeo_CountryCode text,
                   ActionGeo_ADM1Code text,
                   ActionGeo_ADM2Code text,
                   ActionGeo_Lat text,
                   ActionGeo_Long text,
                   ActionGeo_FeatureID text,
                   DATEADDED bigint,
                   SOURCEURL text"""

# NOTE: strictly speaking, many columns (such as GoldsteinScale)
# should be of type numeric. However, these columns contain
# empty strings in the .csv files. This will cause issues
# in the COPY FROM postgreSQL command. Therefore, we treat
# these columns as text initially. If reason is found to use
# these columns in the analysis, the columns can be converted later
# within the database or via python after the data is fetched

# we also want a string of column names to actually insert the data
event_columns_no_types_no_id = (
    event_columns.replace(" text", "")
    .replace(" numeric", "")
    .replace(" bigint", "")
    .replace(" integer", "")
    .replace("id serial PRIMARY KEY,", "")
)

In [11]:
# read in the postgres password to form the connection
with open("../data/gdelt_events.csv", "r") as events_file, open(
    "../etc/postgres.password"
) as psql_pass_file:
    postgres_password = psql_pass_file.read()
    # form the connection
    conn = connect(
        f"host='localhost' dbname='gdelt' user='postgres' password='{postgres_password}'"
    )
    cursor = conn.cursor()
    # the database is fresh, so create an events table
    create_events_table_cmd = f"CREATE TABLE events({event_columns})"
    # copy from the aggregated csv
    copy_events_cmd = f"COPY events({event_columns_no_types_no_id}) FROM STDIN WITH (FORMAT TEXT, HEADER FALSE)"
    cursor.execute(create_events_table_cmd)
    cursor.copy_expert(copy_events_cmd, events_file)
    conn.commit()

postgres
