# osiris

![img](https://dm2301files.storage.live.com/y4mmRC1xelS6Y6MEqUnZ-k2vjpADHpo6UMZAaZWROunr9-Ml5FYDlZ6WMxCGedy7NDhwDpusZdF5E1oLR5Qn6momydHe7tYUOMwNeFeGW7pUWkBjGPSnZp2sacYWs9IKkose6xjhSySL_v2tbfItRI7T_Pw_Tayhaa2F_vrwW6ucyr6WPa6s9DWH_if9Y5Y3yAU?width=375&height=250&cropmode=none)


osiris is a Python data processing and analysis environment for data-based computational conflict forecasting using very large datasets and graph-based methods and models and visualization, powered by scalable graph databases.

You can use osiris to analyze causal chains and networks of confict and violence around the world from realtime-updated, [automatically-encoded political event data](https://parusanalytics.com/eventdata/papers.dir/Schrodt_Yonamine_NewDirectionsInText.pdf) from projects like GDELT. This notebook gives an overview of the osiris project, the [GDELT project](https://www.gdeltproject.org/) data that osiris uses, how to import political event data using osiris either from the GDELT file server or from Google BigQuery, how to visualize and analyze it using Python, and how to load it into a TigerGraph graph server instance to efficiently run graph-centric queries on it to retrieve vertex-edge event data that can then be further analyzed.

## Notebook Environment Setup

In [None]:
import os, sys
# Check if running inside Colab or Kaggle
IN_COLAB = 'COLAB_GPU' in os.environ
IN_KAGGLE = 'KAGGLE_KERNEL_RUN_TYPE' in os.environ
IN_HOSTED_NB = IN_COLAB or IN_KAGGLE
os.environ['IN_HOSTED_NB'] = str(IN_HOSTED_NB)

OS_NAME = sys.platform.upper()
if OS_NAME in ['LINUX', 'DARWIN'] and IN_HOSTED_NB:
    import subprocess
    print('Installing osiris from GitHub...')
    print(subprocess.run('if [ -d "osiris" ]; then rm -Rf osiris; fi', text=True, shell=True, check=True, capture_output=True).stdout)
    print(subprocess.run('git clone https://github.com/allisterb/osiris --recurse-submodule', text=True, shell=True, check=True, capture_output=True).stdout)
    print(subprocess.run('cd osiris && ./install', text=True, shell=True, check=True, capture_output=True).stdout)
    if IN_COLAB:
        print('Installing colab-env which can pull env variable values from a file called vars.env on your GDrive.')
        print(subprocess.run('pip install colab-env --upgrade', text=True, shell=True, check=True, capture_output=True).stdout)

# If we're not in a hosted nb env assume we're running Jupyter from the osiris project directory root
OSIRIS_PATH = '..' if not IN_HOSTED_NB else 'osiris'

# Import the osiris code and set the runtime env. 
sys.path.append(os.path.join(OSIRIS_PATH, 'osiris'))
sys.path.append(os.path.join(OSIRIS_PATH, 'ext'))
from osiris_global import set_runtime_env
set_runtime_env(interactive_nb=True)

## GDELT Event Data

*From the  [GDELT project](https://www.gdeltproject.org/) website*:
>The GDELT Project is a realtime network diagram and database of global human society for open research.
![gf](https://www.gdeltproject.org/images/spinningglobe.gif)

>The GDELT Project is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day.

The GDELT [event data](http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf) contains hundreds of millions of automatically coded events extracted from news stories daily using NLU methods and models. Each event data row contains the following fields:
1. *Actors*: Humans or organizations or states which initiate and are the target of event actions. Actors may have geographic information but not temporal. An event references exactly 2 actors: Actor1 and Actor2.
2. *Actions*: Codes and other information which describe each event. Actions have both temporal and spatial attributes: an event time plus some geo information like latitude / longitude. Actors and actions naturally form graphs with directed edges connecting Actor1-->Action-->Actor2. An Actor-Action edge may contain attributes like the event time and a complementary reverse edge to make querying easier e.g. Actor1----event1_date---->Action1----event1_date--->Actor2----->event2_date----->Action2----event2_date---->Actor3 
3. *SourceURL*: a URL that locates the *story* from which the event data was extracted.

osiris can extract data directly from the GDELT file server. The advantage of this method is that you don't need to have any special credentials or server access (remember we're interested *open-source* indicators.). All the data is downloaded directly to your client machine or notebook environment.

### Importing GDELT data from file server

osiris uses *DataSource* classes to manage importing tabular data. 

In [None]:
# Import data directly from GDELT file server
from data.gdelt import DataSource
import pandas as pd
gdelt = DataSource()

In [None]:
# Get event data for a 1 week period
events = gdelt.import_data('events', 'Apr-14-2022', 'Apr-20-2022')

About a week's worth of event data in 2022 consists of about 700K events takes up about 340MB RAM.

In [None]:
events.info()

In [None]:
events

Event data is highly denormalized with many redundancies for ease of querying and coded using a hierachical coding system called [CAMEO](http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf) - Conflict and Mediation Event Observations

In [None]:
events[['EventCode', 'CAMEOCodeDescription']]

We can query and filter event data directly using the Pandas dataframe

In [None]:
# Find all events that were geolocated in Ukraine
uka_events = events[(events.ActionGeo_CountryCode == 'UP')]
uka_events

So about 50K of 700K events last week were coded as happening in Ukraine, not surprising given recent events. Many of those related to use of military force.

In [None]:
# CAMEO code 190 denotes 'use of military force'
uka_events[uka_events.EventCode.str.startswith('190')]

In [None]:
# Import Folium to plot these military force events on a map
import folium
folium.Map(
    location=[48., 31.], 
    tiles="Stamen Toner",
    zoom_start=6
)

In [None]:
uka_map = folium.Map(
    location=[48., 31.], 
    #tiles="Stamen Toner",
    zoom_start=6
)
uka_map
uka_events_sample = uka_events[uka_events.EventCode.str.startswith('190')].sample(n=1000)
for r in uka_events_sample.itertuples():
    m = folium.Marker(location=[r.ActionGeo_Lat, r.ActionGeo_Long],
                      icon=folium.Icon(color="red", icon="fire", prefix="glyphicon"),
                      tooltip=str(r.Actor1CountryCode) + '->' + str(r.EventCode) + ' ' +  str(r.CAMEOCodeDescription) + '->' + str(r.Actor2CountryCode) +' on ' + str(r.SQLDATE)
                     )
    m.add_to(uka_map)
uka_map

### Shaping tabular data into graph vertices and edges

The GDELT data schema is 'flat' and designed for easy of tabular querying and grouping. To be able to do graph and network queries it needs to be shaped.

In [None]:
from data.etl import shape_event_actor_vertices
events_vertices, actor1_vertices, actor2_vertices = shape_event_actor_vertices(uka_events_sample)

We create unique IDs for actors that can be linked to actions. We hash individual actor fields together to create a unique ID for each actor and then drop all the other actor fields from event data.

In [None]:
events_vertices

The actor information is now stored as separate entities that can be linked to actions.

In [None]:
actor1_vertices

We can visualize this data using Graphistry. First let's 'flatten' the graph schema so we only have one type of node and edge

In [48]:
from data.etl import flatten_event_actor_vertices
nodes, edges = flatten_event_actor_vertices(events_vertices, actor1_vertices, actor2_vertices)

In [55]:
# Start using Graphistry, you'll need GRAPHISTRY_USER and GRAPHISTRY_PASS env variables.
# Uncomment this to begin the authorization process for GDrive to use vars.env in Colab
# if IN_COLAB:
#    import colab_env

# If running from a local machine you can set these vars with the other osiris env variables
# If in a hosted nb env and not using vars.env or not in Colab you'll have to set it manually
# os.envriron[''GRAPHISTRY_USER'] = mygruser
# os.envriron[''GRAPHISTRY_PASS'] = mygrpass

HAVE_GRAPHISTRY = 'GRAPHISTRY_USER' in os.environ and 'GRAPHISTRY_PASS' in os.environ
if HAVE_GRAPHISTRY:
    from graphistry import graphistry
    graphistry.register(api=3, username=os.environ['GRAPHISTRY_USER'], password=os.environ['GRAPHISTRY_PASS'], protocol='https', server='hub.graphistry.com')

In [56]:
# Plot UP events using Graphistry
g = graphistry.bind(source="src", destination="dest", edge_title="date", node="node_id")
g.edges(edges).nodes(nodes).plot()

In [None]:
# Plat a larger sample of 10K events.
events_vertices, actor1_vertices, actor2_vertices = shape_event_actor_vertices(events.sample(10000))
nodes, edges = flatten_event_actor_vertices(events_vertices, actor1_vertices, actor2_vertices)
g = graphistry.bind(source="src", destination="dest", edge_title="date", node="node_id")
g.edges(edges).nodes(nodes).plot()