# osiris

![img](https://dm2301files.storage.live.com/y4mmRC1xelS6Y6MEqUnZ-k2vjpADHpo6UMZAaZWROunr9-Ml5FYDlZ6WMxCGedy7NDhwDpusZdF5E1oLR5Qn6momydHe7tYUOMwNeFeGW7pUWkBjGPSnZp2sacYWs9IKkose6xjhSySL_v2tbfItRI7T_Pw_Tayhaa2F_vrwW6ucyr6WPa6s9DWH_if9Y5Y3yAU?width=375&height=250&cropmode=none)


osiris is a Python data processing and analysis environment for data-based computational conflict forecasting using very large datasets and graph-based methods and models and visualization, powered by scalable graph databases.

You can use osiris to analyze causal chains and networks of confict and violence around the world from realtime-updated, [automatically-encoded political event data](https://parusanalytics.com/eventdata/papers.dir/Schrodt_Yonamine_NewDirectionsInText.pdf) from projects like GDELT. This notebook gives an overview of the osiris project, the [GDELT project](https://www.gdeltproject.org/) data that osiris uses, how to import political event data using osiris either from the GDELT file server or from Google BigQuery, how to visualize and analyze it using Python, and how to load it into a TigerGraph graph server instance to efficiently run graph-centric queries on it to retrieve vertex-edge event data that can then be further analyzed.

## Notebook Environment Setup

In [1]:
import os, sys
# Check if running inside Colab or Kaggle
IN_COLAB = 'COLAB_GPU' in os.environ
IN_KAGGLE = 'KAGGLE_KERNEL_RUN_TYPE' in os.environ
IN_HOSTED_NB = IN_COLAB or IN_KAGGLE
os.environ['IN_HOSTED_NB'] = str(IN_HOSTED_NB)

OS_NAME = sys.platform.upper()
if OS_NAME in ['LINUX', 'DARWIN'] and IN_HOSTED_NB:
  import subprocess
  print('Installing osiris from GitHub...')
  print(subprocess.run('if [ -d "osiris" ]; then rm -Rf osiris; fi', text=True, shell=True, check=True, capture_output=True).stdout)
  print(subprocess.run('git clone https://github.com/allisterb/osiris --recurse-submodule', text=True, shell=True, check=True, capture_output=True).stdout)
  print(subprocess.run('cd osiris && ./install', text=True, shell=True, check=True, capture_output=True).stdout)

# If we're not in a hosted nb env assume we're running Jupyter from the osiris project directory root
OSIRIS_PATH = '..' if not IN_HOSTED_NB else 'osiris'

# Import the osiris code and set the runtime env. 
sys.path.append(os.path.join(OSIRIS_PATH, 'osiris'))
sys.path.append(os.path.join(OSIRIS_PATH, 'ext'))
from osiris_global import set_runtime_env
set_runtime_env(interactive_nb=True)

## GDELT Event Data

*From the  [GDELT project](https://www.gdeltproject.org/) website*:
>The GDELT Project is a realtime network diagram and database of global human society for open research.
![gf](https://www.gdeltproject.org/images/spinningglobe.gif)

>The GDELT Project is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day.

The GDELT [event data](http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf) contains hundreds of millions of automatically coded events extracted from news stories daily using NLU methods and models. Each event data row contains the following fields:
1. *Actors*: Humans or organizations or states which initiate and are the target of event actions. Actors may have geographic information but not temporal. An event references exactly 2 actors: Actor1 and Actor2.
2. *Actions*: Codes and other information which describe each event. Actions have both temporal and spatial attributes: an event time plus some geo information like latitude / longitude.  
3. *SourceURL*: a URL that locates the *story* from which the event data was extracted.

osiris can extract data directly from the GDELT file server. The advantage of this method is that you don't need to have any special credentials or server access (remember we're interested *open-source* indicators.). All the data is downloaded directly to your client machine or notebook environment.

In [2]:
# Import data directly from GDELT file server
from data.gdelt import DataSource
import pandas as pd
gdelt = DataSource()

In [3]:
# Get event data for a 1 week period
events = gdelt.import_data('events', 'Apr-14-2022', 'Apr-20-2022')

Importing GDELT events data for 7 day(s) from 04-14-2022 to 04-20-2022...


Import GDELT events data:   0%|          | 0/7 [00:00<?, ?day/s]

Importing GDELT events data for 7 day(s) from 04-14-2022 to 04-20-2022 completed in 90.37 s.


About a week's worth of event data in 2022 consists of about 700K events takes up about 340MB RAM.

In [None]:
events.info()

In [None]:
events

Event data is highly denormalized with many redundancies for ease of querying and coded using a hierachical coding system called [CAMEO](http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf) - Conflict and Mediation Event Observations

In [None]:
events[['EventCode', 'CAMEOCodeDescription']]

We can query and filter event data directly using the Pandas dataframe

In [None]:
# Find all events that were geolocated in Ukraine
uka_events = events[(events.ActionGeo_CountryCode == 'UP')]
uka_events

So about 50K of 700K events last week were coded as happening in Ukraine, not surprising given recent events. Many of those related to use of military force.

In [None]:
# CAMEO code 190 denotes 'use of military force'
uka_events[uka_events.EventCode.str.startswith('190')]

In [None]:
# Import Folium to plot these military force events on a map
import folium
folium.Map(
    location=[48., 31.], 
    tiles="Stamen Toner",
    zoom_start=6
)

In [None]:
uka_map = folium.Map(
    location=[48., 31.], 
    #tiles="Stamen Toner",
    zoom_start=6
)
uka_map
uka_events_sample = uka_events[uka_events.EventCode.str.startswith('190')].sample(n=100)
for r in uka_events_sample.itertuples():
    m = folium.Marker(location=[r.ActionGeo_Lat, r.ActionGeo_Long],
                      icon=folium.Icon(color="red", icon="fire", prefix="glyphicon"),
                      tooltip=str(r.Actor1CountryCode) + '->' + str(r.EventCode) + ' ' +  str(r.CAMEOCodeDescription) + '->' + str(r.Actor2CountryCode) +' on ' + str(r.SQLDATE)
                     )
    m.add_to(uka_map)
uka_map

In [None]:
from data.etl import shape_events_vertices

shape_events_vertices(uka_events_sample)

In [None]:
from tqdm.notebook import tqdm_notebook
tqdm_notebook.pandas(desc="my bar!")
uka_events_sample.Actor1Code.fillna('', inplace = False)
+ uka_events_sample.Actor1Name.fillna('', inplace = False)
+ uka_events_sample.Actor1CountryCode.fillna('', inplace = False)
#uka_events_sample.Actor1Code.fillna('', inplace = False)
#+ uka_events_sample.Actor1Name.fillna('', inplace = False)
#+ uka_events_sample.Actor1CountryCode.fillna('', inplace = False)
#+ uka_events_sample.Actor1KnownGroupCode.fillna('', inplace = False)
#dd.progress_apply(lambda r: 
                                # r.Actor1Code if not pd.isnull(r.CAMEOCodeDescription) else ''
                                 #+ 'jjjj', axis=1)

In [None]:
from hashlib import sha1
import binascii
from osiris_global import tqdm_auto
import pandas as pd
from tqdm.auto import tqdm
def calc_actor1_id(r:pd.DataFrame):
        tqdm.pandas(total=len(r), unit='row', desc='Hashing Actor1 ID')
        return binascii.b2a_base64(sha1(''.join([
            r.Actor1Code if not pd.isnull(r.Actor1Code) else '', 
            r.Actor1Name if not pd.isnull(r.Actor1Name) else '',
            r.Actor1CountryCode if not pd.isnull(r.Actor1CountryCode) else '',
            r.Actor1KnownGroupCode if not pd.isnull(r.Actor1KnownGroupCode) else '',
            r.Actor1EthnicCode if not pd.isnull(r.Actor1EthnicCode) else '',
            r.Actor1Religion1Code if not pd.isnull(r.Actor1Religion1Code) else '',
            r.Actor1Religion2Code if not pd.isnull(r.Actor1Religion2Code) else '',
            r.Actor1Type1Code if not pd.isnull(r.Actor1Type1Code) else '',
            r.Actor1Type2Code if not pd.isnull(r.Actor1Type2Code) else '',
            r.Actor1Type3Code if not pd.isnull(r.Actor1Type3Code) else '',
            str(r.Actor1Geo_Type) if not pd.isnull(r.Actor1Geo_Type) else '',
            r.Actor1Geo_FullName if not pd.isnull(r.Actor1Geo_FullName) else '',
            r.Actor1Geo_CountryCode if not pd.isnull(r.Actor1Geo_CountryCode) else '',
            r.Actor1Geo_ADM1Code if not pd.isnull(r.Actor1Geo_ADM1Code) else '',
            r.Actor1Geo_ADM2Code if not pd.isnull(r.Actor1Geo_ADM2Code) else '',
            str(r.Actor1Geo_Lat)if not pd.isnull(r.Actor1Geo_Lat) else '',
            str(r.Actor1Geo_Long) if not pd.isnull(r.Actor1Geo_Long) else '',
            str(r.Actor1Geo_FeatureID) if not pd.isnull(r.Actor1Geo_FeatureID) else '',     
        ]).encode('utf-8')).digest()).strip().decode('utf-8')

def calc_actor2_id(r:pd.DataFrame):
        tqdm.pandas(total=len(r), unit='row', desc='Hashing Actor2 ID')
        return binascii.b2a_base64(sha1(''.join([
            r.Actor2Code if not pd.isnull(r.Actor2Code) else '', 
            r.Actor2Name if not pd.isnull(r.Actor2Name) else '',
            r.Actor2CountryCode if not pd.isnull(r.Actor2CountryCode) else '',
            r.Actor2KnownGroupCode if not pd.isnull(r.Actor2KnownGroupCode) else '',
            r.Actor2EthnicCode if not pd.isnull(r.Actor2EthnicCode) else '',
            r.Actor2Religion1Code if not pd.isnull(r.Actor2Religion1Code) else '',
            r.Actor2Religion2Code if not pd.isnull(r.Actor2Religion2Code) else '',
            r.Actor2Type1Code if not pd.isnull(r.Actor2Type1Code) else '',
            r.Actor2Type2Code if not pd.isnull(r.Actor2Type2Code) else '',
            r.Actor2Type3Code if not pd.isnull(r.Actor2Type3Code) else '',
            str(r.Actor2Geo_Type) if not pd.isnull(r.Actor2Geo_Type) else '',
            r.Actor2Geo_FullName if not pd.isnull(r.Actor2Geo_FullName) else '',
            r.Actor2Geo_CountryCode if not pd.isnull(r.Actor2Geo_CountryCode) else '',
            r.Actor2Geo_ADM1Code if not pd.isnull(r.Actor2Geo_ADM1Code) else '',
            r.Actor2Geo_ADM2Code if not pd.isnull(r.Actor2Geo_ADM2Code) else '',
            str(r.Actor2Geo_Lat)if not pd.isnull(r.Actor2Geo_Lat) else '',
            str(r.Actor2Geo_Long) if not pd.isnull(r.Actor2Geo_Long) else '',
            str(r.Actor2Geo_FeatureID) if not pd.isnull(r.Actor2Geo_FeatureID) else '',     
        ]).encode('utf-8')).digest()).strip().decode('utf-8')

def shape_events_vertices(events:pd.DataFrame):
    events.insert(1, 'Actor1ID', events.progress_apply(calc_actor1_id, axis=1))
    events.insert(2, 'Actor2ID', events.progress_apply(calc_actor2_id, axis=1))
    return events

shape_events_vertices(uka_events_sample)

In [None]:
# Uncomment and run below if running inside Colab and you want to pull env variables from a file called vars.env on your GDrive
# !pip install colab-env --upgrade
# import colab_env