# osiris

![img](https://dm2301files.storage.live.com/y4mmRC1xelS6Y6MEqUnZ-k2vjpADHpo6UMZAaZWROunr9-Ml5FYDlZ6WMxCGedy7NDhwDpusZdF5E1oLR5Qn6momydHe7tYUOMwNeFeGW7pUWkBjGPSnZp2sacYWs9IKkose6xjhSySL_v2tbfItRI7T_Pw_Tayhaa2F_vrwW6ucyr6WPa6s9DWH_if9Y5Y3yAU?width=375&height=250&cropmode=none)


osiris is a Python data processing and analysis environment for data-based computational conflict forecasting using very large datasets and graph-based methods and models and visualization, powered by scalable graph databases.

You can use osiris to analyze causal chains and networks of confict and violence around the world from realtime-updated, [automatically-encoded political event data](https://parusanalytics.com/eventdata/papers.dir/Schrodt_Yonamine_NewDirectionsInText.pdf) from projects like GDELT. This notebook gives an overview of the osiris project, the [GDELT project](https://www.gdeltproject.org/) data that osiris uses, how to import political event data using osiris either from the GDELT file server or from Google BigQuery, how to visualize and analyze it using Python, and how to load it into a TigerGraph graph server instance to efficiently run graph-centric queries on it to retrieve vertex-edge event data that can then be further analyzed.

## Notebook Environment Setup

In [1]:
import os, sys
# Check if running inside Colab or Kaggle
IN_COLAB = 'COLAB_GPU' in os.environ
IN_KAGGLE = 'KAGGLE_KERNEL_RUN_TYPE' in os.environ
IN_HOSTED_NB = IN_COLAB or IN_KAGGLE
os.environ['IN_HOSTED_NB'] = str(IN_HOSTED_NB)

OS_NAME = sys.platform.upper()
if OS_NAME in ['LINUX', 'DARWIN'] and IN_HOSTED_NB:
  import subprocess
  print('Installing osiris from GitHub...')
  print(subprocess.run('if [ -d "osiris" ]; then rm -Rf osiris; fi', text=True, shell=True, check=True, capture_output=True).stdout)
  print(subprocess.run('git clone https://github.com/allisterb/osiris --recurse-submodule', text=True, shell=True, check=True, capture_output=True).stdout)
  print(subprocess.run('cd osiris && ./install', text=True, shell=True, check=True, capture_output=True).stdout)

# If we're not in a hosted nb env assume we're running Jupyter from the osiris project directory root
OSIRIS_PATH = '..' if not IN_HOSTED_NB else 'osiris'

# Import the osiris code and set the runtime env. 
sys.path.append(os.path.join(OSIRIS_PATH, 'osiris'))
sys.path.append(os.path.join(OSIRIS_PATH, 'ext'))
from osiris_global import set_runtime_env
set_runtime_env(interactive_nb=True)

## GDELT Event Data

*From the  [GDELT project](https://www.gdeltproject.org/) website*:
>The GDELT Project is a realtime network diagram and database of global human society for open research.
![gf](https://www.gdeltproject.org/images/spinningglobe.gif)

>The GDELT Project is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day.

The GDELT [event data](http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf) contains hundreds of millions of automatically coded events extracted from news stories daily using NLU methods and models. Each event data row contains the following fields:
1. *Actors*: Humans or organizations or states which initiate and are the target of event actions. Actors may have geographic information but not temporal. An event references exactly 2 actors: Actor1 and Actor2.
2. *Actions*: Codes and other information which describe each event. Actions have both temporal and spatial attributes: an event time plus some geo information like latitude / longitude.  
3. *SourceURL*: a URL that locates the *story* from which the event data was extracted.

osiris can extract data directly from the GDELT file server. The advantage of this method is that you don't need to have any special credentials or server access (remember we're interested *open-source* indicators.). All the data is downloaded directly to your client machine or notebook environment.

In [2]:
# Import data directly from GDELT file server
from data.gdelt import DataSource
import pandas as pd
gdelt = DataSource()

In [3]:
# Get event data for a 1 week period
events = gdelt.import_data('events', 'Apr-20-2022', 'Apr-20-2022')

Importing GDELT events data for 1 day(s) from 04-20-2022 to 04-20-2022...


Import GDELT events data:   0%|          | 0/1 [00:00<?, ?day/s]

Importing GDELT events data for 1 day(s) from 04-20-2022 to 04-20-2022 completed in 11.45 s.


About a week's worth of event data in 2022 consists of about 700K events takes up about 340MB RAM.

In [4]:
events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125670 entries, 0 to 125669
Data columns (total 62 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   GLOBALEVENTID          125670 non-null  int64  
 1   SQLDATE                125670 non-null  int64  
 2   MonthYear              125670 non-null  int64  
 3   Year                   125670 non-null  int64  
 4   FractionDate           125670 non-null  float64
 5   Actor1Code             113887 non-null  object 
 6   Actor1Name             113887 non-null  object 
 7   Actor1CountryCode      72667 non-null   object 
 8   Actor1KnownGroupCode   1744 non-null    object 
 9   Actor1EthnicCode       572 non-null     object 
 10  Actor1Religion1Code    1641 non-null    object 
 11  Actor1Religion2Code    391 non-null     object 
 12  Actor1Type1Code        53316 non-null   object 
 13  Actor1Type2Code        3732 non-null    object 
 14  Actor1Type3Code        96 non-null  

In [5]:
events

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_ADM2Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
0,1040201631,20210420,202104,2021,2021.3014,AFGCVL,AFGHANISTAN,AFG,,,...,4,"Kabul, Kabol, Afghanistan",AF,AF13,3580,34.5167,69.1833,-3378435,20220420010000,https://www.pressherald.com/2022/04/19/blasts-...
1,1040201632,20210420,202104,2021,2021.3014,AUS,AUSTRALIA,AUS,,,...,4,"Canberra, Australian Capital Territory, Australia",AS,AS01,4940,-35.2833,149.2170,-1563952,20220420010000,https://www.lowyinstitute.org/the-interpreter/...
2,1040201633,20210420,202104,2021,2021.3014,AUS,AUSTRALIA,AUS,,,...,4,"Canberra, Australian Capital Territory, Australia",AS,AS01,4940,-35.2833,149.2170,-1563952,20220420010000,https://www.lowyinstitute.org/the-interpreter/...
3,1040201634,20210420,202104,2021,2021.3014,AUS,AUSTRALIA,AUS,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420010000,https://www.lowyinstitute.org/the-interpreter/...
4,1040201635,20210420,202104,2021,2021.3014,CVL,NEIGHBORHOOD,,,,...,4,"Kabul, Kabol, Afghanistan",AF,AF13,3580,34.5167,69.1833,-3378435,20220420010000,https://www.pressherald.com/2022/04/19/blasts-...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125665,1040366484,20220420,202204,2022,2022.3014,ZAF,SOUTH AFRICAN,ZAF,,,...,1,United Kingdom,UK,UK,,54.0000,-4.0000,UK,20220420211500,https://www.702.co.za/articles/443421/simplify...
125666,1040366485,20220420,202204,2022,2022.3014,ZAF,SOWETO,ZAF,,,...,4,"Soweto, Gauteng, South Africa",SF,SF06,77364,-26.2667,27.8667,-1285576,20220420211500,https://www.news24.com/news24/southafrica/news...
125667,1040366486,20220420,202204,2022,2022.3014,cho,CHOCTAW,,,cho,...,2,"Mississippi, United States",US,USMS,,32.7673,-89.6812,MS,20220420211500,https://www.mcalesternews.com/news/local_news/...
125668,1040366487,20220420,202204,2022,2022.3014,cho,CHOCTAW,,,cho,...,3,"Choctaw, Oklahoma, United States",US,USOK,OK109,35.4976,-97.2689,1091323,20220420211500,https://www.mcalesternews.com/news/local_news/...


Event data is highly denormalized with many redundancies for ease of querying and coded using a hierachical coding system called [CAMEO](http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf) - Conflict and Mediation Event Observations

In [6]:
events[['EventCode', 'CAMEOCodeDescription']]

Unnamed: 0,EventCode,CAMEOCodeDescription
0,190,"Use conventional military force, not specifie..."
1,057,Sign formal agreement
2,057,Sign formal agreement
3,057,Sign formal agreement
4,190,"Use conventional military force, not specifie..."
...,...,...
125665,130,"Threaten, not specified below"
125666,190,"Use conventional military force, not specifie..."
125667,017,Engage in symbolic act
125668,010,"Make statement, not specified below"


We can query and filter event data directly using the Pandas dataframe

In [7]:
# Find all events that were geolocated in Ukraine
uka_events = events[(events.ActionGeo_CountryCode == 'UP')]
uka_events

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_ADM2Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
3,1040201634,20210420,202104,2021,2021.3014,AUS,AUSTRALIA,AUS,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420010000,https://www.lowyinstitute.org/the-interpreter/...
73,1040201704,20220420,202204,2022,2022.3014,,,,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420010000,https://www.theguardian.com/australia-news/202...
74,1040201705,20220420,202204,2022,2022.3014,,,,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420010000,https://www.theguardian.com/australia-news/202...
170,1040201801,20220420,202204,2022,2022.3014,AUS,AUSTRALIA,AUS,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420010000,https://www.theguardian.com/australia-news/202...
179,1040201810,20220420,202204,2022,2022.3014,AUS,AUSTRALIA,AUS,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420010000,https://www.theguardian.com/australia-news/202...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125334,1040366153,20220420,202204,2022,2022.3014,UKR,UKRAINE,UKR,,,...,4,"Kyiv, Kyyiv, Misto, Ukraine",UP,UP12,28554,50.4333,30.5167,-1044367,20220420211500,https://www.msn.com/en-us/news/world/ukraine-r...
125339,1040366158,20220420,202204,2022,2022.3014,UKRCVL,UKRAINE,UKR,,,...,4,"Zalissya, Chernihivs'ka Oblast', Ukraine",UP,UP02,25037,51.8599,31.2676,11345657,20220420211500,https://www.mirror.co.uk/news/politics/our-hea...
125340,1040366159,20220420,202204,2022,2022.3014,UKRCVL,UKRAINE,UKR,,,...,4,"Zalissya, Chernihivs'ka Oblast', Ukraine",UP,UP02,25037,51.8599,31.2676,11345657,20220420211500,https://www.mirror.co.uk/news/politics/our-hea...
125595,1040366414,20220420,202204,2022,2022.3014,USABUS,NEW YORK,USA,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420211500,https://apnews.com/press-release/business-wire...


So about 50K of 700K events last week were coded as happening in Ukraine, not surprising given recent events. Many of those related to use of military force.

In [8]:
# CAMEO code 190 denotes 'use of military force'
uka_events[uka_events.EventCode.str.startswith('190')]

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_ADM2Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
812,1040202443,20220420,202204,2022,2022.3014,RUS,RUSSIA,RUS,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420010000,https://jp.reuters.com/jp.reuters.com/news/pic...
853,1040202484,20220420,202204,2022,2022.3014,UKR,UKRAINIAN,UKR,,,...,4,"Kyiv, Kyyiv, Misto, Ukraine",UP,UP12,28554,50.4333,30.5167,-1044367,20220420010000,https://www.necn.com/news/national-internation...
863,1040202494,20220420,202204,2022,2022.3014,UKR,UKRAINE,UKR,,,...,4,"Kharkiv, Kharkivs'ka Oblast', Ukraine",UP,UP07,25036,49.9808,36.2527,-1041320,20220420010000,https://www.necn.com/news/national-internation...
865,1040202496,20220420,202204,2022,2022.3014,UKR,UKRAINE,UKR,,,...,4,"Kramatorsk, Donets'ka Oblast', Ukraine",UP,UP05,28549,48.7230,37.5563,-1043300,20220420010000,https://www.indystar.com/story/news/politics/2...
1118,1040202749,20220420,202204,2022,2022.3014,USAMED,ASSOCIATED PRESS,USA,,,...,4,"Kharkiv, Kharkivs'ka Oblast', Ukraine",UP,UP07,25036,49.9808,36.2527,-1041320,20220420010000,https://www.necn.com/news/national-internation...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119663,1040373136,20220420,202204,2022,2022.3014,RUS,RUSSIAN,RUS,,,...,4,"Vadym, Khersons'ka Oblast', Ukraine",UP,UP08,28553,46.1827,33.5971,-1057325,20220420221500,https://www.news8000.com/i/elderly-in-ukraine-...
119744,1040373217,20220420,202204,2022,2022.3014,UKR,UKRAINE,UKR,,,...,4,"Chernihiv, Chernihivs'ka Oblast', Ukraine",UP,UP02,28554,51.5055,31.2849,-1037057,20220420221500,http://www.msn.com/en-us/news/world/a-bomb-sni...
119819,1040373292,20220420,202204,2022,2022.3014,USA,UNITED STATES,USA,,,...,4,"Kyiv, Kyyiv, Misto, Ukraine",UP,UP12,28554,50.4333,30.5167,-1044367,20220420221500,http://www.msn.com/en-us/news/world/as-a-new-u...
124159,1040382951,20220420,202204,2022,2022.3014,UKR,UKRAINIAN,UKR,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420234500,https://www.agassizharrisonobserver.com/news/a...


In [9]:
# Import Folium to plot these military force events on a map
import folium
folium.Map(
    location=[48., 31.], 
    tiles="Stamen Toner",
    zoom_start=6
)

In [10]:
uka_map = folium.Map(
    location=[48., 31.], 
    #tiles="Stamen Toner",
    zoom_start=6
)
uka_map
uka_events_sample = uka_events[uka_events.EventCode.str.startswith('190')].sample(n=100)
for r in uka_events_sample.itertuples():
    m = folium.Marker(location=[r.ActionGeo_Lat, r.ActionGeo_Long],
                      icon=folium.Icon(color="red", icon="fire", prefix="glyphicon"),
                      tooltip=str(r.Actor1CountryCode) + '->' + str(r.EventCode) + ' ' +  str(r.CAMEOCodeDescription) + '->' + str(r.Actor2CountryCode) +' on ' + str(r.SQLDATE)
                     )
    m.add_to(uka_map)
uka_map

In [11]:
from data.etl import shape_events_vertices
events, actors = shape_events_vertices(uka_events_sample)
actors

Hashing Actor1 ID:   0%|          | 0/100 [00:00<?, ?row/s]

Hashing Actor2 ID:   0%|          | 0/100 [00:00<?, ?row/s]

Creating Action ADM Code:   0%|          | 0/100 [00:00<?, ?row/s]

Unnamed: 0,Actor1Code,Actor1CountryCode,Actor1EthnicCode,Actor1Geo_ADM1Code,Actor1Geo_ADM2Code,Actor1Geo_CountryCode,Actor1Geo_FeatureID,Actor1Geo_FullName,Actor1Geo_Lat,Actor1Geo_Long,...,Actor2Geo_Long,Actor2Geo_Type,Actor2KnownGroupCode,Actor2Name,Actor2Religion1Code,Actor2Religion2Code,Actor2Type1Code,Actor2Type2Code,Actor2Type3Code,ActorID
75003,,,,,,,,,,,...,,,,,,,,,,W0HHb85znmKuOl4Tlp4K0WTVzEM=uFCOFAO82AEj6kQpaJ...
91186,,,,,,,,,,,...,,,,,,,,,,TlcOhTTlcyUPpktYpSNuCX4JNl4=d8jLfPi7e5+snSqEme...
62244,,,,,,,,,,,...,,,,,,,,,,475FxGnCxzbdiYaeXhYk4yBehtg=9ThWU8IVo6Nmv2iotz...
21247,,,,,,,,,,,...,,,,,,,,,,YcpY3LzjL+TA6uo7Hr6ODop8fFM=UYp/ImYs2z3kPsZS2Z...
104404,,,,,,,,,,,...,,,,,,,,,,TVWD1trCwPmpvp+F5fCpSKpoTo0=kXm/oYrTzSd1/32J2y...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35923,,,,,,,,,,,...,,,,,,,,,,Rpzr0WcuTApnHvKAcmOowWFuPAs=KWMeiqCpVFcd8W3cJW...
20021,,,,,,,,,,,...,,,,,,,,,,JXQ6n2gyvMEzXAv9EaNn+kDW6YE=tlifxqsNyCzxIJnRwt...
2120,,,,,,,,,,,...,,,,,,,,,,v8sp9cq+wHnSUTOWTLwsFqxj6pY=VoLgw6bQZmth1LSzXf...
109420,,,,,,,,,,,...,,,,,,,,,,tlifxqsNyCzxIJnRwtQKuZToQQw=mfcCY+Az9orsQeCQ+U...


In [14]:
events.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 75003 to 102600
Data columns (total 45 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   GLOBALEVENTID          100 non-null    int64         
 1   Actor1ID               100 non-null    object        
 2   Actor2ID               100 non-null    object        
 3   Date                   100 non-null    datetime64[ns]
 4   IsRoot                 100 non-null    bool          
 5   MonthYear              100 non-null    int64         
 6   Year                   100 non-null    int64         
 7   FractionDate           100 non-null    float64       
 8   Actor2Code             88 non-null     object        
 9   Actor2Name             88 non-null     object        
 10  Actor2CountryCode      69 non-null     object        
 11  Actor2KnownGroupCode   0 non-null      object        
 12  Actor2EthnicCode       1 non-null      object        
 13

In [13]:
# Uncomment and run below if running inside Colab and you want to pull env variables from a file called vars.env on your GDrive
# !pip install colab-env --upgrade
# import colab_env