# osiris GDELT data

This notebook describes the [GDELT project](https://www.gdeltproject.org/) data that osiris uses and how to import it using osiris either from the GDELT file server or from Google BigQuery.

*From the GDELT website*:
>The GDELT Project is a realtime network diagram and database of global human society for open research.
![gf](https://www.gdeltproject.org/images/spinningglobe.gif)
>The GDELT Project is an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day.

In [1]:
# Import the osiris code and set the runtime env 
import os, sys
sys.path.append(os.path.join('..', 'osiris'))
sys.path.append(os.path.join('..', 'ext'))
from osiris_global import set_runtime_env
set_runtime_env(interactive_nb=True)

## GDELT Event Data

The GDELT [event data](http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf) contains hundreds of millions of automatically coded events extracted from news stories daily. osiris allows you to extract data directly from the GDELT file server/ The advantage of this method is that you don't need to have any special credentials or server access (remember we want *open-source* indicators.). All the data is downloaded directly to your client machine.

In [2]:
# Import data directly from GDELT file server
from data.gdelt import DataSource
gdelt = DataSource()

In [3]:
# Get event data for a 1 week period
events = gdelt.import_data('events', 'Apr-14-2022', 'Apr-20-2022')

Importing GDELT events data for 7 day(s) from 04-14-2022 to 04-20-2022...


Import GDELT events data:   0%|          | 0/7 [00:00<?, ?day/s]

Importing GDELT events data for 7 day(s) from 04-14-2022 to 04-20-2022 completed in 70.02 s.


In [4]:
events.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 707186 entries, 0 to 125669
Data columns (total 62 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   GLOBALEVENTID          707186 non-null  int64  
 1   SQLDATE                707186 non-null  int64  
 2   MonthYear              707186 non-null  int64  
 3   Year                   707186 non-null  int64  
 4   FractionDate           707186 non-null  float64
 5   Actor1Code             640700 non-null  object 
 6   Actor1Name             640700 non-null  object 
 7   Actor1CountryCode      408112 non-null  object 
 8   Actor1KnownGroupCode   9610 non-null    object 
 9   Actor1EthnicCode       3423 non-null    object 
 10  Actor1Religion1Code    10452 non-null   object 
 11  Actor1Religion2Code    2561 non-null    object 
 12  Actor1Type1Code        296023 non-null  object 
 13  Actor1Type2Code        19713 non-null   object 
 14  Actor1Type3Code        495 non-null 

About a week's worth of event data in 2022 takes up about 400MB RAM.

In [5]:
events

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_ADM2Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
0,1039303078,20210414,202104,2021,2021.2849,CAN,CANADA,CAN,,,...,4,"Port Elgin, Ontario, Canada",CA,CA08,12643,44.4333,-81.38330,-571576,20220414014500,https://www.lakeshoreadvance.com/news/local-ne...
1,1039303079,20210414,202104,2021,2021.2849,CAN,CANADA,CAN,,,...,4,"Port Elgin, Ontario, Canada",CA,CA08,12643,44.4333,-81.38330,-571576,20220414014500,https://www.lakeshoreadvance.com/news/local-ne...
2,1039303080,20210414,202104,2021,2021.2849,CHN,CHINA,CHN,,,...,4,"Shanghai, Shanghai, China",CH,CH23,13243,31.2222,121.45800,-1924465,20220414014500,https://news.yahoo.com/zealand-court-rules-all...
3,1039303081,20210414,202104,2021,2021.2849,CVL,SCIENTIST,,,,...,4,"Paris, France (general), France",FR,FR00,16282,48.8667,2.33333,-1456928,20220414014500,http://www.jordantimes.com/news/features/first...
4,1039303082,20210414,202104,2021,2021.2849,MNCUSAMED,GOOGLE,USA,,,...,2,"California, United States",US,USCA,,36.1700,-119.74600,CA,20220414014500,https://menafn.com/1104016162/Google-to-invest...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125665,1040383216,20220420,202204,2022,2022.3014,cre,CREE,,,cre,...,0,,,,,,,,20220420234500,https://www.cjvr.com/2022/04/20/first-nations-...
125666,1040383217,20220420,202204,2022,2022.3014,cre,CREE,,,cre,...,0,,,,,,,,20220420234500,https://www.cjvr.com/2022/04/20/first-nations-...
125667,1040383218,20220420,202204,2022,2022.3014,cre,CREE,,,cre,...,0,,,,,,,,20220420234500,https://www.cjvr.com/2022/04/20/first-nations-...
125668,1040383219,20220420,202204,2022,2022.3014,telOPP,TELUGU,,,tel,...,0,,,,,,,,20220420234500,https://www.deccanchronicle.com/nation/politic...


A weeks worth of event data has approx. 700K events. Each event in the data is highly denormalized for ease of querying and coded using a hierachical coding system called [CAMEO](http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf) - Conflict and Mediation Event Observations

In [6]:
events[['EventCode', 'CAMEOCodeDescription']]

Unnamed: 0,EventCode,CAMEOCodeDescription
0,012,Make pessimistic comment
1,020,"Appeal, not specified below"
2,0213,Appeal for judicial cooperation
3,043,Host a visit
4,0311,Express intent to cooperate economically
...,...,...
125665,060,"Engage in material cooperation, not spec below"
125666,073,Provide humanitarian aid
125667,090,"Investigate, not specified below"
125668,043,Host a visit


We can query and filter event data directly using the pandas dataframe

In [18]:
# Find all events located in Ukraine
events[events.ActionGeo_FullName.str.upper().str.contains('UKRAINE') == True]

Unnamed: 0,GLOBALEVENTID,SQLDATE,MonthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,ActionGeo_Type,ActionGeo_FullName,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_ADM2Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
79,1039303157,20220414,202204,2022,2022.2849,,,,,,...,4,"Gostomel, Kyyivs'ka Oblast', Ukraine",UP,UP13,28554,50.5789,30.2622,-1039545,20220414014500,http://www.jordantimes.com/news/world/ukraine-...
83,1039303161,20220414,202204,2022,2022.2849,,,,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220414014500,https://menafn.com/1104018154/World-on-the-bri...
87,1039303165,20220414,202204,2022,2022.2849,,,,,,...,4,"Kiev, Ukraine (general), Ukraine",UP,UP00,28554,50.4333,30.5167,-1044367,20220414014500,https://insidethevatican.com/news/newsflash/le...
119,1039303197,20220414,202204,2022,2022.2849,,,,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220414014500,http://dagblog.com/comment/316219
121,1039303199,20220414,202204,2022,2022.2849,,,,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220414014500,http://dagblog.com/comment/316219
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125624,1040383175,20220420,202204,2022,2022.3014,USAGOV,THE WHITE HOUSE,USA,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420234500,https://www.hellenicshippingnews.com/u-s-crude...
125625,1040383176,20220420,202204,2022,2022.3014,USAGOV,JOE BIDEN,USA,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420234500,https://www.agassizharrisonobserver.com/news/t...
125648,1040383199,20220420,202204,2022,2022.3014,VAT,VATICAN,VAT,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420234500,http://www.icatholic.org/article/christs-resur...
125651,1040383202,20220420,202204,2022,2022.3014,VAT,VATICAN,VAT,,,...,1,Ukraine,UP,UP,,49.0000,32.0000,UP,20220420234500,http://www.icatholic.org/article/christs-resur...
