# ConflictWiki Data v0.1

We introduce the ConflictWiki dataset, a large collection of [Wikipedia articles on armed conflict](https://en.wikipedia.org/wiki/Category:Conflicts) featuring full text accompanied by pre-computed meaningful Longformer representations of the text on the document- and section-level. 
Each conflict is annotated with starting and end date, location, casualty counts, group strengths and conflict outcome. 
If provided on Wikipedia, entity articles feature auxiliary tabular information on languages, religion, ISO2 code and ideology. 


## Data Directory Structure

The data directory contains three subdirectory, `conflict`, `entity` and `mappings`. 
`conflict` contains conflict-related data and information.
`entity` contains entity-related data and information.
`mappings` contains the conflict-entity and entity-entity relationships. 
If there are multiple files in a directory, they all contain the same content.
We provide most data in multiple data formats (json, csv, pickle). 
The contents of each subdirectory is explained in the following.

```
|   data
    |   conflict
        |   data
                | embedding
                | text
        |   info
    |   entity
        |   data
                | embedding
                | text
        |   info
    |   mappings
        |   conflict_entity_id
        |   ally_enemy_pairs
        |   network
                |   aggr_edge_list
                |   node_list
```

### conflict

##### data

contains the retrieved conflict articles. 

- `embedding` features the articles' Longformer representations as a pickled dictionary. 
`conflict_id` is the conflict id; `entity_id` is the id of the entity as mentioned in the conflict article;
`embeddings` are the representations of the relevant entities in the article.
 We use the key `0` to represent the entire conflict article as a whole, irrespective of entities.

    ```
    {conflict_id: entity_id: embeddings}
    ```

- `text` is the raw full text. 

    ```
    {conflict_id: section_title: text}
    ```

##### info

contains all information on the conflicts as extracted from the [Wikipedia militarized conflict template](https://en.wikipedia.org/wiki/Template:Infobox_military_conflict).

`conflict_id`, `conflict_name`, `place`,  `date`, `date_start`, `date_end`, `n_belligerents`, `n_entities`, `strength`, `strength_num`, `casualties`, `casualties_num`, `commander`, `commander_num`, `result` tags.
date holds the date as a string, date_start and date_end are extracted from date and are datetime objects; <br>
n_belligerents, n_entities state the number of entities and belligerents involved in the conflict; <br>
strength and casualties estimate group strengths and losses per belligerent (dictionary keys are belligerent indeces); <br>
strength_num and casualties_num are numbers extracted from strength and casualties (dictionary keys are belligerent indeces); <br>
commander lists all military commanders per belligerent (dictionary keys are belligerent indeces); <br>
result is a brief textual summary of the outcome  <br>


### load external libraries

In [1]:
import pandas as pd
import pathlib

### Conflicts

##### data

contains the retrieved conflict articles. 

- `embedding` features the articles' Longformer representations as a pickled dictionary. 
`conflict_id` is the conflict id; `entity_id` is the id of the entity as mentioned in the conflict article;
`embeddings` are the representations of the relevant entities in the article.
 We use the key `0` to represent the entire conflict article as a whole, irrespective of entities.

    ```
    {conflict_id: entity_id: embeddings}
    ```

- `text` is the raw full text. 

    ```
    {conflict_id: section_title: text}
    ```

##### info

contains all information on the conflicts as extracted from the [Wikipedia militarized conflict template](https://en.wikipedia.org/wiki/Template:Infobox_military_conflict).

`conflict_id`, `conflict_name`, `place`,  `date`, `date_start`, `date_end`, `n_belligerents`, `n_entities`, `strength`, `strength_num`, `casualties`, `casualties_num`, `commander`, `commander_num`, `result` tags.<br>
date holds the date as a string, date_start and date_end are extracted from date and are datetime objects; <br>
n_belligerents, n_entities state the number of entities and belligerents involved in the conflict; <br>
strength and casualties estimate group strengths and losses per belligerent (dictionary keys are belligerent indeces); <br>
strength_num and casualties_num are numbers extracted from strength and casualties (dictionary keys are belligerent indeces); <br>
commander lists all military commanders per belligerent (dictionary keys are belligerent indeces); <br>
result is a brief textual summary of the outcome  <br>


In [6]:
conflict_info_file = pathlib.Path().absolute().parent.absolute() / "conflict" / "info" / "conflict_info.pkl"
conflict_info = pd.DataFrame.from_dict(pd.read_pickle(conflict_info_file), orient='index')
conflict_info.index.name="conflict_id"
conflict_info

Unnamed: 0_level_0,conflict_name,n_belligerents,n_entities,place,date,date_start,date_end,status,casualties,casualties_num,casualties_sum,strength,strength_num,strength_sum,commander,result
conflict_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
22738,Operation Enduring Freedom,2,64,"Afghanistan, Philippines, Somalia, Georgia cou...",7 October 2001 - 28 December 2014,2001-10-07,2014-12-28,Operation ended; Conflict ongoing War in Afgh...,"{0: ' 45,000+ killed 2,438 killed(2,414 in Afg...","{0: [45000.0, 2438.0, 2414.0, 17.0, 5.0, 2.0, ...","{0: 51079.0, 1: 83632.0}","{0: '', 1: ''}",{},{},"{0: ['MRAF', 'Martin Dempsey', 'Graham Stirrup...",
43063,Operation Anaconda,2,16,"Shahi Kot Valley, Paktia Province, Islamic Sta...","March 1-18, 2002",2022-03-01,2021-03-21,,{0: '7 Afghan fighters killed 8 (7 in the Bat...,"{0: [7.0, 8.0, 7.0, 2015.0, 66.0], 1: [500.0, ...","{0: 2103.0, 1: 3382.0}","{0: '30,000', 1: '1,000'}","{0: [30000.0], 1: [1000.0]}","{0: 30000.0, 1: 1000.0}","{0: ['Australia', 'United States', 'Franklin L...","Coalition victory, Taliban evacuates but suffe..."
46216,Israeli–Palestinian conflict,2,7,"Middle EastPrimarily in Israel, West Bank, Gaz...",Mid-20th century - presentMain phase: 1964-1993,2021-03-21,2021-03-21,Israeli–Palestinian peace proceslow-level figh...,"{0: '', 1: '', 2: '21,500 casualties (1965–201...","{2: [21500.0, 1965.0, 2013.0, 1946.0, 2012.0, ...",{2: 35439.0},"{0: '', 1: ''}",{},{},"{0: [], 1: []}",
106346,Battle of Mogadishu (1993),2,6,"Mogadishu, Somalia",,2021-03-21,,,{0: ' 19 kille73 wounde1 captured (later relea...,"{0: [19.0, 73.0, 1.0, 2.0, 3.0, 1.0, 7.0, 1.0,...",{0: 109.0},"{0: '', 1: '2,000–4,000'}","{1: [2000.0, 4000.0]}",{1: 6000.0},"{0: ['Gary L. Harrell', 'William F. Garrison',...",Pyrrhic tactical USUN victory Strategic Somali...
201936,2003 invasion of Iraq,2,17,Iraq,20 March - 1 May 2003,2022-03-20,2003-05-01,,{0: 'Coalition: 214 killed606 wounded (U.S.)Pe...,"{0: [214.0, 606.0, 24.0, 238.0, 1000.0], 1: [3...","{0: 2082.0, 1: 172996.0, 2: 14769.0}","{0: ': 192,000 personnel: 45,000 troops : 2,00...","{0: [192000.0, 45000.0, 2000.0, 194.0, 70000.0...","{0: 619628.0, 1: 4015761.0}","{0: ['Jalal Talabani', 'Australia', 'Tommy Fra...",Coalition operational success Ba'athist Iraq ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64786838,RFDG Insurgency,2,3,Guinea,2 September 2000 - 9 March 2001,2000-09-02,2001-03-09,,"{2: '649 – 1,500 killed100,000 – 350,000 displ...","{2: [649.0, 1500.0, 100000.0, 350000.0]}",{2: 452149.0},"{0: 'Young Volunteers: 7,000–30,000', 1: 'RFDG...","{0: [7000.0, 30000.0], 1: [1800.0, 5000.0]}","{0: 37000.0, 1: 6800.0}",{},
65268074,Second Battle of Khara,2,5,"Khara, Nepal; Khara, Rukum District",7-8 April 2005,2021-03-21,2005-04-08,,"{0: '3 dead', 1: '≈ 300 dead'}","{0: [3.0], 1: [300.0]}","{0: 3.0, 1: 300.0}",{0: ' ≈ 166 troops including 25 Nepal Police p...,"{0: [166.0, 25.0], 1: [6000.0]}","{0: 191.0, 1: 6000.0}","{0: ['Pyar Jung Thapa'], 1: ['Nanda Kishor Pun...",Royal Nepalese Army Victory
65431221,2020 Nagorno-Karabakh war,2,9,Nagorno-Karabakh and Armenian-occupied territo...,{{nowrap; - }},2021-03-21,2021-03-21,,"{0: 'Per Azerbaijan: * 2,854 servicemen killed...","{0: [2854.0, 18.0, 2021.0, 2855.0, 12.0, 2020....","{0: 9780.0, 1: 5519.0}",{0: '{{plainlist| * Unknown regular military *...,"{0: [2580.0], 1: []}","{0: 2580.0, 1: 0}","{0: ['Hikmat Hasanov', 'Mais Barkhudarov', 'Hi...",Azerbaijani victory 2020 Nagorno-Karabakh cea...
65548110,December 2007 Turkish incursion into northern ...,2,2,northern Iraq,"December 16, 2007 - December 26, 2007",2007-12-16,2007-12-26,,"{0: 'None', 1: '200 killed, thousands wounded ...","{1: [200.0], 2: [1800.0, 10.0, 1.0]}","{1: 200.0, 2: 1811.0}","{0: '52 war planes', 1: '2.320-2.640 soldiers ...","{0: [52.0], 1: [2320.0]}","{0: 52.0, 1: 2320.0}","{0: [], 1: []}",Turkish victory


### entity

##### data

contains the retrieved entity articles. 

- `embedding` features the articles' Longformer representations as a pickled dictionary. 
`entity_id` is the entity id; `section_title` is the title of the section;
`embeddings` are the representations of the sections in entity article

    ```
    {entity_id: section_title: embeddings}
    ```

- `text` is the raw full text. 

    ```
    {entity_id: section_title: text}
    ```

##### info

contains all information on the entities as extracted from the [Wikipedia article info box](https://en.wikipedia.org/wiki/Mali).

`entity_id`, `entity_name`, `num_conflicts`, `iso`, `language`, `religion`, `ideology` <br>
num_conflicts gives the number of conflicts, the entity is involved in; <br>

In [8]:
entity_info_file = pathlib.Path().absolute().parent.absolute() / "entity" / "info" / "entity_info.pkl"
entity_info = pd.DataFrame.from_dict(pd.read_pickle(entity_info_file), orient='index')
entity_info.index.name="entity_id"
entity_info

Unnamed: 0_level_0,entity_name,num_conflicts,iso,language,ideology,religion
entity_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
27019,South Korea,6,KR,"[Korean Sign Language, Pyojun-eo, Korean, Sout...",[],"[No religion, Irreligion]"
1737554,Transit Center at Manas,1,[],[],[],[]
17207794,Djibouti,3,DJ,"[French language, Arabic language, Arabic, Fre...",[],[Islam]
4913064,New Zealand,6,NZ,"[English, New Zealand English]",[],"[No religion, Buddhism, Irreligion, Islam, Hin..."
23235,Pakistan,29,PK,"[English, Urdu, English language, Kashmiri lan...",[],"[Islam, Islam in Pakistan, State religion]"
...,...,...,...,...,...,...
14833748,Nepal Police,1,[],[],[],[]
66890,People's Liberation Army,1,[],[],[],[]
8862873,Syrians,1,[],"[Turoyo language, Neo-Aramaic, Assyrian Neo-Ar...",[],"[Assyrian Church of the East, Christianity, Su..."
214413,Armenian diaspora,1,[],[],[],[]


### Mappings

##### conflict_entity_id

Contains the ids of all conflicts and the ids of all involved entities partitioned into belligerents.
In the examplatory conflict below, we have three belligerents, of which the first two are formed by two entities each 
and the third belligerent consists of only one entity

```
{conflict_id: [[entity_id, entity_id, entity_id, entity_id],[entity_id]]}
(1220919: [[1576797, 31717, 4887, 3434750], [40596311]])
```

In [9]:
conflict_entity_id_file = pathlib.Path().absolute().parent.absolute() / "mappings" / "conflict_entity_id" / "conflict_entity_id.pkl"
#conflict_entity_id = pd.read_pickle(conflict_entity_id_file)
conflict_entity_id= pd.DataFrame.from_dict(pd.read_pickle(conflict_entity_id_file), orient='index', columns = ["belligerent_1", "belligerent_2", "belligerent_3", "belligerent_4"])
conflict_entity_id.index.name="conflict_id"
conflict_entity_id

Unnamed: 0_level_0,belligerent_1,belligerent_2,belligerent_3,belligerent_4
conflict_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
22738,"[27019, 1737554, 17207794, 4913064, 23235, 318...","[103100, 51673, 737, 30635, 27358, 21462079, 3...",,
43063,"[21241, 291717, 4913064, 31717, 11125639, 737,...","[389363, 737, 4512092]",,
46216,[9282173],"[24093, 12047, 24324, 13913, 53185055, 13093253]",,
106346,"[1113713, 23235, 3607937, 31769, 3434750]",[8089426],,
201936,"[899044, 22936, 31717, 679693, 7515928, 256597...","[7515849, 26215470, 285632, 2185, 199866, 17633]",,
...,...,...,...,...
64786838,[469639],"[43667081, 298010]",,
65268074,"[12593188, 171166, 5217236, 14833748]",[66890],,
65431221,"[11125639, 73374, 746, 8862873]","[214413, 3057255, 25391, 44596915, 10918072]",,
65548110,[11125639],[69680],,


##### ally_enemy_pairs

Contains the ids of entity pairs, a conflict id and the relationship of the entities in the given conflict as displayed below

```
((entity_pair), conflict_id, entity_relationship)
((27019, 103100), 22738, "enemies")

In [10]:
ally_enemy_pairs_file = pathlib.Path().absolute().parent.absolute() / "mappings" / "ally_enemy_pairs" / "ally_enemy_pairs.pkl"
#ally_enemy_pairs = pd.read_pickle(ally_enemy_pairs_file)
ally_enemy_pairs= pd.DataFrame(pd.read_pickle(ally_enemy_pairs_file),columns = ["entity_pair", "conflict_id", "entity_relationship"])
ally_enemy_pairs.index.name="entity_pair_id"
ally_enemy_pairs

Unnamed: 0_level_0,entity_pair,conflict_id,entity_relationship
entity_pair_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"(27019, 103100)",22738,enemies
1,"(27019, 51673)",22738,enemies
2,"(27019, 737)",22738,enemies
3,"(27019, 30635)",22738,enemies
4,"(27019, 27358)",22738,enemies
...,...,...,...
17066,"(25391, 44596915)",66219979,enemies
17067,"(25391, 10918072)",66219979,enemies
17068,"(44596915, 10918072)",66219979,enemies
17069,"(7515928, 3434750)",66219979,allies


##### network

Contains the precomputed, aggregated network, where nodes are entities and edges are all conflicts between any two respective entities aggregated.

###### aggr_edge_list

List of edges of shape (node_id 1, node_id 2, edge attributes)

```
(entity_id 1, entity_id 2, conflict attributes
27019,	103100,	{'label': 'enemies', 'label_discrete': 0, 'label_continuous': -1, 'n_conflicts': 1, 'conflict_ids': [22738], 'conflict_names': ['Operation Enduring Freedom']}
```

In [11]:
aggr_edge_list_file = pathlib.Path().absolute().parent.absolute() / "mappings" / "network" / "aggr_edge_list.pkl"
aggr_edge_list = pd.DataFrame(pd.read_pickle(aggr_edge_list_file),columns = ["entity_id_1", "entity_id_1", "conflict_attributes"])
aggr_edge_list.index.name="edge"
aggr_edge_list

Unnamed: 0_level_0,entity_id_1,entity_id_1,conflict_attributes
edge,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,27019,103100,"{'label': 'enemies', 'label_discrete': 0, 'lab..."
1,27019,51673,"{'label': 'enemies', 'label_discrete': 0, 'lab..."
2,737,27019,"{'label': 'enemies', 'label_discrete': 0, 'lab..."
3,27019,30635,"{'label': 'enemies', 'label_discrete': 0, 'lab..."
4,27019,27358,"{'label': 'allies', 'label_discrete': 1, 'labe..."
...,...,...,...
10832,214413,44596915,"{'label': 'allies', 'label_discrete': 1, 'labe..."
10833,214413,10918072,"{'label': 'allies', 'label_discrete': 1, 'labe..."
10834,25391,3057255,"{'label': 'allies', 'label_discrete': 1, 'labe..."
10835,3057255,44596915,"{'label': 'enemies', 'label_discrete': 0, 'lab..."


###### node_list

List of nodes (node_id, node attributes)

```
(entity_id, entity attributes
3343, {'name': 'Belgium'}
```

In [12]:
node_list_file = pathlib.Path().absolute().parent.absolute() / "mappings" / "network" / "node_list.pkl"
node_list= pd.DataFrame(pd.read_pickle(node_list_file),columns = ["entity_id", "entity_attributes"])
node_list.index.name="node"
node_list

Unnamed: 0_level_0,entity_id,entity_attributes
node,Unnamed: 1_level_1,Unnamed: 2_level_1
0,358,{'name': 'Algeria'}
1,701,{'name': 'Angola'}
2,737,{'name': 'Afghanistan'}
3,738,{'name': 'Albania'}
4,746,{'name': 'Azerbaijan'}
...,...,...
711,38264234,{'name': 'European Union Training Mission in M...
712,40479968,{'name': 'United Nations Force Intervention Br...
713,43112164,{'name': 'European Union Training Mission in S...
714,45690137,{'name': 'Grupos de autodefensa comunitaria'}
