 # Supervised Link Prediction with the Armed Conflict Location Event Database
 In this notebook, I will be turning relational data from the Armed Conflict
 Location & Event Data Project and turning it into several "Conflict"
 Undirected Graphs/Networks.

 The nodes in these graphs will be different "agents" in Africa defined as
 the product of an actor type and the country that they operate in
 e.g. Civilians - Ghana, Ethnic Militia - Angola

 The edges in these graphs will be defined as follows: 1 if there was at least
 one conflict during a time period between two agents and 0 otherwise.
 The time periods will be different months spanning January 1997 - December 2018.

 By using various link prediction/topological
 features measures, I will create features about the edges that exist
 and don't exist within these graphs and I shall set my target
 (what I am trying to predict) as whether or not a conflict occured
 (represented as an edge existing within the conflict graph) within a certain time frame (a month)
 ACLED (Armed Conflict Location & Event Data Project) is a disaggregated conflict
 analysis and crisis mapping project.
 ACLED collects and analyzes data on locations, dates and types of all reported
 armed conflict and protest events in developing countries.
 It can be found here: https://www.acleddata.com/

 **NOTE**: This analysis has only been completed for Africa

 ### Import utility functions

In [1]:
from utils.data_read import load_data, load_interaction_codes
from utils.data_cleaning import (
    get_actor_categories, subset_columns, country_extractor)
from utils.feature_engineering import create_agents, get_month
from ConflictGraph import ConflictGraph
from collections import OrderedDict
from itertools import product
from functools import reduce
import pandas as pd
import time
present = True

 ### Load conflict data
 #### The Data Dictionary is as follows:

 **ISO**: A numeric code for each individual country <br>
 **EVENT_ID_CNTY**: An individual event identifier by
 number and country acronym. <br>
 **EVENT_ID_NO_CNTY**: An individual event numeric identifier. <br>
 **EVENT_DATE**: Recorded as Year/Month/Day. <br>
 **YEAR**: The year in which an event took place. <br>
 **TIME_PRECISION**: A numeric code indicating the level of certainty of
 the date coded for the event (1-3). <br>
 **EVENT_TYPE**: The type of event. <br>
 **SUB_EVENT_TYPE**: The type of sub-event. <br>
 **ACTOR1**: A named actor involved in the event. <br>
 **ASSOC_ACTOR_1**: The named actor associated with or identifying with
 ACTOR1 in one specific event. <br>
 **INTER1**: A numeric codeindicating the type of ACTOR1. <br>
 **ACTOR2**: The named actor involved in the event. If a dyadic event,
 there will also be an “Actor 1”. <br>
 **ASSOC_ACTOR_2**: The named actor associated with or identifying with
 ACTOR2 in one specific event. <br>
 **INTER2**: A numeric code indicating the type of ACTOR2. <br>
 **INTERACTION**: A numeric code indicating the interaction between types of
 ACTOR1 and ACTOR2.
 Coded  as  an  interaction  between  actor types, and  recorded as lowest joint
 number. <br>
 **REGION**: The region of the world where the event took place. <br>
 **COUNTRY**: The country in which the event took place. <br>
 **ADMIN1**: The largest sub-national administrative region
 in which the event took place.<br>
 **ADMIN2**: The second largest sub-national administrative region
 in which the event took place. <br>
 **ADMIN3**: The  third largest sub-national administrative region
 in which the event took place. <br>
 **LOCATION**: The location in which the event took place. <br>
 **LATITUDE**: The latitude of the location. <br>
 **LONGITUDE**: The longitude of the location. <br>
 **GEO_PRECISION**: A numeric code indicating the level of certainty of the
 location coded for the event. <br>
 **SOURCE**: The source(s) used to code the event. <br>
 **SOURCE SCALE**: The geographic scale of the sources used to code the event. <br>
 **NOTES**: A short description of the event. <br>
 **FATALITIES**: Number or estimate of fatalities due to event. <br>
 These are frequently different across reports

In [2]:
data = load_data('data/data.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 178809 entries, 5022244 to 5016301
Data columns (total 30 columns):
iso                 178809 non-null int64
event_id_cnty       178809 non-null object
event_id_no_cnty    178809 non-null int64
event_date          178809 non-null datetime64[ns]
year                178809 non-null int64
time_precision      178809 non-null int64
event_type          178809 non-null object
sub_event_type      178809 non-null object
actor1              178809 non-null object
assoc_actor_1       33604 non-null object
inter1              178809 non-null int64
actor2              135656 non-null object
assoc_actor_2       23689 non-null object
inter2              178809 non-null int64
interaction         178809 non-null int64
region              178809 non-null object
country             178809 non-null object
admin1              178809 non-null object
admin2              178691 non-null object
admin3              94361 non-null object
location            1788

In [4]:
data.head()

Unnamed: 0_level_0,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,assoc_actor_1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
data_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5022244,180,DRC13977,13977,2019-03-23,2019,1,Riots,Mob violence,Rioters (Democratic Republic of Congo),Vigilante Group (Democratic Republic of Congo),...,Bunia,1.5667,30.25,2,Radio Okapi,National,23 March 2019. An angry mob attacked and kille...,3,1553544834,COD
5022245,180,DRC13978,13978,2019-03-23,2019,2,Violence against civilians,Attack,Unidentified Armed Group (Democratic Republic ...,,...,Nyiragongo,-1.5219,29.2496,2,Twitter,Other,23 March 2019 (on or before). The sound of gun...,0,1553544834,COD
5022284,404,KEN6858,6858,2019-03-23,2019,1,Protests,Protest with intervention,Protesters (Kenya),,...,Kibabii,0.6199,34.5275,1,Star (Kenya),National,22-23 March: Kibabi University students held d...,0,1553544834,KEN
5022293,434,LBY7472,7472,2019-03-23,2019,1,Protests,Peaceful protest,Protesters (Libya),Magarha Ethnic Group (Libya),...,Tripoli,32.8925,13.18,1,AFP,International,"On March 23, Magharha tribesmen staged a prote...",0,1553544834,LBY
5022311,466,MLI2741,2741,2019-03-23,2019,1,Violence against civilians,Attack,Dan Na Ambassagou,,...,Ogassogou,14.0088,-3.8872,1,Reuters; aBamako; Liberation (France),National-International,"On March 23, Dogon militiamen attacked the Ful...",125,1553544834,MLI


 #### Load lookup codes
 The lookup codes for the inter columns to match them to category type
 Each actor has an associated code which represents the type of actor that they are.
 For example: GIA: Armed Islamic Group is classified as a Rebel Force

In [5]:
interaction_lookup = load_interaction_codes('data/categorycodes.csv')

In [6]:
interaction_lookup

Unnamed: 0,code,Category
0,1,Government or mutinous force
1,2,Rebel force
2,3,Political militia
3,4,Ethnic militia
4,5,Rioters
5,6,Protesters
6,7,Civilians
7,8,Outside/external force (e.g. UN)


 #### Determine the set of all countries in the data set

In [7]:
countries = set(data.country)

 #### Define all of the possible actor countries

In [8]:
categories = list(interaction_lookup.Category)

 ## Create a "conflict dataframe"
 Join the interaction lookup to each actor code in order to get the category
 of actor that they are.
 Also, extract the country that each actor belongs from. Conflicts may happen
 in a certain country, but the actors may not come from that country. <br>
 We will be ignoring Associated Actors from this analysis for ease!

In [9]:
conflict_df = data.pipe(lambda x: get_actor_categories(x, interaction_lookup))\
    .pipe(subset_columns)\
    .pipe(lambda x: country_extractor(x, countries))

In [10]:
conflict_df.head()

Unnamed: 0,event_date,country,actor1,actor1_country,actor1_category,actor2,actor2_country,actor2_category
0,2019-03-23,Democratic Republic of Congo,Rioters (Democratic Republic of Congo),Republic of Congo,Rioters,Civilians (Democratic Republic of Congo),Republic of Congo,Civilians
1,2019-03-23,Uganda,Rioters (Uganda),Uganda,Rioters,Civilians (Uganda),Uganda,Civilians
2,2019-03-22,Democratic Republic of Congo,Rioters (Democratic Republic of Congo),Republic of Congo,Rioters,Civilians (Democratic Republic of Congo),Republic of Congo,Civilians
3,2019-03-21,Nigeria,Rioters (Nigeria),Nigeria,Rioters,Civilians (Nigeria),Nigeria,Civilians
4,2019-03-20,Ghana,Rioters (Ghana),Ghana,Rioters,Civilians (Ghana),Ghana,Civilians


 ## Create a dataframe of all realised conflicts between "Agents"
 This will be a dataframe of all conflicts have actually happened.
 We define an agent as a composite label encompassing an actor's
 country of origin and the actor category.
 This is to ensure that the network has the same amount of nodes

In [11]:
all_realised_conflicts = conflict_df.pipe(create_agents)\
    .pipe(get_month)

In [12]:
all_realised_conflicts.head()

Unnamed: 0,event_date,agent1,agent2,period
0,2019-03-23,Rioters-Republic of Congo,Civilians-Republic of Congo,2019-3
1,2019-03-23,Rioters-Uganda,Civilians-Uganda,2019-3
2,2019-03-22,Rioters-Republic of Congo,Civilians-Republic of Congo,2019-3
3,2019-03-21,Rioters-Nigeria,Civilians-Nigeria,2019-3
4,2019-03-20,Rioters-Ghana,Civilians-Ghana,2019-3


 Define a range of monthly time periods

In [13]:
periods = [str(x)+"-"+str(y) for x, y in
           product(range(1997, 2019), range(1, 13))]

 ## Create Conflict Graphs
 For each time period, create a Conflict Graph with the conflicts that
 happened and didn't happen during that period.

In [14]:
def make_conflict_graphs(all_realised_conflicts, categories,
                         countries, periods):

    """

    For each time period, create a Conflict Graph with the conflicts that
    happened and didn't happen during that period

    """
    conflict_graphs = OrderedDict()
    counter = 0

    print('Creating Conflict Graphs....')

    for period in periods:
        counter += 1

        if counter % 20 == 0:
            print(str(counter) + " out of " + str(len(periods)))

        conflicts = all_realised_conflicts[
            all_realised_conflicts.period == period]

        cf = ConflictGraph(categories=categories,
                           countries=countries,
                           period=period)

        cf.set_conflicts(conflicts)
        conflict_graphs[period] = cf
    
    print('Conflict Graph Creation Complete')

    return conflict_graphs

 Let's see what metrics have been included by taking a sample
 graph

In [15]:
conflicts_1997_11 = all_realised_conflicts[
    all_realised_conflicts.period == '1997-11']

cf = ConflictGraph(categories=categories,
                   countries=countries,
                   period='1997-11')

cf.set_conflicts(conflicts_1997_11)

 We have extracted the jaccard coefficient, resource allocation and
 preferential attachment of each potential edge! (the product of all agents)

In [16]:
cf.get_all_metrics()\
    .sort_values('pref_attachment', ascending=False)\
    .head()

Unnamed: 0,agent1,agent2,pref_attachment,resource_alloc_com,jaccard_coef
61623,Civilians-Kenya,Ethnic militia-Sudan,9,0.0,0.0
19232,Ethnic militia-Sudan,Political militia-Somalia,9,0.0,0.0
55539,Political militia-Somalia,Rebel force-Sierra Leone,9,0.0,0.0
67303,Political militia-South Africa,Rebel force-Sierra Leone,9,0.0,0.0
16567,Ethnic militia-Nigeria,Ethnic militia-Nigeria,9,0.5,0.5


 Lets see what the target dataframe looks like

In [17]:
cf.get_edge_labels()\
    .sort_values('target', ascending=False)\
    .head()

Unnamed: 0,agent1,agent2,target,period
0,Ethnic militia-Kenya,Ethnic militia-Kenya,1.0,1997-11
10,Government or mutinous force-Republic of Congo,Government or mutinous force-Republic of Congo,1.0,1997-11
1,Ethnic militia-Niger,Government or mutinous force-Niger,1.0,1997-11
16,Political militia-South Africa,Political militia-South Africa,1.0,1997-11
15,Political militia-Somalia,Political militia-Somalia,1.0,1997-11


 ## Explanation of the Features that will be extracted
 $$\Gamma(x) \text{ : The neighbours of the node x. In other words, the nodes that x is connected to.} $$
 $$\left |\Gamma(x)\right| \text{: The size of the neighbour set of x} $$
 $$ \Gamma(x) \cap \Gamma(y) \text{: The shared neighbours of x and y} $$

 The toplogical features we will be extracting for each edge are as follows: <br>

 ##### Resource Allocation Index with Community Information (Soundarajan_hopcroft):

 $$\sum_{w \in \Gamma(u) \cap \Gamma(v)} \frac{f(w)}{|\Gamma(w)|} \text{ where } f(w) = 1 \text{ if } w \text{ is in the same community as } u \text{ and } w \text{ and } 0 \text{ otherwise}$$

 ##### Jaccard Coefficient:

 $$\frac{\left| \Gamma(x)\cap\Gamma(y) \right|}{\left| \Gamma(x)\cup\Gamma(y) \right|} \text{ for the edge } (x, y)$$

 ##### Preferential Attachment
 $$\left| \Gamma(x) \right| \times \left| \Gamma(y) \right| \text{ for the edge } (x, y)$$

In [18]:
del data, conflict_df

 # Feature Extraction
 The aim is: <br>
 For each 1 month window of time where conflicts have happened,
 we will extract a dataframe where the target is whether or not a
 conflict/edge existed during that time frame and we will extract the features
 are link prediction measures about the emerging graphs
 (a representation of all the interaction between agents up to a certain time)
 up to 10 months
 in the past in 1 month windows

In [19]:
def full_merge(features, target):

    """

    Merge the dataframes with the link features to the
    link target (absence/presence)

    """

    result = reduce(
        lambda x, y: x.merge(y, on=["agent1", "agent2"]),
        features)
    result = result.merge(target, on=["agent1", "agent2"])
    return result

In [20]:
def make_training_data(graphs, n_prev=12):

    """

    Implement a sliding window approach for extracting features
    and targets

    Example:
    {G1, G2, .... G12} -> G13
    {G2, G3, .... G13} -> G14

    Keyword Arguments:
    graphs -- a dictionary of Conflict Graphs
    n_prev -- The amount of time periods to slide the window
              by


    """

    keys = list(graphs.keys())
    indexes = range(len(graphs)-n_prev)

    train_dfs = []

    start = time.time()

    print('Creating Training Data...')
    for index in indexes:

        if index % 10 == 0:
            print(str(index) + " out of " + str(len(indexes)))
            print(str(time.time() - start) + " seconds elapsed")

        # Select n_prev graphs in a sliding window for feature extraction
        selected_keys_X = keys[index:index+n_prev][::-1]
        # Select a graph with a period 1 after the final graph chosen in
        # selected_keys_X
        selected_keys_Y = keys[index+n_prev]

        # Extract features/metrics for each period selected in
        # selected_keys_X
        features = (graphs[key].get_all_metrics(lag=idx+1)
                    for (idx, key) in enumerate(selected_keys_X))

        # Get the target labels for a graph 1 time period after the
        # features
        target = graphs[selected_keys_Y].get_edge_labels()

        # Merge
        result = full_merge(features, target)

        train_dfs.append(result)

        del features, target, result, selected_keys_X, selected_keys_Y

    print('Training Data Created!')

    return train_dfs

 # Creation of the Training Data via Graph Feature Extraction
 1. Create all of the conflict graphs from 1997-1 to 2018-12
 2. Apply a sliding window approach to get the edge labels for a certain period
 and then the graph features for the previous 12 periods
 3. Concatenate the output from all of the sliding windows
 4. Save as a parquet for the modelling phase

In [21]:
if not present:
    conflict_graphs = make_conflict_graphs(all_realised_conflicts,
                                           categories,
                                           countries,
                                           periods)
    train = make_training_data(conflict_graphs)
    train_df = pd.concat(train, ignore_index=True)
    train_df.to_parquet('df.parquet.gzip', compression='gzip')

In [22]:
if present:
    train_df = pd.read_parquet('df.parquet.gzip')
    train_df.head()
