# Predict Attack Type using SVM


The goal of this analysis is to explore some machine learning tools on a single practical task: analyzing five collections of 500,000 conflict-related events on nine different attack types. In this notebook we will:

* load the spreadsheet contents and the categories

* extract feature vectors suitable for machine learning

* train a few models to perform categorization

* use a grid search strategy to find a good configuration of both the feature extraction components and the classifier


## Load the spreadsheet contents and the categories


One of the datasets, GTDB, contains numerous Attack Type columns that can serve as a way of labeling each record. Here I convert the spreadsheet into its own DataFrame in order to visualize a few of its many columns.

In [2]:

import pandas as pd

gtdb_path = "../data/csv/GTDB"
csv_path =  gtdb_path + '.csv'
encoding = ['latin1', 'iso8859-1', 'utf-8'][1]
gtdb_df = pd.read_csv(csv_path, encoding=encoding, low_memory=False)
gtdb_df.columns

Index(['eventid', 'iyear', 'imonth', 'iday', 'approxdate', 'extended',
       'resolution', 'country', 'country_txt', 'region',
       ...
       'addnotes', 'scite1', 'scite2', 'scite3', 'dbsource', 'INT_LOG',
       'INT_IDEO', 'INT_MISC', 'INT_ANY', 'related'],
      dtype='object', length=137)


The columns that might give away the category (the dependent variables) all have the word "attack" in them, so I separate them out here.

In [3]:

import re

attack_regex = re.compile(r"attack")
attack_column_list = [column for column in gtdb_df.columns for m in [attack_regex.search(column)] if m]
attack_column_list

['attacktype1',
 'attacktype1_txt',
 'attacktype2',
 'attacktype2_txt',
 'attacktype3',
 'attacktype3_txt']


It looks like the attacktype1_txt column is the one we need to index on. Here is all the different values contained in that column.

In [4]:

gtdb_df['attacktype1_txt'].unique()

array(['Assassination', 'Hostage Taking (Kidnapping)', 'Bombing/Explosion',
       'Facility/Infrastructure Attack', 'Armed Assault', 'Hijacking',
       'Unknown', 'Unarmed Assault', 'Hostage Taking (Barricade Incident)'], dtype=object)


We need to decide what order to place them in for indexing. So, our categories are summarized in their own spreadsheet. Here I convert it to a DataFrame for easy visualization.

In [5]:

attack_types_path = "../data/csv/AttackTypes"
csv_path =  attack_types_path + '.csv'
attack_types_df = pd.read_csv(csv_path, encoding=encoding, low_memory=False)
attack_types_df

Unnamed: 0,Attack Type Id,Attack Type
0,1,Assassination
1,2,Armed Assault
2,3,Bombing/Explosion
3,4,Hijacking
4,5,Hostage Taking (Barricade Incident)
5,6,Hostage Taking (Kidnapping)
6,7,Facility/Infrastructure Attack
7,8,Unarmed Assault
8,9,Unknown



The problem is all the rest of the datasets don't have anything like an _Attack Type_ column. Here I convert each spreadsheet into its own DataFrame and display the columns in each to show the difficulty.

In [6]:

ucdp_path = "../data/csv/UCDP"
csv_path =  ucdp_path + '.csv'
ucdp_df = pd.read_csv(csv_path, encoding=encoding, low_memory=False)
ucdp_df.columns

Index(['id', 'relid', 'year', 'active_year', 'type_of_violence',
       'conflict_dset_id', 'conflict_new_id', 'conflict_name', 'dyad_dset_id',
       'dyad_new_id', 'dyad_name', 'side_a_dset_id', 'side_a_new_id', 'side_a',
       'side_b_dset_id', 'side_b_new_id', 'side_b', 'number_of_sources',
       'source_article', 'source_office', 'source_date', 'source_headline',
       'source_original', 'where_prec', 'where_coordinates',
       'where_description', 'adm_1', 'adm_2', 'latitude', 'longitude',
       'geom_wkt', 'priogrid_gid', 'country', 'region', 'event_clarity',
       'date_prec', 'date_start', 'date_end', 'deaths_a', 'deaths_b',
       'deaths_civilians', 'deaths_unknown', 'best_est', 'high_est', 'low_est',
       'isocc', 'gwno', 'gwab'],
      dtype='object')

In [7]:

scad_path = "../data/csv/SCAD"
csv_path =  scad_path + '.csv'
scad_df = pd.read_csv(csv_path, encoding=encoding, low_memory=False)
scad_df.columns

Index(['eventid', 'id', 'ccode', 'countryname', 'startdate', 'enddate',
       'duration', 'stday', 'stmo', 'styr', 'eday', 'emo', 'eyr', 'etype',
       'escalation', 'actor1', 'actor2', 'actor3', 'target1', 'target2',
       'cgovtarget', 'rgovtarget', 'npart', 'ndeath', 'repress', 'elocal',
       'ilocal', 'sublocal', 'locnum', 'gislocnum', 'issue1', 'issue2',
       'issue3', 'issuenote', 'nsource', 'notes', 'coder', 'acd_questionable',
       'latitude', 'longitude', 'geo_comments', 'location_precision'],
      dtype='object')

In [8]:

rand_path = "../data/csv/RAND"
csv_path =  rand_path + '.csv'
rand_df = pd.read_csv(csv_path, encoding=encoding, low_memory=False)
rand_df.columns

Index(['startdate', 'city', 'country', 'perpetrator', 'weapon', 'injuries',
       'fatalities', 'description'],
      dtype='object')

In [9]:

acled_path = "../data/csv/ACLED"
csv_path =  acled_path + '.csv'
acled_df = pd.read_csv(csv_path, encoding=encoding, low_memory=False)
acled_df.columns

Index(['GWNO', 'EVENT_ID_CNTY', 'EVENT_ID_NO_CNTY', 'EVENT_DATE', 'YEAR',
       'TIME_PRECISION', 'EVENT_TYPE', 'ACTOR1', 'ALLY_ACTOR_1', 'INTER1',
       'ACTOR1_ID', 'ACTOR2', 'ALLY_ACTOR_2', 'INTER2', 'ACTOR2_ID',
       'INTERACTION', 'ACTOR_DYAD_ID', 'COUNTRY', 'ADMIN1', 'ADMIN2', 'ADMIN3',
       'LOCATION', 'LATITUDE', 'LONGITUDE', 'GEO_PRECISION', 'SOURCE', 'NOTES',
       'FATALITIES'],
      dtype='object')


### Get the training data into a bunch suitable for extracting feature vectors


In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors. But most of the columns in the datasets contain numbers that can't easily be used for categorization. Our solution was to concatonate all the textual descriptions into one sentence for each row, then extract feature vectors from that sentence.

In [16]:

def concat_independendent_variables(df):
    X = pd.Series([])
    for row_index, row_series in df.iterrows():
        row_concat = row_series.astype('str').str.cat(sep=' ').strip()
        row_concat = sq_regex.sub(r'', row_concat)
        row_concat = nonalpha_regex.sub(r' ', row_concat)
        X = X.append(pd.Series([row_concat]), ignore_index=True)
    
    return X

nonalpha_regex = re.compile(r"[^a-zA-Z]+")
sq_regex = re.compile(r"'")


Here we are only ingesting the rows that have been previously labeled to create the training set. The analyst has manually labeled of few of each spreadsheet for greater accuracy. This takes a long time, so we report to ourselves how long (in seconds) it took the last time we ran it, to estimate how long the coffee break should last.

In [17]:

import time

t0 = time.time()
X = pd.Series([])
y = pd.Series([])
for csv_file in ['acled', 'rand', 'scad', 'ucdp', 'GTDB']:
    if csv_file == "GTDB":
        gtdb_path = "../data/csv/GTDB"
        csv_path =  gtdb_path + '.csv'
    else:
        relabeled_path = "../data/csv/mike_"
        csv_path =  relabeled_path + csv_file + '.csv'
    df = pd.read_csv(csv_path, encoding=encoding, low_memory=False)
    df.fillna(value="", inplace=True)
    if csv_file == "GTDB":
        important_columns = [column for column in df.columns if (column not in attack_column_list)]
    else:
        important_columns = df.columns.tolist()[:-1]
    X = X.append(concat_independendent_variables(df[important_columns]), ignore_index=True)
    if csv_file == "GTDB":
        y = y.append(df['attacktype1'].map(lambda x: int(x)-1), ignore_index=True)
    else:
        y = y.append(df[df.columns.tolist()[-1]].map(lambda x: attack_types_df['Attack Type'].tolist().index(x)), 
                     ignore_index=True)
t1 = time.time()
print(t1-t0, time.ctime(t1))

413.7951662540436 Mon Jul  3 13:36:02 2017



Here we break up the X and y into a training and testing sets for ease of use later on.

In [18]:

from sklearn.model_selection import train_test_split
from os import listdir
from os.path import isfile, join
import numpy as np

class Bunch(dict):
    """Container object for datasets: dictionary-like object that
       exposes its keys as attributes."""

    def __init__(self, **kwargs):
        dict.__init__(self, kwargs)
        self.__dict__ = self

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=0)
csv_path = "../data/csv/"
csv_files = [join(csv_path, f) for f in listdir(csv_path) if isfile(join(csv_path, f))]
gtdb_train = Bunch(filenames=np.asarray(csv_files),
                   target_names=attack_types_df['Attack Type'].tolist(),
                   DESCR=None,
                   target=np.asarray(y_train.tolist()),
                   data=X_train.tolist(),
                   description="The GTDB dataset concatoned into one column (minus the target columns)")
gtdb_test = Bunch(filenames=np.asarray(csv_files),
                   target_names=attack_types_df['Attack Type'].tolist(),
                   DESCR=None,
                   target=np.asarray(y_test.tolist()),
                   data=X_test.tolist(),
                   description="The GTDB dataset concatoned into one column (minus the target columns)")
gtdb_all = Bunch(filenames=np.asarray(csv_files),
                   target_names=attack_types_df['Attack Type'].tolist(),
                   DESCR=None,
                   target=np.asarray(y.tolist()),
                   data=X.tolist(),
                   description="The GTDB dataset concatoned into one column (minus the target columns)")


So, the training data looks like a bunch of words all strung together.

In [19]:

gtdb_train.data[:3]

[' Peru South America Ayacucho Ayacucho district Police Police Building headquarters station school Police post Peru Shining Path SL Unknown Attacked Unknown PGIS ',
 ' Afghanistan South Asia Herat Shaydai Assailants opened fire on Afghan National Army ANA soldiers in Shaydai area Herat province Afghanistan Two soldiers were killed in the attack The Taliban claimed responsibility for the incident Insurgency Guerilla Action Military Military Personnel soldiers troops officers forces Afghan National Army ANA Officers Afghanistan Taliban Posted to website blog etc Firearms Unknown Gun Type Unknown Gunmen Kills Afghan Soldiers in Herat Tolo News July Afghanistan Afghan army officers martyred in gunmen attack in Herat Khaama Press July Mine blast kills Afghan two soldiers Afghan Islamic Press July START Primary Collection ',
 ' Philippines Southeast Asia Metropolitian Manila Manila Business Hotel Resort Manila Garden Hotel Philippines Unknown Incendiary Incendiary defused PGIS ']


Each row of which is labeled with an attack type:

In [20]:

for t in gtdb_train.target[:3]:
    print(gtdb_train.target_names[t])

Unknown
Armed Assault
Facility/Infrastructure Attack



## Extract feature vectors suitable for machine learning


We now have our data in a form where we can extract feature vectors suitable for machine learning. The most intuitive way to do so is the **bags of words** representation:

1. assign a fixed integer id to each word occurring in any row of the training set (for instance by building a dictionary from words to integer indices).

2. for each row #`i`, count the number of occurrences of each word `w` and store it in `X[i, j]` as the value of feature `#j` where `j` is the index of word w in the dictionary.

In [24]:

from sklearn.feature_extraction.text import CountVectorizer

t0 = time.time()
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(gtdb_train.data)
t1 = time.time()
print(t1-t0, time.ctime(t1))
X_train_counts.shape

8.046195268630981 Mon Jul  3 14:41:27 2017


(105501, 97921)


So now we have a dictionary of feature indices:

In [27]:

count_vect.vocabulary_.get(u'ayacucho')

7241


Occurrence count is a good start but there is an issue: rows with more words in them will have higher average count values than shorter rows, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a row by the total number of words in the row: these new features are called `tf` for Term Frequencies.

Another refinement on top of `tf` is to downscale weights for words that occur in many rows in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called `tf–idf` for “Term Frequency times Inverse Document Frequency”.

In [28]:

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(105501, 97921)


### Train a few models to perform categorization


Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a **naïve Bayes** classifier, which provides a nice baseline for this task.

In [29]:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, gtdb_train.target)

docs_new = gtdb_test.data[:3]
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, gtdb_train.target_names[category]))

' India South Asia Jharkhand Siladon Assailants set construction equipment on fire in Siladon area Jharkhand state India There were no reported casualties however construction equipment was damaged in the attack No group claimed responsibility for the incident however sources attributed the attack to the Peoples Liberation Front of India Business Construction Unknown Construction Equipment India Peoples Liberation Front of India The specific motive is unknown however sources posited that the attack was part of a bandh by the Peoples Liberation Front of India in demonstration against the death of two civilians Incendiary Arson Fire Minor likely million Three construction machines and a vehicle were damaged in this attack LWE outfit torch four vehicles in Jharkhands Khunti Hindustan Times September PLFI militants torch five machines and a car ahead of bandh in Jharkhand ZeeNews com September START Primary Collection ' => Bombing/Explosion
' Sri Lanka South Asia Eastern Sardhapura Insurge


In order to make the vectorizer => transformer => classifier easier to work with, we will use the Pipeline class that behaves like a compound classifier, and evaluate the predictive accuracy of the naïve Bayes model.

In [30]:

from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
])
text_clf = text_clf.fit(gtdb_train.data, gtdb_train.target)
predicted = text_clf.predict(gtdb_test.data)
np.mean(predicted == gtdb_test.target) 

0.66205449926872451


I.e., we achieved 66.2% accuracy. Let’s see if we can do better with a linear **support vector machine** (SVM), which is widely regarded as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes). We can change the learner by just plugging a different classifier object into our pipeline:

In [31]:

from sklearn.linear_model import SGDClassifier

t0 = time.time()
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                     ('tfidf', TfidfTransformer(use_idf=False)),
                     ('clf', SGDClassifier(loss='modified_huber', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)),
])
text_clf = text_clf.fit(gtdb_train.data, gtdb_train.target)
predicted = text_clf.predict(gtdb_test.data)
t1 = time.time()
print(t1-t0, time.ctime(t1))
np.mean(predicted == gtdb_test.target)

44.02935862541199 Mon Jul  3 15:25:36 2017


0.85836348241089988


An accuracy of 85.8% is a much better choice. Here is a more detailed performance analysis of the results:

In [32]:

from sklearn import metrics

print(metrics.classification_report(gtdb_test.target, predicted, target_names=gtdb_test.target_names))

                                     precision    recall  f1-score   support

                      Assassination       0.74      0.48      0.58      5798
                      Armed Assault       0.76      0.89      0.82     12345
                  Bombing/Explosion       0.94      0.99      0.96     25217
                          Hijacking       0.96      0.14      0.24       187
Hostage Taking (Barricade Incident)       0.50      0.01      0.01       273
        Hostage Taking (Kidnapping)       0.87      0.74      0.80      3035
     Facility/Infrastructure Attack       0.82      0.80      0.81      2905
                    Unarmed Assault       0.82      0.32      0.47       370
                            Unknown       0.70      0.63      0.66      1834

                        avg / total       0.85      0.86      0.85     51964




Notice that the categories of _Hijacking_ and _Hostage Taking (Barricade Incident)_ have very low **support**. This means that there is not much for the model to train on. They also have low **recall**, meaning that they didn't include many rows that were actually labeled with the category as being in the category. And it looks like only half the rows the model predicted to be _Hostage Taking (Barricade Incident)_ were actually that - hence the low **precision**.

We can see this replicated in the confusion matrix:

In [33]:

metrics.confusion_matrix(gtdb_test.target, predicted)

array([[ 2788,  1781,   932,     0,     0,    57,    38,     1,   201],
       [  544, 11025,   368,     0,     2,    74,   310,     5,    17],
       [   18,    97, 24952,     0,     0,     5,    53,     3,    89],
       [    5,    55,    38,    26,     0,    40,    10,     0,    13],
       [   21,   105,    56,     0,     2,    47,    14,     0,    28],
       [  213,   457,    45,     1,     0,  2231,    30,     1,    57],
       [   15,   405,    81,     0,     0,    17,  2311,    13,    63],
       [   42,    82,    57,     0,     0,    31,    23,   120,    15],
       [   99,   451,    50,     0,     0,    60,    22,     3,  1149]])


In the diagonal running from the top left to the bottom right, we can see that the _Hostage Taking (Barricade Incident)_ category was only predicted correctly twice, while the _Hijacking_ category was only predicted correctly 26 times.


### Use a grid search strategy to find a good configuration of both the feature extraction components and the classifier


Instead of tweaking the parameters of the various components of the chain, it is possible to run an exhaustive search of the best parameters on a grid of possible values. We try out all classifiers on either words or bigrams or trigrams, with or without idf, with a penalty parameter of either 0.01 or 0.001 or 0.0001, and a log or modified huber loss function for the linear SVM. As you can see, the best score from this search is 75.3%.

In [34]:

from sklearn.model_selection import GridSearchCV

t0 = time.time()

parameters = {'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3, 1e-4),
              'clf__loss': ('log', 'modified_huber'),
}
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(penalty='l2', n_iter=5, random_state=42)),
])
gs_all_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_all_clf = gs_all_clf.fit(gtdb_all.data, gtdb_all.target)

t1 = time.time()
print(t1-t0, time.ctime(t1))

# 0.7536404915378021
gs_all_clf.best_score_

1955.3238289356232 Mon Jul  3 16:25:54 2017


0.7536404915378021


The comparitive accuracy is 89.7% - much better!

In [35]:

t0 = time.time()
predicted = gs_all_clf.predict(gtdb_test.data)
t1 = time.time()
print(t1-t0, time.ctime(t1))
np.mean(predicted == gtdb_test.target)

3.8498926162719727 Mon Jul  3 17:34:29 2017


0.89656300515741671


### Add a "predicted" column to all the datasets and save them as CSVs

In [36]:

t0 = time.time()

for df in [acled_df, rand_df, scad_df, ucdp_df, gtdb_df]:
    data = concat_independendent_variables(df).tolist()
    df['predicted_id'] = gs_all_clf.predict(data)
    df['predicted_type'] = df['predicted_id'].map(lambda x: gtdb_all.target_names[x])
    df['probabilities'] = pd.Series(list(gs_all_clf.predict_proba(data)))
    df['probability'] = df.apply(lambda row: "{0:.1f}%".format(row['probabilities'][row['predicted_id']]*100), axis=1)
    df.drop(['predicted_id','probabilities'], axis=1, inplace=True)

t1 = time.time()
print(t1-t0, time.ctime(t1))

1283.0713064670563 Mon Jul  3 17:56:17 2017


In [37]:

csv_folder = "../data/csv/"
gtdb_df.to_csv(csv_folder+"gtdb_df.csv", sep=',', encoding=encoding, index=False)
acled_df.to_csv(csv_folder+"acled_df.csv", sep=',', encoding=encoding, index=False)
rand_df.to_csv(csv_folder+"rand_df.csv", sep=',', encoding=encoding, index=False)
scad_df.to_csv(csv_folder+"scad_df.csv", sep=',', encoding=encoding, index=False)
ucdp_df.to_csv(csv_folder+"ucdp_df.csv", sep=',', encoding=encoding, index=False)