<h2><center><b><i>Cluster bomb</b></i>: Uncovering Patterns in Terrorist Group Beliefs and Attacks</center></h2>

#### **COM-480: Data Visualization**

**Team**: Alexander Sternfeld, Silvia Romanato & Antoine Bonnet

**Dataset**: [Global Terrorism Database (GTD)](https://www.start.umd.edu/gtd/) 

**Additional dataset**: [Profiles of Perpetrators of Terrorism in the United States (PPTUS)](https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl%3A1902.1/17702)

## **Map**
 

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from load_data import *

pd.set_option('display.max_columns', None)

GTD = load_GTD()
PPTUS_data, PPTUS_sources = load_PPTUS()

GTD pickle file found, loading...
PPTUS pickle files found, loading...


## Creating an interactive map

Our goal is to create the data underlying an interactive map of the world with terrorist attacks shown as a flow chart. 
Our main inspiration is the [Flight Paths Edge Bundling](https://gist.github.com/sjengle/2e58e83685f6d854aa40c7bc546aeb24) project. We simply need to adapt their code to include all countries of the world (rather than only American states) and to draw out terrorist attacks as lines from the 

We need to create the `countries.csv` and `attacks.csv` files from the GTD. 

To find the countries location on the map, we scrape the [World Countries Centroids](https://github.com/gavinr/world-countries-centroids) dataset ([direct link to csv](https://cdn.jsdelivr.net/gh/gavinr/world-countries-centroids@v1/dist/countries.csv)). We also scrape the [alternate country names](https://www.kaggle.com/datasets/wbdill/country-aliaseslist-of-alternative-country-names?resource=download) dataset to match country names from the GTD to the ISO code.

This file looks like this: 

    Columns:  longitude,latitude,COUNTRY,ISO,COUNTRYAFF,AFF_ISO
    Example:  -159.78768870952257,-21.222613253399842,Cook Islands,CK,New Zealand,NZ

The `countries.csv` file must look like:

    Columns:  COUNTRY,COUNTRY_CODE,longitude,latitude
    Example:  NZ,New Zealand,-159.78768870952257,-21.222613253399842

Note: All attacks on affiliated territories must be remapped to the country to which these countries are affiliated. 


In [2]:
# Get country locations csv from url 
locations_url = 'https://cdn.jsdelivr.net/gh/gavinr/world-countries-centroids@v1/dist/countries.csv'
locations = pd.read_csv(locations_url, keep_default_na=False, na_values=['_'])
locations.head()

Unnamed: 0,longitude,latitude,COUNTRY,ISO,COUNTRYAFF,AFF_ISO
0,-170.700732,-14.305712,American Samoa,AS,United States,US
1,166.638003,19.302046,United States Minor Outlying Islands,UM,United States,US
2,-159.787689,-21.222613,Cook Islands,CK,New Zealand,NZ
3,-149.400417,-17.674684,French Polynesia,PF,France,FR
4,-169.868781,-19.052309,Niue,NU,New Zealand,NZ


In [3]:
# Get alternate names of countries
alt_names_path = os.path.join(DATA_DIR, 'country_aliases.csv')
alt_names = pd.read_csv(alt_names_path)
alt_names.head()


Unnamed: 0,iso3,Alias,AliasDescription
0,,Abkhazia,"common, English"
1,,Republic of Abkhazia,"official, English"
2,,Aphsny Axwynthkharra,"official, Abkhaz"
3,,Respublika Abkhaziya,"official, Russian"
4,,Autonomous Republic of Abkhazia,"Internationally recognized, English"


In [4]:
# Extract all (country, ISO) pairs from locations to a dictionary
country_to_iso = dict(zip(locations['COUNTRY'], locations['ISO']))

# Extract all (aff_country, aff_ISO) pairs from locations to a dictionary
aff_to_iso = dict(zip(locations['COUNTRYAFF'], locations['AFF_ISO']))

# Create a dictionary of {key: GTD country names, value: (ISO code, GTD code)} pairs
country_dict = {}
GTD_country_codes = dict(GTD[['country_txt', 'country']].drop_duplicates().values) # (GTD country name, GTD code) pairs

# Rule: If the GTD country name is in locations.COUNTRY, then use the ISO code from locations.ISO. 
# If the GTD country name is in locations.COUNTRYAFF, then use the ISO code from locations.AFF_ISO.
# If the GTD country name is in alt_names.Alias, then use the ISO code from alt_names.iso3.
# Otherwise, we have no ISO code for that country and we remove those attacks from the GTD dataframe.

def match_country_to_iso(country):
    for c in aff_to_iso.keys():
        if country in c or c in country:
            return aff_to_iso[c]
    for c in country_to_iso.keys(): 
        if country in c or c in country: 
            return country_to_iso[c]
    for c in alt_names['Alias'].values:
        if country in c or c in country:
            return alt_names[alt_names['Alias'] == c]['iso3'].values[0]
    return None

for country, GTD_code in GTD_country_codes.items():
    iso = match_country_to_iso(country)
    if iso is not None:
        country_dict[country] = (iso, GTD_code)

print('{} out of {} countries in the GTD were matched with their ISO code.'.format(len(country_dict), len(GTD['country'].unique())))
print('Unmatched countries: {}'.format(set(GTD['country_txt'].unique()) - set(country_dict.keys())))

# We now filter out attacks that have no identifiable ISO code
print('Removing attacks from those countries from the GTD dataframe...')
print('\tNumber of attacks in GTD before removing unmatched countries: ', len(GTD))
GTD = GTD[GTD['country_txt'].isin(country_dict.keys())]
print('\tNumber of attacks in GTD after removing unmatched countries: ', len(GTD))

# Add the ISO code to the GTD dataframe
GTD['ISO_code'] = GTD['country_txt'].map({country: ISO_code for country, (ISO_code, _) in country_dict.items()})
GTD = GTD[GTD['ISO_code'].notna()] # Remove Kosovo that has no location data

196 out of 204 countries in the GTD were matched with their ISO code.
Unmatched countries: {'St. Lucia', 'Hong Kong', 'St. Kitts and Nevis', 'Western Sahara', 'International', 'Czechoslovakia', 'Macau', 'Bosnia-Herzegovina'}
Removing attacks from those countries from the GTD dataframe...
	Number of attacks in GTD before removing unmatched countries:  214666
	Number of attacks in GTD after removing unmatched countries:  214423


In [5]:
# For each location, add the GTD country name by matching the ISO code
locations['GTD_country'] = locations['ISO'].map({ISO_code: country for country, (ISO_code, _) in country_dict.items()})

# For each NaN GTD country, add the GTD country name by matching the AFF_ISO code
locations['GTD_country'] = locations['GTD_country'].fillna(locations['AFF_ISO'].map({ISO_code: country for country, (ISO_code, _) in country_dict.items()}))

# Remove NaN GTD_country (i.e. locations that don't have a matching GTD country)
locations = locations[~locations['GTD_country'].isna()]
print('Number of countries with either ISO or AFF_ISO in GTD:', len(locations))

# Filter locations whose ISO is not in the GTD
matched_ISO = [x for x, y in country_dict.values()]
locations = locations[locations['ISO'].isin(matched_ISO)]
print('Number of countries with ISO in GTD:', len(locations))

locations

Number of countries with either ISO or AFF_ISO in GTD: 216
Number of countries with ISO in GTD: 172


Unnamed: 0,longitude,latitude,COUNTRY,ISO,COUNTRYAFF,AFF_ISO,GTD_country
3,-149.400417,-17.674684,French Polynesia,PF,France,FR,French Polynesia
9,-178.127356,-14.283442,Wallis and Futuna,WF,France,FR,Wallis and Futuna
10,-88.859115,13.758042,El Salvador,SV,El Salvador,SV,El Salvador
11,-90.312193,15.820879,Guatemala,GT,Guatemala,GT,Guatemala
12,-101.553997,23.874361,Mexico,MX,Mexico,MX,Mexico
...,...,...,...,...,...,...,...
239,127.337981,40.191981,North Korea,KP,North Korea,KP,North Korea
241,137.469342,36.767388,Japan,JP,Japan,JP,Japan
246,98.670499,59.039434,Russian Federation,RU,Russian Federation,RU,Russia
247,-3.651625,40.365008,Spain,ES,Spain,ES,Spain


In [6]:
# Add a column GTD_ISO and GTD_code to locations that first tries to match the ISO code, then the AFF_ISO code
locations['GTD_ISO'] = locations['ISO'].map({ISO_code: ISO_code for country, (ISO_code, _) in country_dict.items()})
locations['GTD_ISO'] = locations['GTD_ISO'].fillna(locations['AFF_ISO'].map({ISO_code: ISO_code for country, (ISO_code, _) in country_dict.items()}))
locations['GTD_code'] = locations['ISO'].map({ISO_code: GTD_code for country, (ISO_code, GTD_code) in country_dict.items()})
locations['GTD_code'] = locations['GTD_code'].fillna(locations['AFF_ISO'].map({ISO_code: GTD_code for country, (ISO_code, GTD_code) in country_dict.items()}))

# Remove rows in locations where GTD_country does not match COUNTRY
locations = locations[locations['GTD_country'] == locations['COUNTRY']]
print('Number of countries with ISO in GTD and COUNTRY == GTD_country:', len(locations))

locations = locations.drop(columns=['COUNTRY', 'COUNTRYAFF', 'ISO', 'AFF_ISO'])
locations = locations.rename(columns={'GTD_country': 'COUNTRY', 'GTD_ISO': 'ISO'})

Number of countries with ISO in GTD and COUNTRY == GTD_country: 165


In [7]:
# Save locations to csv file
locations.to_csv(os.path.join(DATA_DIR, 'locations.csv'), index=False)
locations

Unnamed: 0,longitude,latitude,COUNTRY,ISO,GTD_code
3,-149.400417,-17.674684,French Polynesia,PF,71
9,-178.127356,-14.283442,Wallis and Futuna,WF,226
10,-88.859115,13.758042,El Salvador,SV,61
11,-90.312193,15.820879,Guatemala,GT,83
12,-101.553997,23.874361,Mexico,MX,130
...,...,...,...,...,...
236,121.822089,15.586542,Philippines,PH,160
237,127.762246,36.402387,South Korea,KR,184
239,127.337981,40.191981,North Korea,KP,149
241,137.469342,36.767388,Japan,JP,101



The [`attacks.csv`](https://gist.github.com/mbostock/7608400#file-flights-csv) file must look like:

    origin, origin_ISO, origin_code, location, location_ISO, location_code, casualties


In [8]:
# Keep all attacks with ISO_code in locations ISO
print('Number of attacks in GTD before removing attacks from countries not in locations:', len(GTD))
GTD = GTD[GTD['ISO_code'].isin(locations['ISO'].unique())]
print('Number of attacks in GTD after removing attacks from countries not in locations:', len(GTD))

Number of attacks in GTD before removing attacks from countries not in locations: 214221
Number of attacks in GTD after removing attacks from countries not in locations: 206713
