<h2><center><b><i>Cluster bomb</b></i>: Uncovering Patterns in Terrorist Group Beliefs and Attacks</center></h2>

#### **COM-480: Data Visualization**

**Team**: Alexander Sternfeld, Silvia Romanato & Antoine Bonnet

**Dataset**: [Global Terrorism Database (GTD)](https://www.start.umd.edu/gtd/) 

**Additional dataset**: [Profiles of Perpetrators of Terrorism in the United States (PPTUS)](https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl%3A1902.1/17702)

## **Exploratory Data Analysis**
 

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from load_data import *

pd.set_option('display.max_columns', None)

## Part 1: Global Terrorism Database (GTD)

### Loading the data

>
> **IMPORTANT**: To run our code, please [download the GTD files](https://www.start.umd.edu/gtd/contact/download) by filling out the form. After a few minutes, you will receive two data files `GTD1.xlsx` and `GTD2.xslx`.
>
> Also download the `PPT-US_0517dist.xlsx` file from [this link](https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl%3A1902.1/17702). Place all files in the `/data/raw` folder. 

In [None]:
# NOTE: This takes about 3 minutes to run the first time
GTD = load_GTD()

GTD.head()

### 1. Number of attacks

In [None]:
# Plot distribution of column 'iyear'
sns.histplot(GTD['iyear'])
plt.xlabel('Year')
plt.ylabel('Number of attacks')
plt.show()


### 2. Geographic breakdown

In [None]:
# For each region, plot the number of attacks per year
# create subplots
fig, ax = plt.subplots(3, 4, figsize=(16, 10))
# create a list of regions
regions = GTD['region_txt'].unique()
# create a list of axes
axes = ax.flatten()
# iterate over regions and axes
for region, ax in zip(regions, axes):
    # filter the data
    GTD_region = GTD[GTD['region_txt'] == region]
    # plot the data
    sns.histplot(GTD_region['iyear'], ax=ax)
    # set title
    ax.set_title(region)
fig.tight_layout()
plt.show()



In [None]:
# Plot the top 20 countries with the most attacks, use column 'country_txt', and sort by the number of attacks, tilt x-axis labels by 90 degrees. THe number of attacks is the number of rows for each country
# At the right hand side, do the same plot but with the 20 countries with the least attacks. The number of attacks is the number of rows for each country
# make subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# remove attack with international as country
GTD = GTD[GTD['country_txt'] != 'International']
# plot top 20 countries with most attacks
sns.barplot(x=GTD['country_txt'].value_counts().head(10).index,
            y=GTD['country_txt'].value_counts().head(10).values, ax=ax1)
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=90)
ax1.set_title('Top 10 countries with most attacks')
# y axis label
ax1.set_ylabel('Number of attacks')
# plot top 20 countries with least attacks
sns.barplot(x=GTD['country_txt'].value_counts().tail(10).index,
            y=GTD['country_txt'].value_counts().tail(10).values, ax=ax2)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=90)
ax2.set_title('Top 10 countries with least attacks')
# y axis label
ax2.set_ylabel('Number of attacks')
plt.show()


### Types of attacks

In [None]:
# For the top 5 overall weapons used, plot the number of attacks per year
# in one plot, with a different line for each weapon. The plot should contain lines, not bars.
# create a list of weapons
weapons = GTD['weaptype1_txt'].value_counts().head(5).index
# create a list of colors
colors = ['red', 'blue', 'green', 'orange', 'purple']
# create a figure
fig, ax = plt.subplots(figsize=(8, 5))
# iterate over weapons and colors
for weapon, color in zip(weapons, colors):
    # filter the data
    GTD_weapon = GTD[GTD['weaptype1_txt'] == weapon]
    # plot percentage of attacks per year that used this weapon, use a rolling average of 5 years
    plt.plot((GTD_weapon['iyear'].value_counts().sort_index() / GTD['iyear'].value_counts().sort_index()).rolling(5).mean(), color=color)
ax.legend(weapons, loc='center left', bbox_to_anchor=(1, 0.5))
# add axis
plt.xlabel('Year')
plt.ylabel('Percentage of attacks')

plt.show()


In [None]:
# Do the same plot as above, but now for the column 'attacktype1_txt', which contains the type of attack. The number of attacks is the number of rows for each attack type
sns.barplot(x=GTD['attacktype1_txt'].value_counts().head(
    20).index, y=GTD['attacktype1_txt'].value_counts().head(20).values)
# title and y axis label
plt.title('Top 20 attack types')
plt.ylabel('Number of attacks')
# tilt x-axis labels by 90 degrees
plt.xticks(rotation=90)
plt.show()

### Types of targets

In [None]:
# Do the same for the top 5 target types
# create a list of target types
targets = GTD['targtype1_txt'].value_counts().head(5).index
# create a list of colors
colors = ['red', 'blue', 'green', 'orange', 'purple']
# create a figure
fig, ax = plt.subplots(figsize=(8, 5))
# iterate over target types and colors
for target, color in zip(targets, colors):
    # filter the data
    GTD_target = GTD[GTD['targtype1_txt'] == target]
    # plot the data, do not use sns lineplot
    GTD_target.groupby('iyear')['iyear'].count().plot(ax=ax, color=color)
# set title
ax.set_title('Top 5 target types')
# set x axis label and y axis label
ax.set_xlabel('Year')
ax.set_ylabel('Number of attacks')
# add legend
ax.legend(targets)
plt.show()

In [None]:
# Do the same, but now for the column 'targtype1_txt', which contains the type of target. The number of attacks is the number of rows for each target type
# On the right hand side, do it for the column 'targsubtype1_txt', which contains the subtype of target. The number of attacks is the number of rows for each target subtype
# make subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# plot top 20 target types
sns.barplot(x=GTD['targtype1_txt'].value_counts().head(20).index, y=GTD['targtype1_txt'].value_counts().head(20).values, ax=ax1)
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=90)
ax1.set_title('Top 20 target types')
# y axis label
ax1.set_ylabel('Number of attacks')
# plot top 20 target subtypes
sns.barplot(x=GTD['targsubtype1_txt'].value_counts().head(20).index, y=GTD['targsubtype1_txt'].value_counts().head(20).values, ax=ax2)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=90)
ax2.set_title('Top 20 target subtypes')
# y axis label
ax2.set_ylabel('Number of attacks')
plt.show()



### Motives

In [None]:
# the percentage of null motive with 2 decimal places
print("The percentage of null motive is:", round(
    GTD['motive'].isnull().sum() / len(GTD) * 100, 2), "%")


In [None]:
from wordcloud import WordCloud

# Make a wordcloud of the motives
# create a list of motives
motives = GTD['motive'].dropna().values.tolist()
# create a string of motives
motives = ' '.join(motives)
# lower case all words
motives = motives.lower()
irrelevant_words = ['unknown', 'specific', 'claimed', 'responsibility', 'attack']
# Remove irrelevant words
for word in irrelevant_words:
    motives = motives.replace(word, '')
# create a wordcloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(motives)
# plot the wordcloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()



### Terrorist groups

In [None]:
# the number of groups in the dataset
print("The number of groups is:", GTD['gname'].nunique())
# the number of groups with more than 1000 attacks
print("The number of groups with more than 1000 attacks:", GTD['gname'].value_counts()[GTD['gname'].value_counts() > 1000].count())

In [None]:
# plot the distribution of attacks per group
fig, ax = plt.subplots(figsize=(8, 5))
# plot the log-log of the number of attacks per group
sns.histplot(GTD['gname'].value_counts(), log_scale=(True, True), ax=ax)
ax.set_xlabel('Number of attacks')
ax.set_ylabel('Number of groups')
plt.show()

In [None]:
# Plot the 10 largest groups in terms of number of attacks
# In one barplot

# create a dataframe with the number of attacks per group
GTD_group = GTD['gname'].value_counts().to_frame()
# rename the column
GTD_group.rename(columns={'gname': 'number_of_attacks'}, inplace=True)
# reset the index
GTD_group.reset_index(inplace=True)
# rename the columns
GTD_group.rename(columns={'index': 'group_name'}, inplace=True)
# plot the top 10 groups
sns.barplot(x='group_name', y='number_of_attacks', data=GTD_group.head(10))
# tilt x-axis labels by 90 degrees
plt.xticks(rotation=90)
plt.show()



In [None]:
top_10_groups = GTD.groupby('gname').count()['eventid'].sort_values()[-10:].index
print('The top 20 groups are:', list(top_10_groups))

In [None]:
# plot the most commot attack type for each group
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
sns.countplot(x='gname', hue='attacktype1_txt', data=GTD[GTD['gname'].isin(top_10_groups)], ax=ax)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_title('Most common attack type for each group')
plt.show()

## Part 2: Profiles of Perpetrators of Terrorism in the United States

In [2]:
# NOTE: This takes about __ minutes to run the first time
PPTUS_data, PPTUS_sources = load_PPTUS()

PPTUS pickle files found, loading...


In [None]:
PPTUS_data.head()

In [None]:
PPTUS_sources.head()