<a href="https://colab.research.google.com/github/cpvivek/Global-Terrorism-Database-EDA/blob/main/Global_Terrorism_Database_EDA__Vivek_CP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EDA on Global Terrorism Database
Global Terrorism Database (GTD) is a dataset maintained by National Consortium for the Study of Terrorism and Responses to Terrorism (START), and contains the record of terrorism activities around the globe since 1970.

# Scope of the Project
Since this is an exploratory data analysis project, the scope of the project would be limited to deriving meaningful insights/patterns from the dataset, on a global, regional (Primarily South Asian) and national(India) level. 
The focus here would not essentialy be to obtain solutions to problems pertaining to terrorism, but rather to derive intuitons from the dataset.

My contribution to the project as an individual are focused on following deliverables:



1. Visual Representation of attacks over the globe.
2. Word cloud displaying the group names with font size proportional to frequency of attacks.
3. Which group has the highest success rate?
4. Success rate of different attack types.
5. Actions of major groups.
6. Which group has attacked most number of countries?
7. Heat Maps
8. Tree Maps
9. Time lines





# Data Preparation

Since this is a huge data set with over 136 fields and 1.8 lakh rows, we need to weed out unrequired fields, fill NaN values appropriately, and rename the fields to suit our convenience.

In [None]:
#let's take help of following libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px #I prefer to use plotly.express predominantly for most of my visualisation. 
import plotly.graph_objects as go   

In [None]:
# #Plotly has a new update and I love it already :'). Let's get rid of version 4.4.1 and get 5.3.1 tenacity 8.0.1
# pip install --upgrade plotly

In [4]:
#mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.activity.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fexperimentsandconfigs%20https%3a%2f%2fwww.googleapis.com%2fauth%2fphotos.native&response_type=code

Enter your authorization code:
4/1AX4XfWjxv73dgvT_3alACVu_zS0xFFhorEZUjN72AEF7-hrIwrDvW3T6y8E
Mounted at /content/drive


I've used the latest database available in the START official website. Dataset contains records of incidents from 1970-2019. 
You can find the same through this link:
[link text](https://www.start.umd.edu/gtd/access/#gtd-download)

In [8]:
# Reading the dataset. 
gtd_global_primary=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Alma Better Pro Program/Capstone Projects/EDA Capstone/Data Sets/Global Terrorism Data/Global Terrorism Data_2017.csv',encoding='ISO-8859-1')



Columns (4,6,31,33,61,62,63,76,79,90,92,94,96,114,115,121) have mixed types.Specify dtype option on import or set low_memory=False.



In [9]:
gtd_global_primary.head()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,region_txt,provstate,city,latitude,longitude,specificity,vicinity,location,summary,crit1,crit2,crit3,doubtterr,alternative,alternative_txt,multiple,success,suicide,attacktype1,attacktype1_txt,attacktype2,attacktype2_txt,attacktype3,attacktype3_txt,targtype1,targtype1_txt,targsubtype1,targsubtype1_txt,corp1,target1,...,weapsubtype4,weapsubtype4_txt,weapdetail,nkill,nkillus,nkillter,nwound,nwoundus,nwoundte,property,propextent,propextent_txt,propvalue,propcomment,ishostkid,nhostkid,nhostkidus,nhours,ndays,divert,kidhijcountry,ransom,ransomamt,ransomamtus,ransompaid,ransompaidus,ransomnote,hostkidoutcome,hostkidoutcome_txt,nreleased,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,Central America & Caribbean,,Santo Domingo,18.456792,-69.951164,1.0,0,,,1,1,1,0.0,,,0.0,1,0,1,Assassination,,,,,14,Private Citizens & Property,68.0,Named Civilian,,Julio Guzman,...,,,,1.0,,,0.0,,,0,,,,,0.0,,,,,,,0.0,,,,,,,,,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,North America,Federal,Mexico city,19.371887,-99.086624,1.0,0,,,1,1,1,0.0,,,0.0,1,0,6,Hostage Taking (Kidnapping),,,,,7,Government (Diplomatic),45.0,"Diplomatic Personnel (outside of embassy, cons...",Belgian Ambassador Daughter,"Nadine Chaval, daughter",...,,,,0.0,,,0.0,,,0,,,,,1.0,1.0,0.0,,,,Mexico,1.0,800000.0,,,,,,,,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,Southeast Asia,Tarlac,Unknown,15.478598,120.599741,4.0,0,,,1,1,1,0.0,,,0.0,1,0,1,Assassination,,,,,10,Journalists & Media,54.0,Radio Journalist/Staff/Facility,Voice of America,Employee,...,,,,1.0,,,0.0,,,0,,,,,0.0,,,,,,,0.0,,,,,,,,,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,Western Europe,Attica,Athens,37.99749,23.762728,1.0,0,,,1,1,1,0.0,,,0.0,1,0,3,Bombing/Explosion,,,,,7,Government (Diplomatic),46.0,Embassy/Consulate,,U.S. Embassy,...,,,Explosive,,,,,,,1,,,,,0.0,,,,,,,0.0,,,,,,,,,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,East Asia,Fukouka,Fukouka,33.580412,130.396361,1.0,0,,,1,1,1,-9.0,,,0.0,1,0,7,Facility/Infrastructure Attack,,,,,7,Government (Diplomatic),46.0,Embassy/Consulate,,U.S. Consulate,...,,,Incendiary,,,,,,,1,,,,,0.0,,,,,,,0.0,,,,,,,,,,,,,PGIS,-9,-9,1,1,


In [11]:
#columns in the dataset:
list(gtd_global_primary)

['eventid',
 'iyear',
 'imonth',
 'iday',
 'approxdate',
 'extended',
 'resolution',
 'country',
 'country_txt',
 'region',
 'region_txt',
 'provstate',
 'city',
 'latitude',
 'longitude',
 'specificity',
 'vicinity',
 'location',
 'summary',
 'crit1',
 'crit2',
 'crit3',
 'doubtterr',
 'alternative',
 'alternative_txt',
 'multiple',
 'success',
 'suicide',
 'attacktype1',
 'attacktype1_txt',
 'attacktype2',
 'attacktype2_txt',
 'attacktype3',
 'attacktype3_txt',
 'targtype1',
 'targtype1_txt',
 'targsubtype1',
 'targsubtype1_txt',
 'corp1',
 'target1',
 'natlty1',
 'natlty1_txt',
 'targtype2',
 'targtype2_txt',
 'targsubtype2',
 'targsubtype2_txt',
 'corp2',
 'target2',
 'natlty2',
 'natlty2_txt',
 'targtype3',
 'targtype3_txt',
 'targsubtype3',
 'targsubtype3_txt',
 'corp3',
 'target3',
 'natlty3',
 'natlty3_txt',
 'gname',
 'gsubname',
 'gname2',
 'gsubname2',
 'gname3',
 'gsubname3',
 'motive',
 'guncertain1',
 'guncertain2',
 'guncertain3',
 'individual',
 'nperps',
 'nperpcap',
 

A lot of the field names here are hard to make sense of. So here's a code book if you're curious.

https://www.start.umd.edu/gtd/downloads/Codebook.pdf

In [12]:
#cleaning up dataset and selecting fields that I believe we need for the analysis
gtd_global=gtd_global_primary[['eventid','iyear','imonth','iday','country_txt','region_txt', 'city','provstate',
                       'latitude','longitude','summary','success','suicide','attacktype1_txt','targtype1_txt',
                       'gname','claimed','motive','weaptype1_txt','nkill','nwound','propvalue']]
                       

In [13]:
#cleaning up the NaN values.
gtd_global['country_txt'].fillna('Unknown',inplace=True)
gtd_global['region_txt'].fillna('Unknown',inplace=True)
gtd_global['city'].fillna('Unknown',inplace=True)
gtd_global['provstate'].fillna('Unknown',inplace=True)
gtd_global['latitude'].fillna('Unknown',inplace=True)
gtd_global['longitude'].fillna('Unknown',inplace=True)
gtd_global['summary'].fillna('Unknown',inplace=True)
gtd_global['success'].fillna('Unknown',inplace=True)
gtd_global['suicide'].fillna('Unknown',inplace=True)
gtd_global['attacktype1_txt'].fillna('Unknown',inplace=True)
gtd_global['targtype1_txt'].fillna('Unknown',inplace=True)
gtd_global['gname'].fillna('Unknown',inplace=True)
gtd_global['claimed'].fillna(0,inplace=True) #You can't really 'not know' if its claimed. :/. I'm taking the liberty to assume the NaN values here are unclaimed.
gtd_global['motive'].fillna('Unknown',inplace=True)
gtd_global['weaptype1_txt'].fillna('Unknown',inplace=True)
gtd_global['nkill'].fillna(0,inplace=True)
gtd_global['nwound'].fillna(0,inplace=True)
gtd_global['propvalue'].fillna(0,inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [14]:
#renaming the columns
gtd_global.rename(columns={'iyear':'year',
                           'imonth':'month',
                           'iday':'day',
                           'country_txt':'country',
                           'region_txt':'region',
                           'provstate':'state',
                           'attacktype1_txt':'attack_type',
                           'targtype1_txt':'target_type',
                           'gname':'organisation',
                           'weaptype1_txt':'weapon_type'},inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [15]:
#adding casuality column to the data frame. Casulities= nwound+casuality
gtd_global['casuality']=gtd_global.nkill+gtd_global.nwound



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [16]:
#creating subsets for the regional and national level analysis
gtd_SA=gtd_global[gtd_global.region=='South Asia']
gtd_india=gtd_global[gtd_global.country=='India']


# Global Visualisation of attacks over the world
Let's take a step back and have a look on what the situation looks like.

In [17]:
#creating a df with frequency of attacks grouped by countries
gtd_country=gtd_global.groupby('country')['eventid'].count().reset_index() #this would give us total number of attack in each country over the time

gtd_country_timeline=gtd_global.groupby(['year','country'])['eventid'].count().reset_index() # this dataframe would help us with a timeline of every year since 1970


In [18]:
#visualisation
total_attacks=px.choropleth( gtd_country,locations='country',locationmode='country names',color='eventid',
                  hover_name='country',projection='orthographic',title='Total number of attacks(1970-2019)'
                  ,color_continuous_scale = px.colors.sequential.Plasma,
                  labels={'eventid':'attacks'})

total_attacks.show()


timeline=px.choropleth(gtd_country_timeline,locations='country',locationmode='country names',color='eventid',
                  hover_name='country',title='Time line of attacks in each year from 1970 to 2019',
                  color_continuous_scale = px.colors.sequential.Plasma,
                  animation_frame='year',
                  labels={'eventid':'attacks'})

timeline.show()

print('use the animation frame above to navigate through years')

use the animation frame above to navigate through years


**Remarks**
1. It's evident from the figure that the terrorist attacks are quite concentrated in a handful of countries like Afghanistan, Pakistan, Iraq, India etc.

2. It can be observed that things starteg getting grim for India in the late 80s, where it started gaining more traction than the rest of the countries.