To clean the data, we follow the kernels posted by Ashwini Swain and Abigail Larion. Their kernels can be found at:

Ashwini Swain: https://www.kaggle.com/ash316/terrorism-around-the-world

Abigail Larion: https://www.kaggle.com/abigaillarion/terrorist-attacks-in-united-states

Potential issues: check for and deal with missing data, outliers (what to do with extreme events like 9/11), duplicates.

First, import the necessary modules and the data.

In [2]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import numpy as np
plt.style.use('fivethirtyeight')
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
#from mpl_toolkits.basemap import Basemap
import folium
import folium.plugins
from matplotlib import animation,rc
import io
import base64
from IPython.display import HTML, display
import warnings
warnings.filterwarnings('ignore')
from scipy.misc import imread
import codecs
from subprocess import check_output

In [117]:
#os.chdir('Downloads')
data = pd.read_csv('globalterrorism.csv', encoding = "ISO-8859-1")

In [118]:
#Rename several of our variables
data.rename(columns={'iyear':'Year','imonth':'Month','iday':'Day','provstate':'State','country_txt':'Country','region_txt':'Region','attacktype1_txt':'AttackType1','attacktype2_txt':'AttackType2','attacktype3_txt':'AttackType3','target1':'Target1','nkill':'Killed','nwound':'Wounded','summary':'Summary','gname':'Group','targtype1_txt':'Target1_type','targsubtype1_txt':'Target1_subtype','weaptype1_txt':'Weapon1_type','motive':'Motive'},inplace=True)
#Keep only variables that we want to use
data=data[['eventid','Year','Month','Day','Country','Region','city','State','location','latitude','longitude','specificity','vicinity','crit1','crit2','crit3','multiple','success','suicide','AttackType1','Killed','Wounded','Target1','extended','Summary','Group','Target1_type','Target1_subtype','corp1','natlty1_txt','Weapon1_type','Motive','targtype2_txt','corp2','targsubtype2_txt','corp2','target2','natlty2_txt','targtype3_txt','targsubtype3_txt','corp3','target3','natlty3_txt','gsubname','gname2','gsubname2','gname3','gsubname3','guncertain1','guncertain2','guncertain3','individual','nperps','nperpcap','claimed','claimmode_txt','claim2','claimmode2_txt','claim3','claimmode3_txt','compclaim','weapsubtype1_txt','weaptype2_txt','weapsubtype2_txt','weaptype3_txt','weapsubtype3_txt','weaptype4_txt','weapsubtype4_txt','weapdetail','nkillus','nkillter','nwoundus','nwoundte','property','propextent_txt','propvalue','propcomment','ishostkid','ransom','ransomamt','ransomamtus']]
"""
Notes:
Resolution, alternative, alternative_txt have very little data
Maybe we could combine location and city into one variable ("Ithaca, Cornell University," for example) -- location has little info
Maybe combine AttackType, other variables 1, 2, and 3-- 2 and 3 have little info for basically every such variable
Drop ransompaid variables, since those are generally determined only after things are already settled down
doubtterr won't be particularly useful 
Motive, nhost, nhours, ndays, divert, kidhijcountry, hostkidoutcome variables have little info
Some variables have more data during later years
Interpret and deal with negative values in variables
Look at discussions to know how to interpret variables
Notice that the "kid" variables appear to have to do with kidnapping
"""
#Create new variable for the sum of killed and wounded
data['Hurt_Dead']=data['Killed']+data['Wounded']

In [119]:
data.isnull().sum()

eventid                  0
Year                     0
Month                    0
Day                      0
Country                  0
Region                   0
city                   434
State                  421
location            126196
latitude              4556
longitude             4557
specificity              6
vicinity                 0
crit1                    0
crit2                    0
crit3                    0
multiple                 1
success                  0
suicide                  0
AttackType1              0
Killed               10313
Wounded              16311
Target1                636
extended                 0
Summary              66129
Group                    0
Target1_type             0
Target1_subtype      10373
corp1                42550
natlty1_txt           1559
                     ...  
nperps               71115
nperpcap             69489
claimed              66120
claimmode_txt       162608
claim2              179801
claimmode2_txt      181075
c

In [120]:
#We drop the 'ransom' and 'claim' variables and several sub variables due to a lack of extra information
#Drop the property variables (except 'property') due to a lack of information held by them
#Drop 'summary', 'weapdetail', 'corp' variables, 'target' variables since they won't be too helpful


data = data.drop(['ransom','ransomamt','ransomamtus','claimed','claimmode_txt','claim2','claimmode2_txt','claim3','claimmode3_txt','compclaim','gsubname','gsubname2','gsubname3','Target1_subtype','targsubtype2_txt','targsubtype3_txt','propextent_txt','propvalue','propcomment'], axis=1)
data = data.drop(['Summary','weapdetail','corp1', 'corp2', 'corp3', 'Target1', 'target2', 'target3'], axis=1)

In [121]:
data.isnull().sum()

eventid                  0
Year                     0
Month                    0
Day                      0
Country                  0
Region                   0
city                   434
State                  421
location            126196
latitude              4556
longitude             4557
specificity              6
vicinity                 0
crit1                    0
crit2                    0
crit3                    0
multiple                 1
success                  0
suicide                  0
AttackType1              0
Killed               10313
Wounded              16311
extended                 0
Group                    0
Target1_type             0
natlty1_txt           1559
Weapon1_type             0
Motive              131130
targtype2_txt       170547
natlty2_txt         170863
targtype3_txt       180515
natlty3_txt         180544
gname2              179678
gname3              181367
guncertain1            380
guncertain2         179736
guncertain3         181371
i

In [122]:
#Let's replace weapon types with their subtypes where applicable
weapvars = ['Weapon1_type','weaptype2_txt','weaptype3_txt','weaptype4_txt']
weapsubvars = ['weapsubtype1_txt','weapsubtype2_txt','weapsubtype3_txt','weapsubtype4_txt',]
for i in range(0,4):
    var = "det_weapon_type_" + str(i+1)
    data[var] = data[weapsubvars[i]]
    data.loc[data[var].isnull(),var] = data[weapvars[i]]
    data = data.drop([weapsubvars[i], weapvars[i]], axis=1)


In [123]:
data.isnull().sum()

eventid                   0
Year                      0
Month                     0
Day                       0
Country                   0
Region                    0
city                    434
State                   421
location             126196
latitude               4556
longitude              4557
specificity               6
vicinity                  0
crit1                     0
crit2                     0
crit3                     0
multiple                  1
success                   0
suicide                   0
AttackType1               0
Killed                10313
Wounded               16311
extended                  0
Group                     0
Target1_type              0
natlty1_txt            1559
Motive               131130
targtype2_txt        170547
natlty2_txt          170863
targtype3_txt        180515
natlty3_txt          180544
gname2               179678
gname3               181367
guncertain1             380
guncertain2          179736
guncertain3         

In [124]:
#Clean variables further
#Drop variables with tons of missing values, as well as "guncertain1"
data = data.drop(['guncertain1', 'guncertain2', 'guncertain3'], axis=1)

In [125]:
#Deal with negative integers (vicinity,nperps,nperpcap,property,ishostkid)
"""
num = data._get_numeric_data()

num[num < 0] = 'NaN'
"""

li = ['vicinity','nperps','nperpcap','property','ishostkid']
for i in li:
    data[i][data[i]<0] = np.nan


In [134]:
#Deal with unknown values
#Replace several variables' missing values with their median
#In the case of people getting wounded and killed, it is likely that at least of the time, the number is approximately the median where unreported
li = ['nkillus','nkillter','nwoundus','nwoundte','Killed','Wounded','Hurt_Dead','ishostkid','multiple','specificity','nperpcap','property']
for i in li:
    data[i] = data[i].fillna(np.nanmedian(data[i]))

In [135]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Data columns (total 47 columns):
eventid              181691 non-null int64
Year                 181691 non-null int64
Month                181691 non-null int64
Day                  181691 non-null int64
Country              181691 non-null object
Region               181691 non-null object
city                 171482 non-null object
State                176980 non-null object
location             55479 non-null object
latitude             177135 non-null float64
longitude            177134 non-null float64
specificity          181691 non-null float64
vicinity             181656 non-null float64
crit1                181691 non-null int64
crit2                181691 non-null int64
crit3                181691 non-null int64
multiple             181691 non-null float64
success              181691 non-null int64
suicide              181691 non-null int64
AttackType1          174415 non-null object
Killed        

In [136]:
data.describe()

Unnamed: 0,eventid,Year,Month,Day,latitude,longitude,specificity,vicinity,crit1,crit2,...,individual,nperps,nperpcap,nkillus,nkillter,nwoundus,nwoundte,property,ishostkid,Hurt_Dead
count,181691.0,181691.0,181691.0,181691.0,177135.0,177134.0,181691.0,181656.0,181691.0,181691.0,...,181691.0,28356.0,181691.0,181691.0,181691.0,181691.0,181691.0,181691.0,181691.0,181691.0
mean,200270500000.0,2002.638997,6.467277,15.505644,23.498343,-458.6957,1.451437,0.070044,0.98853,0.993093,...,0.00295,32.17044,0.077505,0.029671,0.320825,0.025076,0.066382,0.632497,0.074698,4.897139
std,1325957000.0,13.25943,3.388303,8.814045,18.569242,204779.0,0.995416,0.255223,0.106483,0.082823,...,0.054234,412.375875,1.621754,4.564308,3.346474,2.453378,1.172976,0.482126,0.262905,40.087301
min,197000000000.0,1970.0,0.0,0.0,-53.154613,-86185900.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,199102100000.0,1991.0,4.0,8.0,11.510046,4.54564,1.0,0.0,1.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,200902200000.0,2009.0,6.0,15.0,31.467463,43.24651,1.0,0.0,1.0,1.0,...,0.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
75%,201408100000.0,2014.0,9.0,23.0,34.685087,68.71033,1.0,0.0,1.0,1.0,...,0.0,6.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0
max,201712300000.0,2017.0,12.0,31.0,74.633553,179.3667,5.0,1.0,1.0,1.0,...,1.0,25000.0,406.0,1360.0,500.0,751.0,200.0,1.0,1.0,9574.0


In [137]:
#Deal with missing values in # of perpetrators
#We only have values of # of perpetrators for 1/9 of our dataset. Let's drop the variable
data = data.drop(['nperps'],axis=1)

In [138]:
data[data=="Unknown"] = np.nan
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Data columns (total 46 columns):
eventid              181691 non-null int64
Year                 181691 non-null int64
Month                181691 non-null int64
Day                  181691 non-null int64
Country              181691 non-null object
Region               181691 non-null object
city                 171482 non-null object
State                176980 non-null object
location             55479 non-null object
latitude             177135 non-null float64
longitude            177134 non-null float64
specificity          181691 non-null float64
vicinity             181656 non-null float64
crit1                181691 non-null int64
crit2                181691 non-null int64
crit3                181691 non-null int64
multiple             181691 non-null float64
success              181691 non-null int64
suicide              181691 non-null int64
AttackType1          174415 non-null object
Killed        

In [None]:
#Deal with unknown values in strings


In [None]:
#Deal with unknown latitudes/longitudes by using information

#Give remaining lats/longs their countries' mean lat/long

In [None]:
#Now that we've dealt with lat./long., drop the following variables
data = data.drop(['location','city','State'],axis=1)

In [None]:
#Check for outliers and deal with them

In [66]:
#Now let's combine the variables that are additions of each other
"""
data['target_types'] = data['Target1_type'] + " " + data['targtype2_txt'] + " " + data['targtype3_txt']
data['target_types'] = data['target_types'].trim()
data['natlty_types'] = data['natlty1_type_txt'] + " " + data['natlty2_txt'] + " " + data['natlty3_txt']
data['natlty_types'] = data['natlty_types'].trim()
data['attack_types'] = data['attack1_type_txt'] + " " + data['natlty2_txt'] + " " + data['natlty3_txt']
data['attack_types'] = data['natlty_types'].trim()
"""
#Let's make dummy variables for our categorical variables
data = pd.get_dummies(data)

In [None]:
#Drop duplicate dummy columns


In [64]:
data['det_weapon_type_2']

0                                       NaN
1                                       NaN
2                                       NaN
3                                       NaN
4                                       NaN
5                                       NaN
6                                       NaN
7                                       NaN
8                                       NaN
9                                       NaN
10                                      NaN
11                                      NaN
12                                  Handgun
13                                      NaN
14                                      NaN
15                                      NaN
16                                      NaN
17                                      NaN
18                                      NaN
19                                      NaN
20                                      NaN
21                                      NaN
22                              

In [69]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Columns: 19655 entries, eventid to det_weapon_type_4_Unknown Weapon Type
dtypes: float64(17), int64(13), uint8(19625)
memory usage: 3.4 GB


In [None]:
data.drop_duplicates(keep=False,inplace=True) 