To clean the data, we follow the kernels posted by Ashwini Swain and Abigail Larion. Their kernels can be found at:

Ashwini Swain: https://www.kaggle.com/ash316/terrorism-around-the-world

Abigail Larion: https://www.kaggle.com/abigaillarion/terrorist-attacks-in-united-states

Potential issues: check for and deal with missing data, outliers (what to do with extreme events like 9/11), duplicates.

First, import the necessary modules and the data.

In [8]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import numpy as np
plt.style.use('fivethirtyeight')
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
#from mpl_toolkits.basemap import Basemap
import folium
import folium.plugins
from matplotlib import animation,rc
import io
import base64
from IPython.display import HTML, display
import warnings
warnings.filterwarnings('ignore')
from scipy.misc import imread
import codecs
from subprocess import check_output

In [9]:
os.chdir('Downloads')
data = pd.read_csv('globalterrorism.csv', encoding = "ISO-8859-1")

In [12]:
#Rename several of our variables
data.rename(columns={'iyear':'Year','imonth':'Month','iday':'Day','provstate':'State','country_txt':'Country','region_txt':'Region','attacktype1_txt':'AttackType1','attacktype2_txt':'AttackType2','attacktype3_txt':'AttackType3','target1':'Target1','nkill':'Killed','nwound':'Wounded','summary':'Summary','gname':'Group','targtype1_txt':'Target1_type','targsubtype1_txt':'Target1_subtype','weaptype1_txt':'Weapon1_type','motive':'Motive'},inplace=True)
#Keep only variables that we want to use
data=data[['eventid','Year','Month','Day','Country','Region','city','location','latitude','longitude','specificity','vicinity','location','crit1','crit2','crit3','doubtterr','multiple','success','suicide','AttackType1','Killed','Wounded','Target1','extended','Summary','Group','Target1_type','Target1_subtype','corp1','natlty1_txt','Weapon1_type','Motive','targtype2_txt','corp2','targsubtype2_txt','corp2','target2','natlty2_txt','targtype3_txt','targsubtype3_txt','corp3','target3','natlty3_txt','gsubname','gname2','gsubname2','gname3','gsubname3','guncertain1','guncertain2','guncertain3','individual','nperps','nperpcap','claimed','claimmode_txt','claim2','claimmode2_txt','claim3','claimmode3_txt','compclaim','weapsubtype1_txt','weaptype2_txt','weapsubtype2_txt','weaptype3_txt','weapsubtype3_txt','weaptype4_txt','weapsubtype4_txt','weapdetail','nkillus','nkillter','nwoundus','nwoundte','property','propextent_txt','propvalue','propcomment','ishostkid','ransom','ransomamt','ransomamtus']]
"""
Notes:
Resolution, alternative, alternative_txt have very little data
Maybe we could combine location and city into one variable ("Ithaca, Cornell University," for example) -- location has little info
Maybe combine AttackType, other variables 1, 2, and 3-- 2 and 3 have little info for basically every such variable
Drop ransompaid variables, since those are generally determined only after things are already settled down
Motive, nhost, nhours, ndays, divert, kidhijcountry, hostkidoutcome variables have little info
Some variables have more data during later years
Interpret and deal with negative values in variables
Look at discussions to know how to interpret variables
"""
#Create new variable for the sum of killed and wounded
data['Hurt_Dead']=data['Killed']+data['Wounded']

In [13]:
data.isnull().sum()

eventid                  0
Year                     0
Month                    0
Day                      0
Country                  0
Region                   0
city                   434
location            126196
latitude              4556
longitude             4557
specificity              6
vicinity                 0
location            126196
crit1                    0
crit2                    0
crit3                    0
doubtterr                1
multiple                 1
success                  0
suicide                  0
AttackType1              0
Killed               10313
Wounded              16311
Target1                636
extended                 0
Summary              66129
Group                    0
Target1_type             0
Target1_subtype      10373
corp1                42550
                     ...  
nperps               71115
nperpcap             69489
claimed              66120
claimmode_txt       162608
claim2              179801
claimmode2_txt      181075
c

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Data columns (total 83 columns):
eventid             181691 non-null int64
Year                181691 non-null int64
Month               181691 non-null int64
Day                 181691 non-null int64
Country             181691 non-null object
Region              181691 non-null object
city                181257 non-null object
location            55495 non-null object
latitude            177135 non-null float64
longitude           177134 non-null float64
specificity         181685 non-null float64
vicinity            181691 non-null int64
location            55495 non-null object
crit1               181691 non-null int64
crit2               181691 non-null int64
crit3               181691 non-null int64
doubtterr           181690 non-null float64
multiple            181690 non-null float64
success             181691 non-null int64
suicide             181691 non-null int64
AttackType1         181691 non-null 