## Gael Blanchard
###  Problem: Given the Global Terrorism Data Set construct a predicitve model which can determine which terrorist group is responsible for an event

## Global Terrorism Data
[Source: National Consortium for the Study of Terrorism and Responses to Terrorism (START). (2016). Global Terrorism Database [Data file]. Retrieved from https://www.start.umd.edu/gtd](http://www.start.umd.edu/gtd)

In [None]:
#Required Libraries
import numpy as np 
import pandas as pd 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, hamming_loss
import matplotlib.pyplot as plt

In [None]:
#Disables SettingWithCopy Warning
pd.options.mode.chained_assignment = None
#Set the random seed for reproducible results
np.random.seed(0)

In [None]:
#Method to determine the performance of our classification models
#Does a simple comparison of predicitons and actual calculations
def print_results(predictions,data,desired_variable):
	print("Results:")
	print("Number of mislabeled points out of a total {} points : {}, performance {:05.2f}%"
	.format(
		data.shape[0],
		(data[desired_variable] != predictions).sum(),
		100*(1-(data[desired_variable] != predictions).sum()/data.shape[0])
		)
	)
    
#method to return the head and tail frequencies
def frequency_data(data,factors):
    for factor in factors:
        print(factor)
        print(data[factor].value_counts().head())
        #print(data[factor].value_counts().tail())

# 1.   Data Collection
* Read in our data from the Global Terrorism Database into a Pandas Data Frame
* Isolated the variables we needed (10/135 available)

In [None]:
#Isolated features for use in our classification models
#at Data Collection stage to lower memory usage
factors_list = ["gname","iyear","country","attacktype1","targtype1","weaptype1","multiple","success","suicide","city"]
global_terrorism_data = pd.read_csv("../input/globalterrorismdb_0617dist.csv", encoding = "ISO-8859-1", usecols=factors_list, low_memory = False)

In [None]:
#Descirbe the data we have just read in
global_terrorism_data.describe(include="all")

[This website](http://www.start.umd.edu/gtd/downloads/Codebook.pdf) provides a detail explanation of all the features in the GTD. Given the explanation within the booklet we can isolate certain features. For example, when considering a variable for time year is the most complete whereas month and day can have 0 values when unknown. This booklet also allowed the omission of feature engineering for variables such as country_txt, attacktype1_txt, targtype1_txt, and weaptype1_txt because it has associated numerical codes which serve as factors within this program.

 Logic behind selecting these features:
*  Minimal imputation (All of our features with the exception of city are complete for all rows)
* Complete features within our data allow for logical subsets within the data which is optimal for prediciton
* Prevent overfitting


# 2. Data Preparation & Exploration
* Searched for NA data within our data
* Determined the frequency of every variable besides year
* Factorized city and gname variables for usage in our predictive models
* Seperated data into known and unknown perpetrators
* Defined our train and test sets

In [None]:
#Look for missing data
missing_data = global_terrorism_data.isna().any()
print(missing_data)
#Recognize that our city variable has null values
#All of our data is categorical in nature
freq_factors = ["gname","country","attacktype1","targtype1","weaptype1","multiple","success","suicide","city"]
#Frequency
frequency_data(global_terrorism_data,freq_factors)

Note: There are several classes with single instances

In [None]:
#Factorizing data will remove the missing data and seperate city into classes
#Key if we want to determine the associated perpreatator after running a prediction
key_names = global_terrorism_data["gname"].unique()
for_factor_key_names = key_names
key_id = pd.factorize(for_factor_key_names)[0]
#We also factorize our variable to predict (Perpetrator Group Name)
global_terrorism_data["city"] = pd.factorize(global_terrorism_data["city"])[0]
global_terrorism_data["gname"] = pd.factorize(global_terrorism_data["gname"])[0]
#Training and Test Data Sets
training_data = global_terrorism_data
test_data = global_terrorism_data.sample(frac=0.8, replace=False)
test_unknown = global_terrorism_data[global_terrorism_data["gname"] == 2]
#list of factors to use for prediction
predictor_factors = ["iyear","country","attacktype1","targtype1","weaptype1","multiple","success","suicide","city"]

In [None]:
#Looking at our data post preparation
#Notice now that our data is now fully representated as numerical values
global_terrorism_data.describe()

## Data Visulaizations

In [None]:
viz_factors_list = ["gname","iyear","country_txt","attacktype1_txt","targtype1_txt","weaptype1_txt","multiple","success","suicide","city"]
gt_data_for_viz = pd.read_csv("../input/globalterrorismdb_0617dist.csv", encoding = "ISO-8859-1", usecols=viz_factors_list, low_memory = False)

In [None]:
#AttackType
gt_data_for_viz["attacktype1_txt"].value_counts(sort=False).plot.pie()
plt.show()

In [None]:
#Targ Type
gt_data_for_viz["targtype1_txt"].value_counts(sort=False).plot(kind="bar")
plt.show()

In [None]:
#Weapon Type
gt_data_for_viz["weaptype1_txt"].value_counts(sort=False).plot.pie(figsize=(8, 8),fontsize=15)
plt.show()

In [None]:
#Recorded Terrorist Incidents per year
gt_data_for_viz["iyear"].value_counts().sort_index().plot(figsize=(5, 5))
plt.show()

In [None]:
#Top 10 Recorded Terrorist Incidents per country
gt_data_for_viz["country_txt"].value_counts().head(10).plot(kind="bar",figsize=(5, 5))
plt.show()

In [None]:
#Top 10 Recorded Terrorist Incidents per city
gt_data_for_viz["city"].value_counts().head(10).plot(kind="bar",figsize=(5, 5))
plt.show()

In [None]:
#Incidents resulting in Suicide
gt_data_for_viz["suicide"].value_counts().plot(kind="bar",figsize=(5, 5))
plt.show()

In [None]:
#Succesful
gt_data_for_viz["success"].value_counts().plot(kind="bar",figsize=(5, 5))
plt.show()

In [None]:
#Top 10 Perpetrator Groups by incidents
gt_data_for_viz["gname"].value_counts().head(10).plot(kind="bar",figsize=(5, 5))
plt.show()

# 3. Training and Testing Algorithms 
*  Initialized classifcation models (Decision Tree, KNN and Random Forest)
*  Trained classification models
*  Tested algorithms on random samples large and small

In [None]:
#Initialize Classifier Models
#Decision Tree, KNN, Random Forest
dtree_model = DecisionTreeClassifier(max_depth = 100)
knn_model = KNeighborsClassifier(n_neighbors=1)
knn_hamming = KNeighborsClassifier(n_neighbors=1,p=2,metric="hamming")

## Why Decision Tree?
DT is a classification model which performs implicit feature selection and is not affected by non-linear relationships within the data. DTs create a set of rules based on the features of the data which is then used to determine the class of a test data point. Chose DT over random forest classification implementation because random forest is computationally expensive with our training data set.

In [None]:
#Training Algorithms
dtree_model.fit(training_data[predictor_factors],training_data["gname"])

## Why K-Nearest Neighbors?
KNN is a simple classification model that is highly dependent on the similarity between a test data point and a stored data point(the neighbor(s)) from the training data used in the model. There are multiple single class instances within our GTD dataset(e.g., One time terrorist groups, City's that have been attacked once, etc.). By setting the algorithm to classify the test data points based on the single most similar neighbor we should be able to classify those instances with high accuracy.

In [None]:
knn_model.fit(training_data[predictor_factors],training_data["gname"])

In [None]:
#decision tree test
dtree_predictions = dtree_model.predict(test_data[predictor_factors])

In [None]:
#Test Unknown
dtree_unknown = dtree_model.predict(test_unknown[predictor_factors])

In [None]:
#KNN test
knn_pred = knn_model.predict(test_data[predictor_factors])

In [None]:
#KNN Unknown
knn_unknown = knn_model.predict(test_unknown[predictor_factors])

# 4. Evaluating Models
* custom function for accuracy
* accuracy score
* hamming loss

In [None]:
print("Decision Tree: ")
print_results(dtree_predictions,test_data,"gname")
print("Accuracy Score: ",accuracy_score(test_data["gname"],dtree_predictions))
print("Hamming Loss: ",hamming_loss(test_data["gname"],dtree_predictions))

In [None]:
print("Decision Tree Unknown: ")
print_results(dtree_unknown,test_unknown,"gname")
print("Accuracy Score: ",accuracy_score(test_unknown["gname"],dtree_unknown))
print("Hamming Loss: ",hamming_loss(test_unknown["gname"],dtree_unknown))

In [None]:
print("KNN: ")
print_results(knn_pred,test_data,"gname")
print("Accuracy Score: ",accuracy_score(test_data["gname"],knn_pred))
print("Hamming Loss: ",hamming_loss(test_data["gname"],knn_pred))

In [None]:
print("KNN Unknown: ")
print_results(knn_unknown,test_unknown,"gname")
print("Accuracy Score: ",accuracy_score(test_unknown["gname"],knn_unknown))
print("Hamming Loss: ",hamming_loss(test_unknown["gname"],knn_unknown))

As noted above, there are several classes with single instances. To address these instances, I refrained from using k-fold cross-validation to determine model accuracy because we would have splits that wouldn't provide an accurate representation of all classes as they appear within the data.

I used a custom function in conjunction with built-in sci-kit learn classification metrics accuracy score, f1_score and haming loss. Didn't use Jaccard Similarity Score since it is equivalent to the accuracy score.

Testing purely unknown perpetrators the predictive model performed almost identically given a data set with a different class distribution.

# 5. Conclusion
In conclusion, we are able to construct a fairly accurate predictive model that can determine the perpetrator group related to an incident. By selecting and engineering the right features we are able to construct a model with upwards of 90% accuracy and downwards of 10% hamming loss.