*** Insights and analysis on the dataset from the Global Terrorism Database (GTD) maintained by the University of Maryland to know the current state of terrorism in the US, and possibly make predictions of future risks. I foresee this analysis valuable to the US Department of Homeland Security, other government establishments, businesses or individuals that care about safety of lives and properties. ***

***Dataset can be found at: http://www.start.umd.edu/gtd/***

***Goal of this project is to compare predicition accuracy based on number of classes in a target variable & predict if the nationality of an attacker's group is same as that of the attacked location***

***Overseeing Mentor: Dr. Stylianos Kampakis***

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
sns.set(style="whitegrid", color_codes=True)
np.random.seed(sum(map(ord, "categorical")))
matplotlib.style.use('ggplot')


In [None]:
%matplotlib inline

### Data wrangling and preprocessing

The dataset was downloaded as an excel spreadsheet from the GTD website, uploaded into Pandas and then we carry out some data wrangling and preprocessing.

In [None]:
file= r'C:\Users\dejavu\Desktop\git_jupyter\springboard_mini_project\capstone_projects/globalterrorismdb_0617dist.xlsx'
df= pd.read_excel(file)

In [None]:
df.shape

In [None]:
#restrict this dataset to occurrences in the US.
df1= df['country_txt'].str.contains('United States')
df2= df[df1]
df2.head(3)

In [None]:
df2.shape

In [None]:
df2.info()

In [None]:
class EDA():
    '''Used for running Exploratory Data Analysis'''
    def __init__(self):
        ''''''
    def drop_col_nan(self, x, threshold):
        for col in x.columns:
            amt = sum(x[col].isnull())/float(len(x)) * 100
            if amt > threshold:
                x = x.drop(col,1)
                pd.set_option('display.max_columns', None)
        return x
        
    def drop_noisy_col(self, y, w=[]):
        y=y.drop(w, 1)
        return y
                
    def drop_col_txt(self, z):
        for c in z.columns:
            c = str(c)
            if c[-3:] =='txt':
                z = z.drop(c, 1)
                pd.set_option('display.max_columns', None)
        return z     
       

In [None]:
my_EDA = EDA()
df3=my_EDA.drop_col_nan(df2, 80)

In [None]:
df3.index = range(len(df2))

In [None]:
df4=my_EDA.drop_col_txt(df3)

In [None]:
df5=my_EDA.drop_noisy_col(df4, ['corp1', 'motive', 'target1', 'weapdetail','country','addnotes', 'summary', 'scite1' , 'scite2' , 'scite3' , 'dbsource', 'INT_LOG' ,'longitude','specificity', 'eventid', 'location','region', 'propcomment', 'latitude'])

In [None]:
df5.shape

In [None]:
df5.isnull().sum()

In [None]:
df5.dtypes

***Exporting the DataFrame to Excel for more analysis***

In [None]:
#writer = pd.ExcelWriter('abc2_xlsx', engine='xlsxwriter')
#df5.to_excel(writer, sheet_name='Sheet1')
#writer.save()

***Imputing Missing Values***

In [None]:
from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.
        Columns of dtype object are imputed with the most frequent value 
        in column.
        Columns of other types are imputed with median of column.
        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].median() for c in X],
            index=X.columns)
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

In [None]:
df6 = DataFrameImputer().fit_transform(df5)
df6.head()

** Removing the unknowns from group name (gname) column**

In [None]:
df8 = df6[df6['gname'] != 'Unknown']

In [None]:
df8.shape

**Reseting the Index**


In [None]:
df9 = df8.reset_index(drop=True)
df9.head()

***Encoding objects into categorical variables, since scikit-learn requires that all cells be numeric (int or float etc)***

In [None]:
from sklearn.preprocessing import LabelEncoder
var_mod = ['gname', 'provstate', 'city']
le = LabelEncoder()
for i in var_mod:
    df9[i] = le.fit_transform(df9[i])

*** Descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution after excluding NaN values.***

In [None]:
df9.describe()

From the discriptive statistics above, one can tell the following:
1. if the distribution of each each feature or predictor is skewed by comparing the mean to the median (50% mark). 
2. skewedness from #1 if any translates to some outliers in the distribution.
3. the distribution of the datapoints. The greater the standard deviation ('std'), the more dispersed the datapoints are.

One can delve deeper into each feature using scatterplots and barcharts or histogram depending on the algorithm you intend using. However, it is not really necessary for my analysis 

In [None]:
df9.head()

Our Dataset is now ready for machine learning algorithm analysis

### Analysis of how number of labels affect generalization of training model to out of sample data

***Creation of Predictors and the Target (Outcome) Variables*** 

In [None]:
X = df9.drop('gname', axis=1)
y = df9['gname']

In [None]:
X.shape

In [None]:
y.shape

In [None]:
#Shows the number of unique labels in our target variable. 
len(y.unique())

That number above in my opionion shows so many classes or labels compared to the size of the observations or dataset. I foresee an issue with overfitting that won't generalize training model to out of sample data. 

***Selecting the most important features with Random Forest Classier***

In [None]:
# Split the data into 40% test and 60% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

In [None]:
# Create a random forest classifier
clf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)

In [None]:
# Train the classifier
clf.fit(X_train, y_train)


In [None]:
# Print the name and gini importance of each feature
for feature in zip(X.columns, clf.feature_importances_):
    print(feature)

In [None]:
feature_rank = pd.DataFrame(clf.feature_importances_)
feature_rank.columns = ['rank']

In [None]:
plt.figure(figsize=(40,20)) # this creates a figure 8 inch wide, 4 inch high
sns.barplot(x = X.columns, y = 'rank',  data = feature_rank, order = X.columns )
plt.show()

In [None]:
# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.020)

In [None]:
# Train the selector
sfm.fit(X_train, y_train)

In [None]:
# Print the names of the most important features
for feature_list_index in sfm.get_support(indices=True):
    print(X.columns[feature_list_index])


***Create A Data Subset With Only The Most Important Features***


In [None]:
# Transform the data to create a new dataset containing only the most important features
# Note: We have to apply the transform to both the training X and test X data.
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)

***Train A New Random Forest Classifier Using Only Most Important Features***


In [None]:
# Create a new random forest classifier for the most important features
clf_important = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)

# Train the new classifier on the new dataset containing the most important features
clf_important.fit(X_important_train, y_train)

***Compare The Accuracy Of Our Full Feature Classifier To Our Limited Feature Classifier***


In [None]:
# Apply The Full Featured Classifier To The Test Data
y_pred = clf.predict(X_test)

# View The Accuracy Of Our Full Feature Model
accuracy_score(y_test, y_pred)

In [None]:
# Apply The Full Featured Classifier To The Test Data
y_important_pred = clf_important.predict(X_important_test)

# View The Accuracy Of Our Limited Feature Model
accuracy_score(y_test, y_important_pred)

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X_train, y_train)
dtree.score(X_train, y_train)

This is a classic example of overfitting where the trained model is tested on the same dataset, achieveing 100% accurracy; but achieving much less than that when tested using the test model (about 66% accuracy). This means the trained dataset cannot generalize well to an out of sample dataset. There are ways to fix this. One of these is feature engineering which requires domain knowledge. This can be used to reduce the number of classes or labels to a point where required generalization is attained 

Barely looking at the unique classes for gname below, we see that there are 228 unique labels for a dataset of 2206 observations. This by advice, is way too much. This is probably the cause of the unnecessary overfitting. In future version of this analysis i will employ the aforementioned feature engineering to combine labels and re-categorizing same. 

In [None]:
len(y.unique())

I will prove this case using another feature 'INT_IDEO', which is categorical variable that specifies:

1 = "Yes"    for nationality of attack group different from the location of the attack

0 = "No"     for nationality of attack group same as the location of the attack

-9 = "Unknown" nationality of the attack group is unknown

Please refer to the Codebook PDF at http://www.start.umd.edu/gtd/

#### Recreating same code and cells for 'INT_IDEO'

In [None]:
W = df9.drop('INT_IDEO', axis=1)
z = df9['INT_IDEO']

In [None]:
len(z.unique())

In [None]:
# Split the data into 40% test and 60% training
W_train, W_test, z_train, z_test = train_test_split(W, z, test_size=0.4, random_state=0)

In [None]:
# Create a random forest classifier
clf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)

In [None]:
# Train the classifier
clf.fit(W_train, z_train)


In [None]:
# Print the name and gini importance of each feature
for feature in zip(W.columns, clf.feature_importances_):
    print(feature)

In [None]:
# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.020)

In [None]:
# Train the selector
sfm.fit(W_train, z_train)

In [None]:
# Print the names of the most important features
for feature_list_index in sfm.get_support(indices=True):
    print(W.columns[feature_list_index])

In [None]:
# Transform the data to create a new dataset containing only the most important features
# Note: We have to apply the transform to both the training X and test X data.
W_important_train = sfm.transform(W_train)
W_important_test = sfm.transform(W_test)

In [None]:
# Create a new random forest classifier for the most important features
clf_important = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)

# Train the new classifier on the new dataset containing the most important features
clf_important.fit(W_important_train, z_train)

In [None]:
# Apply The Full Featured Classifier To The Test Data
y_pred = clf.predict(W_test)

# View The Accuracy Of Our Full Feature Model
accuracy_score(z_test, y_pred)

In [None]:
# Apply The Full Featured Classifier To The Test Data
y_important_pred = clf_important.predict(W_important_test)

# View The Accuracy Of Our Limited Feature Model
accuracy_score(z_test, y_important_pred)

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree = dtree.fit(W_train, z_train)
dtree.score(W_train, z_train)

So, we can see the importance of limiting the number of classes in the target. Having just 3 labels achieved 98% with the important predictors, which is close to the training model of 100%. This generalizes well compared with having so many classes in the prediction analysis of 'gname'. 

There are no rule of thumb for number of labels one should have has target variables. It all depends on business needs and how much errors one can stomach. This proves that re-engineering 'gname' and regrouping into lesser classes will improve generalization from training model to testing model. 

Future versions of this model will include Exploratory Data Analysis (EDA) to accompany the descriptive statistics in cell 23, feature engineering to recreate some features, compacting the machine learning code with OOP for code reuse and modularity, and further explanation on the approach i took. Enjoy!!!