# Predicting level of fatalities in political violence and protest events in India
The Armed Conflict Location & Event Data Project (ACLED) is a disaggregated conflict collection, analysis and crisis mapping project.ACLED collects the dates, actors, types of violence, locations, and fatalities of all reported political violence and protest events. Political violence and protest includes events that occur within civil wars and periods of instability, public protest and regime breakdown. Data collected from India during the period of 26-January-2016 to 26-January-2019
#### source: https://www.acleddata.com/data/
*Raleigh, Clionadh, Andrew Linke, Håvard Hegre and Joakim Karlsen. (2010).“Introducing ACLED-Armed Conflict Location and Event Data.” Journal of PeaceResearch 47(5) 651-660.*

## Step 3. Model training and Evaluation

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Although it is not advisable but to keep this notebook clean and short, supress warnings 
# comment this when you want to see warnings
import warnings
warnings.filterwarnings("ignore")

In [3]:
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

### Loading Data into dataframe

In [4]:
data=pd.read_csv('Cleaned_fatalities_data2.csv')

In [5]:
data.head()

Unnamed: 0,time_precision,inter1,inter2,geo_precision,State_label,Event_label,month,No_of_actors,SourceCount,Source_scale1,Fatality_Label
0,1,1,8,2,0,0,38,3,1,1,0
1,1,3,1,2,0,1,38,2,2,1,1
2,1,5,0,2,0,2,38,1,1,2,0
3,1,5,0,2,0,2,38,1,1,2,0
4,1,5,1,1,1,2,38,4,1,3,0


### Test, Train, Valid Split
We split the Data into 3 sets based on the months the event takes place;
1. training: Events taking place in month no. 1 to month no. 25 
2. valid: Events taking place in month no. 26 to month no. 33
3. test: Events taking place in month no. 34 to month no. 38

In [6]:
def Valid_range_check(mon):
        if mon>25 and mon<=33:
            return True
        else:
            return False
#splitting data into training, validation and test
#training: month 1 to 25 #validation: month 26 to 33 #test: month 34 to 38
Train_data_imask=data.month<=25
Valid_data_imask=data.month.apply(Valid_range_check)
Test_data_imask=data.month>32
print('Number of rows: Training data=',sum(Train_data_imask))
print('Number of rows: Validation data=',sum(Valid_data_imask))
print('Number of rows: Test data=',sum(Test_data_imask))
Train_data=data[Train_data_imask]
Valid_data=data[Valid_data_imask]
Test_data=data[Test_data_imask]

Number of rows: Training data= 26818
Number of rows: Validation data= 10887
Number of rows: Test data= 10422


In [7]:
X_train=Train_data.drop(['Fatality_Label'],axis=1)
Y_train=Train_data['Fatality_Label']
X_Valid=Valid_data.drop("Fatality_Label", axis=1)
Y_Valid=Valid_data['Fatality_Label']
X_test=Test_data.drop("Fatality_Label", axis=1)
Y_test=Test_data['Fatality_Label']

## Model training
Train different supervised machine learning models. After training the models we judge the performance on 3 different performance metrics, **Root Mean Square Error (RMSE), Accuracy** and most importantly, **Accuracy of predicting fatal events**.

In [8]:
def Model_train(ModelName,X_t,Y_t,X_V,Y_V):
    if ModelName=='LogisticRegression':
        Model=LogisticRegression(solver="newton-cg",multi_class='multinomial')
    elif ModelName=='DecisionTreeClassifier':
        Model=DecisionTreeClassifier()
    elif ModelName=='RandomForestClassifier':
        Model=RandomForestClassifier(n_estimators=100,random_state=123)
    elif ModelName=='GaussianNaiveBayes':
        Model= GaussianNB()
    elif ModelName=='LinearSupportVectorMachines':
        Model=LinearSVC(max_iter=100,random_state=123)
    elif ModelName=='KNeighborsClassifier':
        Model=KNeighborsClassifier(n_neighbors=4)
    elif ModelName=='StochasticGradientDescent':
        Model= SGDClassifier(random_state=123)
    elif ModelName=='AdaBoostClassifier':
        Model=AdaBoostClassifier(base_estimator=RandomForestClassifier(n_estimators=10),learning_rate=0.1,random_state=123)
    elif ModelName=='XGBoost':
        Model=XGBClassifier(seed=1,learning_rate=0.1,n_estimators=100)
    Model.fit(X_t,Y_t)
    Y_pred=Model.predict(X_V)
    R2=round(Model.score(X_t,Y_t)*100,2)
    RMSE=mean_squared_error(Y_V, Y_pred)
    Acc=accuracy_score(Y_V, Y_pred)
    Acc_fatal=accuracy_score(Y_V[Y_V>0],Y_pred[Y_V>0])
    return pd.DataFrame(data={'Model':[ModelName],'R2_Score':[R2],'RMSE_Score':[RMSE],
                  'Accuracy_Score':[Acc],'Fatal_Events_Accuracy_Score':[Acc_fatal]})

In [9]:
def Supervised_Learning(X_train,Y_train,X_Valid,Y_Valid):
    Model_Validation = pd.DataFrame(columns=['Model','R2_Score','RMSE_Score',
                                             'Accuracy_Score','Fatal_Events_Accuracy_Score'])
    Model_Validation = Model_Validation.append(Model_train(ModelName='LogisticRegression',X_t=X_train,Y_t=Y_train
                                                                                     ,X_V=X_Valid,Y_V=Y_Valid))
    Model_Validation = Model_Validation.append(Model_train(ModelName='DecisionTreeClassifier',X_t=X_train,Y_t=Y_train
                                                                                     ,X_V=X_Valid,Y_V=Y_Valid))
    Model_Validation = Model_Validation.append(Model_train(ModelName='RandomForestClassifier',X_t=X_train,Y_t=Y_train
                                                                                     ,X_V=X_Valid,Y_V=Y_Valid))
    Model_Validation = Model_Validation.append(Model_train(ModelName='GaussianNaiveBayes',X_t=X_train,Y_t=Y_train
                                                                                     ,X_V=X_Valid,Y_V=Y_Valid))
    Model_Validation = Model_Validation.append(Model_train(ModelName='LinearSupportVectorMachines',X_t=X_train,Y_t=Y_train
                                                                                     ,X_V=X_Valid,Y_V=Y_Valid))
    Model_Validation = Model_Validation.append(Model_train(ModelName='KNeighborsClassifier',X_t=X_train,Y_t=Y_train
                                                                                     ,X_V=X_Valid,Y_V=Y_Valid))
    Model_Validation = Model_Validation.append(Model_train(ModelName='StochasticGradientDescent',X_t=X_train,Y_t=Y_train
                                                                                     ,X_V=X_Valid,Y_V=Y_Valid))
    Model_Validation = Model_Validation.append(Model_train(ModelName='AdaBoostClassifier',X_t=X_train,Y_t=Y_train
                                                                                     ,X_V=X_Valid,Y_V=Y_Valid))
    Model_Validation = Model_Validation.append(Model_train(ModelName='XGBoost',X_t=X_train,Y_t=Y_train
                                                                            ,X_V=X_Valid,Y_V=Y_Valid))
    return Model_Validation

In [10]:
ValidationScore1=Supervised_Learning(X_train=X_train,Y_train=Y_train,X_Valid=X_Valid,Y_Valid=Y_Valid)
ValidationScore1.sort_values(by='Fatal_Events_Accuracy_Score', ascending=False)

Unnamed: 0,Model,R2_Score,RMSE_Score,Accuracy_Score,Fatal_Events_Accuracy_Score
0,GaussianNaiveBayes,92.62,0.106825,0.893175,0.933225
0,RandomForestClassifier,98.91,0.058235,0.941765,0.579805
0,AdaBoostClassifier,98.91,0.058602,0.941398,0.558632
0,DecisionTreeClassifier,98.91,0.071461,0.928539,0.496743
0,XGBoost,96.58,0.052264,0.947736,0.449511
0,LogisticRegression,95.02,0.057408,0.942592,0.13355
0,KNeighborsClassifier,97.03,0.055938,0.944062,0.127036
0,LinearSupportVectorMachines,95.59,0.05603,0.94397,0.029316
0,StochasticGradientDescent,95.05,0.055938,0.944062,0.011401


## Dealing with Unbalanced dataset
Eventhough the accuracy is high but the the accuracy for events that actually lead to fatalities the accuracy is very low (about 50%) for all the models except Gaussian Naive Bayes. Keeping this in mind, we do see that there is a clear unbalance issue as the fatal events are much less in frequency than non-fatal events(Refer the Data visualization Jupyter Notebook)
We want to fix that and deal with the unbalanced dataset for the three classes we are trying to predict.

### Up-sample Minority Class

INcrease the sample data for classes with low datacount.

In [11]:
from sklearn.utils import resample
df_fatal0=Train_data[Train_data.Fatality_Label==0]
df_fatal1=Train_data[Train_data.Fatality_Label==1

In [13]:
# Upsample Class Fatality_Label=1
df_fatal1_upsampled = resample(df_fatal1, 
                                 replace=True,     # sample with replacement
                                 n_samples=data['Fatality_Label'].value_counts()[0],# to match Class Fatality_Label=0
                                 random_state=123) # reproducible results
# Combine majority class with upsampled minority class
Train_data_upsampled = pd.concat([df_fatal0, df_fatal1_upsampled])
X_train_upsampled=Train_data_upsampled.drop(['Fatality_Label'],axis=1)
Y_train_upsampled=Train_data_upsampled['Fatality_Label']

In [14]:
ValidationScore2=Supervised_Learning(X_train_upsampled,Y_train_upsampled,X_Valid,Y_Valid)
ValidationScore2.sort_values(by='Fatal_Events_Accuracy_Score', ascending=False)

Unnamed: 0,Model,R2_Score,RMSE_Score,Accuracy_Score,Fatal_Events_Accuracy_Score
0,StochasticGradientDescent,90.43,0.197667,0.802333,0.985342
0,XGBoost,93.91,0.131257,0.868743,0.967427
0,LogisticRegression,89.27,0.139249,0.860751,0.959283
0,GaussianNaiveBayes,89.18,0.136034,0.863966,0.95114
0,KNeighborsClassifier,96.93,0.09075,0.90925,0.721498
0,RandomForestClassifier,98.3,0.06494,0.93506,0.680782
0,AdaBoostClassifier,98.3,0.063378,0.936622,0.576547
0,DecisionTreeClassifier,98.3,0.069073,0.930927,0.449511
0,LinearSupportVectorMachines,80.11,0.058051,0.941949,0.30456


UpSample Minority Classes has drastically improved the performance for both all events accuracy and Fatal event accuracy. Thus we will use the XG Boost Classifier to judge the performance on test data in next Jupyter notebook after upsampling data for training the model. **Overall accuracy of 86% and 96% on fatal events.**