# Predicting level of fatalities in political violence and protest events in India
The Armed Conflict Location & Event Data Project (ACLED) is a disaggregated conflict collection, analysis and crisis mapping project.ACLED collects the dates, actors, types of violence, locations, and fatalities of all reported political violence and protest events. Political violence and protest includes events that occur within civil wars and periods of instability, public protest and regime breakdown. Data collected from India during the period of 26-January-2016 to 26-January-2019
#### source: https://www.acleddata.com/data/
*Raleigh, Clionadh, Andrew Linke, Håvard Hegre and Joakim Karlsen. (2010).“Introducing ACLED-Armed Conflict Location and Event Data.” Journal of PeaceResearch 47(5) 651-660.*

# Step 1. Data Cleaning

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

In [2]:
# Although it is not advisable but to keep this notebook clean and short, supress warnings 
# comment this when you want to see warnings
import warnings
warnings.filterwarnings("ignore")

Display more columns in the data, we set it to 50.

In [3]:
pd.set_option('display.max_columns', 50)

### Loading Data into dataframe

In [4]:
# Loading protest and political violence data
data=pd.read_csv('ACLED_data_India_updated.csv')
# load codes description data
inter_codes=pd.read_csv('Inter_codes.csv')
geo_precision_codes=pd.read_csv('geo_precision_code.csv')
time_precision_codes=pd.read_csv('time_precision_code.csv')

After loading the data I first have to be know the basic structure of dataset to look for possible issues and cleaning up needed for it.

In [5]:
data.head()

Unnamed: 0,data_id,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,actor1,assoc_actor_1,inter1,actor2,assoc_actor_2,inter2,interaction,region,country,admin1,admin2,admin3,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
0,3090605,356,IND46153,46153,16-Feb-19,2019,1,Battle-No change of territory,Military Forces of India (2014-),Police Forces of India (2014-) Border Security...,1,Military Forces of Pakistan (2018-),,8,18,Southern Asia,India,Jammu and Kashmir,Rajouri,Naushera,Kalal,33.3418,74.3889,2,Chandigarh Tribune,Subnational,"On 16 Feb, Indian and Pakistani forces exchang...",0,1550588555,IND
1,3090606,356,IND46154,46154,16-Feb-19,2019,1,Remote violence,Unidentified Armed Group (India),,3,Military Forces of India (2014-),,1,13,Southern Asia,India,Jammu and Kashmir,Rajouri,Naushera,Jhangar Dharmsal,33.2705,74.0508,2,Chandigarh Tribune; Daily Excelsior,Subnational,"On 16 Feb, an Indian army officer was killed a...",1,1550588555,IND
2,3090607,356,IND46155,46155,16-Feb-19,2019,1,Riots/Protests,Rioters (India),,5,,,0,50,Southern Asia,India,Jammu and Kashmir,Samba,Samba,Samba,32.5625,75.1199,2,Daily Excelsior,subnational,"On 16 Feb, people held demonstrations and burn...",0,1550588555,IND
3,3090608,356,IND46156,46156,16-Feb-19,2019,1,Riots/Protests,Rioters (India),,5,,,0,50,Southern Asia,India,Jammu and Kashmir,Reasi,Reasi,Reasi,33.0792,74.8342,2,Daily Excelsior,subnational,"On 16 Feb, people held demonstrations and burn...",0,1550588555,IND
4,3090609,356,IND46157,46157,16-Feb-19,2019,1,Riots/Protests,Rioters (India),BJP: Bharatiya Janata Party,5,Police Forces of India (2014-),Civilians (India); TMC: Trinamool Congress Par...,1,15,Southern Asia,India,West Bengal,Birbhum,Labhpur,Labhpur,23.816,87.7982,1,India Blooms News Service,National,"On February 16, locals attacked a TMC MLA and ...",0,1550588555,IND


In [6]:
inter_codes

Unnamed: 0,inter_code,description
0,0,No actor
1,1,Government and state security services
2,2,Rebel Groups
3,3,Political Militias
4,4,Identity Militias
5,5,Rioters
6,6,Protestors
7,7,Civilians
8,8,External/other forces


In [7]:
geo_precision_codes

Unnamed: 0,geo_precision_code,precision_level
0,1,highest (exact location)
1,2,intermediate (regional)
2,3,lowest (provincial)


In [8]:
time_precision_codes

Unnamed: 0,time_precision_code,precision_level
0,1,highest (day)
1,2,intermediate (week)
2,3,lowest (month)


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46135 entries, 0 to 46134
Data columns (total 30 columns):
data_id             46135 non-null int64
iso                 46135 non-null int64
event_id_cnty       46135 non-null object
event_id_no_cnty    46135 non-null int64
event_date          46135 non-null object
year                46135 non-null int64
time_precision      46135 non-null int64
event_type          46135 non-null object
actor1              46135 non-null object
assoc_actor_1       29511 non-null object
inter1              46135 non-null int64
actor2              12993 non-null object
assoc_actor_2       3627 non-null object
inter2              46135 non-null int64
interaction         46135 non-null int64
region              46135 non-null object
country             46135 non-null object
admin1              46135 non-null object
admin2              46124 non-null object
admin3              44509 non-null object
location            46135 non-null object
latitude          

Look at number of unique elements in each features

In [10]:
unique_count=pd.Series()
for column in data.columns:
    unique_count[column]=data[column].unique().size
unique_count

data_id             46135
iso                     1
event_id_cnty       46135
event_id_no_cnty    46135
event_date           1143
year                    4
time_precision          3
event_type              8
actor1                503
assoc_actor_1        1338
inter1                  8
actor2                257
assoc_actor_2         396
inter2                  9
interaction            40
region                  1
country                 2
admin1                 35
admin2                718
admin3               2865
location             6513
latitude             6652
longitude            6593
geo_precision           4
source                788
source_scale           19
notes               44657
fatalities             23
timestamp             162
iso3                    1
dtype: int64

### Dropping features
First I drop columns with only one kind to data entry.

In [11]:
for column in unique_count.index:
    if unique_count[column]==1:
        data.drop(columns=column,inplace=True)

Second, I drop features with non-relavant information.

In [12]:
features_to_drop=['country','data_id','event_id_cnty','event_id_no_cnty','timestamp','year','interaction']
Data1=data.drop(columns=features_to_drop)

Next, I drop some features with information relavant but redundant. In training the prediction model I decided not to include these feature. 

In [13]:
Data1.drop(columns=['location','notes','latitude','longitude','admin2','admin3'],inplace=True)

### Changing date column to datetime format

In [14]:
Data1['event_date_formatted']=pd.to_datetime(Data1['event_date'])
Data1.drop(columns='event_date',inplace=True)

### Fixing for Null values
Columns and number of entries with NULL entries

In [15]:
Data1.isnull().sum()[Data1.isnull().sum()>0]

assoc_actor_1    16624
actor2           33142
assoc_actor_2    42508
source_scale        77
dtype: int64

First, I fix the NULL data in feature 'source_scale'.

The source_scale of the news is related to source of the news, for example newspaper 'The Times of India' is a National Newspaper. So, wherever the source_scale entry is null we look at the source then use the source_scale for that to fill in the missing value.

In [16]:
Source_missing_scale=Data1['source'][Data1.source_scale.isnull()].unique()
Source_missing_scale_filllist=pd.Series()
for source_name in Source_missing_scale:
    Source_missing_scale_filllist[source_name]=Data1[Data1['source']==source_name]['source_scale'].dropna().value_counts()
Source_missing_scale_filllist

Telegraph (India)                   National    208
Name: source_scale, dtype: int64
Sangai Express (India)             Subnational    126
National         1
Name: so...
Times of India                     National       759
Subnational      2
Name: so...
Asian News International           Regional         79
International     8
Nation...
Chandigarh Tribune                 Subnational    710
National       153
Name: so...
Indian Express                     National       1274
Subnational       1
region...
Pioneer (India)                    National       26
Subnational     1
Name: sour...
Hindustan Times (India)            National       273
Subnational      1
Name: so...
Pioneer (India); Times of India       National    1
Name: source_scale, dtype: int64
dtype: object

In [17]:
Data1.replace('regional','Regional',inplace=True) 
for source_item,source_scale_item in zip(Source_missing_scale_filllist.index,Source_missing_scale_filllist):
    Index_of_NaN=Data1.loc[Data1['source']==source_item][Data1.source_scale.isnull()].index
    Data1.loc[Index_of_NaN,'source_scale']=Source_missing_scale_filllist[source_item].idxmax()

Next, We will combine the actor1, actor2, associate actor1 and associate actor2 into one column and drop these columns. This will deal with null values in 'assoc_actor_1', 'actor2', 'assoc_actor_2'.

In [18]:
Actors=pd.Series(index=Data1.index)
Actor2_null_imask=Data1.actor2.notnull()
AssoActor1_null_imask=Data1.assoc_actor_1.notnull()
AssoActor2_null_imask=Data1.assoc_actor_2.notnull()
for idx in Data1.index:
    if Actor2_null_imask[idx] and AssoActor1_null_imask[idx] and AssoActor2_null_imask[idx]:
        Actors[idx]=[Data1.loc[idx,'actor1'],Data1.loc[idx,'actor2'],Data1.loc[idx,'assoc_actor_1'],Data1.loc[idx,'assoc_actor_2']]
    elif Actor2_null_imask[idx] and AssoActor1_null_imask[idx] and not AssoActor2_null_imask[idx]:
        Actors[idx]=[Data1.loc[idx,'actor1'],Data1.loc[idx,'actor2'],Data1.loc[idx,'assoc_actor_1']]
    elif Actor2_null_imask[idx] and not AssoActor1_null_imask[idx] and AssoActor2_null_imask[idx]:
        Actors[idx]=[Data1.loc[idx,'actor1'],Data1.loc[idx,'actor2'],Data1.loc[idx,'assoc_actor_2']]
    elif not Actor2_null_imask[idx] and AssoActor1_null_imask[idx] and AssoActor2_null_imask[idx]:
        Actors[idx]=[Data1.loc[idx,'actor1'],Data1.loc[idx,'assoc_actor_1'],Data1.loc[idx,'assoc_actor_2']]
    elif Actor2_null_imask[idx] and not AssoActor1_null_imask[idx] and not AssoActor2_null_imask[idx]:
        Actors[idx]=[Data1.loc[idx,'actor1'],Data1.loc[idx,'actor2']]
    elif not Actor2_null_imask[idx] and not AssoActor1_null_imask[idx] and AssoActor2_null_imask[idx]:
        Actors[idx]=[Data1.loc[idx,'actor1'],Data1.loc[idx,'assoc_actor_2']]
    elif not Actor2_null_imask[idx] and AssoActor1_null_imask[idx] and not AssoActor2_null_imask[idx]:
        Actors[idx]=[Data1.loc[idx,'actor1'],Data1.loc[idx,'assoc_actor_1']]
    else:
        Actors[idx]=Data1.loc[idx,'actor1']
Data1=Data1.join(pd.DataFrame(Actors,columns=['Actors']))
Data1.drop(columns=['actor1','actor2','assoc_actor_1', 'assoc_actor_2'],inplace=True)

### Creating new numerical features
To proceed with supervised learning algorithms I created numeric features from features with string entries.

Creating new feature of state label. Below is a table to show the state label and the state name.

In [19]:
States=pd.DataFrame(Data1.admin1.unique(),columns=['state'])
print(States)

                     state
0        Jammu and Kashmir
1              West Bengal
2              Maharashtra
3                    Assam
4                Meghalaya
5                   Odisha
6              Uttarakhand
7                   Punjab
8                Telangana
9         Himachal Pradesh
10           Uttar Pradesh
11            NCT of Delhi
12          Andhra Pradesh
13               Jharkhand
14                 Haryana
15              Tamil Nadu
16     Andaman and Nicobar
17                 Manipur
18                   Bihar
19          Madhya Pradesh
20            Chhattisgarh
21              Puducherry
22                  Kerala
23               Karnataka
24       Arunachal Pradesh
25                     Goa
26                 Mizoram
27                 Tripura
28              Chandigarh
29                Nagaland
30                 Gujarat
31               Rajasthan
32                  Sikkim
33           Daman and Diu
34  Dadra and Nagar Haveli


In [20]:
def State_label(state):
    for i in range(len(States)):
        if state==States.loc[i,'state']:
            return i
Data1['State_label']=Data1.admin1.apply(State_label)
Data1.drop(columns=['admin1'],inplace=True)

Creating new feature of Eventype label. Below is a table to show the Eventtype label and the Eventtype.

In [21]:
Events=pd.DataFrame(Data1.event_type.unique(),columns=['Event'])
print(Events)

                                 Event
0        Battle-No change of territory
1                      Remote violence
2                       Riots/Protests
3           Violence against civilians
4                Strategic development
5     Headquarters or base established
6    Non-violent transfer of territory
7  Battle-Government regains territory


In [22]:
def Event_label(event):
    for i in range(len(Events)):
        if event==Events.loc[i,'Event']:
            return i
Data1['Event_label']=Data1.event_type.apply(Event_label)
Data1.drop(columns=['event_type'],inplace=True)

Creating new feature of Month label where 1 is first month in dataset and goes as 1,2,.....38 

In [23]:
def month_of(date_formatted):
    return date_formatted.month+12*(date_formatted.year-2016)
Data1['month']=Data1.event_date_formatted.apply(month_of)
Data1.drop(columns=['event_date_formatted'],inplace=True)

Creating new feature of Number of actors 

In [24]:
def No_of_Actors(actors):
    if type(actors)==list:
        return len(actors)
    else:
        return 1
Data1['No_of_actors']=Data1['Actors'].apply(No_of_Actors)
Data1.drop(columns=['Actors'],inplace=True)

Creating a new feature that counts the number of sources

In [25]:
def Source_Split(datacut):
    SourceSplit=datacut.split("; ")
    return len(SourceSplit)
Data1['SourceCount']=Data1.source.apply(Source_Split)
Data1.drop(columns='source',inplace=True)

Convert source_scale into two type of columns which contains a source_scale numeric label and since each entry has at max two types of source_scale with make two columns. Below we show the label for source_scale and source_scale

In [26]:
def Source_scale_Split(datacut):
    SourceScaleSplit=datacut.split("-")
    return SourceScaleSplit
Data1['source_scale']=Data1.source_scale.apply(Source_scale_Split)
Source_Scale_columns=Data1.source_scale.apply(pd.Series).fillna(0)
Source_Scale_columns=Source_Scale_columns.rename(columns={0:"Source_scale1",1:"Source_scale2"})
SourceScaleList=Source_Scale_columns.Source_scale1.append(Source_Scale_columns.Source_scale2[Source_Scale_columns.Source_scale2.notnull()])
S_scale=pd.DataFrame(SourceScaleList.unique(),columns=['source_scale'])
print(S_scale)

    source_scale
0    Subnational
1    subnational
2       National
3       Regional
4          Other
5  International
6              0


In [27]:
def SourceScale_label(ss):
    for i in range(len(S_scale)):
        if ss==S_scale.loc[i,'source_scale']:
            return i+1
        if ss==0:
            return 0
Source_Scale_columns.Source_scale2=Source_Scale_columns.Source_scale2.apply(SourceScale_label)
Source_Scale_columns.Source_scale1=Source_Scale_columns.Source_scale1.apply(SourceScale_label)
Data1=pd.concat([Data1[:],Source_Scale_columns[:]],axis=1)
Data1.drop(columns=['source_scale'],inplace=True)

In [28]:
Data1.drop(columns=['Source_scale2'],inplace=True)

In [29]:
Data1.head()

Unnamed: 0,time_precision,inter1,inter2,geo_precision,fatalities,State_label,Event_label,month,No_of_actors,SourceCount,Source_scale1
0,1,1,8,2,0,0,0,38,3,1,1
1,1,3,1,2,1,0,1,38,2,2,1
2,1,5,0,2,0,0,2,38,1,1,2
3,1,5,0,2,0,0,2,38,1,1,2
4,1,5,1,1,0,1,2,38,4,1,3


In [33]:
Data1.to_csv('Cleaned_fatalities_data.csv',index=False)

In [32]:
S_scale.to_csv('Source_Scale.csv')
Events.to_csv('Events.csv')
States.to_csv('States.csv')