# SIECVAC preprocessing
The data from SIEVCAC is collected, mapped and filtered so that it can be joined to the main table.
For Belic Actions, Terrorist Attacks ans Attacks to towns, the MELTT method is applied to ensure that events are no beeing counted twice when they are found in different tables. 

In [1]:
# load packages and modules
import pandas as pd 
from SIEVCACFunctions import *

In [2]:
# import data 
AB=pd.read_csv('RAW DATA/AB.csv')
AP=pd.read_csv('RAW DATA/AP.csv')
MA=pd.read_csv('RAW DATA/MA.csv')
AT=pd.read_csv('RAW DATA/AT.csv')
SE=pd.read_csv('RAW DATA/SE.csv')

In [138]:
pd.set_option('display.max_columns', None)

# MELTT Implementation
* Here the data will be set up for applying MELTT
* Afterward the deduplicated data will be used

## Belic Actions 

Warlike actions in the framework of armed conflict are understood as ‘(...) any act carried out in the legitimate conduct of war, taking into account that it responds to a defined military objective and makes use of lawful means and weapons in combat’ (GMH, 2013). (GMH, 2013) 1. Warfare involves at least two parties. Either governmental or state armed forces and organised armed groups2 , which, under the direction of a command, directly conduct hostilities (ICRC, Vietri, Melzer)3 or organised armed groups among themselves.

`Translated with DeepL.com (free version)`

I will assume that events are not 100 % equal in all variables are not duplicates.
 

### TO DO: 
* Put clash analysis in some sort of appendix code.

In [139]:
# drop observations outside the sample period
indexYear = AB[ (AB['a_o'] < 1988) | (AB['a_o'] > 2022) ].index
AB=AB.drop(indexYear)
print('Number of observations:',len(AB))

# define columns for defining duplicate information 
dup_cols=list(AB.columns[2:])

# drop duplicates with respects to those columns 
AB_f=AB.drop_duplicates(subset=dup_cols,ignore_index=True)

Number of observations: 33467


In [140]:
# define columns for defining duplicate information 
dup_cols=list(AB.columns[2:])

# drop duplicates with respects to those columns 
AB_f=AB.drop_duplicates(subset=dup_cols,ignore_index=True)

# see how much of the original data was lost
len_nodup=len(AB_f)
print('relative observations lost',round(1-len(AB_f)/len(AB),4))

relative observations lost 0.0106


One can see that 1,06% of the observations are lost after getting rid of the duplicates

In [141]:
# drop observations that do not have any informations on year or municipality or point location
indexMiss = AB_f[ (AB_f['a_o'] == 0) | (AB_f['c_digo_dane_de_municipio'] == 0) | (AB_f['latitud_longitud']=="{'type': 'Point', 'coordinates': [-72, 4]}") ].index
AB_f=AB_f.drop(indexMiss)
print('total relative observations lost',round(1-len(AB_f)/len(AB),4))
print('missing info relative observations lost',round(((len_nodup-len(AB_f))/len(AB)),4))

total relative observations lost 0.0146
missing info relative observations lost 0.004


One can see that 0,4% of the observations are lost after getting rid of observations with missing information on the year or municipality. For a total loss of 1,46% 

In [142]:
# map variables
AB_f['map']=[map_AB_AP(modality=mod,iniciative=ini,g1=g1,g2=g2,g3=g3) for mod,ini,g1,g2,g3 in zip(AB_f['modalidad'],
                                                                                               AB_f['iniciativa'],
                                                                                               AB_f['grupo_armado_1'],
                                                                                               AB_f['grupo_armado_2'],
                                                                                               AB_f['grupo_armado_3'])]
# create flags 
for flag in ['clash', 'govattack', 'guerrattack', 'other', 'parattack', 'posdattack']:
    AB_f[f'{flag}']=[1 if label==flag else 0 for label in AB_f['map']]

In [143]:
AB_f.head(2)

Unnamed: 0.1,Unnamed: 0,id_caso,id_caso_relacionado,a_o,mes,d_a,c_digo_dane_de_municipio,municipio,departamento,regi_n,modalidad,iniciativa,tipo_de_unidad_atacada,grupo_armado_1,descripci_n_grupo_armado,grupo_armado_2,descripci_n_grupo_armado_1,lesionados_civiles,capturados,lesionados_combatientes,militares,polic_as,otras_fuerzas_armadas,agentes_del_estado_sin,total_agentes_del_estado,guerrilleros,paramilitares,grupos_posdesmovilizaci_n,combatientes_sin_informaci,otros_grupos_armados,total_combatientes_de_grupos,total_combatientes,personas_sin_informaci_n,total_civiles,ventaja_militar,total_de_v_ctimas_del_caso,latitud_longitud,grupo_al_que_pertenecen_los,grupo_armado_3,descripci_n_grupo_armado_2,map,clash,govattack,guerrattack,other,parattack,posdattack
0,0,1,CR000497,2006,3,12,27073,BAGADO,CHOCO,ATRATO,COMBATE Y/O CONTACTO ARMADO,GRUPOS ARMADOS ORGANIZADOS,SIN INFORMACIÓN,AGENTE DEL ESTADO,EJÉRCITO NACIONAL,GUERRILLA,FARC,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,GRUPOS ARMADOS ORGANIZADOS,1,"{'type': 'Point', 'coordinates': [-76.41529298...",,,,clash,1,0,0,0,0,0
1,4,10000,,1996,2,17,85010,AGUAZUL,CASANARE,PIEDEMONTE LLANERO,COMBATE Y/O CONTACTO ARMADO,FUERZAS ARMADAS ESTATALES,SIN INFORMACIÓN,AGENTE DEL ESTADO,EJÉRCITO NACIONAL,GUERRILLA,ELN,0,0,0,0,0,0,0,0,2,0,0,0,0,2,2,0,0,FUERZAS ARMADAS ESTATALES,2,"{'type': 'Point', 'coordinates': [-72.45852372...",,,,clash,1,0,0,0,0,0


In [144]:
# it is not --> sum victims and clashes so is counted as more but only related to one event 
AB_g=AB_f.groupby(by=['id_caso_relacionado', 'a_o', 'mes', 'd_a', 'latitud_longitud',
       'c_digo_dane_de_municipio', 'municipio', 'departamento', 'regi_n',
        'grupo_armado_1',
       'descripci_n_grupo_armado', 'grupo_armado_2',
       'descripci_n_grupo_armado_1','grupo_armado_3',
       'descripci_n_grupo_armado_2'], dropna=False).sum(numeric_only=True)[['total_de_v_ctimas_del_caso','clash', 'govattack',
       'guerrattack', 'other', 'parattack', 'posdattack']].reset_index()

In [145]:
#check again for without modalidad and inciativa and tipo de unidad duplicates 
AB_g[AB_g.duplicated(subset=['id_caso_relacionado', 'a_o', 'mes', 'd_a', 'latitud_longitud',
       'c_digo_dane_de_municipio', 'municipio', 'departamento', 'regi_n',
       'grupo_armado_1',
       'descripci_n_grupo_armado', 'grupo_armado_2',
       'descripci_n_grupo_armado_1','grupo_armado_3',
       'descripci_n_grupo_armado_2','clash','govattack','guerrattack','parattack'], keep=False)]

Unnamed: 0,id_caso_relacionado,a_o,mes,d_a,latitud_longitud,c_digo_dane_de_municipio,municipio,departamento,regi_n,grupo_armado_1,descripci_n_grupo_armado,grupo_armado_2,descripci_n_grupo_armado_1,grupo_armado_3,descripci_n_grupo_armado_2,total_de_v_ctimas_del_caso,clash,govattack,guerrattack,other,parattack,posdattack


In [146]:
# check if there is more than 1 observation for every day in every location
for flag in ['clash', 'govattack', 'guerrattack', 'other', 'parattack', 'posdattack']:
    print(flag,set(AB_g[f'{flag}']))

clash {0, 1, 2, 3, 4}
govattack {0, 1, 2}
guerrattack {0, 1, 2, 3}
other {0, 1, 2}
parattack {0, 1}
posdattack {0, 1, 2}


Format Table such that it matches the format for MELTT

In [147]:
# map coordinates to latitude and longitude 
# create a column with the dictionary as objects 
AB_g['latitud_longitud_dict']=[eval(_dict) for _dict in AB_g['latitud_longitud']]
AB_g['latitud']=[_dict['coordinates'][1] for _dict in AB_g['latitud_longitud_dict']]
AB_g['longitud']=[_dict['coordinates'][0] for _dict in AB_g['latitud_longitud_dict']]

# create a date variable
AB_g['date']=['-'.join([str(year),str(month),str(day)]) for year,month,day in zip(AB_g['a_o'],AB_g['mes'],AB_g['d_a'])]

# create a variable for the event 
AB_g['event']=AB_g.index

# dataset variable
AB_g['dataset']='AB'

In [148]:
AB_g[['latitud_longitud_dict']]

Unnamed: 0,latitud_longitud_dict
0,"{'type': 'Point', 'coordinates': [-71.42675117..."
1,"{'type': 'Point', 'coordinates': [-70.74635030..."
2,"{'type': 'Point', 'coordinates': [-74.78996877..."
3,"{'type': 'Point', 'coordinates': [-72.91148891..."
4,"{'type': 'Point', 'coordinates': [-73.96271743..."
...,...
31913,"{'type': 'Point', 'coordinates': [-74.81383960..."
31914,"{'type': 'Point', 'coordinates': [-73.46839757..."
31915,"{'type': 'Point', 'coordinates': [-77.23558958..."
31916,"{'type': 'Point', 'coordinates': [-77.24945028..."


In [174]:
#rename variables into taxonomies 
    # TO DO: get rid of complexity if I do not use it in the end
rename_dict={'id_caso_relacionado':'relcase_tax',
             'grupo_armado_1':'g1_tax', 
             'descripci_n_grupo_armado':'dg1_tax',
             'presunto_responsable':'g1_tax', 
             'descripci_n_presunto':'dg1_tax',
            'grupo_armado_2':'g2_tax',
       'descripci_n_grupo_armado_1':'dg2_tax', 'grupo_armado_3':'g3_tax',
       'descripci_n_grupo_armado_2':'dg3_tax', 
       'total_de_v_ctimas_del_caso':'victims_tax',
       'clash':'clash_tax',
       'govattack':'govattack_tax', 
       'guerrattack':'guerrattack_tax', 
       'parattack':'parattack_tax',
       'posdattack':'posdattack_tax',
       'latitud':'latitude',
       'longitud':'longitude',
       'a_o':'year'}
AB_g.rename(columns=rename_dict, inplace=True)

In [150]:
AB_g.to_csv('MELTT\ABforMELTT_allcols.csv')

In [151]:
# drop observations that do not have any informations on month or day for MELTT
indexMiss = AB_g[ (AB_g['mes'] == 0) | (AB_g['d_a']==0) ].index
len_before=len(AB_g)
AB_g=AB_g.drop(indexMiss)
print('observations lost',len_before-len(AB_g))

# create enddate column
AB_g['enddate']=AB_g['date']

# get only relevant columns 
AB_meltt=AB_g[['dataset','event','date','enddate','latitude','longitude','relcase_tax',
              'g1_tax', 'dg1_tax', 'g2_tax', 'dg2_tax', 'g3_tax', 'dg3_tax',
            'victims_tax', 'clash_tax', 'govattack_tax', 'guerrattack_tax','parattack_tax', 'posdattack_tax','c_digo_dane_de_municipio','year']]

observations lost 321


In [152]:
AB_meltt.head(2)

Unnamed: 0,dataset,event,date,enddate,latitude,longitude,relcase_tax,g1_tax,dg1_tax,g2_tax,dg2_tax,g3_tax,dg3_tax,victims_tax,clash_tax,govattack_tax,guerrattack_tax,parattack_tax,posdattack_tax,c_digo_dane_de_municipio,year
0,AB,0,2000-10-20,2000-10-20,7.026981,-71.426751,CR000004,AGENTE DEL ESTADO,EJÉRCITO NACIONAL - FUERZA AÉREA,GUERRILLA,FARC,,,13,1,0,0,0,0,81065,2000
1,AB,1,2000-12-15,2000-12-15,7.077359,-70.74635,CR000005,AGENTE DEL ESTADO,EJÉRCITO NACIONAL,GUERRILLA,FARC,,,0,0,0,1,0,0,81001,2000


In [153]:
print('Check for date missings:', AB_meltt['date'].isna().sum())
# save output
AB_meltt.to_csv('MELTT\ABforMELTT.csv')

Check for date missings: 0


## Attacks to villages
It is understood as an incursion by an armed group that involves the temporary occupation of a territory and a continued military action aimed at the destruction of a military objective within an urban area or population centre. 
military action directed towards the razing of a military objective within an urban area or population centre and which is accompanied by attacks 
accompanied by attacks on and/or damage to the civilian population.

`Translated with DeepL.com (free version)`

In [154]:
# get only observations from sample period
indexYear = AP[ (AP['a_o'] < 1988) | (AP['a_o'] > 2022) ].index
AP=AP.drop(indexYear)
# save lenght of sample period 
len_ap=len(AP)
print('Number of observations:',len_ap)
# define columns for defining duplicate information 
dup_cols=list(AP.columns[2:])

# drop duplicates with respects to those columns 
AP_f=AP.drop_duplicates(subset=dup_cols,ignore_index=True)
len_nodup=len(AP_f)
print('relative observations lost',round(1-len(AP_f)/len_ap,4))
print('absolute observations lost',round(len_ap-len(AP_f),4))

Number of observations: 391
relative observations lost 0.0026
absolute observations lost 1


In [155]:
# drop observations that do not have any informations on year or municipality or point location
indexMiss = AP_f[ (AP_f['a_o'] == 0) | (AP_f['c_digo_dane_de_municipio'] == 0) | (AP_f['latitud_longitud']=="{'type': 'Point', 'coordinates': [-72, 4]}") ].index
AP_f=AP_f.drop(indexMiss)
print('total relative observations lost',round(1-len(AP_f)/len(AP),4))
print('missing info relative observations lost',round(((len_nodup-len(AP_f))/len(AP)),4))

total relative observations lost 0.0026
missing info relative observations lost 0.0


In [156]:
# map variables 
AP_f['map']=[map_AB_AP(modality='',iniciative='',g1=g1,g2=g2,g3=g3, ap_flag=True) for g1,g2,g3 in zip(
                                                                                               AP_f['grupo_armado_1'],
                                                                                               AP_f['grupo_armado_2'],
                                                                                               AP_f['grupo_armado_3'])]
# create flags 
for flag in ['guerrattack','parattack']:
    AP_f[f'{flag}']=[1 if label==flag else 0 for label in AP_f['map']]

In [157]:
# check duplicates 
AP_f[AP_f.duplicated(subset=['id_caso_relacionado', 'a_o', 'mes', 'd_a', 'latitud_longitud',
       'c_digo_dane_de_municipio', 'municipio', 'departamento', 'regi_n',
        'grupo_armado_1',
       'descripci_n_grupo_armado', 'grupo_armado_2',
       'descripci_n_grupo_armado_1','grupo_armado_3',
       'descripci_n_grupo_armado_2'])]
# none 

Unnamed: 0.1,Unnamed: 0,id_caso,id_caso_relacionado,a_o,mes,d_a,c_digo_dane_de_municipio,municipio,departamento,regi_n,grupo_armado_1,descripci_n_grupo_armado,grupo_armado_2,descripci_n_grupo_armado_1,abandono_o_despojo_forzado,amenaza_o_intimidaci_n,ataque_contra_misi_n_m_dica,confinamiento_o_restricci,desplazamiento_forzado,extorsi_n,lesionados_civiles,pillaje,tortura,capturados,escudo_humano,lesionados_combatientes,militares,polic_as,otras_fuerzas_armadas,agentes_del_estado_sin,total_agentes_del_estado,guerrilleros,paramilitares,grupos_posdesmovilizaci_n,combatientes_sin_informaci,otros_grupos_armados,total_combatientes_de_grupos,total_combatientes,personas_sin_informaci_n,total_civiles,ventaja_militar,total_de_v_ctimas_del_caso,latitud_longitud,grupo_armado_3,descripci_n_grupo_armado_2,otro_hecho_simult_neo,grupo_al_que_pertenecen_los,map,guerrattack,parattack


In [158]:
AP_g=AP_f.groupby(by=['id_caso_relacionado', 'a_o', 'mes', 'd_a', 'latitud_longitud',
       'c_digo_dane_de_municipio', 'municipio', 'departamento', 'regi_n',
        'grupo_armado_1',
       'descripci_n_grupo_armado', 'grupo_armado_2',
       'descripci_n_grupo_armado_1','grupo_armado_3',
       'descripci_n_grupo_armado_2'], dropna=False).sum(numeric_only=True)[['total_de_v_ctimas_del_caso',
       'guerrattack', 'parattack']].reset_index()

Format so it fits MELTT


In [159]:
# add other flag variables 
AP_g['clash']=0
AP_g['govattack']=0
AP_g['posdattack']=0
# map coordinates to latitude and longitude 
# create a column with the dictionary as objects 
AP_g['latitud_longitud_dict']=[eval(_dict) for _dict in AP_g['latitud_longitud']]
AP_g['latitud']=[_dict['coordinates'][1] for _dict in AP_g['latitud_longitud_dict']]
AP_g['longitud']=[_dict['coordinates'][0] for _dict in AP_g['latitud_longitud_dict']]

# create a date variAPle
AP_g['date']=['-'.join([str(year),str(month),str(day)]) for year,month,day in zip(AP_g['a_o'],AP_g['mes'],AP_g['d_a'])]

# create a variAPle for the event 
AP_g['event']=AP_g.index

# dataset variAPle
AP_g['dataset']='AP'

#rename variables into taxonomies 
AP_g.rename(columns=rename_dict, inplace=True)

In [160]:
# save table
AP_g.to_csv('MELTT\APforMELTT_allcols.csv')

In [161]:
# drop observations that do not have any informations on month or day for MELTT
indexMiss = AP_g[ (AP_g['mes'] == 0) | (AP_g['d_a']==0) ].index
len_before=len(AP_g)
AP_g=AP_g.drop(indexMiss)
print('observations lost',len_before-len(AP_g))

observations lost 1


In [162]:
# form as example in meltt
AP_g['enddate']=AP_g['date']
AP_meltt=AP_g[['dataset','event','date','enddate','latitude','longitude','relcase_tax',
              'g1_tax', 'dg1_tax', 'g2_tax', 'dg2_tax', 'g3_tax', 'dg3_tax',
            'victims_tax', 'clash_tax', 'govattack_tax', 'guerrattack_tax','parattack_tax', 'posdattack_tax','c_digo_dane_de_municipio','year']]

In [163]:
print('Check for date missings:', AP_meltt['date'].isna().sum())
AP_meltt.to_csv('MELTT\APforMELTT.csv')

Check for date missings: 0


## Terrorist Attacks (AT)
It is understood as any attack perpetrated through the use of explosives, which occur in densely populated areas and in which there is plural affectation of persons or civilian property, regardless of whether the target of the action is civilian or military.
`Translated with DeepL.com (free version)`

In [164]:
# get only observations from sample period
indexYear = AT[ (AT['a_o'] < 1988) | (AT['a_o'] > 2022) ].index
AT=AT.drop(indexYear)
# save lenght of sample period 
len_at=len(AT)
print('Number of observations:',len_at)
# define columns for defining duplicate information 
dup_cols=list(AT.columns[2:])

# drop duplicates with respects to those columns 
AT_f=AT.drop_duplicates(subset=dup_cols,ignore_index=True)
len_nodup=len(AT_f)
print('relative observations lost',round(1-len(AT_f)/len_at,4))
print('absolute observations lost',round(len_at-len(AT_f),4))

Number of observations: 219
relative observations lost 0.0
absolute observations lost 0


In [165]:
# drop observations that do not have any informations on year or municipality
indexMiss = AT_f[ (AT_f['a_o'] == 0) | (AT_f['c_digo_dane_de_municipio'] == 0)  | (AT_f['latitud_longitud']=="{'type': 'Point', 'coordinates': [-72, 4]}") ].index
AT_f=AT_f.drop(indexMiss)
print('total relative observations lost',round(1-len(AT_f)/len(AT),4))
print('missing info relative observations lost',round(((len_nodup-len(AT_f))/len(AT)),4))

total relative observations lost 0.0
missing info relative observations lost 0.0


In [166]:
# map
AT_f=map_resposable('attack',AT_f)
#create flags
for flag in ['parattack','guerrattack','posdattack']: 
    AT_f[f'{flag}']=[1 if f else 0 for f  in AT_f[f'{flag}']]

In [167]:
AT_g=AT_f.groupby(by=['id_caso_relacionado', 'a_o', 'mes', 'd_a', 'latitud_longitud',
       'c_digo_dane_de_municipio', 'municipio', 'departamento', 'regi_n',
        'presunto_responsable',
       'descripci_n_presunto'], dropna=False).sum(numeric_only=True)[['total_de_v_ctimas_del_caso',
       'guerrattack', 'parattack','posdattack']].reset_index()

In [168]:
# check if flag repeated 
for flag in ['parattack','guerrattack','posdattack']:
    print(flag,set(AT_g[f'{flag}']))

parattack {0, 1}
guerrattack {0, 1}
posdattack {0, 1}


Format so it is compatible with MELTT

In [169]:
# add other flag variables 
AT_g['clash']=0
AT_g['govattack']=0
AT_g['g2_tax']=None
AT_g['dg2_tax']=None
AT_g['g3_tax']=None
AT_g['dg3_tax']=None
# map coordinates to latitude and longitude 
# create a column with the dictionary as objects 
AT_g['latitud_longitud_dict']=[eval(_dict) for _dict in AT_g['latitud_longitud']]
AT_g['latitud']=[_dict['coordinates'][1] for _dict in AT_g['latitud_longitud_dict']]
AT_g['longitud']=[_dict['coordinates'][0] for _dict in AT_g['latitud_longitud_dict']]

# create a date variAPle
AT_g['date']=['-'.join([str(year),str(month),str(day)]) for year,month,day in zip(AT_g['a_o'],AT_g['mes'],AT_g['d_a'])]

# create a variAPle for the event 
AT_g['event']=AT_g.index

# dataset variAPle
AT_g['dataset']='AT'

#rename variables into taxonomies 
AT_g.rename(columns=rename_dict, inplace=True)

In [170]:
AT_g.head(2)

Unnamed: 0,relcase_tax,year,mes,d_a,latitud_longitud,c_digo_dane_de_municipio,municipio,departamento,regi_n,g1_tax,dg1_tax,victims_tax,guerrattack_tax,parattack_tax,posdattack_tax,clash_tax,govattack_tax,g2_tax,dg2_tax,g3_tax,dg3_tax,latitud_longitud_dict,latitude,longitude,date,event,dataset
0,CR000025,2009,5,9,"{'type': 'Point', 'coordinates': [-77.59445713...",52678,SAMANIEGO,NARIÑO,OCCIDENTE DE NARIÑO,GUERRILLA,ELN,0.0,1,0,0,0,0,,,,,"{'type': 'Point', 'coordinates': [-77.59445713...",1.335352,-77.594457,2009-5-9,0,AT
1,CR002067,2011,7,9,"{'type': 'Point', 'coordinates': [-76.27112721...",19821,TORIBIO,CAUCA,NORTE DEL CAUCA,GUERRILLA,FARC,4.0,1,0,0,0,0,,,,,"{'type': 'Point', 'coordinates': [-76.27112721...",2.95145,-76.271127,2011-7-9,1,AT


In [171]:
# save data
AT_g.to_csv('MELTT\ATforMELTT_allcols.csv')

In [172]:
# form as example in meltt
AT_g['enddate']=AT_g['date']
AT_meltt=AT_g[['dataset','event','date','enddate','latitude','longitude','relcase_tax',
              'g1_tax', 'dg1_tax', 'g2_tax', 'dg2_tax', 'g3_tax', 'dg3_tax',
            'victims_tax', 'clash_tax', 'govattack_tax', 'guerrattack_tax','parattack_tax', 'posdattack_tax','c_digo_dane_de_municipio','year']]

In [173]:
print('Check for date missings:', AT_meltt['date'].isna().sum())
AT_meltt.to_csv('MELTT\ATforMELTT.csv')

Check for date missings: 0


## Aggregate combined data

### TO DO: 
- Think how to include manually check if observation with no information on the day etc are duplicated. 
- should be included right?
 

In [3]:
# read data 
post_meltt=pd.read_csv('MELTT\matchedevents.csv')

In [4]:
# aggregate by municipality and year
siev_ca=post_meltt[["victims_tax","clash_tax", "govattack_tax","guerrattack_tax",
                    "parattack_tax","posdattack_tax",'c_digo_dane_de_municipio',
                    'year']].groupby(['c_digo_dane_de_municipio','year']).sum().reset_index()
# rename columns
rename_dict={'c_digo_dane_de_municipio':'muncode','victims_tax': 'causalities', 'clash_tax': 'clashes',
             'govattack_tax':'govattacks','guerrattack_tax':'guerrattacks','parattack_tax':'parattacks',
             'posdattack_tax':'posdattacks'}
siev_ca=siev_ca.rename(columns=rename_dict)

In [5]:
siev_ca.head(3)

Unnamed: 0,muncode,year,causalities,clashes,govattacks,guerrattacks,parattacks,posdattacks
0,5000,1991,1,1,0,0,0,0
1,5000,1992,7,0,0,1,0,0
2,5000,1993,1,1,0,1,0,0


# Massacres
It is understood as the intentional homicide of four (4) or more persons in a state of defenselessness and under the same circumstances of 
time and place, and which is distinguished by the public exposure of violence and the asymmetrical relationship between the armed actor and the civilian population, without interaction between armed actors (GMH, 2013). 
and the civilian population, without interaction between armed actors (GMH, 2013).
`Translated with DeepL.com (free version)`

In [6]:
# get only observations from sample period
indexYear = MA[ (MA['a_o'] < 1988) | (MA['a_o'] > 2022) ].index
MA=MA.drop(indexYear)
# save lenght of sample period 
len_ma=len(MA)
print('Number of observations:',len_ma)
# define columns for defining duplicate information 
dup_cols=list(MA.columns[2:])

# drop duplicates with respects to those columns 
MA_f=MA.drop_duplicates(subset=dup_cols,ignore_index=True)
len_nodup=len(MA_f)
print('relative observations lost',round(1-len(MA_f)/len_ma,4))
print('absolute observations lost',round(len_ma-len(MA_f),4))

Number of observations: 3710
relative observations lost 0.0022
absolute observations lost 8


In [8]:
# drop observations that do not have any informations on year or municipality
indexMiss = MA_f[ (MA_f['a_o'] == 0) | (MA_f['c_digo_dane_de_municipio'] == 0)  | (MA_f['latitud_longitud']=="{'type': 'Point', 'coordinates': [-72, 4]}") ].index
MA_f=MA_f.drop(indexMiss)
print('total relative observations lost',round(1-len(MA_f)/len(MA),4))
print('missing info relative observations lost',round(((len_nodup-len(MA_f))/len(MA)),4))
print('absolute observations lost',round(len_nodup-len(MA_f),4))

total relative observations lost 0.0027
missing info relative observations lost 0.0005
absolute observations lost 2


In [9]:
# map
MA_f=map_resposable('mass',MA_f)
#create flags 
for flag in ['parmass','guerrmass','posdmass']: 
    MA_f[f'{flag}']=[1 if f else 0 for f  in MA_f[f'{flag}']]

In [10]:
# group by merging variables, that should identify one case 
MA_g=MA_f.groupby(['id_caso_relacionado', 'a_o', 'mes', 'd_a',
       'c_digo_dane_de_municipio','latitud_longitud'],dropna=False).sum(numeric_only=True)[['total_de_v_ctimas_del_caso','parmass','guerrmass','posdmass']].reset_index()

In [11]:
# check if flag repeated 
for flag in ['parmass','guerrmass','posdmass']:
    print(flag,set(MA_f[f'{flag}']))

parmass {0, 1}
guerrmass {0, 1}
posdmass {0, 1}


In [12]:
MA_g.head(4)

Unnamed: 0,id_caso_relacionado,a_o,mes,d_a,c_digo_dane_de_municipio,latitud_longitud,total_de_v_ctimas_del_caso,parmass,guerrmass,posdmass
0,CR000001,1997,2,27,70508,"{'type': 'Point', 'coordinates': [-75.22908444...",6,1,0,0
1,CR000002,1997,10,25,5361,"{'type': 'Point', 'coordinates': [-75.51925813...",17,1,0,0
2,CR000104,1995,9,20,5045,"{'type': 'Point', 'coordinates': [-76.62596065...",25,0,1,0
3,CR000130,1998,7,4,20060,"{'type': 'Point', 'coordinates': [-73.88876766...",4,0,1,0


In [13]:
# aggregate by municipality and year
mass_agg=MA_g[["total_de_v_ctimas_del_caso","parmass", "guerrmass","posdmass",
                "a_o",'c_digo_dane_de_municipio']].groupby(['c_digo_dane_de_municipio','a_o']).sum().reset_index()
# rename columns
rename_dict={'c_digo_dane_de_municipio':'muncode',"total_de_v_ctimas_del_caso": 'causalities', 'a_o': 'year'}
mass_agg=mass_agg.rename(columns=rename_dict)


In [14]:
mass_agg.head(3)

Unnamed: 0,muncode,year,causalities,parmass,guerrmass,posdmass
0,5000,1996,5,1,0,0
1,5001,1988,4,0,0,0
2,5001,1990,99,2,3,0


In [15]:
len(mass_agg)

2218

# Kidnappings
It is the snatching, subtraction, retention or concealment of a person, against his or her will, by means of intimidation, violence or deception, by or with the participation of the actors of the armed conflict, 
violence or deception, by or with the participation of actors in the armed conflict. It can be simple, when it has no 
a manifest purpose, or extortive when it is carried out with the purpose of demanding for his or her freedom a profit or any other benefit, 
or so that something is done or omitted, or for publicity or political purposes (Congress of the Republic, 2000). 
`Translated with DeepL.com (free version)`


In [16]:
# get only observations from sample period
indexYear = SE[ (SE['a_o'] < 1988) | (SE['a_o'] > 2022) ].index
SE=SE.drop(indexYear)
# save lenght of sample period 
len_SE=len(SE)
print('Number of observations:',len_SE)
# define columns for defining duplicate inforSEtion 
dup_cols=list(SE.columns[2:])

# drop duplicates with respects to those columns 
SE_f=SE.drop_duplicates(subset=dup_cols,ignore_index=True)
len_nodup=len(SE_f)
print('relative observations lost',round(1-len(SE_f)/len_SE,4))
print('absolute observations lost',round(len_SE-len(SE_f),4))

Number of observations: 28078
relative observations lost 0.1441
absolute observations lost 4046


In [18]:
# drop observations that do not have any informations on year or municipality
indexMiss = SE_f[ (SE_f['a_o'] == 0) | (SE_f['c_digo_dane_de_municipio'] == 0)  | (SE_f['latitud_longitud']=="{'type': 'Point', 'coordinates': [-72, 4]}") ].index
SE_f=SE_f.drop(indexMiss)
print('total relative observations lost',round(1-len(SE_f)/len(SE),4))
print('missing info relative observations lost',round(((len_nodup-len(SE_f))/len(SE)),4))
print('absolute observations lost',round(len_nodup-len(SE_f),4))

total relative observations lost 0.1571
missing info relative observations lost 0.013
absolute observations lost 365


In [19]:
# map
SE_f=map_resposable('sec',SE_f)

# create flags and variable with victims
for flag in ['parsec','guerrsec','posdsec']: 
    SE_f[f'{flag}']=[1 if f else 0 for f  in SE_f[f'{flag}']]
    SE_f[f'n_{flag}']=[victims if f else 0 for f,victims  in zip(SE_f[f'{flag}'],SE_f['total_de_v_ctimas_del_caso'])]


In [20]:
# group by merging variables, that should identify one day in one location 
    # note that it cloud be more plausible that more than one kidnapping took place one day (e.g retenes )
    # TO DO: 
        # Find literature to support this aussage 
SE_g=SE_f.groupby(['id_caso_relacionado', 'a_o', 'mes', 'd_a',
       'c_digo_dane_de_municipio','latitud_longitud'],dropna=False).sum(numeric_only=True)[['parsec','guerrsec','posdsec','n_parsec','n_guerrsec','n_posdsec']].reset_index()
# check if flag repeated 
# expected to be 
for flag in ['parsec','guerrsec','posdsec']:
    print(flag,set(SE_f[f'{flag}']))

parsec {0, 1}
guerrsec {0, 1}
posdsec {0, 1}


In [21]:
SE_g.head(5)

Unnamed: 0,id_caso_relacionado,a_o,mes,d_a,c_digo_dane_de_municipio,latitud_longitud,parsec,guerrsec,posdsec,n_parsec,n_guerrsec,n_posdsec
0,CR000038,1988,5,27,68152,"{'type': 'Point', 'coordinates': [-72.62708286...",0,1,0,0,3,0
1,CR000041,1988,10,5,86001,"{'type': 'Point', 'coordinates': [-76.65027141...",0,1,0,0,4,0
2,CR000048,1989,4,21,5411,"{'type': 'Point', 'coordinates': [-75.81283226...",0,1,0,0,1,0
3,CR000052,1989,7,19,5040,"{'type': 'Point', 'coordinates': [-75.14834842...",0,1,0,0,1,0
4,CR000053,1989,8,27,68324,"{'type': 'Point', 'coordinates': [-73.70080209...",0,1,0,0,1,0


In [22]:
# aggregate by municipality and year
sec_agg=SE_g[['parsec','guerrsec','posdsec','n_parsec','n_guerrsec','n_posdsec',
                "a_o",'c_digo_dane_de_municipio']].groupby(['c_digo_dane_de_municipio','a_o']).sum().reset_index()
# rename columns
rename_dict={'c_digo_dane_de_municipio':'muncode', 'a_o': 'year'}
sec_agg=sec_agg.rename(columns=rename_dict)

In [23]:
sec_agg.head(5)

Unnamed: 0,muncode,year,parsec,guerrsec,posdsec,n_parsec,n_guerrsec,n_posdsec
0,5000,1990,0,0,0,0,0,0
1,5000,1991,0,6,0,0,41,0
2,5000,1992,0,2,0,0,3,0
3,5000,1993,0,1,0,0,1,0
4,5000,1994,0,2,0,0,4,0


# Final Merge
In this parts all tables are joined so that we get all the variables by municipality and year.

In [24]:
# join massacres to meltt output
sievcac=siev_ca.merge(mass_agg, how='outer', on=['muncode','year'],validate='1:1')

# join kidnappings to main
sievcac=sievcac.merge(sec_agg, how='outer', on=['muncode','year'],validate='1:1')

# fill missings with 0 since would mean on that year and municipalities were not events of said type
sievcac=sievcac.fillna(0)

In [25]:
# sum the causalities from massacres and attacks etc 
sievcac['causalities']=sievcac['causalities_x']+sievcac['causalities_y']
sievcac=sievcac.drop(['causalities_x','causalities_y'],axis=1)

In [26]:
sievcac.head(5)

Unnamed: 0,muncode,year,clashes,govattacks,guerrattacks,parattacks,posdattacks,parmass,guerrmass,posdmass,parsec,guerrsec,posdsec,n_parsec,n_guerrsec,n_posdsec,causalities
0,5000,1991,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,41.0,0.0,1.0
1,5000,1992,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,3.0,0.0,7.0
2,5000,1993,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
3,5000,1995,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5000,1996,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,5.0,0.0,6.0


In [27]:
sievcac.to_csv('SIEVCAC_data.csv')