<a href="https://colab.research.google.com/github/abroniewski/Child-Wasting-Prediction/blob/ft_eng_model_eval/notebooks/feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [87]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import plotly.express as px
import warnings
import os
warnings.filterwarnings('ignore')

# fixed Variables for reproducability
Random_State = 0

In [100]:
# path settings

# Input file paths (change these paths)
acled_processed_in = "https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/acled/acled.csv"
prevalence_processed_in = "https://raw.githubusercontent.com/abroniewski/Child-Wasting-Prediction/main/data/acled/prevalence.csv"
conflict_df_old = pd.read_csv('../data/ZHL/conflict.csv')

# Output file paths (change these paths)
acled_out = "acled_features.csv"
# prevalence_out = "prevalence.csv"


In [101]:
# Read conflict data
conflict_df = pd.read_csv(acled_processed_in)
conflict_df.head(3)

Unnamed: 0,event_date,year,event_type,sub_event_type,actor1,assoc_actor_1,inter1,actor2,assoc_actor_2,inter2,...,admin1,admin2,location,latitude,longitude,source,source_scale,notes,fatalities,timestamp
0,2022-09-23,2022,Explosions/Remote violence,Shelling/artillery/missile attack,Al Shabaab,,2,ATMIS: African Union Transition Mission in Som...,,8,...,Lower Juba,Kismaayo,Abdale Birole,-0.4906,42.1969,Calamada; Al Furqaan,National,"On 23 September 2022, Al Shabaab militants fir...",0,1664231916
1,2022-09-23,2022,Battles,Armed clash,Al Shabaab,,2,Military Forces of Somalia (2022-),,1,...,Lower Shabelle,Afgooye,Almada,2.2187,45.2087,Al Furqaan; Calamada,National,"On 23 September 2022, overnight, Al Shabaab mi...",0,1664231916
2,2022-09-23,2022,Battles,Armed clash,Al Shabaab,,2,ATMIS: African Union Transition Mission in Som...,,8,...,Lower Shabelle,Afgooye,Awbocow,1.985,45.0018,Al Furqaan; Calamada,National,"On 23 September 2022, overnight, Al Shabaab mi...",0,1664231916


**discription of some important variables** 
- **inter** - Contains 8 categories. These categories offer a way to distinguish between actors and determine how patterns of activity conform to goals and organizations. ACLED does not use a pattern of activity to designate what kind of agent a group is: it specifically observes the goals and structure of an organization, where possible, its spatial dimension and its relationships to communities. 
- **interaction** - The joined interaction code is the combination of the two ‘INTER’ codes associated with the two main actors. Single actor type codes are recorded in ‘INTER1’ and ‘INTER2’ columns, and the compounded number is recorded in the ‘INTERACTION’ column. For example, if a country’s military fights a political militia group, and the respective ‘INTER1’ and ‘INTER2’ codes are “1” and “3”, respectively, the compounded Interaction is recorded as “13”. Interaction numbers are always the smallest possible number (for example, 37 instead of 73), regardless of the order of ‘ACTOR1’ and ‘ACTOR2’. Interaction codes are recorded for all events, including non-violent activity. For one-sided events, the empty second actor category is coded as “0”. If a non-violent rebel event occurs where only ‘INTER1’ is noted with a “2”, “20” is coded in the ‘INTERACTION’ column.
- **admin1** - Somalia has 18 administrative regions - https://en.wikipedia.org/wiki/Administrative_divisions_of_Somalia
- **admin2** - Although according to document admin3 denotes the district (page30 acled pdf), but it seems admin2 is district 
- **source** - The source of the event report


# Basic PreProcessing before feature engineering 

In [102]:
# Read prevalence data and extract the time
prevalence_df = pd.read_csv(prevalence_processed_in)
min_date, max_date = prevalence_df['date'].min(), prevalence_df['date'].max()
print(f"The data that is scope in this project is from     ----    {min_date}  to  {max_date}")
prevalence_df

The data that is scope in this project is from     ----    2017-07-01  to  2021-07-01


Unnamed: 0,date,district,total population,Under-Five Population,GAM,MAM,SAM,GAM Prevalence,SAM Prevalence,SAM/GAM ratio
0,2021-07-01,Adan Yabaal,,17190.000,4930.000000,,710.000000,0.286795,0.041303,0.144016
1,2021-07-01,Afgooye,,94444.600,43800.000000,,8930.000000,0.463764,0.094553,0.203881
2,2021-07-01,Afmadow,,46703.800,18290.000000,,4150.000000,0.391617,0.088858,0.226900
3,2021-07-01,Baardheere,,34453.400,13330.000000,,2230.000000,0.386899,0.064725,0.167292
4,2021-07-01,Badhaadhe,,14272.600,5790.000000,,1330.000000,0.405672,0.093186,0.229706
...,...,...,...,...,...,...,...,...,...,...
672,2017-07-01,Wanla Weyn,227667.915,45533.583,16810.998844,13022.604738,3788.394106,0.369200,0.083200,0.225352
673,2017-07-01,Xarardheere,278193.510,55638.702,26328.233786,21120.451279,5207.782507,0.473200,0.093600,0.197802
674,2017-07-01,Xudun,42361.515,8472.303,3744.757926,2863.638414,881.119512,0.442000,0.104000,0.235294
675,2017-07-01,Xudur,113853.105,22770.621,11071.075930,7874.080742,3196.995188,0.486200,0.140400,0.288770


In [103]:
# but the min date for us will be 6 months before the starting date, so we will get the aggregation of first 6 months for that date
# Basically 2017-07-01 (min date) reflects the aggregation of 2017-01-01 to 2017-01-01
# so our min date will be 6 months before it
min_date = '2017-01-01'
min_date

'2017-01-01'

In [104]:
# removing the acled data which is out of scope from our project timeline because i don't need to worry about which feature was important 10 years back, and 
# of course we don't have the prevelance data for that as well
# parsing event date in pandas datetime format for future use
conflict_df = conflict_df[(conflict_df['event_date'] >= min_date) & (conflict_df['event_date'] <= max_date)].sort_values('event_date').reset_index(drop=True)
conflict_df['event_date'] = pd.to_datetime(conflict_df['event_date'])
conflict_df

Unnamed: 0,event_date,year,event_type,sub_event_type,actor1,assoc_actor_1,inter1,actor2,assoc_actor_2,inter2,...,admin1,admin2,location,latitude,longitude,source,source_scale,notes,fatalities,timestamp
0,2017-01-01,2017,Violence against civilians,Attack,Al Shabaab,,2,Civilians (Somalia),,7,...,Middle Juba,Jilib,Jilib,0.4833,42.7666,Undisclosed Source,Local partner-Other,Al Shabaab publicly executed a 76 year old civ...,1,1600121131
1,2017-01-01,2017,Violence against civilians,Attack,Habar Jeclo Clan Militia (Somalia),,4,Civilians (Somalia),Dhulbahante Clan Group (Somalia),7,...,Togdheer,Buuhoodle,Buuhoodle,8.2516,46.3157,Undisclosed Source,Local partner-Other,Habar Jeclo militiamen ambushed a vehicle carr...,3,1571260135
2,2017-01-01,2017,Violence against civilians,Attack,Police Forces of Somaliland (2010-),,1,Civilians (Somalia),,7,...,Togdheer,Buuhoodle,Buuhoodle,8.2516,46.3157,Radio Kulmiye,National,Somaliland police reportedly shoot and kill tw...,2,1567462229
3,2017-01-02,2017,Strategic developments,Disrupted weapons use,AMISOM: African Union Mission in Somalia (2007...,,8,Unidentified Armed Group (Somalia),,3,...,Hiraan,Bulo Burto,Bulo Burto,3.8519,45.5651,Undisclosed Source,Local partner-Other,Defusal: An RCIED placed at an AMISOM Djibouti...,0,1571260135
4,2017-01-02,2017,Battles,Armed clash,Gaaljecel Clan Militia (Somalia),,4,Makane Clan Militia (Somalia),,4,...,Hiraan,Belet Weyne,Belet Weyne,4.7360,45.2043,Undisclosed Source,Local partner-Other,"Clan militias clash at Makiinta, near Belet We...",0,1571260135
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12476,2021-07-01,2021,Violence against civilians,Attack,Leelkayse Clan Militia (Somalia),,4,Civilians (Somalia),Majeerteen-Omar Mahmud Sub-Clan Group (Somalia),7,...,Mudug,Galdogob,Galdogob,7.0120,47.0600,Undisclosed Source,Local partner-Other,"On 1 July 2021, a Leelkayse clan militia shot ...",1,1626118834
12477,2021-07-01,2021,Explosions/Remote violence,Shelling/artillery/missile attack,Al Shabaab,,2,AMISOM: African Union Mission in Somalia (2007...,,8,...,Middle Shabelle,Jowhar,Miir-Taqwo,3.1034,45.8515,Radio Risala,National,"On 1 July 2021, Al Shabaab militants launched ...",0,1626118833
12478,2021-07-01,2021,Violence against civilians,Attack,Al Shabaab,,2,Civilians (Kenya),Labour Group (Kenya),7,...,Lower Juba,Afmadow,Dhobley,0.4063,41.0124,The Star (Kenya),Regional,"On 1 July 2021, Al Shabaab militants shot and ...",3,1625510721
12479,2021-07-01,2021,Battles,Armed clash,Al Shabaab,,2,AMISOM: African Union Mission in Somalia (2007...,,8,...,Lower Juba,Badhaadhe,Koday,-1.0351,41.9748,Somali Memo,National,"On 1 July 2021, Al Shabaab militants attacked ...",0,1625510721


In [105]:
conflict_df.shape

(12481, 21)

In [106]:
# By definaction these features does not seem useful to feed - latitude longitude notes timestamp location
# As our minimum cardinality for child wasting is district, where it happens within district is not important for us - latitude	longitude location we can remove
# source and source_scale are just denoting where we got the information
# notes are about the event that happened, the important information is already taken out so can drop it 
# source and source scale are just denoting the information gathered, which are not useful and can be dropped
# moreover the idea is, we need limited features, as we have very limited data to train, we do not want our model to be overfitted, so first we can remove some features
# which are not useful according to our knowledge and then select the remaining once using tests  

conflict_df.drop(columns=['latitude','longitude','notes','timestamp','location','source','source_scale','year'],inplace=True)
conflict_df.head(2)

Unnamed: 0,event_date,event_type,sub_event_type,actor1,assoc_actor_1,inter1,actor2,assoc_actor_2,inter2,interaction,admin1,admin2,fatalities
0,2017-01-01,Violence against civilians,Attack,Al Shabaab,,2,Civilians (Somalia),,7,27,Middle Juba,Jilib,1
1,2017-01-01,Violence against civilians,Attack,Habar Jeclo Clan Militia (Somalia),,4,Civilians (Somalia),Dhulbahante Clan Group (Somalia),7,47,Togdheer,Buuhoodle,3


# Feature Engineering

1. Check for Data/Data Types and Bad Data
6. Imputation of missing values
7. Checks and remove outliers
8. New features imputation
9. Feature Selection using stastitical tests
10. Check for distribution type and Perform necessary transformations (if needed)
11. Scaling the data (if needed)
12. Binning Continuous data (if needed)

## 1. Check Data

In [107]:
# check if all data types make sense
conflict_df.dtypes

event_date        datetime64[ns]
event_type                object
sub_event_type            object
actor1                    object
assoc_actor_1             object
inter1                     int64
actor2                    object
assoc_actor_2             object
inter2                     int64
interaction                int64
admin1                    object
admin2                    object
fatalities                 int64
dtype: object

- inter1, inter2 and interaction are unique codes only, so are technically categorical variables not integer, this is important if we do imputaion of null values, applying numeric methods like mean and interpolation does not make sense on these variables
- year can be categorical variables as well

In [108]:
# as indicated above, converting those columns to string
conflict_df = conflict_df.astype({"inter1": str, "inter2":str, "interaction":str})
conflict_df.head(1)

Unnamed: 0,event_date,event_type,sub_event_type,actor1,assoc_actor_1,inter1,actor2,assoc_actor_2,inter2,interaction,admin1,admin2,fatalities
0,2017-01-01,Violence against civilians,Attack,Al Shabaab,,2,Civilians (Somalia),,7,27,Middle Juba,Jilib,1


In [109]:
### Checking categorical columns
# T will represent the transpose of the resulting dataframe, better for visualization
temp= conflict_df.describe(include='O').T
temp

Unnamed: 0,count,unique,top,freq
event_type,12481,6,Battles,5710
sub_event_type,12481,24,Armed clash,5428
actor1,12481,282,Al Shabaab,4434
assoc_actor_1,705,122,Military Forces of Somalia (2017-2022),151
inter1,12481,8,2,4507
actor2,11765,242,Civilians (Somalia),3269
assoc_actor_2,2059,331,Government of Somalia (2017-2022),382
inter2,12481,9,1,3453
interaction,12481,36,12,4160
admin1,12481,18,Banadir,2896


In [110]:
temp['unique'].sum()

1152

**few important points**

- Al Shabaab and civilians are most active actors
- Battles are most frequent, along with armed clash
- most interactions are between 12 (MILITARY VERSUS REBELS)

In [76]:
### Checking numeric columns
conflict_df.describe(include='number').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fatalities,12481.0,1.693614,6.986975,0.0,0.0,0.0,1.0,587.0


## 2. Missing values imputation

1. According to the acled documentatoin - " Interaction codes are recorded for all events, including non-violent activity. For one-sided events, the empty second actor category is coded as “0”. If a non-violent rebel event occurs where only ‘INTER1’ is noted with a “2”, “20” is coded in the ‘INTERACTION’ column."
    1. So basically if there is NaN in actor 2, that is not missing, that means only single actor was responible for conflict/violenece/event

In [77]:
# persentage of missing values
null_percentage = conflict_df.isnull().sum().sort_values(ascending = False)/len(conflict_df)*100
null_percentage

assoc_actor_1     94.351414
assoc_actor_2     83.502924
actor2             5.736720
event_date         0.000000
event_type         0.000000
sub_event_type     0.000000
actor1             0.000000
inter1             0.000000
inter2             0.000000
interaction        0.000000
admin1             0.000000
admin2             0.000000
fatalities         0.000000
dtype: float64

In [78]:
# dropping columns which has more than 60% missing values
col_to_drop = null_percentage[null_percentage>60].keys()
col_to_drop

Index(['assoc_actor_1', 'assoc_actor_2'], dtype='object')

In [79]:
conflict_df.drop(columns=col_to_drop,inplace=True)
conflict_df.head(2)

Unnamed: 0,event_date,event_type,sub_event_type,actor1,inter1,actor2,inter2,interaction,admin1,admin2,fatalities
0,2017-01-01,Violence against civilians,Attack,Al Shabaab,2,Civilians (Somalia),7,27,Middle Juba,Jilib,1
1,2017-01-01,Violence against civilians,Attack,Habar Jeclo Clan Militia (Somalia),4,Civilians (Somalia),7,47,Togdheer,Buuhoodle,3


In [80]:
# filling actor2 NaN with None, where none indicates that there was no actor2 involved in the event
conflict_df = conflict_df.fillna('None')

In [81]:
# persentage of missing values
null_percentage = conflict_df.isnull().sum().sort_values(ascending = False)/len(conflict_df)*100
null_percentage

event_date        0.0
event_type        0.0
sub_event_type    0.0
actor1            0.0
inter1            0.0
actor2            0.0
inter2            0.0
interaction       0.0
admin1            0.0
admin2            0.0
fatalities        0.0
dtype: float64

## 3. Outlier Detection and removal
Removing outlier is debatable as here each row is not independent, here at the end we are grouping it by 6 months so may be the case rgar group by average out the outlier

In [82]:
conflict_df.head()

Unnamed: 0,event_date,event_type,sub_event_type,actor1,inter1,actor2,inter2,interaction,admin1,admin2,fatalities
0,2017-01-01,Violence against civilians,Attack,Al Shabaab,2,Civilians (Somalia),7,27,Middle Juba,Jilib,1
1,2017-01-01,Violence against civilians,Attack,Habar Jeclo Clan Militia (Somalia),4,Civilians (Somalia),7,47,Togdheer,Buuhoodle,3
2,2017-01-01,Violence against civilians,Attack,Police Forces of Somaliland (2010-),1,Civilians (Somalia),7,17,Togdheer,Buuhoodle,2
3,2017-01-02,Strategic developments,Disrupted weapons use,AMISOM: African Union Mission in Somalia (2007...,8,Unidentified Armed Group (Somalia),3,38,Hiraan,Bulo Burto,0
4,2017-01-02,Battles,Armed clash,Gaaljecel Clan Militia (Somalia),4,Makane Clan Militia (Somalia),4,44,Hiraan,Belet Weyne,0


In [83]:
# since we have only one categorical varaible that is fatalities, checking outlier for that 
conflict_numeric_data = conflict_df.select_dtypes(include="number").columns
px.box(conflict_df, y=conflict_numeric_data).show()

In [84]:
# percentage count of no of fatalities to get an idea of box plot structure
conflict_df['fatalities'].value_counts().sort_values(ascending=False)/len(conflict_df)*100

0      55.059691
1      22.370002
2       7.307107
3       3.813797
10      2.996555
4       2.211361
5       1.578399
6       0.817242
7       0.697060
8       0.504767
12      0.312475
9       0.312475
11      0.256390
15      0.200304
20      0.184280
18      0.120183
17      0.120183
14      0.120183
16      0.112170
13      0.104158
30      0.088134
40      0.072110
27      0.056085
26      0.056085
25      0.048073
19      0.040061
50      0.032049
22      0.032049
35      0.032049
29      0.024037
81      0.024037
52      0.024037
100     0.024037
23      0.024037
24      0.024037
57      0.016024
37      0.016024
34      0.016024
60      0.016024
32      0.016024
21      0.016024
28      0.016024
61      0.008012
88      0.008012
38      0.008012
587     0.008012
36      0.008012
70      0.008012
93      0.008012
72      0.008012
48      0.008012
73      0.008012
41      0.008012
Name: fatalities, dtype: float64

- Since 57% fatalities are 0 and 20% are 1 the box plot is majorely towards 0 and 1
- As we will group by at the later stage, removing outlier here is debatable as those outliers can be average out in 1 or 6 months duration (maybe after a big explosion nothing happend for few months before and after)
- For this outlier, Since i have also checked nothing happend in that district for that 6 months, except this major event, i am keeoing it, so it basically denotes 584 fatalities in 6 months technically

In [85]:
# removing fatalities greater then 150
# conflict_df = conflict_df[conflict_df['fatalities']<=150]

## 4. New Feature imputation

In [86]:
# number of unique values for each feature
conflict_df.nunique()

event_date        1634
event_type           6
sub_event_type      24
actor1             282
inter1               8
actor2             243
inter2               9
interaction         36
admin1              18
admin2              74
fatalities          53
dtype: int64

**Note** - 

- since all the features except fatalities are categorial variables, and we have to groupby for 6 months for those categorical variables, we should not do it for each category otherwise there will be too many features and the model will be overfitted
- since the information of interaction we already have in inter1 and inter2, if we keep it that will be repeating the information, so that can be removed as well  
- moreover from the domain knowledge (document) we already have these features clustered. For example inter contains similar type of actors, subevent type are clustered in event types etc., so these features can be variables can be dropped

In [22]:
col_to_drop = ['sub_event_type','actor1','actor2','interaction']

In [23]:
conflict_df.drop(columns=col_to_drop,inplace=True)
conflict_df.head(2)

Unnamed: 0,event_date,event_type,inter1,inter2,admin1,admin2,fatalities
0,2017-01-01,Violence against civilians,2,7,Middle Juba,Jilib,1
1,2017-01-01,Violence against civilians,4,7,Togdheer,Buuhoodle,3


In [24]:
# The below code is used to do the one hot encoding for the variable inter
# here since inter has two columns inter1 and inter2 we need to do an multi-labled OHE which is not possible with custom functions
# So for each row we assign the index as 1 from 1 to 8 based on inter1 and inter2 codes

inter = []
inter_col_names = []

for i in range(1,9):
    s = "inter_"+str(i)
    inter_col_names.append(s)

for i,row in conflict_df.iterrows():
    temp = [0]*8
    in1 =int(row['inter1'])-1
    in2 =int(row['inter2'])-1
    temp[in1] = 1
    if in2 != -1:
        temp[in2] = 1
    
    inter.append(temp)

conflict_inter_df = pd.DataFrame(inter, columns=inter_col_names)
conflict_inter_df.head(2)

Unnamed: 0,inter_1,inter_2,inter_3,inter_4,inter_5,inter_6,inter_7,inter_8
0,0,1,0,0,0,0,1,0
1,0,0,0,1,0,0,1,0


In [25]:
# dropping the inter columns and will use the encoded data for that
conflict_df.drop(columns=['inter1','inter2'],inplace=True)

In [26]:
# doing one hot encoding for event_type as well and concatinating with the above inter column
conflict_df_features = pd.concat([pd.get_dummies(conflict_df, columns=['event_type']), conflict_inter_df], axis=1)
conflict_df_features.head()

Unnamed: 0,event_date,admin1,admin2,fatalities,event_type_Battles,event_type_Explosions/Remote violence,event_type_Protests,event_type_Riots,event_type_Strategic developments,event_type_Violence against civilians,inter_1,inter_2,inter_3,inter_4,inter_5,inter_6,inter_7,inter_8
0,2017-01-01,Middle Juba,Jilib,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0
1,2017-01-01,Togdheer,Buuhoodle,3,0,0,0,0,0,1,0,0,0,1,0,0,1,0
2,2017-01-01,Togdheer,Buuhoodle,2,0,0,0,0,0,1,1,0,0,0,0,0,1,0
3,2017-01-02,Hiraan,Bulo Burto,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1
4,2017-01-02,Hiraan,Belet Weyne,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0


In [33]:
# creating a structure of our final data frame with new features
# setting custom starting date for the groupby, otherwise it will start grouping by from the original date
conflict_df_final_cols = ['date','district'] + list(conflict_df_features.columns[3:])
conflict_df_grouped_features = pd.DataFrame(columns=conflict_df_final_cols)
conflict_df_grouped_features

Unnamed: 0,date,district,fatalities,event_type_Battles,event_type_Explosions/Remote violence,event_type_Protests,event_type_Riots,event_type_Strategic developments,event_type_Violence against civilians,inter_1,inter_2,inter_3,inter_4,inter_5,inter_6,inter_7,inter_8


In [34]:
# the custom time intervals that we needed (6Months gap starting from '2017-01-01')
time_df = pd.DataFrame({'Start Dates': ['2017-01-01','2017-07-01','2018-01-01','2018-07-01','2019-01-01','2019-07-01','2020-01-01','2020-07-01','2021-01-01','2021-07-01']})
time_df['Start Dates'] = pd.to_datetime(time_df['Start Dates'])
time_df

Unnamed: 0,Start Dates
0,2017-01-01
1,2017-07-01
2,2018-01-01
3,2018-07-01
4,2019-01-01
5,2019-07-01
6,2020-01-01
7,2020-07-01
8,2021-01-01
9,2021-07-01


In [35]:
# -------- OLD METHOD --------------

# This function is used to do the groupby with 6 months of frequency
# we get the data for each district then do groupby with sum (which is basically count for categorical variables as it is OHE)
# We shift all the dates 6 months back and enter new end date of last 6 months (discussed below) 

# def group_features(district,data):
#     temp = data[data['admin2']==district]
#     temp = temp.groupby(pd.Grouper(key ='event_date',freq='6MS',origin_date = '2017-01-01')).sum()
#     temp['date'] = temp.index
#     end_date = temp['date'].iloc[-1]
#     new_end_date = end_date+ pd.DateOffset(months=6)
#     temp['date'] = temp['date'].shift(-1)
#     temp['date'].iloc[-1] = new_end_date
#     temp['date'] = temp['date'].dt.date
#     temp['district'] = district
#     temp = temp.reset_index(drop=True)
#     return temp



# -------- NEW METHOD --------------
# The below code uses custom bins for making exact time durations for the desired target variables (child wasting) and group by according to that
def group_features(district,data,time_df):
    temp = data[data['admin2']==district].copy()
    temp = temp.groupby(pd.cut(temp['event_date'], time_df['Start Dates'], right=False, labels = time_df['Start Dates'].iloc[1:])).sum()
    temp['date'] = temp.index
    temp['date'] = pd.to_datetime(temp['date'])
    temp['date'] = temp['date'].dt.date
    temp['district'] = district
    temp = temp.reset_index(drop=True)
    return temp

In [36]:
# get all the districts 
district_list = sorted(conflict_df_features['admin2'].value_counts().keys())

# iterate over all the districts and groupby 
for district in district_list:
    print(f"processing district {district}")
    district_grouped_df = group_features(district, conflict_df_features,time_df)
    print(f"total number of semi-years are {len(district_grouped_df)} \n\n")
    conflict_df_grouped_features = pd.concat([conflict_df_grouped_features,district_grouped_df])

conflict_df_grouped_features = conflict_df_grouped_features.reset_index(drop=True)

processing district Adan Yabaal
total number of semi-years are 9 


processing district Afgooye
total number of semi-years are 9 


processing district Afmadow
total number of semi-years are 9 


processing district Baardheere
total number of semi-years are 9 


processing district Badhaadhe
total number of semi-years are 9 


processing district Baki
total number of semi-years are 9 


processing district Balcad
total number of semi-years are 9 


processing district Banadir
total number of semi-years are 9 


processing district Bandarbeyla
total number of semi-years are 9 


processing district Baraawe
total number of semi-years are 9 


processing district Baydhaba
total number of semi-years are 9 


processing district Belet Weyne
total number of semi-years are 9 


processing district Belet Xaawo
total number of semi-years are 9 


processing district Berbera
total number of semi-years are 9 


processing district Borama
total number of semi-years are 9 


processing district Bos

In [46]:
conflict_df_grouped_features.columns.to_list()[2:]

['fatalities',
 'event_type_Battles',
 'event_type_Explosions/Remote violence',
 'event_type_Protests',
 'event_type_Riots',
 'event_type_Strategic developments',
 'event_type_Violence against civilians',
 'inter_1',
 'inter_2',
 'inter_3',
 'inter_4',
 'inter_5',
 'inter_6',
 'inter_7',
 'inter_8']

In [40]:
base_path = '../data/acled/'
save_path = os.path.join(base_path,acled_out)
conflict_df_grouped_features.to_csv(save_path, index=False)

**NOTE**

    - So here we need to groupby our data from 01-01-2017 to current date with 6 month frequency, but when we use grouper function with freq **M** it uses the end dates of the months to groupby, Basically starts from  31-01-2017 to 31-07-2017 which is not correct as month July data is added in the first 6 months, which is not compataible with our prevelence data. So the freq **MS** is used which is month start, that gives data from 01-01-2017 to 01-07-2017 in front of 01-01-2017 which needs to be shifted by 6 months to match our prevelence data and then we can merge it same as baseline_model (merge) -  FIRST WAY OF USING DATA

    - In the  FIRST WAY OF USING DATA we need to keep in mind, that we are limiting our final data to only those 6 months where the conflict data was present, as we only have those rows where data was actually present. 
    
    - The above freq mistake is done by baseline model (I guess)

    - 2nd WAY OF USING DATA - we can create exact start to end date (bins) as prevelance data and groupby for all date, if noting is there we assume 0. (We can assume missing values are 0 ) 


**IMPORTANT**

    - If we have to use the above conflict data along with the remaing dataset (production, population etc.), we need to put something for each six month in conflict data for each district otherwise we need to drop the whole row which also contains the information about other variables(production, population etc.). We can assume, the conflict features are NaN (missing) for those dates where no conflict infromation is given, but for now those values i have put "0" - the assumption is basically, if no reporting happened in those 6 months, then there are actully no battles or conflicts or fatalities. 

### EXTRA TEST CODES (maybe useful for development and checking)

In [473]:
# we need to use merge_as_of to match the dates
#Merge dataframes, only joining on current or previous dates as to prevent data leakage
# df = pd.merge_asof(left=prevalence_df, right=ipc_df, direction='backward', on='date')

In [439]:
# extra Useful codes for development and checking
# conflict_df_features[conflict_df_features['admin2']=='Lughaye'].groupby(pd.Grouper(key ='event_date',freq='6MS')).sum()
# conflict_df_features[conflict_df_features['admin2']=='Baardheere']
# conflict_df_features['event_date'].iloc[0] <  pd.to_datetime('2017-07-01')
# temp = group_features('Afmadow',conflict_df_features)
# temp = conflict_df_features[conflict_df_features['admin2']=='Baardheere']
# temp2 = temp.groupby(pd.Grouper(key ='event_date',freq='6MS', origin = '2017-01-01')).sum()
# t = temp.groupby(pd.cut(temp['event_date'], time_df['Start Dates'], right=False, labels = time_df['Start Dates'].iloc[1:])).sum()
# temp
# temp