# About notebook:

* Updated to load data from MongoDB
* Sandbox version of function to enable one-hot encoding
  * Will update and eventually move into 'modules'
  * I think we should keep it outside of the ACLED class, as functions are generic and can be used on other data
 
Note that content is a bit immature (per 7th of March 2017).

# Imports

In [1]:
import pandas as pd
import numpy as np
import datetime

%matplotlib inline

# Autoreloading external modules

In [2]:
# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
print('Loaded')

Loaded


## Importing ACLED data from MongoDB

In [3]:
import sys
sys.path.insert(0, '../')
import modules.datasets

In [4]:
acled = modules.datasets.ACLED()
acled.mongodb_update_database

<bound method ACLED.mongodb_update_database of <modules.datasets.ACLED object at 0x108ec90b8>>

In [5]:
# Loading ACLED-data to pandas.Dataframe:
df = acled.mongodb_get_entire_database()

# Mini-dataset to play with

In [6]:
df_f = df[['event_date', 'country', 'event_type', 'fatalities']].copy()

# Full one-hot encoding:
Everything you need to apply one-hot encoding summarized with:

In [7]:
import modules.utils

df_onehot, event_type_dict = modules.utils.apply_one_hot_encoding(df_f, encode_col='event_type', col_prefix='e_type')

df_onehot.head(6)

Unnamed: 0,event_date,country,event_type,fatalities,e_type_1,e_type_2,e_type_3,e_type_4,e_type_5,e_type_6,e_type_7,e_type_8
0,2017-02-18,Somalia,Violence against civilians,1,0,0,0,0,0,0,0,0
1,2017-02-18,Libya,Battle-No change of territory,1,1,0,0,0,0,0,0,0
2,2017-02-18,Ivory Coast,Riots/Protests,0,0,1,0,0,0,0,0,0
3,2017-02-18,Ethiopia,Violence against civilians,1,0,0,0,0,0,0,0,0
4,2017-02-18,Somalia,Battle-No change of territory,1,1,0,0,0,0,0,0,0
5,2017-02-18,Democratic Republic of Congo,Battle-No change of territory,0,1,0,0,0,0,0,0,0


The categories above are decoded by:

- event_type_dict[#number]

In [8]:
for i, e_type in event_type_dict.items():
    print(i, ':', e_type)

0 : violence against civilians
1 : battle-no change of territory
2 : riots/protests
3 : strategic development
4 : remote violence
5 : battle-government regains territory
6 : battle-non-state actor overtakes territory
7 : non-violent transfer of territory
8 : headquarters or base established


## Ready for analysis
With the above, you can apply ML techniques :)


# Background info:

The below shows in some more detail what the one-hot encoding function is doing and also contains some tests. <font color='red'>The below is NOT required for using one-hot encoding. For that, the above is sufficient.</font>


In [9]:
import modules.utils
event_cat_col, event_type_dict = modules.utils.encoding_col_values_to_num(df_f, encode_col='event_type')

df_f['event_cat'] = event_cat_col

### Quick check to verify the above does the correct thing
... in essence: that the (new numerical categories + dict) in fact do correspond to the event type:

In [10]:
true_if_it_worked = (df_f['event_cat'].map(event_type_dict) == df_f['event_type'].apply(str.lower).apply(str.strip)).all()

print("If encoding did what we wanted, the last word should be True:", true_if_it_worked)

If encoding did what we wanted, the last word should be True: True


### Explanation:
We've added column 'event_cat' to the dataframe. Dictionary 'event_type_dict' translates from category number to name of category:

In [11]:
print(
    'Category: ',
    df_f['event_cat'][0],'\n',
    'From dict: ', event_type_dict[df_f['event_cat'][0]],'\n',
    'From original column: ', df_f['event_type'][0])

Category:  0 
 From dict:  violence against civilians 
 From original column:  Violence against civilians


As an example, let's extract all 'Riots/Protests':

In [12]:
df_riot = df_f[df_f.loc[:, 'event_cat']==2].copy()
print(df_riot.head(5))

   event_date       country      event_type  fatalities  event_cat
2  2017-02-18   Ivory Coast  Riots/Protests           0          2
6  2017-02-18       Algeria  Riots/Protests           0          2
8  2017-02-18  South Africa  Riots/Protests           0          2
11 2017-02-18          Mali  Riots/Protests           0          2
15 2017-02-17       Nigeria  Riots/Protests           0          2


The above could just as well have been done with:

    df_riot = df_f[df_f.loc[:, 'event_cat']=='Riots/Protests'].copy()

However, the one-hot encoding is better with the numeric categories (using **pd.get_dummies()**):

### One-hot encoding applied:

In [13]:
pd.get_dummies(df_f, columns=['event_cat'], drop_first=True).head(5)

Unnamed: 0,event_date,country,event_type,fatalities,event_cat_1,event_cat_2,event_cat_3,event_cat_4,event_cat_5,event_cat_6,event_cat_7,event_cat_8
0,2017-02-18,Somalia,Violence against civilians,1,0,0,0,0,0,0,0,0
1,2017-02-18,Libya,Battle-No change of territory,1,1,0,0,0,0,0,0,0
2,2017-02-18,Ivory Coast,Riots/Protests,0,0,1,0,0,0,0,0,0
3,2017-02-18,Ethiopia,Violence against civilians,1,0,0,0,0,0,0,0,0
4,2017-02-18,Somalia,Battle-No change of territory,1,1,0,0,0,0,0,0,0
