# About notebook:

* Updated to load data from MongoDB
* Sandbox version of function to enable one-hot encoding
  * Will update and eventually move into 'modules'
  * I think we should keep it outside of the ACLED class, as functions are generic and can be used on other data
 
Note that content is a bit immature (per 7th of March 2017).

# Imports

In [1]:
import pandas as pd
import numpy as np
import datetime

%matplotlib inline

## Importing ACLED data from MongoDB

In [2]:
import sys
sys.path.insert(0, '../')
import datasets

In [3]:
acled = datasets.ACLED()
acled.mongodb_update_database

<bound method ACLED.mongodb_update_database of <datasets.ACLED object at 0x10e63f6a0>>

In [4]:
# Loading ACLED-data to pandas.Dataframe:
df = acled.mongodb_get_entire_database()

# Mini-dataset to play with

In [5]:
df_f = df[['event_date', 'country', 'event_type', 'fatalities']].copy()

# Load the new encoding function:
I will add this into the 'module' section.

That said: 
* I think we should keep it outside of the ACLED class, as functions are generic and can be used on other data

In [6]:

def _invert_dict_nonunique(d):
    """ Inverting nonunique dictionary 'd'
    
    Reference:
    [1] http://www.saltycrane.com/blog/2008/01/how-to-invert-dict-in-python/
    """
    newdict = {}
    for k, v in d.iteritems():
        newdict.setdefault(v, []).append(k)
    return newdict

def encoding_col_values_to_num(df, col, lower=True, strip=True, preset_dict=None):
    """Values of df['col'] encoded as numerical pandas.Series + dict
    
    Results in:
    category_col[category_dict] == df['col']           *
     
    *: Modulo str.lower() and str.strip(), if applied
    
    Keyword arguments:
        df -- pandas.Dataframe
        col -- Column (string) encoding is applied to,
               the column should contain strings.
        lower -- Ignore case? Default: True
        strip -- Strip strings? Default: True
        preset_dict -- Enables encoding using custom dictionary (custom grouping) 
                       
    
    Returns:
        category_col -- pandas.Series
        category_dict -- dict        

    Warning: Does not change 'column' in the passed dataframe.
    
    Todo:
        - Finalize 'presed_dict' functionality: Enabling customized dictionaries.
          currently inversion of dictionary will cause error.
        - Alternatively remove code related to 'prest_dict' (including function '_invert_dict_nonunique') 
    """
    column = df[col].copy()
    if lower:
        column = column.apply(str.lower)
    if strip:
        column = column.apply(str.strip)

    if preset_dict:
        mapping_dict = preset_dict
        # Inverting mapping_dict:
        category_dict = _invert_dict_nonunique(mapping_dict)
    else:
        categories = column.unique()
        mapping_dict = {category : i for i, category in enumerate(categories)}
        # Inverting mapping_dict:
        category_dict = dict([(v, k) for k, v in mapping_dict.items()])


    # Establishing 
    category_col = pd.Series(column.map(mapping_dict), dtype=int)

    # Verifies that all rows were successfully coded:
    assert(category_col.isnull().any()==False)
           
    return category_col, category_dict

# Apply the encoding function

In [7]:
event_cat_col, event_type_dict = encoding_col_values_to_num(df_f, 'event_type')

df_f['event_cat'] = event_cat_col

In [12]:
df_f.head(10)

Unnamed: 0,event_date,country,event_type,fatalities,event_cat
0,2017-02-18,Somalia,Violence against civilians,1,0
1,2017-02-18,Libya,Battle-No change of territory,1,1
2,2017-02-18,Ivory Coast,Riots/Protests,0,2
3,2017-02-18,Ethiopia,Violence against civilians,1,0
4,2017-02-18,Somalia,Battle-No change of territory,1,1
5,2017-02-18,Democratic Republic of Congo,Battle-No change of territory,0,1
6,2017-02-18,Algeria,Riots/Protests,0,2
7,2017-02-18,Egypt,Battle-No change of territory,1,1
8,2017-02-18,South Africa,Riots/Protests,0,2
9,2017-02-18,Somalia,Battle-No change of territory,3,1


## Quick check to verify the above does the correct thing
... in essence: that the (new numerical categories + dict) in fact do correspond to the event type:

In [9]:
(df_f['event_cat'].map(event_type_dict) == df_f['event_type'].apply(str.lower).apply(str.strip)).all()

True

## One-hot encoding (using pandas)
... aaaand we're done:

In [11]:
pd.get_dummies(df_f, columns=['event_cat'], drop_first=True).head(10)

Unnamed: 0,event_date,country,event_type,fatalities,event_cat_1,event_cat_2,event_cat_3,event_cat_4,event_cat_5,event_cat_6,event_cat_7,event_cat_8
0,2017-02-18,Somalia,Violence against civilians,1,0,0,0,0,0,0,0,0
1,2017-02-18,Libya,Battle-No change of territory,1,1,0,0,0,0,0,0,0
2,2017-02-18,Ivory Coast,Riots/Protests,0,0,1,0,0,0,0,0,0
3,2017-02-18,Ethiopia,Violence against civilians,1,0,0,0,0,0,0,0,0
4,2017-02-18,Somalia,Battle-No change of territory,1,1,0,0,0,0,0,0,0
5,2017-02-18,Democratic Republic of Congo,Battle-No change of territory,0,1,0,0,0,0,0,0,0
6,2017-02-18,Algeria,Riots/Protests,0,0,1,0,0,0,0,0,0
7,2017-02-18,Egypt,Battle-No change of territory,1,1,0,0,0,0,0,0,0
8,2017-02-18,South Africa,Riots/Protests,0,0,1,0,0,0,0,0,0
9,2017-02-18,Somalia,Battle-No change of territory,3,1,0,0,0,0,0,0,0


### Todo (hot-encoding)
Will have to implement into function that does what we need with it :) :)